1. Introduction
Apache Kafka is a messaging platform. With it, we can exchange data between different applications at scale.
Spring Cloud Stream is a framework for building message-driven applications. It can simplify the integration of Kafka into our services.
Conventionally, Kafka is used with the Avro message format, supported by a schema registry. In this tutorial, we’ll use the Confluent Schema Registry. We’ll try both Spring’s implementation of integration with the Confluent Schema Registry and also the Confluent native libraries.
2. Confluent Schema Registry
Kafka represents all data as bytes, so it’s common to use an external schema and serialize and deserialize into bytes according to that schema. Rather than supply a copy of that schema with each message, which would be an expensive overhead, it’s also common to keep the schema in a registry and supply just an id with each message.
Confluent Schema Registry provides an easy way to store, retrieve and manage schemas. It exposes several useful RESTful APIs.
Schemata are stored by subject, and by default, the registry does a compatibility check before allowing a new schema to be uploaded against a subject.
Each producer will know the schema it’s producing with, and each consumer should be able to either consume data in ANY format or should have a specific schema it prefers to read in. The producer consults the registry to establish the correct ID to use when sending a message. The consumer uses the registry to fetch the sender’s schema.
When the consumer knows both the sender’s schema and its own desired message format, the Avro library can convert the data into the consumer’s desired format.
3. Apache Avro
Apache Avro is a data serialization system.
It uses a JSON structure to define the schema, providing for serialization between bytes and structured data.
One strength of Avro is its support for evolving messages written in one version of a schema into the format defined by a compatible alternative schema.
The Avro toolset is also able to generate classes to represent the data structures of these schemata, making it easy to serialize in and out of POJOs.
4. Setting up the Project
To use a schema registry with Spring Cloud Stream, we need the Spring Cloud Kafka Binder and schema registry Maven dependencies:
<dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-stream-binder-kafka</artifactId> </dependency> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-stream-schema</artifactId> </dependency>
For Confluent’s serializer, we need:
<dependency> <groupId>io.confluent</groupId> <artifactId>kafka-avro-serializer</artifactId> <version>4.0.0</version> </dependency>
And the Confluent’s Serializer is in their repo:
<repositories> <repository> <id>confluent</id> <url>https://packages.confluent.io/maven/</url> </repository> </repositories>
Also, let’s use a Maven plugin to generate the Avro classes:
<build> <plugins> <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.8.2</version> <executions> <execution> <id>schemas</id> <phase>generate-sources</phase> <goals> <goal>schema</goal> <goal>protocol</goal> <goal>idl-protocol</goal> </goals> <configuration> <sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> </configuration> </execution> </executions> </plugin> </plugins> </build>
For testing, we can use either an existing Kafka and Schema Registry set up or use a dockerized Confluent and Kafka.
5. Spring Cloud Stream
Now that we’ve got our project set up, let’s next write a producer using Spring Cloud Stream. It’ll publish employee details on a topic.
Then, we’ll create a consumer which will read events from the topic and write them out in a log statement.
5.1. Schema
First, let’s define a schema for employee details. We can name it employee-schema.avsc.
We can keep the schema file in src/main/resources:
{ "type": "record", "name": "Employee", "namespace": "com.baeldung.schema", "fields": [ { "name": "id", "type": "int" }, { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" }] }
After creating the above schema, we need to build the project. Then, the Apache Avro code generator will create a POJO named Employee under the package com.baeldung.schema.
5.2. Producer
Spring Cloud Stream provides the Processor interface. This provides us with an output and input channel.
Let’s use this to make a producer that sends Employee objects to the employee-details Kafka topic:
@Autowired private Processor processor; public void produceEmployeeDetails(int empId, String firstName, String lastName) { // creating employee details Employee employee = new Employee(); employee.setId(empId); employee.setFirstName(firstName); employee.setLastName(lastName); Message<Employee> message = MessageBuilder.withPayload(employee) .build(); processor.output() .send(message); }
5.2. Consumer
Now, let’s write our consumer:
@StreamListener(Processor.INPUT) public void consumeEmployeeDetails(Employee employeeDetails) { logger.info("Let's process employee details: {}", employeeDetails); }
This consumer will read events published on the employee-details topic. Let’s direct its output to the log to see what it does.
5.3. Kafka Bindings
So far we’ve only been working against the input and output channels of our Processor object. These channels need configuring with the correct destinations.
Let’s use application.yml to provide the Kafka bindings:
spring: cloud: stream: bindings: input: destination: employee-details content-type: application/*+avro output: destination: employee-details content-type: application/*+avro
We should note that, in this case, destination means the Kafka topic. It may be slightly confusing that it is called destination since it is the input source in this case, but it’s a consistent term across consumers and producers.
5.4. Entry Point
Now that we have our producer and consumer, let’s expose an API to take inputs from a user and pass it to the producer:
@Autowired private AvroProducer avroProducer; @PostMapping("/employees/{id}/{firstName}/{lastName}") public String producerAvroMessage(@PathVariable int id, @PathVariable String firstName, @PathVariable String lastName) { avroProducer.produceEmployeeDetails(id, firstName, lastName); return "Sent employee details to consumer"; }
5.5. Enable the Confluent Schema Registry and Bindings
Finally, to make our application apply both the Kafka and schema registry bindings, we’ll need to add @EnableBinding and @EnableSchemaRegistryClient on one of our configuration classes:
@SpringBootApplication @EnableBinding(Processor.class) @EnableSchemaRegistryClient public class AvroKafkaApplication { public static void main(String[] args) { SpringApplication.run(AvroKafkaApplication.class, args); } }
And we should provide a ConfluentSchemaRegistryClient bean:
@Value("${spring.cloud.stream.kafka.binder.producer-properties.schema.registry.url}") private String endPoint; @Bean public SchemaRegistryClient schemaRegistryClient() { ConfluentSchemaRegistryClient client = new ConfluentSchemaRegistryClient(); client.setEndpoint(endPoint); return client; }
The endPoint is the URL for the Confluent Schema Registry.
5.6. Testing our Service
Let’s test the service with a POST request:
curl -X POST localhost:8080/employees/1001/Harry/Potter
The logs tell us that this has worked:
2019-06-11 18:45:45.343 INFO 17036 --- [container-0-C-1] com.baeldung.consumer.AvroConsumer : Let's process employee details: {"id": 1001, "firstName": "Harry", "lastName": "Potter"}
5.7. What happened during Processing?
Let’s try to understand what exactly happened with our example application:
- The producer built the Kafka message using the Employee object
- The producer registered the employee schema with the schema registry to get a schema version ID, this either creates a new ID or reuses the existing one for that exact schema
- Avro serialized the Employee object using the schema
- Spring Cloud put the schema-id in the message headers
- The message was published on the topic
- When the message came to the consumer, it read the schema-id from the header
- The consumer used schema-id to get the Employee schema from the registry
- The consumer found a local class that could represent that object and deserialized the message into it
6. Serialization/Deserialization Using Native Kafka Libraries
Spring Boot provides a few out of box message converters. By default, Spring Boot uses the Content-Type header to select an appropriate message converter.
In our example, the Content-Type is application/*+avro, Hence it used AvroSchemaMessageConverter to read and write Avro formats. But, Confluent recommends using KafkaAvroSerializer and KafkaAvroDeserializer for message conversion.
While Spring’s own format works well, it has some drawbacks in terms of partitioning, and it is not interoperable with the Confluent standards, which some non-Spring services on our Kafka instance may need to be.
Let’s update our application.yml to use the Confluent converters:
spring: cloud: stream: default: producer: useNativeEncoding: true consumer: useNativeEncoding: true bindings: input: destination: employee-details content-type: application/*+avro output: destination: employee-details content-type: application/*+avro kafka: binder: producer-properties: key.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer value.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer schema.registry.url: http://localhost:8081 consumer-properties: key.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer value.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer schema.registry.url: http://localhost:8081 specific.avro.reader: true
We have enabled the useNativeEncoding. It forces Spring Cloud Stream to delegate serialization to the provided classes.
We should also know how we can provide native settings properties for Kafka within Spring Cloud using kafka.binder.producer-properties and kafka.binder.consumer-properties.
7. Consumer Groups and Partitions
The consumer groups are the set of consumers belonging to the same application. Consumers from the same Consumer Group share the same group name.
Let’s update application.yml to add a consumer group name:
spring: cloud: stream: // ... bindings: input: destination: employee-details content-type: application/*+avro group: group-1 // ...
All the consumers distribute the topic partitions among them evenly. Messages in different partitions will be processed in parallel.
In a consumer group, the max number of consumers reading messages at a time is equal to the number of partitions. So we can configure the number of partitions and consumers to get the desired parallelism. In general, we should have more partitions than the total number of consumers across all replicas of our service.
7.1. Partition Key
When processing our messages, the order they are processed may be important. When our messages are processed in parallel, the sequence of processing would be hard to control.
Kafka provides the rule that in a given partition, the messages are always processed in the sequence they arrived. So, where it matters that certain messages are processed in the right order, we ensure that they land in the same partition as each other.
We can provide a partition key while sending a message to a topic. The messages with the same partition key will always go to the same partition. If the partition key is not present, messages will be partitioned in round-robin fashion.
Let’s try to understand this with an example. Imagine we are receiving multiple messages for an employee and we want to process all the messages of an employee in the sequence. The department name and employee id can identify an employee uniquely.
So let’s define the partition key with employee’s id and department name:
{ "type": "record", "name": "EmployeeKey", "namespace": "com.baeldung.schema", "fields": [ { "name": "id", "type": "int" }, { "name": "departmentName", "type": "string" }] }
After building the project, the EmployeeKey POJO will get generated under the package com.baeldung.schema.
Let’s update our producer to use the EmployeeKey as a partition key:
public void produceEmployeeDetails(int empId, String firstName, String lastName) { // creating employee details Employee employee = new Employee(); employee.setId(empId); // ... // creating partition key for kafka topic EmployeeKey employeeKey = new EmployeeKey(); employeeKey.setId(empId); employeeKey.setDepartmentName("IT"); Message<Employee> message = MessageBuilder.withPayload(employee) .setHeader(KafkaHeaders.MESSAGE_KEY, employeeKey) .build(); processor.output() .send(message); }
Here, we’re putting the partition key in the message header.
Now, the same partition will receive the messages with the same employee id and department name.
7.2 Consumer Concurrency
Spring Cloud Stream allows us to set the concurrency for a consumer in application.yml:
spring: cloud: stream: // ... bindings: input: destination: employee-details content-type: application/*+avro group: group-1 concurrency: 3
Now our consumers will read three messages from the topic concurrently. In other words, Spring will spawn three different threads to consume independently.
8. Conclusion
In this article, we integrated a producer and consumer against Apache Kafka with Avro schemas and the Confluent Schema Registry.
We did this in a single application, but the producer and consumer could have been deployed in different applications and would have been able to have their own versions of the schemas, kept in sync via the registry.
We looked at how to use Spring’s implementation of Avro and Schema Registry client, and then we saw how to switch over to the Confluent standard implementation of serialization and deserialization for the purposes of interoperability.
Finally, we looked at how to partition our topic and ensure we have the correct message keys to enable safe parallel processing of our messages.
The complete code used for this article can be found over GitHub.