What is Avro?
Apache Avro is a data serialization framework designed for efficiency and flexibility. It is widely used in Big Data ecosystems and event-driven architectures like Apache Kafka, Hadoop, and Spark. Avro allows efficient serialization and deserialization of structured data, making it an excellent choice for applications requiring high-performance data exchange.
Types of Avro Encoding
Avro supports two main types of encoding:
Binary Encoding – Compact, fast, and efficient. Ideal for performance-critical applications.
JSON Encoding – Human-readable and easy to debug but less efficient compared to binary.
What is Schema Registry
Schema Registry is a central repository that stores and manages Avro schemas used in serialization and deserialization processes. It ensures that producers and consumers can share a common schema without embedding it in every message, reducing redundancy and improving efficiency.
How Schema Registry Works in Avro-Based Systems:
1.Producer registers schema – When a producer sends data, it first registers the schema with Schema Registry (if not already present). Instead of sending the full schema with each message, the producer includes only a Schema ID in the message.
2.Producer sends serialized message – Kafka, Pulsar, or another messaging system transmits the serialized message along with the Schema ID.
3.Consumer fetches schema – Upon receiving the message, the consumer retrieves the Schema ID and queries Schema Registry for the corresponding schema.
4.Consumer deserializes message using schema – The consumer uses the fetched schema to deserialize the message correctly.
Avro schema example
User.avsc
{
"type": "record",
"name": "User",
"namespace": "com.example.avro",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}
Generating classes from Avro schema
To generate Java classes from an Avro schema, add the following dependencies and plugin to your pom.xml:
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.11.1</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/avro</sourceDirectory>
<outputDirectory>${project.build.directory}/generated-sources</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Run the following command:
mvn clean compile
This will generate Java classes from the Avro schema files in src/main/resources/avro/.
Serializing and Deserializing Data (Binary & JSON)
public class AvroBinarySerializer {
public static byte[] serialize(User user) throws IOException {
DatumWriter<User> writer = new SpecificDatumWriter<>(User.class);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null);
writer.write(user, encoder);
encoder.flush();
return outputStream.toByteArray();
}
}
public class AvroBinaryDeserializer {
public static User deserialize(byte[] data) throws IOException {
DatumReader<User> reader = new SpecificDatumReader<>(User.class);
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(data, null);
return reader.read(null, decoder);
}
}
public class AvroJsonSerializer {
public static String serializeToJson(User user) throws IOException {
DatumWriter<User> writer = new SpecificDatumWriter<>(User.class);
StringWriter stringWriter = new StringWriter();
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(User.getClassSchema(), stringWriter);
writer.write(user, encoder);
encoder.flush();
return stringWriter.toString();
}
}
public class AvroJsonDeserializer {
public static User deserializeFromJson(String json) throws IOException {
DatumReader<User> reader = new SpecificDatumReader<>(User.class);
JsonDecoder decoder = DecoderFactory.get().jsonDecoder(User.getClassSchema(), json);
return reader.read(null, decoder);
}
}
Integrating Avro with Spring Boot and Kafka
Add the following in pom.xml
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
</dependency>
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>7.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.1</version>
</dependency>
</dependencies>
@Bean
public ProducerFactory<String, User> producerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
config.put("schema.registry.url", "http://localhost:8081");
return new DefaultKafkaProducerFactory<>(config);
}
@Bean
public ConsumerFactory<String, User> consumerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
config.put("schema.registry.url", "http://localhost:8081");
return new DefaultKafkaConsumerFactory<>(config);
}
Avro use cases
Apache Kafka
Used in Kafka to serialize messages efficiently.
Works with Schema Registry to manage schema evolution.
Big Data Processing (Hadoop, Spark, Flink)
Avro files are commonly stored in HDFS.
Supports high-speed distributed computing.
Microservices Communication
Enables efficient RPC communication between microservices.
Reduces overhead compared to JSON or XML.
Database Storage
Used in NoSQL databases like HBase for efficient data storage.
Provides schema evolution benefits.