Storing Null Values in Avro Files

1. Introduction

In this tutorial, we’ll explore two ways of handling and writing to file null values when working with Apache Avro in Java. These ways of approaching null values will also allow us to discuss best practices for handling nullable fields.

2. The Problem With Null Values in Avro

Apache Avro is a data serialization framework that provides rich data structures and a compact, fast, binary data format. However, the use of null values in Avro requires special attention.

Let’s go over a common scenario where we might encounter issues:

GenericRecord record = new GenericData.Record(schema);
record.put("email", null);
// This might throw NullPointerException when writing to file

By default, Avro fields aren’t nullable. Attempting to store null values result in a NullPointerException during serialization.

Before we take a look at the first solution, let’s setup our project with the correct dependency:

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.12.0</version>
</dependency>

3. Solutions for Handling Null Values

In this section, we’ll explore two main approaches for handling null values in Avro: schema definition and annotation-based.

3.1. Defining a Schema in Three Possible Ways

We can define an Avro schema with acceptable null values in three ways. First, let’s look at the JSON string approach:

private static final String SCHEMA_JSON = """
    {
        "type": "record",
        "name": "User",
        "namespace": "com.baeldung.apache.avro.storingnullvaluesinavrofile",
        "fields": [
            {"name": "id", "type": "long"},
            {"name": "name", "type": "string"},
            {"name": "active", "type": "boolean"},
            {"name": "lastUpdatedBy", "type": ["null", "string"], "default": null},
            {"name": "email", "type": "string"}
        ]
    }""";

public static Schema createSchemaFromJson() {
    return new Schema.Parser().parse(SCHEMA_JSON);
}

Here we defined nullable fields using the union type syntax: [“null”, “string”].

Next, we’ll use the SchemaBuilder approach for a more programmatic way to define our schema:

public static Schema createSchemaWithOptionalFields1() {
    return SchemaBuilder
      .record("User")
      .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")
      .fields()
      .requiredLong("id")
      .requiredString("name")
      .requiredBoolean("active")
      .name("lastUpdatedBy")
      .type() // Start of configuration
      .unionOf()
      .nullType()
      .and()
      .stringType()
      .endUnion()
      .nullDefault() // End of configuration
      .requiredString("email")
      .endRecord();
}

In this example, we’re using SchemaBuilder to create a schema where the lastUpdatedBy field can be either null or a boolean value.

Finally, let’s create another schema, similar to the one above but with a different approach:

public static Schema createSchemaWithOptionalFields2() {
    return SchemaBuilder
      .record("User")
      .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")
      .fields()
      .requiredLong("id")
      .requiredString("name")
      .requiredBoolean("active")
      .requiredString("lastUpdatedBy")
      .optionalString("email")  // Using optional field
      .endRecord();
}

Instead of using the type().unionOf().nullType().andStringType().endUnion().nullDefault() chain, we’ve used optionalString().

Let’s quickly compare the last two ways of defining a schema, since they’re very similar.

The longer version offers the option of more control when configuring the null value. The shorter version is syntactic sugar offered by SchemaBuilder. In essence, they do the same thing.

3.2. Using @Nullable Annotation

The next approach uses Avro’s built-in @Nullable annotation:

public class AvroUser {
    private long id;
    private String name;
    @Nullable
    private Boolean active;  
    private String lastUpdatedBy;  
    private String email; 
    // rest of code
}

This annotation tells Avro’s reflection-based code generation abilities that the field can accept null values.

4. Implementation of Writing To File

Now, let’s look at how we’ll serialize the Record that contains the null value:

public static void writeToAvroFile(Schema schema, GenericRecord record, String filePath) throws IOException {
    DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
    try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
        dataFileWriter.create(schema, new File(filePath));
        dataFileWriter.append(record);
    }
}

We initialize a GenericDatumWriter for handling GenericRecord objects. This is the implementation that works with GenericRecord. Next, we pass the schema as a constructor argument for it to know how to serialize the data.

Then, we initialize a DataFileWriter, the class that handles the actual writing of data in the Avro record. It also handles the files’ metadata and compression.

Then, using the create() method, we create the Avro file with the specified schema. Here, we add further data (header) and metadata.

Finally, we write the actual record in the file. If the record contains null values in fields marked with @Nullable or of the union type, these will be serialized correctly.

5. Testing Our Solution

Now, let’s check our implementations work correctly:

@Test
void whenSerializingUserWithNullPropFromStringSchema_thenSuccess(@TempDir Path tempDir) {
    user.setLastUpdatedBy(null);
    schema = AvroUser.createSchemaWithOptionalFields1();
    String filePath = tempDir.resolve("test.avro").toString();
    GenericRecord record = AvroUser.createRecord(AvroUser.createSchemaFromJson(), user);
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

In this test, we initially set the lastUpdatedBy field to null. Then, we created a schema from the String schema declared in the beginning.

As we can see from the tests, the record is successfully serialized with a null value:

@Test
void givenSchemaBuilderWithOptionalFields1_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {
    user.setLastUpdatedBy(null);
    String filePath = tempDir.resolve("test.avro").toString();
    schema = AvroUser.createSchemaWithOptionalFields1();
    GenericRecord record = AvroUser.createRecord(schema, user);
    assertTrue(schema.getField("lastUpdatedBy").schema().isNullable(),
        "Union type field should be nullable");
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

A similar situation is above, in the second test, where we’ve used the SchemaBuilder with a longer configuration for the null field.

Finally, the second version of the SchemaBuilder has a shorter null field configuration:

@Test
void givenSchemaBuilderWithOptionalFields2_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {
    user.setEmail(null);
    String filePath = tempDir.resolve("test.avro").toString();
    schema = AvroUser.createSchemaWithOptionalFields2();
    GenericRecord record = AvroUser.createRecord(schema, user);
    assertTrue(schema.getField("email").schema().isNullable(),
        "Union type field should be nullable");
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

6. Conclusion

In this article, we explored two main approaches to handling null values in Apache Avro. First, we saw how to define a schema in three ways. Then, we implemented the @Nullable annotation directly on the class property.

Both methods are valid. However, the schema approach offers more granularity and is generally preferred for production systems.

As always, the code is available over on GitHub.

The post Storing Null Values in Avro Files first appeared on Baeldung.

Storing Null Values in Avro Files

1. Introduction

2. The Problem With Null Values in Avro

3. Solutions for Handling Null Values

3.1. Defining a Schema in Three Possible Ways

3.2. Using @Nullable Annotation

4. Implementation of Writing To File

5. Testing Our Solution

6. Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112