Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4616

Storing Null Values in Avro Files

$
0
0
start here featured

1. Introduction

In this tutorial, we’ll explore two ways of handling and writing to file null values when working with Apache Avro in Java. These ways of approaching null values will also allow us to discuss best practices for handling nullable fields.

2. The Problem With Null Values in Avro

Apache Avro is a data serialization framework that provides rich data structures and a compact, fast, binary data format. However, the use of null values in Avro requires special attention.

Let’s go over a common scenario where we might encounter issues:

GenericRecord record = new GenericData.Record(schema);
record.put("email", null);
// This might throw NullPointerException when writing to file

By default, Avro fields aren’t nullable. Attempting to store null values result in a NullPointerException during serialization.

Before we take a look at the first solution, let’s setup our project with the correct dependency:

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.12.0</version>
</dependency>

3. Solutions for Handling Null Values

In this section, we’ll explore two main approaches for handling null values in Avro: schema definition and annotation-based.

3.1. Defining a Schema in Three Possible Ways

We can define an Avro schema with acceptable null values in three ways. First, let’s look at the JSON string approach:

private static final String SCHEMA_JSON = """
    {
        "type": "record",
        "name": "User",
        "namespace": "com.baeldung.apache.avro.storingnullvaluesinavrofile",
        "fields": [
            {"name": "id", "type": "long"},
            {"name": "name", "type": "string"},
            {"name": "active", "type": "boolean"},
            {"name": "lastUpdatedBy", "type": ["null", "string"], "default": null},
            {"name": "email", "type": "string"}
        ]
    }""";
public static Schema createSchemaFromJson() {
    return new Schema.Parser().parse(SCHEMA_JSON);
}

Here we defined nullable fields using the union type syntax: [“null”, “string”].

Next, we’ll use the SchemaBuilder approach for a more programmatic way to define our schema:

public static Schema createSchemaWithOptionalFields1() {
    return SchemaBuilder
      .record("User")
      .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")
      .fields()
      .requiredLong("id")
      .requiredString("name")
      .requiredBoolean("active")
      .name("lastUpdatedBy")
      .type() // Start of configuration
      .unionOf()
      .nullType()
      .and()
      .stringType()
      .endUnion()
      .nullDefault() // End of configuration
      .requiredString("email")
      .endRecord();
}

In this example, we’re using SchemaBuilder to create a schema where the lastUpdatedBy field can be either null or a boolean value.

Finally, let’s create another schema, similar to the one above but with a different approach:

public static Schema createSchemaWithOptionalFields2() {
    return SchemaBuilder
      .record("User")
      .namespace("com.baeldung.apache.avro.storingnullvaluesinavrofile")
      .fields()
      .requiredLong("id")
      .requiredString("name")
      .requiredBoolean("active")
      .requiredString("lastUpdatedBy")
      .optionalString("email")  // Using optional field
      .endRecord();
}

Instead of using the type().unionOf().nullType().andStringType().endUnion().nullDefault() chain, we’ve used optionalString().

Let’s quickly compare the last two ways of defining a schema, since they’re very similar.

The longer version offers the option of more control when configuring the null value. The shorter version is syntactic sugar offered by SchemaBuilder. In essence, they do the same thing.

3.2. Using @Nullable Annotation

The next approach uses Avro’s built-in @Nullable annotation:

public class AvroUser {
    private long id;
    private String name;
    @Nullable
    private Boolean active;  
    private String lastUpdatedBy;  
    private String email; 
    // rest of code
}

This annotation tells Avro’s reflection-based code generation abilities that the field can accept null values.

4. Implementation of Writing To File

Now, let’s look at how we’ll serialize the Record that contains the null value:

public static void writeToAvroFile(Schema schema, GenericRecord record, String filePath) throws IOException {
    DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
    try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
        dataFileWriter.create(schema, new File(filePath));
        dataFileWriter.append(record);
    }
}

We initialize a GenericDatumWriter for handling GenericRecord objects. This is the implementation that works with GenericRecord. Next, we pass the schema as a constructor argument for it to know how to serialize the data.

Then, we initialize a DataFileWriter, the class that handles the actual writing of data in the Avro record. It also handles the files’ metadata and compression.

Then, using the create() method, we create the Avro file with the specified schema. Here, we add further data (header) and metadata.

Finally, we write the actual record in the file. If the record contains null values in fields marked with @Nullable or of the union type, these will be serialized correctly.

5. Testing Our Solution

Now, let’s check our implementations work correctly:

@Test
void whenSerializingUserWithNullPropFromStringSchema_thenSuccess(@TempDir Path tempDir) {
    user.setLastUpdatedBy(null);
    schema = AvroUser.createSchemaWithOptionalFields1();
    String filePath = tempDir.resolve("test.avro").toString();
    GenericRecord record = AvroUser.createRecord(AvroUser.createSchemaFromJson(), user);
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

In this test, we initially set the lastUpdatedBy field to null. Then, we created schema from the String schema declared in the beginning.

As we can see from the tests, the record is successfully serialized with a null value:

@Test
void givenSchemaBuilderWithOptionalFields1_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {
    user.setLastUpdatedBy(null);
    String filePath = tempDir.resolve("test.avro").toString();
    schema = AvroUser.createSchemaWithOptionalFields1();
    GenericRecord record = AvroUser.createRecord(schema, user);
    assertTrue(schema.getField("lastUpdatedBy").schema().isNullable(),
        "Union type field should be nullable");
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

A similar situation is above, in the second test, where we’ve used the SchemaBuilder with a longer configuration for the null field.

Finally, the second version of the SchemaBuilder has a shorter null field configuration:

@Test
void givenSchemaBuilderWithOptionalFields2_whenCreatingSchema_thenSupportsNull(@TempDir Path tempDir) {
    user.setEmail(null);
    String filePath = tempDir.resolve("test.avro").toString();
    schema = AvroUser.createSchemaWithOptionalFields2();
    GenericRecord record = AvroUser.createRecord(schema, user);
    assertTrue(schema.getField("email").schema().isNullable(),
        "Union type field should be nullable");
    assertDoesNotThrow(() -> AvroUser.writeToAvroFile(schema, record, filePath));
    File avroFile = new File(filePath);
    assertTrue(avroFile.exists());
    assertTrue(avroFile.length() > 0);
}

6. Conclusion

In this article, we explored two main approaches to handling null values in Apache Avro. First, we saw how to define a schema in three ways. Then, we implemented the @Nullable annotation directly on the class property.

Both methods are valid. However, the schema approach offers more granularity and is generally preferred for production systems.

As always, the code is available over on GitHub.

The post Storing Null Values in Avro Files first appeared on Baeldung.
       

Viewing all articles
Browse latest Browse all 4616

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>