Tag: apache-avro

Unions and default value in apache avro serialization and deserialization

Initial avro schema (user.avsc) defines a User record with a name field only.

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    }
  ]
}

Maven pom.xml defines avro dependency

        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.1</version>
        </dependency>

so we can serialize the User data in Java to disc to user.avro file

        Schema schema = new Schema.Parser().parse(new File("user.avsc"));
        File avroFile = new File("target/user.avro");
        GenericRecord user = new GenericData.Record(schema);
        user.put("name", "Alyssa");
        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
        dataFileWriter.create(schema, avroFile);
        dataFileWriter.append(user);
        dataFileWriter.close();

we can read (deserialize) User using the same schema from the disc either by Java

        DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
        DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader);
        GenericRecord user2 = null;
        while (dataFileReader.hasNext()) {
            user2 = dataFileReader.next(user2);
            System.out.println(user2);
        }

or by using avro-utils jar that can be downloaded by maven when declared maven test dependency:

        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-tools</artifactId>
            <version>1.8.1</version>
            <scope>test</scope>
        </dependency>

and running with ‘tojson’ argument

me@MacBook:~/dev/my-projects/my-avro$ java -jar /Users/me/.m2/repository/org/apache/avro/avro-tools/1.8.1/avro-tools-1.8.1.jar tojson users.avro 
{"name":"Alyssa"}

Then we will add a new favorite_number element to the schema:

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "favorite_number",
      "type": "int"
    }
  ]
}

but not yet write favourite_number in the Java code.

When trying to write we get

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: null of int in field favorite_number of com.bawi.avro.model.User

since the favorite_number field is required by the avro schema but was not written by the writer.

Add a union of null and int value fixes the writing problem (union of int and null also works)

    {
      "name": "favorite_number",
      "type": [
        "null",
        "int"
      ]
    }

or

    {
      "name": "favorite_number",
      "type": [
        "int",
        "null"
      ]
    }

If written avro file has schema with favorite_number and it is written as null then it will always be read as null irregardless how the read schema looks like (default value affects only reading fields that were not defined in schema used for writing so the null values were not written, only schema used for reading should define that field (including default), schema used for writing should not define that field at all)

Lets assume different scenario where the write schema has only name field (without favorite_number):

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    }
  ]
}

and we write only name field into avro

Lets assume we want favorite_number to be set to -1 (with lets say new requirement to always populate in java code the favorite_number since we do not want to check for null for favorite_number fields when reading avro/hive table on the top of avro). Then lets modify the read schema to include default -1:
user_with_default_favourite_number.avsc:

{
  "namespace": "com.bawi.avro.model",
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "favorite_number",
      "type": [
        "int",
        "null"
      ],
      "default": -1
    }
  ]
}

with

File file2 = new File("user_with_default_favourite_number.avsc");
Schema schema2 = new Schema.Parser().parse(file2);
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema2);

and the output is:

{"name": "Alyssa", "favorite_number": -1}

If we change read schema for favorite_number to invalid:

    {
      "name": "favorite_number",
      "type": [
        "null",
        "int"
      ],
      "default": -1
    }

then we get:

org.apache.avro.AvroTypeException: Non-null default value for null type: -1

so if default non-null value is given then null in union needs to on second place.

If we want to have "default": null then on the first place in the union needs to be null:

    {
      "name": "favorite_number",
      "type": [
        "null",
        "int"
      ],
      "default": null
    }

since for invalid:

    {
      "name": "favorite_number",
      "type": [
        "int",
        "null"
      ],
      "default": null
    }

we will get

org.apache.avro.AvroTypeException: Non-numeric default value for int: null

as described in https://avro.apache.org/docs/1.7.7/spec.html#Unions