Initial avro schema (user.avsc
) defines a User
record with a name
field only.
{ "namespace": "com.bawi.avro.model", "type": "record", "name": "User", "fields": [ { "name": "name", "type": "string" } ] }
Maven pom.xml defines avro dependency
<dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.8.1</version> </dependency>
so we can serialize the User
data in Java to disc to user.avro
file
Schema schema = new Schema.Parser().parse(new File("user.avsc")); File avroFile = new File("target/user.avro"); GenericRecord user = new GenericData.Record(schema); user.put("name", "Alyssa"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter); dataFileWriter.create(schema, avroFile); dataFileWriter.append(user); dataFileWriter.close();
we can read (deserialize) User
using the same schema from the disc either by Java
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader); GenericRecord user2 = null; while (dataFileReader.hasNext()) { user2 = dataFileReader.next(user2); System.out.println(user2); }
or by using avro-utils jar that can be downloaded by maven when declared maven test dependency:
<dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-tools</artifactId> <version>1.8.1</version> <scope>test</scope> </dependency>
and running with ‘tojson’ argument
me@MacBook:~/dev/my-projects/my-avro$ java -jar /Users/me/.m2/repository/org/apache/avro/avro-tools/1.8.1/avro-tools-1.8.1.jar tojson users.avro {"name":"Alyssa"}
Then we will add a new favorite_number
element to the schema:
{ "namespace": "com.bawi.avro.model", "type": "record", "name": "User", "fields": [ { "name": "name", "type": "string" }, { "name": "favorite_number", "type": "int" } ] }
but not yet write favourite_number
in the Java code.
When trying to write we get
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: null of int in field favorite_number of com.bawi.avro.model.User
since the favorite_number
field is required by the avro schema but was not written by the writer.
Add a union of null
and int
value fixes the writing problem (union of int
and null
also works)
{ "name": "favorite_number", "type": [ "null", "int" ] }
or
{ "name": "favorite_number", "type": [ "int", "null" ] }
If written avro file has schema with favorite_number
and it is written as null then it will always be read as null irregardless how the read schema looks like (default value affects only reading fields that were not defined in schema used for writing so the null values were not written, only schema used for reading should define that field (including default), schema used for writing should not define that field at all)
Lets assume different scenario where the write schema has only name
field (without favorite_number
):
{ "namespace": "com.bawi.avro.model", "type": "record", "name": "User", "fields": [ { "name": "name", "type": "string" } ] }
and we write only name
field into avro
Lets assume we want favorite_number
to be set to -1
(with lets say new requirement to always populate in java code the favorite_number
since we do not want to check for null
for favorite_number
fields when reading avro/hive table on the top of avro). Then lets modify the read schema to include default -1:
user_with_default_favourite_number.avsc:
{ "namespace": "com.bawi.avro.model", "type": "record", "name": "User", "fields": [ { "name": "name", "type": "string" }, { "name": "favorite_number", "type": [ "int", "null" ], "default": -1 } ] }
with
File file2 = new File("user_with_default_favourite_number.avsc"); Schema schema2 = new Schema.Parser().parse(file2); DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema2);
and the output is:
{"name": "Alyssa", "favorite_number": -1}
If we change read schema for favorite_number to invalid:
{ "name": "favorite_number", "type": [ "null", "int" ], "default": -1 }
then we get:
org.apache.avro.AvroTypeException: Non-null default value for null type: -1
so if default non-null value is given then null in union needs to on second place.
If we want to have "default": null
then on the first place in the union needs to be null:
{ "name": "favorite_number", "type": [ "null", "int" ], "default": null }
since for invalid:
{ "name": "favorite_number", "type": [ "int", "null" ], "default": null }
we will get
org.apache.avro.AvroTypeException: Non-numeric default value for int: null
as described in https://avro.apache.org/docs/1.7.7/spec.html#Unions
One thought on “Unions and default value in apache avro serialization and deserialization”