首页 > 代码库 > avro tricks and pitfalls
avro tricks and pitfalls
Use avro Reflection to serialize/deserialize object: (As ofversion 1.8.1)
Schema schema =ReflectData.AllowNull.get().getSchema(obj.getClass());
byte[] arr = null;
final DatumWriterwriter = new ReflectDatumWriter(schema);
finalByteArrayOutputStream out = new ByteArrayOutputStream(10*1024);
final BinaryEncoder encoder =EncoderFactory.get().binaryEncoder(out, null);
writer.write(obj, encoder);
encoder.flush();
arr = out.toByteArray();
Schemaschema = ReflectData.AllowNull.get().getSchema(targetClass);
final DatumReaderreader = new ReflectDatumReader(schema);
final Decoder decoder =DecoderFactory.get().binaryDecoder(arr, null);
Object readObj= reader.read(null,decoder);
By default ReflectData.get().getSchema is unable to handnull value for attributes that are of type: Object or collection of object.NullPointerException will be thrown. Note: ReflectDatumWriter uses reflectionon field directly. Use ReflectData.AllowNull.get() instead
By default ReflectDataWriter does not handle cyclic objectgraph: Ie. Class A contains an attributeof Class B and Class B contains an attribute of Class A. A StackOverflowErrorwill be thrown
See: https://issues.apache.org/jira/browse/AVRO-695
For collections, List and Map are fully supported. However,Set attribute is only partially supported with ReflectDataWriter. You need to explicitly declare actual type ofSet in class field declaration.
Eg.
private Set<String> components
Error: java.lang.RuntimeException:java.lang.NoSuchMethodException: java.util.Set.<init>()
atorg.apache.avro.specific.SpecificData.newInstance(SpecificData.java:344)
atorg.apache.avro.reflect.ReflectDatumReader.newArray(ReflectDatumReader.java:100)
at org.apache.avro.reflect.ReflectDatumReader.readArray(ReflectDatumReader.java:133)
private HashSet<String>components
Works fine
Some native java types like Date, BigDecimal etc are notsupported until the recent version of avro. Avro has introduced LogicalType thatenhances primitive types with additional information. Eg Date is a logic typeof int and time-micros is a logic type of long.
To write to a ByteArrayOutputStream, BinaryEncoder.flush()must be called after write operation is performed, otherwise, you are likely toget an empty byte array.
ReflectDatumWriter accept two types of constructors: with aschema as parameter or with a class as a parameter. The former is moreflexible, as you can customize the schema building yourself. ReflectData.getSchema()already uses an internal schema cache to boost performance. From analysis, wecan see that building schema is quite expensive, so it is worth consideringbuilding the schema on system start rather than in serialization operation
EncoderFactory has two configuration parameters: bufferSizeand blockSize. Large buffer size can improve performance when serializing largeobject
DirectBinaryEncoder: no write buffering, not recommended forwriting large data
BinaryEncoder:
BlockingBinaryEncoder
Thread safety: a DatumReader instance may be used inmultiple threads. Encoder and Decoder are not thread-safe, butDatumReader and DatumWriter are
Advanced avro techniques such as schema reusing, inheritanceetc
https://www.infoq.com/articles/ApacheAvro
Customizing serialization/deserialization for special javaclass that is not natively supported by avro. (Eg date) requires a specialconversion class
Eg
GenericData genericData = new GenericData();
genericData.addLogicalTypeConversion(newDateConversion());
DatumWriter<GenericRecord>datumWriter = new GenericDatumWriter<GenericRecord>(schema, genericData);
DatumReader<GenericRecord> datumReader = newGenericDatumReader<GenericRecord>(schema, schema, genericData);
Multiple schemas:
By default, ReflectionData create nested schemas, which is verylengthy and hard to maintain.
Avro supports multiple schema definition in one schema file,provided that earlier type definition in the schema file does not havedependency on later ones. Eg.
{"type" : "record",
"name" :"TestObject3",
"namespace": "de.hybris.core.network.serialization",
"fields" :[ {"name" : "components",
"type" : [ "null",{"type" : "array",
"items" : "string",
"java-class" :"java.util.HashSet"
}
],
"default" : null
},
{"name" : "parent",
"type" : [ "null", de.hybris.core.network.serialization.TestObject1],
"default" : null
}
]
},
{"type" : "record",
"name" : "TestObject1",
"namespace" : "de.hybris.core.network.serialization",
…
}
Will throw an exception when parsing the schema file. Another major limitation with single schemafile, only fields for the first schema is accessible.
An alternative way is to create multiple schema definitionfiles and write a utility class to auto expand it to nested form as explainedin https://www.infoq.com/articles/ApacheAvro
Still cyclic schema definition dependency is not allowed,
Another thing worth note is that avro does not allow enclosing‘ ” ’ for type reference for “values”attribute in array type or “items” attribute in map type.
Performance:
Serialization Time | Deserialization Time | Binary form data size | |
Java serialization | 13 | 4 | 2647 |
Avro reflection datum serialization | 25 | 22 | 1158 |
Avro generic record datum serialization | 2 | 3 | 1230 |
As you can see, avro has a great advantage in terms of datasize over java serialization. However, avro reflectionserialization/deserialization is even slower than java. Avro generic recordserialization/deserialization yields best performance but substantial amount ofcoding effort is needed especially when object structure is complex
avro tricks and pitfalls