avro tricks and pitfalls

Use avro Reflection to serialize/deserialize object: (As ofversion 1.8.1)

Schema schema =ReflectData.AllowNull.get().getSchema(obj.getClass());

byte[] arr = null;

final DatumWriterwriter = new ReflectDatumWriter(schema);

finalByteArrayOutputStream out = new ByteArrayOutputStream(10*1024);

final BinaryEncoder encoder =EncoderFactory.get().binaryEncoder(out, null); 

writer.write(obj, encoder);

encoder.flush();

arr = out.toByteArray();

 

Schemaschema = ReflectData.AllowNull.get().getSchema(targetClass);

final DatumReaderreader = new ReflectDatumReader(schema);

final Decoder decoder =DecoderFactory.get().binaryDecoder(arr, null);

Object readObj= reader.read(null,decoder);

 

By default ReflectData.get().getSchema is unable to handnull value for attributes that are of type: Object or collection of object.NullPointerException will be thrown. Note: ReflectDatumWriter uses reflectionon field directly. Use ReflectData.AllowNull.get() instead

 

By default ReflectDataWriter does not handle cyclic objectgraph:  Ie. Class A contains an attributeof Class B and Class B contains an attribute of Class A. A StackOverflowErrorwill be thrown

See: https://issues.apache.org/jira/browse/AVRO-695

 

For collections, List and Map are fully supported. However,Set attribute is only partially supported with ReflectDataWriter.  You need to explicitly declare actual type ofSet in class field declaration.

Eg.  

private Set<String> components

Error: java.lang.RuntimeException:java.lang.NoSuchMethodException: java.util.Set.<init>()

atorg.apache.avro.specific.SpecificData.newInstance(SpecificData.java:344)

atorg.apache.avro.reflect.ReflectDatumReader.newArray(ReflectDatumReader.java:100)

at org.apache.avro.reflect.ReflectDatumReader.readArray(ReflectDatumReader.java:133)

 

private HashSet<String>components

Works fine

 

Some native java types like Date, BigDecimal etc are notsupported until the recent version of avro. Avro has introduced LogicalType thatenhances primitive types with additional information. Eg Date is a logic typeof int and time-micros is a logic type of long.    

 

To write to a ByteArrayOutputStream, BinaryEncoder.flush()must be called after write operation is performed, otherwise, you are likely toget an empty byte array.

 

ReflectDatumWriter accept two types of constructors: with aschema as parameter or with a class as a parameter. The former is moreflexible, as you can customize the schema building yourself. ReflectData.getSchema()already uses an internal schema cache to boost performance. From analysis, wecan see that building schema is quite expensive, so it is worth consideringbuilding the schema on system start rather than in serialization operation

 

EncoderFactory has two configuration parameters: bufferSizeand blockSize. Large buffer size can improve performance when serializing largeobject

DirectBinaryEncoder: no write buffering, not recommended forwriting large data

BinaryEncoder:

BlockingBinaryEncoder

 

Thread safety: a DatumReader instance may be used inmultiple threads.  Encoder and Decoder are not thread-safe, butDatumReader and DatumWriter are

 

Advanced avro techniques such as schema reusing, inheritanceetc

https://www.infoq.com/articles/ApacheAvro

 

Customizing serialization/deserialization for special javaclass that is not natively supported by avro. (Eg date) requires a specialconversion class

Eg

GenericData genericData = new GenericData();

genericData.addLogicalTypeConversion(newDateConversion());

DatumWriter<GenericRecord>datumWriter = new GenericDatumWriter<GenericRecord>(schema, genericData);

DatumReader<GenericRecord> datumReader = newGenericDatumReader<GenericRecord>(schema, schema, genericData);

 

Multiple schemas:

By default, ReflectionData create nested schemas, which is verylengthy and hard to maintain.

 

Avro supports multiple schema definition in one schema file,provided that earlier type definition in the schema file does not havedependency on later ones. Eg.

{"type" : "record",

 "name" :"TestObject3",

 "namespace": "de.hybris.core.network.serialization",

 "fields" :[ {"name" : "components",

                      "type" : [ "null",{"type" : "array",

                                                        "items" : "string",

                                                        "java-class" :"java.util.HashSet"

                                                      }

                                         ],

                        "default" : null

                    },

                      {"name" : "parent",

                      "type" : [ "null", de.hybris.core.network.serialization.TestObject1],

                       "default" : null

                     }

                    ]

},

{"type" : "record",

 "name" : "TestObject1",

 "namespace" : "de.hybris.core.network.serialization",

  …         

}

Will throw an exception when parsing the schema file.  Another major limitation with single schemafile, only fields for the first schema is accessible.

An alternative way is to create multiple schema definitionfiles and write a utility class to auto expand it to nested form as explainedin https://www.infoq.com/articles/ApacheAvro

Still cyclic schema definition dependency is not allowed,

Another thing worth note is that avro does not allow enclosing‘ ” ’  for type reference for “values”attribute in array type or “items” attribute in map type.

 

Performance:


Serialization Time

Deserialization Time

Binary form data size

Java serialization

13

4

2647

Avro reflection datum serialization

25

22

1158

Avro generic record datum serialization

2

3

1230

 

As you can see, avro has a great advantage in terms of datasize over java serialization. However, avro reflectionserialization/deserialization is even slower than java.  Avro generic recordserialization/deserialization yields best performance but substantial amount ofcoding effort is needed especially when object structure is complex

 


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章