【Hadoop】Why Writable Interface

在前面學習RPC系統的時候,可以看到client和server相互通訊都是用Writable類型來序列化。實際上,Writable Interface是一種Hadoop內置的序列化機制,MapReduce中的key, value都需要是Writable。


最初一直不太理解,爲何Hadoop需要自定義一種序列化機制,而不重用Java內置的序列化機制。《Hadoop The Definitive Guide》給出瞭如下解釋:

The problem is that Java Serialization doesn’t meet the criteria for a serialization format listed earlier:compact, fast, extensible, and interoperable. 


http://blog.csdn.net/tragicjun/article/details/8897096一文中提過,序列化機制有兩方面:Primitive type serialization和Constructed type serializtion。以下將從這兩方面出發,理解爲何Writable比Java Serialization更加compact。


Primitive type serialization

Java中對primitive type進行序列化,可以使用java.io.DataOutput接口。例如,序列化一個int:

	  ByteArrayOutputStream out = new ByteArrayOutputStream();
	  DataOutputStream dataOut = new DataOutputStream(out);	  
	  dataOut.writeInt(163);
	  dataOut.close();
其中out.toByteArray().length的值爲4。下面來看看Hadoop IntWritable如何encode:

  public void write(DataOutput out) throws IOException {
    out.writeInt(value);
  }
顯然,其實它無非也是使用DataOutput。因此,在primitive type上,Writable和Java Serialization沒有區別,甚至前者就是基於後者實現的。

Constructed type serializtion

Java中對constructed type進行序列化,可以使用java.io.ObjectOutput接口,例如,序列化一個CustomObject:

	  ByteArrayOutputStream out = new ByteArrayOutputStream();
	  ObjectOutputStream objectOut = new ObjectOutputStream(out);	  
	  objectOut.writeObject(customObject);
	  objectOut.close();

然而,Writable的接口裏只有DataOutput,沒有ObjectOutput:

public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}
差別就在這裏,先看ObjectOutput的javadoc描述:

The class of each serializable object is encoded including the class name and signature of the class, the values of the object's fields and arrays, and the closure of any other objects referenced from the initial objects.

Primitive data, excluding serializable fields and externalizable data, is written to the ObjectOutputStream in block-data records. A block data record is composed of a header and data. The block data header consists of amarker and the number of bytes to follow the header. Consecutive primitive data writes are merged into one block-data record.


從兩方面去理解ObjectOutput序列化class object的額外開銷:一方面,它會包括class name和class signature;另一方面,對於class object裏所包含的primitive data,還會增加marker和byte length等header信息。


相比而言,Writable序列化class object,將直接用DataOutput序列化所包含的primitive data,沒有額外的信息開銷。當然,Writable這樣做所犧牲的是通用性:ObjectOutput和ObjectInput是通用的接口,可以encode和decode任何實現Serializable接口的Java object(Serializable只是一個marker interface,不需要實現任何方法);而Writable則需要object自己實現encode和decode,所以每個Writable object都必須override write和readFields方法。對於Hadoop來說,Writable的序列化大小關乎整個系統的性能,因而這種犧牲是十分必要的。


Composite Pattern

最後一個問題是,既然Writable的primitive type序列化本質上就是使用Java serialization實現的,爲什麼不直接用int, long,而要wrap一層IntWritable,LongWritable?其實,這裏是一種composite design pattern,好處是"Let client treat individual objects and compositions of objects uniformly"。不論是primitive type,還是constructed type,都是通過統一的write()/readFields()方法來encode/decode。



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章