[Hadoop源碼解讀]（五）MapReduce篇之Writable相關類

昨天出去玩了，今天繼續。

前面講了InputFormat，就順便講一下Writable的東西吧，本來應當是放在HDFS中的。

當要在進程間傳遞對象或持久化對象的時候，就需要序列化對象成字節流，反之當要將接收到或從磁盤讀取的字節流轉換爲對象，就要進行反序列化。Writable是Hadoop的序列化格式，Hadoop定義了這樣一個Writable接口。

public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}

一個類要支持可序列化只需實現這個接口即可。下面是Writable類得層次結構，借用了<<Hadoop:The Definitive Guide>>的圖。

下面我們一點一點來看，先是IntWritable和LongWritable。

WritableComparable接口擴展了Writable和Comparable接口，以支持比較。正如層次圖中看到，IntWritable、LongWritable、ByteWritable等基本類型都實現了這個接口。IntWritable和LongWritable的readFields()都直接從實現了DataInput接口的輸入流中讀取二進制數據並分別重構成int型和long型，而write()則直接將int型數據和long型數據直接轉換成二進制流。IntWritable和LongWritable都含有相應的Comparator內部類，這是用來支持對在不反序列化爲對象的情況下對數據流中的數據單位進行直接的，這是一個優化，因爲無需創建對象。看下面IntWritable的代碼片段：

public class IntWritable implements WritableComparable {
  private int value;

   //…… other methods
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(IntWritable.class);
    }

    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int thisValue = readInt(b1, s1);
      int thatValue = readInt(b2, s2);
      return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
    }
  }

  static {                                        // register this comparator
    WritableComparator.define(IntWritable.class, new Comparator());
  }
}

代碼中的static塊調用WritableComparator的static方法define()用來註冊上面這個Comparator，就是將其加入WritableComparator的comparators成員中，comparators是HashMap類型且是static的。這樣，就告訴WritableComparator，當我使用WritableComparator.get（IntWritable.class）方法的時候，你返回我註冊的這個Comparator給我[對IntWritable來說就是IntWritable.Comparator]，然後我就可以使用comparator.compare(byte[] b1, int s1, int l1,byte[] b2, int s2, int l2)來比較b1和b2，而不需要將它們反序列化成對象[像下面代碼中]。comparator.compare(byte[] b1, int s1, int l1,byte[] b2, int s2, int l2)中的readInt()是從WritableComparator繼承來的，它將IntWritable的value從byte數組中通過移位轉換出來。

//params byte[] b1, byte[] b2
RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
comparator.compare(b1,0,b1.length,b2,0,b2.length);

注意，當comparators中沒有註冊要比較的類的Comparator，則會返回一個默認的Comparator，然後使用這個默認Comparator的compare(byte[] b1, int s1, int l1,byte[] b2, int s2, int l2)方法比較b1、b2的時候還是要序列化成對象的，詳見後面細講WritableComparator。

LongWritable的方法基本和IntWritable一樣，區別就是LongWritable的值是long型，且多了一個額外的LongWritable.DecresingComparator，它繼承於LongWritable.Comparator，只是它的比較方法返回值與使用LongWritable.Comparator比較相反[取負]，這個應當是爲降序排序準備的。

public class LongWritable implements WritableComparable {
  private long value;
  //……others
  /** A decreasing Comparator optimized for LongWritable. */ 
  public static class DecreasingComparator extends Comparator {
    public int compare(WritableComparable a, WritableComparable b) {
      return -super.compare(a, b);
    }
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
      return -super.compare(b1, s1, l1, b2, s2, l2);
    }
  }
  static {                                       // register default comparator
    WritableComparator.define(LongWritable.class, new Comparator());
  }
}

另外，ByteWritable、BooleanWritable、FloatWritable、DoubleWritable都基本一樣。

然後我們看VIntWritable和VLongWritable，這兩個類基本一樣而且VIntWritable[反]的value編碼的時候也是使用VLongWritable的value編解碼時的方法，主要區別是VIntWritable對象使用int型value成員，而VLongWritable使用long型value成員，這是由它們的取值範圍決定的。它們都沒有Comparator，不像上面的類。

我們只看VLongWritable即可，先看看其源碼長什麼樣。

public class VLongWritable implements WritableComparable {
  private long value;

  public VLongWritable() {}

  public VLongWritable(long value) { set(value); }

  /** Set the value of this LongWritable. */
  public void set(long value) { this.value = value; }

  /** Return the value of this LongWritable. */
  public long get() { return value; }

  public void readFields(DataInput in) throws IOException {
    value = WritableUtils.readVLong(in);
  }

  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVLong(out, value);
  }

  /** Returns true iff <code>o</code> is a VLongWritable with the same value. */
  public boolean equals(Object o) {
    if (!(o instanceof VLongWritable))
      return false;
    VLongWritable other = (VLongWritable)o;
    return this.value == other.value;
  }

  public int hashCode() {
    return (int)value;
  }

  /** Compares two VLongWritables. */
  public int compareTo(Object o) {
    long thisValue = this.value;
    long thatValue = ((VLongWritable)o).value;
    return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
  }

  public String toString() {
    return Long.toString(value);
  }

}

在上面可以看到它編碼時使用WritableUtils.writeVLong()方法。WritableUtils是關於編解碼等的，暫時只看關於VIntWritable和VLongWritable的。

VIntWritable的value的編碼實際也是使用writeVLong()：

  public static void writeVInt(DataOutput stream, int i) throws IOException {
    writeVLong(stream, i);
  }

首先VIntWritable的長度是[1-5],VLonWritable長度是[1-9]，如果數值在[-112,127]時，使用1Byte表示，即編碼後的1Byte存儲的就是這個數值。{中文版權威指南上p91我看見說範圍是[-127,127]，我猜可能是編碼方法進行更新了}。如果不是在這個範圍內，則需要更多的Byte，而第一個Byte將被用作存儲長度，其它Byte存儲數值。

writeVLong()的操作過程如下圖，解析附在代碼中[不知道說的夠明白不，如果感覺難理解，個人覺得其實也不一定要了解太細節]。

WritableUtils.writeVLong()源碼：

  public static void writeVLong(DataOutput stream, long i) throws IOException {
    if (i >= -112 && i <= 127) {
      stream.writeByte((byte)i);
      return;  //-112~127 only use one byte
    }
      
    int len = -112;
    if (i < 0) {
      i ^= -1L; // take one's complement' ~1 = (11111111)2  得到這
      		//個i_2, i_2 + 1 = |i|,可想一下負數的反碼如何能得到其正數[連符號一起取反+1]
      len = -120;
    }
      
    long tmp = i;  //到這裏，i一定是正數，這個數介於[0,2^64-1]
    //然後用這個循環計算一下長度,i越大，實際長度越大，偏離長度起始值[原來len]越大，len值越小
    while (tmp != 0) { 
      tmp = tmp >> 8;
      len--;
    }
    //現在，我們顯然計算出了一個能表示其長度的值len,只要看其偏離長度起始值多少即可  
    stream.writeByte((byte)len);
      
    len = (len < -120) ? -(len + 120) : -(len + 112); //看吧，計算出了長度,不包含第一個Byte哈[表示長度的Byte]
      
    for (int idx = len; idx != 0; idx--) {  //然後，這裏從將i的二進制碼從左到右8位8位地拿出來，然後寫入流中
      int shiftbits = (idx - 1) * 8;
      long mask = 0xFFL << shiftbits;
      stream.writeByte((byte)((i & mask) >> shiftbits));
    }
  }

現在知道它是怎麼寫出去的了，再看看它是怎麼讀進來，這顯然是個反過程。

WritableUtils.readVLong():

  public static long readVLong(DataInput stream) throws IOException {
    byte firstByte = stream.readByte();
    int len = decodeVIntSize(firstByte);
    if (len == 1) {
      return firstByte;
    }
    long i = 0;
    for (int idx = 0; idx < len-1; idx++) {
      byte b = stream.readByte();
      i = i << 8;
      i = i | (b & 0xFF);
    }
    return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
  }

這顯然就是讀出字節表示長度[包括表示長度],然後從輸入流中一個Byte一個Byte讀出來，& 0xFF是爲了不讓系統自動類型轉換，然後再^ -1L，也就是連符號一起取反.

WritableUtils.decodeVIntSize()就是獲取編碼長度：

  public static int decodeVIntSize(byte value) {
    if (value >= -112) {
      return 1;
    } else if (value < -120) {
      return -119 - value;
    }
    return -111 - value;
  }

顯然，就是按照上面圖中的反過程，使用了-119和-111只是爲了獲取編碼長度而不是實際數值長度[不包含表示長度的第一個Byte]而已。

繼續說前面的WritableComparator，它是實現了RawComparator接口。RawComparator無非就是一個compare()方法。

public interface RawComparator<T> extends Comparator<T> {
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}

WritableComparator是RawComparator實例的工廠[註冊了的Writable的實現類]，它爲這些Writable實現類提供了反序列化用的方法，這些方法都比較簡單，比較難的readVInt()和readVLong()也就是上面說到的過程。Writable還提供了compare()的默認實現，它會反序列化才比較。如果WritableComparator.get()沒有得到註冊的Comparator，則會創建一個新的Comparator[其實是WritableComparator的實例]，然後當你使用 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)進行比較，它會去使用你要比較的Writable的實現的readFields()方法讀出value來。

比如，VIntWritable沒有註冊，我們get()時它就構造一個WritableComparator，然後設置key1,key2,buffer,keyClass，當你使用 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) ，則使用VIntWritable.readField從編碼後的byte[]中讀取value值再進行比較。

然後是ArrayWritable和TwoDArrayWritable，AbstractMapWritable

這兩個Writable實現分別是對一位數組和二維數組的封裝，不難想象它們都應該提供一個Writable數組和保持關於這個數組的類型，而且序列化和反序列化也將使用封裝的Writable實現的readFields()方法和write()方法。

public class TwoDArrayWritable implements Writable {
  private Class valueClass;
  private Writable[][] values;

  //……others
  public void readFields(DataInput in) throws IOException {
    // construct matrix
    values = new Writable[in.readInt()][];          
    for (int i = 0; i < values.length; i++) {
      values[i] = new Writable[in.readInt()];
    }

    // construct values
    for (int i = 0; i < values.length; i++) {
      for (int j = 0; j < values[i].length; j++) {
        Writable value;                             // construct value
        try {
          value = (Writable)valueClass.newInstance();
        } catch (InstantiationException e) {
          throw new RuntimeException(e.toString());
        } catch (IllegalAccessException e) {
          throw new RuntimeException(e.toString());
        }
        value.readFields(in);                       // read a value
        values[i][j] = value;                       // store it in values
      }
    }
  }

  public void write(DataOutput out) throws IOException {
    out.writeInt(values.length);                 // write values
    for (int i = 0; i < values.length; i++) {
      out.writeInt(values[i].length);
    }
    for (int i = 0; i < values.length; i++) {
      for (int j = 0; j < values[i].length; j++) {
        values[i][j].write(out);
      }
    }
  }
}

也就是那樣，沒什麼好講的了。

另外還有些TupleWritable，AbstractMapWritable->{MapWritable,SortMapWritable}，DBWritable，CompressedWritable，VersionedWritable，GenericWritable之類的，有必要時去再談它們，其實也差不多，功能不一樣而已。

參考資料：

[1]Hadoop權威指南中文版第二版

[Hadoop源碼解讀]（五）MapReduce篇之Writable相關類

[Hadoop]使用DistributedCache進行復制聯結

Hadoop全分佈安裝配置及常見問題

使用hadoop的datajoin包進行關係型join操作

用eclipse編寫mapreduce程序

[MapReduce編程]用MapReduce大刀砍掉海量數據離線處理問題。

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結