[hadoop2.7.1]I/O之一步一步解析Text(基礎知識及與String比較)

hadoop中的Text類,跟java中的String類很相似,在其定義的方法上,也多有相近之處,當然,由於用途的不同,兩者之間還是有很大的區別的,那麼,在分析Text類之前,先來回顧下java.lang.String類。


1、java中的String類:


String 類代表字符串。Java 程序中的所有字符串字面值(如 "abc" )都作爲此類的實例實現。 

字符串是常量;它們的值在創建之後不能更改。字符串緩衝區支持可變的字符串。因爲 String 對象是不可變的,所以可以共享。例如: 

     String str = "abc";
 等效於:
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
 下面給出了一些如何使用字符串的更多示例: 

     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
 String 類包括的方法可用於檢查序列的單個字符、比較字符串、搜索字符串、提取子字符串、創建字符串副本並將所有字符全部轉換爲大寫或小寫。大小寫映射基於 Character 類指定的 Unicode 標準版。 

Java 語言提供對字符串串聯符號("+")以及將其他對象轉換爲字符串的特殊支持。字符串串聯是通過 StringBuilder(或 StringBuffer)類及其 append 方法實現的。字符串轉換是通過 toString 方法實現的,該方法由 Object 類定義,並可被 Java 中的所有類繼承。有關字符串串聯和轉換的更多信息,請參閱 Gosling、Joy 和 Steele 合著的 The Java Language Specification。 

除非另行說明,否則將 null 參數傳遞給此類中的構造方法或方法將拋出 NullPointerException。 

String 表示一個 UTF-16 格式的字符串,其中的增補字符 由代理項對 表示(有關詳細信息,請參閱 Character 類中的 Unicode 字符表示形式)。索引值是指 char 代碼單元,因此增補字符在 String 中佔用兩個位置。 

String 類提供處理 Unicode 代碼點(即字符)和 Unicode 代碼單元(即 char 值)的方法。


2、Unicode

   hadoop中的Text類和java中的String類都是使用標準的Unicode,但是在編碼方式上卻有不同之處,hadoop中的Text類使用修訂的UTF-8,而java中的String類使用的是UTF-16。接下來,對於Unicode做一個較爲詳細的闡述。

 Unicode

 Unicode(統一碼、萬國碼、單一碼)是一種在計算機上使用的字符編碼。Unicode 是爲了解決傳統的字符編碼方案的侷限而產生的,它爲每種語言中的每個字符設定了統一併且唯一的二進制編碼,以滿足跨語言、跨平臺進行文本轉換、處理的要求。Unicode涉及到兩個步驟,首先是定義一個規範,給所有的字符指定一個唯一對應的數字,第二步纔是怎麼把字符對應的數字保存在計算機中,這才涉及到實際在計算機中佔多少字節空間。
     我們也可以這樣理解,Unicode是用0至65535之間的數字來表示所有字符。其中0至127這128個數字表示的字符仍然跟ASCII完全一樣。65536是2的16次方,這是第一步。第二步就是怎麼把0至65535這些數字轉化成01串保存到計算機中,這肯定就有不同的保存方式了,於是出現了UTF(unicode transformation format),有UTF-8,UTF-16,UTF-32。

UTF-8 、UTF-16、UTF-32


那麼現在,我們的問題來了,UTF-8 、UTF-16、UTF-32之間到底有何區別呢?

    UTF-16比較好理解,就是任何字符對應的數字都用兩個字節來保存。我們通常對Unicode的誤解就是把Unicode與UTF-16等同了,但是很顯然如果都是英文字母這做有點浪費,明明用一個字節能表示一個字符爲什麼用兩個?

   於是又有個UTF-8,這裏的8非常容易誤導人,8不是指一個字節,難道一個字節表示一個字符?實際上不是。當用UTF-8時表示一個字符是可變的,有可能是用一個字節表示一個字符,也可能是兩個,三個,四個,當然最多不能超過4個字節,具體根據字符對應的數字大小來確定。實際表示ASCII字符的UNICODE字符,將會編碼成1個字節,並且UTF-8表示與ASCII字符表示是一樣的。所有其他的UNICODE字符轉化成UTF-8將需要至少2個字節。每個字節由一個換碼序列開始。第一個字節由唯一的換碼序列,由n位連續的1加一位0組成,首字節連續的1的個數表示字符編碼所需的字節數。

   現在UTF-8和UTF-16的優劣很容易就看出來了:如果全部英文或英文與其他文字混合,但英文佔絕大部分,用UTF-8就比UTF-16節省了很多空間。

例如,“漢字”對應的數字是0x6c49和0x5b57,而編碼的程序數據是:
  BYTE data_utf8[] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8編碼
  WORD data_utf16[] = {0x6c49, 0x5b57}; // UTF-16編碼
  DWORD data_utf32[] = {0x6c49, 0x5b57}; // UTF-32編碼
  這裏用BYTE、WORD、DWORD分別表示無符號8位整數,無符號16位整數和無符號32位整數。UTF-8、UTF-16、UTF-32分別以BYTE、WORD、DWORD作爲編碼單位。“漢字”的UTF-8編碼需要6個字節。“漢字”的UTF-16編碼需要兩個WORD,大小是4個字節。“漢字”的UTF-32編碼需要兩個DWORD,大小是8個字節。根據字節序的不同,UTF-16可以被實現爲UTF-16LE或UTF-16BE,UTF-32可以被實現爲UTF-32LE或UTF-32BE。

下面分別介紹UTF-8、UTF-16、UTF-32。

UTF-8(注意:Text使用修訂的標準UTF-8編碼)


UTF-8以字節爲單位對Unicode進行編碼。從Unicode到UTF-8的編碼方式如下:
  Unicode編碼(16進制) ║ UTF-8 字節流(二進制)
  000000 - 00007F ║ 0xxxxxxx
  000080 - 0007FF ║ 110xxxxx 10xxxxxx
  000800 - 00FFFF ║ 1110xxxx 10xxxxxx 10xxxxxx
  010000 - 10FFFF ║ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  UTF-8的特點是對不同範圍的字符使用不同長度的編碼。對於0x00-0x7F之間的字符,UTF-8編碼與ASCII編碼完全相同。UTF-8編碼的最大長度是4個字節。從上表可以看出,4字節模板有21個x,即可以容納21位二進制數字。Unicode的最大碼位0x10FFFF也只有21位。
  例1:“漢”字的Unicode編碼是0x6C49。0x6C49在0x0800-0xFFFF之間,使用用3字節模板了:1110xxxx 10xxxxxx 10xxxxxx。將0x6C49寫成二進制是:0110 1100 0100 1001, 用這個比特流依次代替模板中的x,得到:11100110 10110001 10001001,即E6 B1 89。
  例2:Unicode編碼0x20C30在0x010000-0x10FFFF之間,使用用4字節模板了:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。將0x20C30寫成21位二進制數字(不足21位就在前面補0):0 0010 0000 1100 0011 0000,用這個比特流依次代替模板中的x,得到:11110000 10100000 10110000 10110000,即F0 A0 B0 B0。

UTF-16


  UTF-16編碼以16位無符號整數爲單位。我們把Unicode編碼記作U。編碼規則如下:
  如果U<0x10000,U的UTF-16編碼就是U對應的16位無符號整數(爲書寫簡便,下文將16位無符號整數記作WORD)。 
  如果U≥0x10000,我們先計算U'=U-0x10000,然後將U'寫成二進制形式:yyyy yyyy yyxx xxxx xxxx,U的UTF-16編碼(二進制)就是:110110yyyyyyyyyy 110111xxxxxxxxxx。 
  爲什麼U'可以被寫成20個二進制位?Unicode的最大碼位是0x10ffff,減去0x10000後,U'的最大值是0xfffff,所以肯定可以用20個二進制位表示。例如:Unicode編碼0x20C30,減去0x10000後,得到0x10C30,寫成二進制是:0001 0000 1100 0011 0000。用前10位依次替代模板中的y,用後10位依次替代模板中的x,就得到:1101100001000011 1101110000110000,即0xD843 0xDC30。
  按照上述規則,Unicode編碼0x10000-0x10FFFF的UTF-16編碼有兩個WORD,第一個WORD的高6位是110110,第二個WORD的高6位是110111。可見,第一個WORD的取值範圍(二進制)是11011000 00000000到11011011 11111111,即0xD800-0xDBFF。第二個WORD的取值範圍(二進制)是11011100 00000000到11011111 11111111,即0xDC00-0xDFFF。
  爲了將一個WORD的UTF-16編碼與兩個WORD的UTF-16編碼區分開來,Unicode編碼的設計者將0xD800-0xDFFF保留下來,並稱爲代理區(Surrogate):
  D800-DB7F ║ High Surrogates ║ 高位替代 
  DB80-DBFF ║ High Private Use Surrogates ║ 高位專用替代 
  DC00-DFFF ║ Low Surrogates ║ 低位替代
  高位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第一個WORD。低位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第二個WORD。那麼,高位專用替代是什麼意思?我們來解答這個問題,順便看看怎麼由UTF-16編碼推導Unicode編碼。 
  如果一個字符的UTF-16編碼的第一個WORD在0xDB80到0xDBFF之間,那麼它的Unicode編碼在什麼範圍內?我們知道第二個WORD的取值範圍是0xDC00-0xDFFF,所以這個字符的UTF-16編碼範圍應該是0xDB80 0xDC00到0xDBFF 0xDFFF。我們將這個範圍寫成二進制:
  1101101110000000 11011100 00000000 - 1101101111111111 1101111111111111
  按照編碼的相反步驟,取出高低WORD的後10位,並拼在一起,得到
  1110 0000 0000 0000 0000 - 1111 1111 1111 1111 1111
  即0xe0000-0xfffff,按照編碼的相反步驟再加上0x10000,得到0xf0000-0x10ffff。這就是UTF-16編碼的第一個WORD在0xdb80到0xdbff之間的Unicode編碼範圍,即平面15和平面16。因爲Unicode標準將平面15和平面16都作爲專用區,所以0xDB80到0xDBFF之間的保留碼位被稱作高位專用替代。

UTF-32


  UTF-32編碼以32位無符號整數爲單位。Unicode的UTF-32編碼就是其對應的32位無符號整數。
  字節序
  根據字節序的不同,UTF-16可以被實現爲UTF-16LE或UTF-16BE,UTF-32可以被實現爲UTF-32LE或UTF-32BE。例如:
  Unicode編碼 ║ UTF-16LE ║ UTF-16BE ║ UTF32-LE ║ UTF32-BE 
  0x006C49 ║ 49 6C ║ 6C 49 ║ 49 6C 00 00 ║ 00 00 6C 49 
  0x020C30 ║ 43 D8 30 DC ║ D8 43 DC 30 ║ 30 0C 02 00 ║ 00 02 0C 30

而UTF-32 對每一個Unicode碼位使用恰好32位元。因爲UTF-32對每個字符都使用4字節,就空間而言,是非常沒有效率的。特別地,非基本多文種平面的字符在大部分文件中通常很罕見,以致於它們通常被認爲不存在佔用空間大小的討論,使得UTF-32通常會是其它編碼的二到四倍。雖然每一個碼位使用固定長定的字節看似方便,它並不如其它Unicode編碼使用得廣泛。

Text類


提供了序列化、反序列化和在字節級別上比較文本的方法。它的長度類型是整型,採用0壓縮序列化格式。另外,它還支持在不將字符數組轉換爲字符串的情況下進行字符串遍歷。

hadoop2.7.1中的Text源碼:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.io;

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.MalformedInputException;
import java.text.CharacterIterator;
import java.text.StringCharacterIterator;
import java.util.Arrays;

import org.apache.avro.reflect.Stringable;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;

/** This class stores text using standard UTF8 encoding.  It provides methods
 * to serialize, deserialize, and compare texts at byte level.  The type of
 * length is integer and is serialized using zero-compressed format.  <p>In
 * addition, it provides methods for string traversal without converting the
 * byte array to a string.  <p>Also includes utilities for
 * serializing/deserialing a string, coding/decoding a string, checking if a
 * byte array contains valid UTF8 code, calculating the length of an encoded
 * string.
 */
@Stringable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {
  
  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      @Override
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    @Override
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static final byte [] EMPTY_BYTES = new byte[0];
  
  private byte[] bytes;
  private int length;

  public Text() {
    bytes = EMPTY_BYTES;
  }

  /** Construct from a string. 
   */
  public Text(String string) {
    set(string);
  }

  /** Construct from another text. */
  public Text(Text utf8) {
    set(utf8);
  }

  /** Construct from a byte array.
   */
  public Text(byte[] utf8)  {
    set(utf8);
  }
  
  /**
   * Get a copy of the bytes that is exactly the length of the data.
   * See {@link #getBytes()} for faster access to the underlying array.
   */
  public byte[] copyBytes() {
    byte[] result = new byte[length];
    System.arraycopy(bytes, 0, result, 0, length);
    return result;
  }
  
  /**
   * Returns the raw bytes; however, only data up to {@link #getLength()} is
   * valid. Please use {@link #copyBytes()} if you
   * need the returned array to be precisely the length of the data.
   */
  @Override
  public byte[] getBytes() {
    return bytes;
  }

  /** Returns the number of bytes in the byte array */ 
  @Override
  public int getLength() {
    return length;
  }
  
  /**
   * Returns the Unicode Scalar Value (32-bit integer value)
   * for the character at <code>position</code>. Note that this
   * method avoids using the converter or doing String instantiation
   * @return the Unicode scalar value at position or -1
   *          if the position is invalid or points to a
   *          trailing byte
   */
  public int charAt(int position) {
    if (position > this.length) return -1; // too long
    if (position < 0) return -1; // duh.
      
    ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
    return bytesToCodePoint(bb.slice());
  }
  
  public int find(String what) {
    return find(what, 0);
  }
  
  /**
   * Finds any occurence of <code>what</code> in the backing
   * buffer, starting as position <code>start</code>. The starting
   * position is measured in bytes and the return value is in
   * terms of byte position in the buffer. The backing buffer is
   * not converted to a string for this operation.
   * @return byte position of the first occurence of the search
   *         string in the UTF-8 buffer or -1 if not found
   */
  public int find(String what, int start) {
    try {
      ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length);
      ByteBuffer tgt = encode(what);
      byte b = tgt.get();
      src.position(start);
          
      while (src.hasRemaining()) {
        if (b == src.get()) { // matching first byte
          src.mark(); // save position in loop
          tgt.mark(); // save position in target
          boolean found = true;
          int pos = src.position()-1;
          while (tgt.hasRemaining()) {
            if (!src.hasRemaining()) { // src expired first
              tgt.reset();
              src.reset();
              found = false;
              break;
            }
            if (!(tgt.get() == src.get())) {
              tgt.reset();
              src.reset();
              found = false;
              break; // no match
            }
          }
          if (found) return pos;
        }
      }
      return -1; // not found
    } catch (CharacterCodingException e) {
      // can't get here
      e.printStackTrace();
      return -1;
    }
  }  
  /** Set to contain the contents of a string. 
   */
  public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened ", e); 
    }
  }

  /** Set to a utf8 byte array
   */
  public void set(byte[] utf8) {
    set(utf8, 0, utf8.length);
  }
  
  /** copy a text. */
  public void set(Text other) {
    set(other.getBytes(), 0, other.getLength());
  }

  /**
   * Set the Text to range of bytes
   * @param utf8 the data to copy from
   * @param start the first position of the new string
   * @param len the number of bytes of the new string
   */
  public void set(byte[] utf8, int start, int len) {
    setCapacity(len, false);
    System.arraycopy(utf8, start, bytes, 0, len);
    this.length = len;
  }

  /**
   * Append a range of bytes to the end of the given text
   * @param utf8 the data to copy from
   * @param start the first position to append from utf8
   * @param len the number of bytes to append
   */
  public void append(byte[] utf8, int start, int len) {
    setCapacity(length + len, true);
    System.arraycopy(utf8, start, bytes, length, len);
    length += len;
  }

  /**
   * Clear the string to empty.
   *
   * <em>Note</em>: For performance reasons, this call does not clear the
   * underlying byte array that is retrievable via {@link #getBytes()}.
   * In order to free the byte-array memory, call {@link #set(byte[])}
   * with an empty byte array (For example, <code>new byte[0]</code>).
   */
  public void clear() {
    length = 0;
  }

  /*
   * Sets the capacity of this Text object to <em>at least</em>
   * <code>len</code> bytes. If the current buffer is longer,
   * then the capacity and existing content of the buffer are
   * unchanged. If <code>len</code> is larger
   * than the current capacity, the Text object's capacity is
   * increased to match.
   * @param len the number of bytes we need
   * @param keepData should the old data be kept
   */
  private void setCapacity(int len, boolean keepData) {
    if (bytes == null || bytes.length < len) {
      if (bytes != null && keepData) {
        bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
      } else {
        bytes = new byte[len];
      }
    }
  }
   
  /** 
   * Convert text back to string
   * @see java.lang.Object#toString()
   */
  @Override
  public String toString() {
    try {
      return decode(bytes, 0, length);
    } catch (CharacterCodingException e) { 
      throw new RuntimeException("Should not have happened " , e); 
    }
  }
  
  /** deserialize 
   */
  @Override
  public void readFields(DataInput in) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    readWithKnownLength(in, newLength);
  }
  
  public void readFields(DataInput in, int maxLength) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    if (newLength < 0) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data!  newLength must be non-negative.");
    } else if (newLength >= maxLength) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data, but maxLength = " + maxLength);
    }
    readWithKnownLength(in, newLength);
  }

  /** Skips over one Text in the input. */
  public static void skip(DataInput in) throws IOException {
    int length = WritableUtils.readVInt(in);
    WritableUtils.skipFully(in, length);
  }

  /**
   * Read a Text object whose length is already known.
   * This allows creating Text from a stream which uses a different serialization
   * format.
   */
  public void readWithKnownLength(DataInput in, int len) throws IOException {
    setCapacity(len, false);
    in.readFully(bytes, 0, len);
    length = len;
  }

  /** serialize
   * write this object to out
   * length uses zero-compressed encoding
   * @see Writable#write(DataOutput)
   */
  @Override
  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  public void write(DataOutput out, int maxLength) throws IOException {
    if (length > maxLength) {
      throw new IOException("data was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  /** Returns true iff <code>o</code> is a Text with the same contents.  */
  @Override
  public boolean equals(Object o) {
    if (o instanceof Text)
      return super.equals(o);
    return false;
  }

  @Override
  public int hashCode() {
    return super.hashCode();
  }

  /** A WritableComparator optimized for Text keys. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int n1 = WritableUtils.decodeVIntSize(b1[s1]);
      int n2 = WritableUtils.decodeVIntSize(b2[s2]);
      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
    }
  }

  static {
    // register this comparator
    WritableComparator.define(Text.class, new Comparator());
  }

  /// STATIC UTILITIES FROM HERE DOWN
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If the input is malformed,
   * replace by a default value.
   */
  public static String decode(byte[] utf8) throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8), true);
  }
  
  public static String decode(byte[] utf8, int start, int length) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), true);
  }
  
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   */
  public static String decode(byte[] utf8, int start, int length, boolean replace) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), replace);
  }
  
  private static String decode(ByteBuffer utf8, boolean replace) 
    throws CharacterCodingException {
    CharsetDecoder decoder = DECODER_FACTORY.get();
    if (replace) {
      decoder.onMalformedInput(
          java.nio.charset.CodingErrorAction.REPLACE);
      decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    String str = decoder.decode(utf8).toString();
    // set decoder back to its default value: REPORT
    if (replace) {
      decoder.onMalformedInput(CodingErrorAction.REPORT);
      decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return str;
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If the input is malformed,
   * invalid chars are replaced by a default value.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */

  public static ByteBuffer encode(String string)
    throws CharacterCodingException {
    return encode(string, true);
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */
  public static ByteBuffer encode(String string, boolean replace)
    throws CharacterCodingException {
    CharsetEncoder encoder = ENCODER_FACTORY.get();
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPLACE);
      encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    ByteBuffer bytes = 
      encoder.encode(CharBuffer.wrap(string.toCharArray()));
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPORT);
      encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return bytes;
  }

  static final public int DEFAULT_MAX_LEN = 1024 * 1024;

  /** Read a UTF8 encoded string from in
   */
  public static String readString(DataInput in) throws IOException {
    return readString(in, Integer.MAX_VALUE);
  }
  
  /** Read a UTF8 encoded string with a maximum size
   */
  public static String readString(DataInput in, int maxLength)
      throws IOException {
    int length = WritableUtils.readVIntInRange(in, 0, maxLength);
    byte [] bytes = new byte[length];
    in.readFully(bytes, 0, length);
    return decode(bytes);
  }
  
  /** Write a UTF8 encoded string to out
   */
  public static int writeString(DataOutput out, String s) throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  /** Write a UTF8 encoded string with a maximum size to out
   */
  public static int writeString(DataOutput out, String s, int maxLength)
      throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    if (length > maxLength) {
      throw new IOException("string was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  ////// states for validateUTF8
  
  private static final int LEAD_BYTE = 0;

  private static final int TRAIL_BYTE_1 = 1;

  private static final int TRAIL_BYTE = 2;

  /** 
   * Check if a byte array contains valid utf-8
   * @param utf8 byte array
   * @throws MalformedInputException if the byte array contains invalid utf-8
   */
  public static void validateUTF8(byte[] utf8) throws MalformedInputException {
    validateUTF8(utf8, 0, utf8.length);     
  }
  
  /**
   * Check to see if a byte array is valid utf-8
   * @param utf8 the array of bytes
   * @param start the offset of the first byte in the array
   * @param len the length of the byte sequence
   * @throws MalformedInputException if the byte array contains invalid bytes
   */
  public static void validateUTF8(byte[] utf8, int start, int len)
    throws MalformedInputException {
    int count = start;
    int leadByte = 0;
    int length = 0;
    int state = LEAD_BYTE;
    while (count < start+len) {
      int aByte = utf8[count] & 0xFF;

      switch (state) {
      case LEAD_BYTE:
        leadByte = aByte;
        length = bytesFromUTF8[aByte];

        switch (length) {
        case 0: // check for ASCII
          if (leadByte > 0x7F)
            throw new MalformedInputException(count);
          break;
        case 1:
          if (leadByte < 0xC2 || leadByte > 0xDF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 2:
          if (leadByte < 0xE0 || leadByte > 0xEF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 3:
          if (leadByte < 0xF0 || leadByte > 0xF4)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        default:
          // too long! Longest valid UTF-8 is 4 bytes (lead + three)
          // or if < 0 we got a trail byte in the lead byte position
          throw new MalformedInputException(count);
        } // switch (length)
        break;

      case TRAIL_BYTE_1:
        if (leadByte == 0xF0 && aByte < 0x90)
          throw new MalformedInputException(count);
        if (leadByte == 0xF4 && aByte > 0x8F)
          throw new MalformedInputException(count);
        if (leadByte == 0xE0 && aByte < 0xA0)
          throw new MalformedInputException(count);
        if (leadByte == 0xED && aByte > 0x9F)
          throw new MalformedInputException(count);
        // falls through to regular trail-byte test!!
      case TRAIL_BYTE:
        if (aByte < 0x80 || aByte > 0xBF)
          throw new MalformedInputException(count);
        if (--length == 0) {
          state = LEAD_BYTE;
        } else {
          state = TRAIL_BYTE;
        }
        break;
      default:
        break;
      } // switch (state)
      count++;
    }
  }

  /**
   * Magic numbers for UTF-8. These are the number of bytes
   * that <em>follow</em> a given lead byte. Trailing bytes
   * have the value -1. The values 4 and 5 are presented in
   * this table, even though valid UTF-8 cannot include the
   * five and six byte sequences.
   */
  static final int[] bytesFromUTF8 =
  { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0,
    // trail bytes
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
    3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 };

  /**
   * Returns the next code point at the current position in
   * the buffer. The buffer's position will be incremented.
   * Any mark set on this buffer will be changed by this method!
   */
  public static int bytesToCodePoint(ByteBuffer bytes) {
    bytes.mark();
    byte b = bytes.get();
    bytes.reset();
    int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
    if (extraBytesToRead < 0) return -1; // trailing byte!
    int ch = 0;

    switch (extraBytesToRead) {
    case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 0: ch += (bytes.get() & 0xFF);
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    return ch;
  }

  
  static final int offsetsFromUTF8[] =
  { 0x00000000, 0x00003080,
    0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 };

  /**
   * For the given string, returns the number of UTF-8 bytes
   * required to encode the string.
   * @param string text to encode
   * @return number of UTF-8 bytes required to encode
   */
  public static int utf8Length(String string) {
    CharacterIterator iter = new StringCharacterIterator(string);
    char ch = iter.first();
    int size = 0;
    while (ch != CharacterIterator.DONE) {
      if ((ch >= 0xD800) && (ch < 0xDC00)) {
        // surrogate pair?
        char trail = iter.next();
        if ((trail > 0xDBFF) && (trail < 0xE000)) {
          // valid pair
          size += 4;
        } else {
          // invalid pair
          size += 3;
          iter.previous(); // rewind one
        }
      } else if (ch < 0x80) {
        size++;
      } else if (ch < 0x800) {
        size += 2;
      } else {
        // ch < 0x10000, that is, the largest char value
        size += 3;
      }
      ch = iter.next();
    }
    return size;
  }
}

下面是Text類實現的方法:

 
Modifier and Type Method and Description
void append(byte[] utf8, int start, int len)
Append a range of bytes to the end of the given text
static int bytesToCodePoint(ByteBuffer bytes)
Returns the next code point at the current position in the buffer.
int charAt(int position)
Returns the Unicode Scalar Value (32-bit integer value) for the character atposition.
void clear()
Clear the string to empty.
byte[] copyBytes()
Get a copy of the bytes that is exactly the length of the data.
static String decode(byte[] utf8)
Converts the provided byte array to a String using the UTF-8 encoding.
static String decode(byte[] utf8, int start, int length) 
static String decode(byte[] utf8, int start, int length, boolean replace)
Converts the provided byte array to a String using the UTF-8 encoding.
static ByteBuffer encode(String string)
Converts the provided String to bytes using the UTF-8 encoding.
static ByteBuffer encode(String string, boolean replace)
Converts the provided String to bytes using the UTF-8 encoding.
boolean equals(Object o)
Returns true iff o is a Text with the same contents.
int find(String what) 
int find(String what, int start)
Finds any occurence of what in the backing buffer, starting as positionstart.
byte[] getBytes()
Returns the raw bytes; however, only data up to getLength() is valid.
int getLength()
Returns the number of bytes in the byte array
int hashCode()
Return a hash of the bytes returned from {#getBytes()}.
void readFields(DataInput in)
deserialize
void readFields(DataInput in, int maxLength) 
static String readString(DataInput in)
Read a UTF8 encoded string from in
static String readString(DataInput in, int maxLength)
Read a UTF8 encoded string with a maximum size
void readWithKnownLength(DataInput in, int len)
Read a Text object whose length is already known.
void set(byte[] utf8)
Set to a utf8 byte array
void set(byte[] utf8, int start, int len)
Set the Text to range of bytes
void set(String string)
Set to contain the contents of a string.
void set(Text other)
copy a text.
static void skip(DataInput in)
Skips over one Text in the input.
String toString()
Convert text back to string
static int utf8Length(String string)
For the given string, returns the number of UTF-8 bytes required to encode the string.
static void validateUTF8(byte[] utf8)
Check if a byte array contains valid utf-8
static void validateUTF8(byte[] utf8, int start, int len)
Check to see if a byte array is valid utf-8
void write(DataOutput out)
serialize write this object to out length uses zero-compressed encoding
void write(DataOutput out, int maxLength) 
static int writeString(DataOutput out,String s)
Write a UTF8 encoded string to out
static int writeString(DataOutput out,String s, int maxLength)
Write a UTF8 encoded string with a maximum size to out


爲對上面闡述進行驗證,接下來將列舉一個測試實例,並進行分析。


參考:
http://hadoop.apache.org/docs/current/



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章