[hadoop2.7.1]I/O之一步一步解析Text（基礎知識及與String比較）

hadoop中的Text類，跟java中的String類很相似，在其定義的方法上，也多有相近之處，當然，由於用途的不同，兩者之間還是有很大的區別的，那麼，在分析Text類之前，先來回顧下java.lang.String類。

1、java中的String類：

String 類代表字符串。Java 程序中的所有字符串字面值（如 "abc" ）都作爲此類的實例實現。

字符串是常量；它們的值在創建之後不能更改。字符串緩衝區支持可變的字符串。因爲 String 對象是不可變的，所以可以共享。例如：

     String str = "abc";
等效於：
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
下面給出了一些如何使用字符串的更多示例：

     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
String 類包括的方法可用於檢查序列的單個字符、比較字符串、搜索字符串、提取子字符串、創建字符串副本並將所有字符全部轉換爲大寫或小寫。大小寫映射基於 Character 類指定的 Unicode 標準版。

Java 語言提供對字符串串聯符號（"+"）以及將其他對象轉換爲字符串的特殊支持。字符串串聯是通過 StringBuilder（或 StringBuffer）類及其 append 方法實現的。字符串轉換是通過 toString 方法實現的，該方法由 Object 類定義，並可被 Java 中的所有類繼承。有關字符串串聯和轉換的更多信息，請參閱 Gosling、Joy 和 Steele 合著的 The Java Language Specification。

除非另行說明，否則將 null 參數傳遞給此類中的構造方法或方法將拋出 NullPointerException。

String 表示一個 UTF-16 格式的字符串，其中的增補字符由代理項對表示（有關詳細信息，請參閱 Character 類中的 Unicode 字符表示形式）。索引值是指 char 代碼單元，因此增補字符在 String 中佔用兩個位置。

String 類提供處理 Unicode 代碼點（即字符）和 Unicode 代碼單元（即 char 值）的方法。

2、Unicode

hadoop中的Text類和java中的String類都是使用標準的Unicode，但是在編碼方式上卻有不同之處，hadoop中的Text類使用修訂的UTF-8，而java中的String類使用的是UTF-16。接下來，對於Unicode做一個較爲詳細的闡述。

Unicode

Unicode（統一碼、萬國碼、單一碼）是一種在計算機上使用的字符編碼。Unicode 是爲了解決傳統的字符編碼方案的侷限而產生的，它爲每種語言中的每個字符設定了統一併且唯一的二進制編碼，以滿足跨語言、跨平臺進行文本轉換、處理的要求。Unicode涉及到兩個步驟，首先是定義一個規範，給所有的字符指定一個唯一對應的數字，第二步纔是怎麼把字符對應的數字保存在計算機中，這才涉及到實際在計算機中佔多少字節空間。

我們也可以這樣理解，Unicode是用0至65535之間的數字來表示所有字符。其中0至127這128個數字表示的字符仍然跟ASCII完全一樣。65536是2的16次方，這是第一步。第二步就是怎麼把0至65535這些數字轉化成01串保存到計算機中，這肯定就有不同的保存方式了，於是出現了UTF(unicode transformation format)，有UTF-8，UTF-16，UTF-32。

UTF-8 、UTF-16、UTF-32

那麼現在，我們的問題來了，UTF-8 、UTF-16、UTF-32之間到底有何區別呢？

    UTF-16比較好理解，就是任何字符對應的數字都用兩個字節來保存。我們通常對Unicode的誤解就是把Unicode與UTF-16等同了，但是很顯然如果都是英文字母這做有點浪費，明明用一個字節能表示一個字符爲什麼用兩個？

   於是又有個UTF-8，這裏的8非常容易誤導人，8不是指一個字節，難道一個字節表示一個字符？實際上不是。當用UTF-8時表示一個字符是可變的，有可能是用一個字節表示一個字符，也可能是兩個，三個，四個，當然最多不能超過4個字節，具體根據字符對應的數字大小來確定。實際表示ASCII字符的UNICODE字符，將會編碼成1個字節，並且UTF-8表示與ASCII字符表示是一樣的。所有其他的UNICODE字符轉化成UTF-8將需要至少2個字節。每個字節由一個換碼序列開始。第一個字節由唯一的換碼序列，由n位連續的1加一位0組成，首字節連續的1的個數表示字符編碼所需的字節數。

   現在UTF-8和UTF-16的優劣很容易就看出來了：如果全部英文或英文與其他文字混合，但英文佔絕大部分，用UTF-8就比UTF-16節省了很多空間。

例如，“漢字”對應的數字是0x6c49和0x5b57，而編碼的程序數據是：
　　BYTE data_utf8[] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8編碼
　　WORD data_utf16[] = {0x6c49, 0x5b57}; // UTF-16編碼
　　DWORD data_utf32[] = {0x6c49, 0x5b57}; // UTF-32編碼
　　這裏用BYTE、WORD、DWORD分別表示無符號8位整數，無符號16位整數和無符號32位整數。UTF-8、UTF-16、UTF-32分別以BYTE、WORD、DWORD作爲編碼單位。“漢字”的UTF-8編碼需要6個字節。“漢字”的UTF-16編碼需要兩個WORD，大小是4個字節。“漢字”的UTF-32編碼需要兩個DWORD，大小是8個字節。根據字節序的不同，UTF-16可以被實現爲UTF-16LE或UTF-16BE，UTF-32可以被實現爲UTF-32LE或UTF-32BE。

下面分別介紹UTF-8、UTF-16、UTF-32。

UTF-8（注意：Text使用修訂的標準UTF-8編碼）

UTF-8以字節爲單位對Unicode進行編碼。從Unicode到UTF-8的編碼方式如下：
　　Unicode編碼(16進制)　║　UTF-8 字節流(二進制)
　　000000 - 00007F　║　0xxxxxxx
　　000080 - 0007FF　║　110xxxxx 10xxxxxx
　　000800 - 00FFFF　║　1110xxxx 10xxxxxx 10xxxxxx
　　010000 - 10FFFF　║　11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
　　UTF-8的特點是對不同範圍的字符使用不同長度的編碼。對於0x00-0x7F之間的字符，UTF-8編碼與ASCII編碼完全相同。UTF-8編碼的最大長度是4個字節。從上表可以看出，4字節模板有21個x，即可以容納21位二進制數字。Unicode的最大碼位0x10FFFF也只有21位。
　　例1：“漢”字的Unicode編碼是0x6C49。0x6C49在0x0800-0xFFFF之間，使用用3字節模板了：1110xxxx 10xxxxxx 10xxxxxx。將0x6C49寫成二進制是：0110 1100 0100 1001，用這個比特流依次代替模板中的x，得到：11100110 10110001 10001001，即E6 B1 89。
　　例2：Unicode編碼0x20C30在0x010000-0x10FFFF之間，使用用4字節模板了：11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。將0x20C30寫成21位二進制數字（不足21位就在前面補0）：0 0010 0000 1100 0011 0000，用這個比特流依次代替模板中的x，得到：11110000 10100000 10110000 10110000，即F0 A0 B0 B0。

UTF-16

　　UTF-16編碼以16位無符號整數爲單位。我們把Unicode編碼記作U。編碼規則如下：
　　如果U<0x10000，U的UTF-16編碼就是U對應的16位無符號整數（爲書寫簡便，下文將16位無符號整數記作WORD）。
　　如果U≥0x10000，我們先計算U'=U-0x10000，然後將U'寫成二進制形式：yyyy yyyy yyxx xxxx xxxx，U的UTF-16編碼（二進制）就是：110110yyyyyyyyyy 110111xxxxxxxxxx。
　　爲什麼U'可以被寫成20個二進制位？Unicode的最大碼位是0x10ffff，減去0x10000後，U'的最大值是0xfffff，所以肯定可以用20個二進制位表示。例如：Unicode編碼0x20C30，減去0x10000後，得到0x10C30，寫成二進制是：0001 0000 1100 0011 0000。用前10位依次替代模板中的y，用後10位依次替代模板中的x，就得到：1101100001000011 1101110000110000，即0xD843 0xDC30。
　　按照上述規則，Unicode編碼0x10000-0x10FFFF的UTF-16編碼有兩個WORD，第一個WORD的高6位是110110，第二個WORD的高6位是110111。可見，第一個WORD的取值範圍（二進制）是11011000 00000000到11011011 11111111，即0xD800-0xDBFF。第二個WORD的取值範圍（二進制）是11011100 00000000到11011111 11111111，即0xDC00-0xDFFF。
　　爲了將一個WORD的UTF-16編碼與兩個WORD的UTF-16編碼區分開來，Unicode編碼的設計者將0xD800-0xDFFF保留下來，並稱爲代理區（Surrogate）：
　　D800－DB7F　║　High Surrogates　║　高位替代
　　DB80－DBFF　║　High Private Use Surrogates　║　高位專用替代
　　DC00－DFFF　║　Low Surrogates　║　低位替代
　　高位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第一個WORD。低位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第二個WORD。那麼，高位專用替代是什麼意思？我們來解答這個問題，順便看看怎麼由UTF-16編碼推導Unicode編碼。
　　如果一個字符的UTF-16編碼的第一個WORD在0xDB80到0xDBFF之間，那麼它的Unicode編碼在什麼範圍內？我們知道第二個WORD的取值範圍是0xDC00-0xDFFF，所以這個字符的UTF-16編碼範圍應該是0xDB80 0xDC00到0xDBFF 0xDFFF。我們將這個範圍寫成二進制：
　　1101101110000000 11011100 00000000 - 1101101111111111 1101111111111111
　　按照編碼的相反步驟，取出高低WORD的後10位，並拼在一起，得到
　　1110 0000 0000 0000 0000 - 1111 1111 1111 1111 1111
　　即0xe0000-0xfffff，按照編碼的相反步驟再加上0x10000，得到0xf0000-0x10ffff。這就是UTF-16編碼的第一個WORD在0xdb80到0xdbff之間的Unicode編碼範圍，即平面15和平面16。因爲Unicode標準將平面15和平面16都作爲專用區，所以0xDB80到0xDBFF之間的保留碼位被稱作高位專用替代。

UTF-32

　　UTF-32編碼以32位無符號整數爲單位。Unicode的UTF-32編碼就是其對應的32位無符號整數。
　　字節序
　　根據字節序的不同，UTF-16可以被實現爲UTF-16LE或UTF-16BE，UTF-32可以被實現爲UTF-32LE或UTF-32BE。例如：
　　Unicode編碼　║　UTF-16LE　║　UTF-16BE　║　UTF32-LE　║　UTF32-BE
　　0x006C49　║　49 6C　║　6C 49　║　49 6C 00 00　║　00 00 6C 49
　　0x020C30　║　43 D8 30 DC　║　D8 43 DC 30　║　30 0C 02 00　║　00 02 0C 30

而UTF-32 對每一個Unicode碼位使用恰好32位元。因爲UTF-32對每個字符都使用4字節，就空間而言，是非常沒有效率的。特別地，非基本多文種平面的字符在大部分文件中通常很罕見，以致於它們通常被認爲不存在佔用空間大小的討論，使得UTF-32通常會是其它編碼的二到四倍。雖然每一個碼位使用固定長定的字節看似方便，它並不如其它Unicode編碼使用得廣泛。

Text類

提供了序列化、反序列化和在字節級別上比較文本的方法。它的長度類型是整型，採用0壓縮序列化格式。另外，它還支持在不將字符數組轉換爲字符串的情況下進行字符串遍歷。

hadoop2.7.1中的Text源碼：

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.io;

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.MalformedInputException;
import java.text.CharacterIterator;
import java.text.StringCharacterIterator;
import java.util.Arrays;

import org.apache.avro.reflect.Stringable;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;

/** This class stores text using standard UTF8 encoding.  It provides methods
 * to serialize, deserialize, and compare texts at byte level.  The type of
 * length is integer and is serialized using zero-compressed format.  <p>In
 * addition, it provides methods for string traversal without converting the
 * byte array to a string.  <p>Also includes utilities for
 * serializing/deserialing a string, coding/decoding a string, checking if a
 * byte array contains valid UTF8 code, calculating the length of an encoded
 * string.
 */
@Stringable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {
  
  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      @Override
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    @Override
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static final byte [] EMPTY_BYTES = new byte[0];
  
  private byte[] bytes;
  private int length;

  public Text() {
    bytes = EMPTY_BYTES;
  }

  /** Construct from a string. 
   */
  public Text(String string) {
    set(string);
  }

  /** Construct from another text. */
  public Text(Text utf8) {
    set(utf8);
  }

  /** Construct from a byte array.
   */
  public Text(byte[] utf8)  {
    set(utf8);
  }
  
  /**
   * Get a copy of the bytes that is exactly the length of the data.
   * See {@link #getBytes()} for faster access to the underlying array.
   */
  public byte[] copyBytes() {
    byte[] result = new byte[length];
    System.arraycopy(bytes, 0, result, 0, length);
    return result;
  }
  
  /**
   * Returns the raw bytes; however, only data up to {@link #getLength()} is
   * valid. Please use {@link #copyBytes()} if you
   * need the returned array to be precisely the length of the data.
   */
  @Override
  public byte[] getBytes() {
    return bytes;
  }

  /** Returns the number of bytes in the byte array */ 
  @Override
  public int getLength() {
    return length;
  }
  
  /**
   * Returns the Unicode Scalar Value (32-bit integer value)
   * for the character at <code>position</code>. Note that this
   * method avoids using the converter or doing String instantiation
   * @return the Unicode scalar value at position or -1
   *          if the position is invalid or points to a
   *          trailing byte
   */
  public int charAt(int position) {
    if (position > this.length) return -1; // too long
    if (position < 0) return -1; // duh.
      
    ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
    return bytesToCodePoint(bb.slice());
  }
  
  public int find(String what) {
    return find(what, 0);
  }
  
  /**
   * Finds any occurence of <code>what</code> in the backing
   * buffer, starting as position <code>start</code>. The starting
   * position is measured in bytes and the return value is in
   * terms of byte position in the buffer. The backing buffer is
   * not converted to a string for this operation.
   * @return byte position of the first occurence of the search
   *         string in the UTF-8 buffer or -1 if not found
   */
  public int find(String what, int start) {
    try {
      ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length);
      ByteBuffer tgt = encode(what);
      byte b = tgt.get();
      src.position(start);
          
      while (src.hasRemaining()) {
        if (b == src.get()) { // matching first byte
          src.mark(); // save position in loop
          tgt.mark(); // save position in target
          boolean found = true;
          int pos = src.position()-1;
          while (tgt.hasRemaining()) {
            if (!src.hasRemaining()) { // src expired first
              tgt.reset();
              src.reset();
              found = false;
              break;
            }
            if (!(tgt.get() == src.get())) {
              tgt.reset();
              src.reset();
              found = false;
              break; // no match
            }
          }
          if (found) return pos;
        }
      }
      return -1; // not found
    } catch (CharacterCodingException e) {
      // can't get here
      e.printStackTrace();
      return -1;
    }
  }  
  /** Set to contain the contents of a string. 
   */
  public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened ", e); 
    }
  }

  /** Set to a utf8 byte array
   */
  public void set(byte[] utf8) {
    set(utf8, 0, utf8.length);
  }
  
  /** copy a text. */
  public void set(Text other) {
    set(other.getBytes(), 0, other.getLength());
  }

  /**
   * Set the Text to range of bytes
   * @param utf8 the data to copy from
   * @param start the first position of the new string
   * @param len the number of bytes of the new string
   */
  public void set(byte[] utf8, int start, int len) {
    setCapacity(len, false);
    System.arraycopy(utf8, start, bytes, 0, len);
    this.length = len;
  }

  /**
   * Append a range of bytes to the end of the given text
   * @param utf8 the data to copy from
   * @param start the first position to append from utf8
   * @param len the number of bytes to append
   */
  public void append(byte[] utf8, int start, int len) {
    setCapacity(length + len, true);
    System.arraycopy(utf8, start, bytes, length, len);
    length += len;
  }

  /**
   * Clear the string to empty.
   *
   * <em>Note</em>: For performance reasons, this call does not clear the
   * underlying byte array that is retrievable via {@link #getBytes()}.
   * In order to free the byte-array memory, call {@link #set(byte[])}
   * with an empty byte array (For example, <code>new byte[0]</code>).
   */
  public void clear() {
    length = 0;
  }

  /*
   * Sets the capacity of this Text object to <em>at least</em>
   * <code>len</code> bytes. If the current buffer is longer,
   * then the capacity and existing content of the buffer are
   * unchanged. If <code>len</code> is larger
   * than the current capacity, the Text object's capacity is
   * increased to match.
   * @param len the number of bytes we need
   * @param keepData should the old data be kept
   */
  private void setCapacity(int len, boolean keepData) {
    if (bytes == null || bytes.length < len) {
      if (bytes != null && keepData) {
        bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
      } else {
        bytes = new byte[len];
      }
    }
  }
   
  /** 
   * Convert text back to string
   * @see java.lang.Object#toString()
   */
  @Override
  public String toString() {
    try {
      return decode(bytes, 0, length);
    } catch (CharacterCodingException e) { 
      throw new RuntimeException("Should not have happened " , e); 
    }
  }
  
  /** deserialize 
   */
  @Override
  public void readFields(DataInput in) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    readWithKnownLength(in, newLength);
  }
  
  public void readFields(DataInput in, int maxLength) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    if (newLength < 0) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data!  newLength must be non-negative.");
    } else if (newLength >= maxLength) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data, but maxLength = " + maxLength);
    }
    readWithKnownLength(in, newLength);
  }

  /** Skips over one Text in the input. */
  public static void skip(DataInput in) throws IOException {
    int length = WritableUtils.readVInt(in);
    WritableUtils.skipFully(in, length);
  }

  /**
   * Read a Text object whose length is already known.
   * This allows creating Text from a stream which uses a different serialization
   * format.
   */
  public void readWithKnownLength(DataInput in, int len) throws IOException {
    setCapacity(len, false);
    in.readFully(bytes, 0, len);
    length = len;
  }

  /** serialize
   * write this object to out
   * length uses zero-compressed encoding
   * @see Writable#write(DataOutput)
   */
  @Override
  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  public void write(DataOutput out, int maxLength) throws IOException {
    if (length > maxLength) {
      throw new IOException("data was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  /** Returns true iff <code>o</code> is a Text with the same contents.  */
  @Override
  public boolean equals(Object o) {
    if (o instanceof Text)
      return super.equals(o);
    return false;
  }

  @Override
  public int hashCode() {
    return super.hashCode();
  }

  /** A WritableComparator optimized for Text keys. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int n1 = WritableUtils.decodeVIntSize(b1[s1]);
      int n2 = WritableUtils.decodeVIntSize(b2[s2]);
      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
    }
  }

  static {
    // register this comparator
    WritableComparator.define(Text.class, new Comparator());
  }

  /// STATIC UTILITIES FROM HERE DOWN
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If the input is malformed,
   * replace by a default value.
   */
  public static String decode(byte[] utf8) throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8), true);
  }
  
  public static String decode(byte[] utf8, int start, int length) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), true);
  }
  
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   */
  public static String decode(byte[] utf8, int start, int length, boolean replace) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), replace);
  }
  
  private static String decode(ByteBuffer utf8, boolean replace) 
    throws CharacterCodingException {
    CharsetDecoder decoder = DECODER_FACTORY.get();
    if (replace) {
      decoder.onMalformedInput(
          java.nio.charset.CodingErrorAction.REPLACE);
      decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    String str = decoder.decode(utf8).toString();
    // set decoder back to its default value: REPORT
    if (replace) {
      decoder.onMalformedInput(CodingErrorAction.REPORT);
      decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return str;
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If the input is malformed,
   * invalid chars are replaced by a default value.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */

  public static ByteBuffer encode(String string)
    throws CharacterCodingException {
    return encode(string, true);
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */
  public static ByteBuffer encode(String string, boolean replace)
    throws CharacterCodingException {
    CharsetEncoder encoder = ENCODER_FACTORY.get();
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPLACE);
      encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    ByteBuffer bytes = 
      encoder.encode(CharBuffer.wrap(string.toCharArray()));
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPORT);
      encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return bytes;
  }

  static final public int DEFAULT_MAX_LEN = 1024 * 1024;

  /** Read a UTF8 encoded string from in
   */
  public static String readString(DataInput in) throws IOException {
    return readString(in, Integer.MAX_VALUE);
  }
  
  /** Read a UTF8 encoded string with a maximum size
   */
  public static String readString(DataInput in, int maxLength)
      throws IOException {
    int length = WritableUtils.readVIntInRange(in, 0, maxLength);
    byte [] bytes = new byte[length];
    in.readFully(bytes, 0, length);
    return decode(bytes);
  }
  
  /** Write a UTF8 encoded string to out
   */
  public static int writeString(DataOutput out, String s) throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  /** Write a UTF8 encoded string with a maximum size to out
   */
  public static int writeString(DataOutput out, String s, int maxLength)
      throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    if (length > maxLength) {
      throw new IOException("string was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  ////// states for validateUTF8
  
  private static final int LEAD_BYTE = 0;

  private static final int TRAIL_BYTE_1 = 1;

  private static final int TRAIL_BYTE = 2;

  /** 
   * Check if a byte array contains valid utf-8
   * @param utf8 byte array
   * @throws MalformedInputException if the byte array contains invalid utf-8
   */
  public static void validateUTF8(byte[] utf8) throws MalformedInputException {
    validateUTF8(utf8, 0, utf8.length);     
  }
  
  /**
   * Check to see if a byte array is valid utf-8
   * @param utf8 the array of bytes
   * @param start the offset of the first byte in the array
   * @param len the length of the byte sequence
   * @throws MalformedInputException if the byte array contains invalid bytes
   */
  public static void validateUTF8(byte[] utf8, int start, int len)
    throws MalformedInputException {
    int count = start;
    int leadByte = 0;
    int length = 0;
    int state = LEAD_BYTE;
    while (count < start+len) {
      int aByte = utf8[count] & 0xFF;

      switch (state) {
      case LEAD_BYTE:
        leadByte = aByte;
        length = bytesFromUTF8[aByte];

        switch (length) {
        case 0: // check for ASCII
          if (leadByte > 0x7F)
            throw new MalformedInputException(count);
          break;
        case 1:
          if (leadByte < 0xC2 || leadByte > 0xDF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 2:
          if (leadByte < 0xE0 || leadByte > 0xEF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 3:
          if (leadByte < 0xF0 || leadByte > 0xF4)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        default:
          // too long! Longest valid UTF-8 is 4 bytes (lead + three)
          // or if < 0 we got a trail byte in the lead byte position
          throw new MalformedInputException(count);
        } // switch (length)
        break;

      case TRAIL_BYTE_1:
        if (leadByte == 0xF0 && aByte < 0x90)
          throw new MalformedInputException(count);
        if (leadByte == 0xF4 && aByte > 0x8F)
          throw new MalformedInputException(count);
        if (leadByte == 0xE0 && aByte < 0xA0)
          throw new MalformedInputException(count);
        if (leadByte == 0xED && aByte > 0x9F)
          throw new MalformedInputException(count);
        // falls through to regular trail-byte test!!
      case TRAIL_BYTE:
        if (aByte < 0x80 || aByte > 0xBF)
          throw new MalformedInputException(count);
        if (--length == 0) {
          state = LEAD_BYTE;
        } else {
          state = TRAIL_BYTE;
        }
        break;
      default:
        break;
      } // switch (state)
      count++;
    }
  }

  /**
   * Magic numbers for UTF-8. These are the number of bytes
   * that <em>follow</em> a given lead byte. Trailing bytes
   * have the value -1. The values 4 and 5 are presented in
   * this table, even though valid UTF-8 cannot include the
   * five and six byte sequences.
   */
  static final int[] bytesFromUTF8 =
  { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0,
    // trail bytes
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
    3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 };

  /**
   * Returns the next code point at the current position in
   * the buffer. The buffer's position will be incremented.
   * Any mark set on this buffer will be changed by this method!
   */
  public static int bytesToCodePoint(ByteBuffer bytes) {
    bytes.mark();
    byte b = bytes.get();
    bytes.reset();
    int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
    if (extraBytesToRead < 0) return -1; // trailing byte!
    int ch = 0;

    switch (extraBytesToRead) {
    case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 0: ch += (bytes.get() & 0xFF);
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    return ch;
  }

  
  static final int offsetsFromUTF8[] =
  { 0x00000000, 0x00003080,
    0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 };

  /**
   * For the given string, returns the number of UTF-8 bytes
   * required to encode the string.
   * @param string text to encode
   * @return number of UTF-8 bytes required to encode
   */
  public static int utf8Length(String string) {
    CharacterIterator iter = new StringCharacterIterator(string);
    char ch = iter.first();
    int size = 0;
    while (ch != CharacterIterator.DONE) {
      if ((ch >= 0xD800) && (ch < 0xDC00)) {
        // surrogate pair?
        char trail = iter.next();
        if ((trail > 0xDBFF) && (trail < 0xE000)) {
          // valid pair
          size += 4;
        } else {
          // invalid pair
          size += 3;
          iter.previous(); // rewind one
        }
      } else if (ch < 0x80) {
        size++;
      } else if (ch < 0x800) {
        size += 2;
      } else {
        // ch < 0x10000, that is, the largest char value
        size += 3;
      }
      ch = iter.next();
    }
    return size;
  }
}

下面是Text類實現的方法：


Modifier and Type	Method and Description
`void`	`append(byte[] utf8, int start, int len)` Append a range of bytes to the end of the given text
`static int`	`bytesToCodePoint(ByteBuffer bytes)` Returns the next code point at the current position in the buffer.
`int`	`charAt(int position)` Returns the Unicode Scalar Value (32-bit integer value) for the character at`position`.
`void`	`clear()` Clear the string to empty.
`byte[]`	`copyBytes()` Get a copy of the bytes that is exactly the length of the data.
`static String`	`decode(byte[] utf8)` Converts the provided byte array to a String using the UTF-8 encoding.
`static String`	`decode(byte[] utf8, int start, int length)`
`static String`	`decode(byte[] utf8, int start, int length, boolean replace)` Converts the provided byte array to a String using the UTF-8 encoding.
`static ByteBuffer`	`encode(String string)` Converts the provided String to bytes using the UTF-8 encoding.
`static ByteBuffer`	`encode(String string, boolean replace)` Converts the provided String to bytes using the UTF-8 encoding.
`boolean`	`equals(Object o)` Returns true iff `o` is a Text with the same contents.
`int`	`find(String what)`
`int`	`find(String what, int start)` Finds any occurence of `what` in the backing buffer, starting as position`start`.
`byte[]`	`getBytes()` Returns the raw bytes; however, only data up to `getLength()` is valid.
`int`	`getLength()` Returns the number of bytes in the byte array
`int`	`hashCode()` Return a hash of the bytes returned from {#getBytes()}.
`void`	`readFields(DataInput in)` deserialize
`void`	`readFields(DataInput in, int maxLength)`
`static String`	`readString(DataInput in)` Read a UTF8 encoded string from in
`static String`	`readString(DataInput in, int maxLength)` Read a UTF8 encoded string with a maximum size
`void`	`readWithKnownLength(DataInput in, int len)` Read a Text object whose length is already known.
`void`	`set(byte[] utf8)` Set to a utf8 byte array
`void`	`set(byte[] utf8, int start, int len)` Set the Text to range of bytes
`void`	`set(String string)` Set to contain the contents of a string.
`void`	`set(Text other)` copy a text.
`static void`	`skip(DataInput in)` Skips over one Text in the input.
`String`	`toString()` Convert text back to string
`static int`	`utf8Length(String string)` For the given string, returns the number of UTF-8 bytes required to encode the string.
`static void`	`validateUTF8(byte[] utf8)` Check if a byte array contains valid utf-8
`static void`	`validateUTF8(byte[] utf8, int start, int len)` Check to see if a byte array is valid utf-8
`void`	`write(DataOutput out)` serialize write this object to out length uses zero-compressed encoding
`void`	`write(DataOutput out, int maxLength)`
`static int`	`writeString(DataOutput out,String s)` Write a UTF8 encoded string to out
`static int`	`writeString(DataOutput out,String s, int maxLength)` Write a UTF8 encoded string with a maximum size to out

爲對上面闡述進行驗證，接下來將列舉一個測試實例，並進行分析。

參考：
http://hadoop.apache.org/docs/current/

[hadoop2.7.1]I/O之一步一步解析Text（基礎知識及與String比較）

1、java中的String類：

2、Unicode

Unicode

UTF-8 、UTF-16、UTF-32

UTF-8（注意：Text使用修訂的標準UTF-8編碼）

UTF-16

UTF-32

Text類

hadoop2.7.1中的Text源碼：

下面是Text類實現的方法：

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

試讀《寫給大忙人看的Java核心技術》

試讀《程序員面試寶典（第5版）》

試讀《Node.js實戰（第2季）》

[hadoop2.7.1]I/O之序列化（serializer）

試讀《算法之美——隱匿在數據結構背後的原理（C++版）》

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結