[hadoop2.7.1]I/O之一步一步解析Text(基础知识及与String比较)

hadoop中的Text类,跟java中的String类很相似,在其定义的方法上,也多有相近之处,当然,由于用途的不同,两者之间还是有很大的区别的,那么,在分析Text类之前,先来回顾下java.lang.String类。


1、java中的String类:


String 类代表字符串。Java 程序中的所有字符串字面值(如 "abc" )都作为此类的实例实现。 

字符串是常量;它们的值在创建之后不能更改。字符串缓冲区支持可变的字符串。因为 String 对象是不可变的,所以可以共享。例如: 

     String str = "abc";
 等效于:
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
 下面给出了一些如何使用字符串的更多示例: 

     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
 String 类包括的方法可用于检查序列的单个字符、比较字符串、搜索字符串、提取子字符串、创建字符串副本并将所有字符全部转换为大写或小写。大小写映射基于 Character 类指定的 Unicode 标准版。 

Java 语言提供对字符串串联符号("+")以及将其他对象转换为字符串的特殊支持。字符串串联是通过 StringBuilder(或 StringBuffer)类及其 append 方法实现的。字符串转换是通过 toString 方法实现的,该方法由 Object 类定义,并可被 Java 中的所有类继承。有关字符串串联和转换的更多信息,请参阅 Gosling、Joy 和 Steele 合著的 The Java Language Specification。 

除非另行说明,否则将 null 参数传递给此类中的构造方法或方法将抛出 NullPointerException。 

String 表示一个 UTF-16 格式的字符串,其中的增补字符 由代理项对 表示(有关详细信息,请参阅 Character 类中的 Unicode 字符表示形式)。索引值是指 char 代码单元,因此增补字符在 String 中占用两个位置。 

String 类提供处理 Unicode 代码点(即字符)和 Unicode 代码单元(即 char 值)的方法。


2、Unicode

   hadoop中的Text类和java中的String类都是使用标准的Unicode,但是在编码方式上却有不同之处,hadoop中的Text类使用修订的UTF-8,而java中的String类使用的是UTF-16。接下来,对于Unicode做一个较为详细的阐述。

 Unicode

 Unicode(统一码、万国码、单一码)是一种在计算机上使用的字符编码。Unicode 是为了解决传统的字符编码方案的局限而产生的,它为每种语言中的每个字符设定了统一并且唯一的二进制编码,以满足跨语言、跨平台进行文本转换、处理的要求。Unicode涉及到两个步骤,首先是定义一个规范,给所有的字符指定一个唯一对应的数字,第二步才是怎么把字符对应的数字保存在计算机中,这才涉及到实际在计算机中占多少字节空间。
     我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符。其中0至127这128个数字表示的字符仍然跟ASCII完全一样。65536是2的16次方,这是第一步。第二步就是怎么把0至65535这些数字转化成01串保存到计算机中,这肯定就有不同的保存方式了,于是出现了UTF(unicode transformation format),有UTF-8,UTF-16,UTF-32。

UTF-8 、UTF-16、UTF-32


那么现在,我们的问题来了,UTF-8 、UTF-16、UTF-32之间到底有何区别呢?

    UTF-16比较好理解,就是任何字符对应的数字都用两个字节来保存。我们通常对Unicode的误解就是把Unicode与UTF-16等同了,但是很显然如果都是英文字母这做有点浪费,明明用一个字节能表示一个字符为什么用两个?

   于是又有个UTF-8,这里的8非常容易误导人,8不是指一个字节,难道一个字节表示一个字符?实际上不是。当用UTF-8时表示一个字符是可变的,有可能是用一个字节表示一个字符,也可能是两个,三个,四个,当然最多不能超过4个字节,具体根据字符对应的数字大小来确定。实际表示ASCII字符的UNICODE字符,将会编码成1个字节,并且UTF-8表示与ASCII字符表示是一样的。所有其他的UNICODE字符转化成UTF-8将需要至少2个字节。每个字节由一个换码序列开始。第一个字节由唯一的换码序列,由n位连续的1加一位0组成,首字节连续的1的个数表示字符编码所需的字节数。

   现在UTF-8和UTF-16的优劣很容易就看出来了:如果全部英文或英文与其他文字混合,但英文占绝大部分,用UTF-8就比UTF-16节省了很多空间。

例如,“汉字”对应的数字是0x6c49和0x5b57,而编码的程序数据是:
  BYTE data_utf8[] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8编码
  WORD data_utf16[] = {0x6c49, 0x5b57}; // UTF-16编码
  DWORD data_utf32[] = {0x6c49, 0x5b57}; // UTF-32编码
  这里用BYTE、WORD、DWORD分别表示无符号8位整数,无符号16位整数和无符号32位整数。UTF-8、UTF-16、UTF-32分别以BYTE、WORD、DWORD作为编码单位。“汉字”的UTF-8编码需要6个字节。“汉字”的UTF-16编码需要两个WORD,大小是4个字节。“汉字”的UTF-32编码需要两个DWORD,大小是8个字节。根据字节序的不同,UTF-16可以被实现为UTF-16LE或UTF-16BE,UTF-32可以被实现为UTF-32LE或UTF-32BE。

下面分别介绍UTF-8、UTF-16、UTF-32。

UTF-8(注意:Text使用修订的标准UTF-8编码)


UTF-8以字节为单位对Unicode进行编码。从Unicode到UTF-8的编码方式如下:
  Unicode编码(16进制) ║ UTF-8 字节流(二进制)
  000000 - 00007F ║ 0xxxxxxx
  000080 - 0007FF ║ 110xxxxx 10xxxxxx
  000800 - 00FFFF ║ 1110xxxx 10xxxxxx 10xxxxxx
  010000 - 10FFFF ║ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  UTF-8的特点是对不同范围的字符使用不同长度的编码。对于0x00-0x7F之间的字符,UTF-8编码与ASCII编码完全相同。UTF-8编码的最大长度是4个字节。从上表可以看出,4字节模板有21个x,即可以容纳21位二进制数字。Unicode的最大码位0x10FFFF也只有21位。
  例1:“汉”字的Unicode编码是0x6C49。0x6C49在0x0800-0xFFFF之间,使用用3字节模板了:1110xxxx 10xxxxxx 10xxxxxx。将0x6C49写成二进制是:0110 1100 0100 1001, 用这个比特流依次代替模板中的x,得到:11100110 10110001 10001001,即E6 B1 89。
  例2:Unicode编码0x20C30在0x010000-0x10FFFF之间,使用用4字节模板了:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。将0x20C30写成21位二进制数字(不足21位就在前面补0):0 0010 0000 1100 0011 0000,用这个比特流依次代替模板中的x,得到:11110000 10100000 10110000 10110000,即F0 A0 B0 B0。

UTF-16


  UTF-16编码以16位无符号整数为单位。我们把Unicode编码记作U。编码规则如下:
  如果U<0x10000,U的UTF-16编码就是U对应的16位无符号整数(为书写简便,下文将16位无符号整数记作WORD)。 
  如果U≥0x10000,我们先计算U'=U-0x10000,然后将U'写成二进制形式:yyyy yyyy yyxx xxxx xxxx,U的UTF-16编码(二进制)就是:110110yyyyyyyyyy 110111xxxxxxxxxx。 
  为什么U'可以被写成20个二进制位?Unicode的最大码位是0x10ffff,减去0x10000后,U'的最大值是0xfffff,所以肯定可以用20个二进制位表示。例如:Unicode编码0x20C30,减去0x10000后,得到0x10C30,写成二进制是:0001 0000 1100 0011 0000。用前10位依次替代模板中的y,用后10位依次替代模板中的x,就得到:1101100001000011 1101110000110000,即0xD843 0xDC30。
  按照上述规则,Unicode编码0x10000-0x10FFFF的UTF-16编码有两个WORD,第一个WORD的高6位是110110,第二个WORD的高6位是110111。可见,第一个WORD的取值范围(二进制)是11011000 00000000到11011011 11111111,即0xD800-0xDBFF。第二个WORD的取值范围(二进制)是11011100 00000000到11011111 11111111,即0xDC00-0xDFFF。
  为了将一个WORD的UTF-16编码与两个WORD的UTF-16编码区分开来,Unicode编码的设计者将0xD800-0xDFFF保留下来,并称为代理区(Surrogate):
  D800-DB7F ║ High Surrogates ║ 高位替代 
  DB80-DBFF ║ High Private Use Surrogates ║ 高位专用替代 
  DC00-DFFF ║ Low Surrogates ║ 低位替代
  高位替代就是指这个范围的码位是两个WORD的UTF-16编码的第一个WORD。低位替代就是指这个范围的码位是两个WORD的UTF-16编码的第二个WORD。那么,高位专用替代是什么意思?我们来解答这个问题,顺便看看怎么由UTF-16编码推导Unicode编码。 
  如果一个字符的UTF-16编码的第一个WORD在0xDB80到0xDBFF之间,那么它的Unicode编码在什么范围内?我们知道第二个WORD的取值范围是0xDC00-0xDFFF,所以这个字符的UTF-16编码范围应该是0xDB80 0xDC00到0xDBFF 0xDFFF。我们将这个范围写成二进制:
  1101101110000000 11011100 00000000 - 1101101111111111 1101111111111111
  按照编码的相反步骤,取出高低WORD的后10位,并拼在一起,得到
  1110 0000 0000 0000 0000 - 1111 1111 1111 1111 1111
  即0xe0000-0xfffff,按照编码的相反步骤再加上0x10000,得到0xf0000-0x10ffff。这就是UTF-16编码的第一个WORD在0xdb80到0xdbff之间的Unicode编码范围,即平面15和平面16。因为Unicode标准将平面15和平面16都作为专用区,所以0xDB80到0xDBFF之间的保留码位被称作高位专用替代。

UTF-32


  UTF-32编码以32位无符号整数为单位。Unicode的UTF-32编码就是其对应的32位无符号整数。
  字节序
  根据字节序的不同,UTF-16可以被实现为UTF-16LE或UTF-16BE,UTF-32可以被实现为UTF-32LE或UTF-32BE。例如:
  Unicode编码 ║ UTF-16LE ║ UTF-16BE ║ UTF32-LE ║ UTF32-BE 
  0x006C49 ║ 49 6C ║ 6C 49 ║ 49 6C 00 00 ║ 00 00 6C 49 
  0x020C30 ║ 43 D8 30 DC ║ D8 43 DC 30 ║ 30 0C 02 00 ║ 00 02 0C 30

而UTF-32 对每一个Unicode码位使用恰好32位元。因为UTF-32对每个字符都使用4字节,就空间而言,是非常没有效率的。特别地,非基本多文种平面的字符在大部分文件中通常很罕见,以致于它们通常被认为不存在占用空间大小的讨论,使得UTF-32通常会是其它编码的二到四倍。虽然每一个码位使用固定长定的字节看似方便,它并不如其它Unicode编码使用得广泛。

Text类


提供了序列化、反序列化和在字节级别上比较文本的方法。它的长度类型是整型,采用0压缩序列化格式。另外,它还支持在不将字符数组转换为字符串的情况下进行字符串遍历。

hadoop2.7.1中的Text源码:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.io;

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.MalformedInputException;
import java.text.CharacterIterator;
import java.text.StringCharacterIterator;
import java.util.Arrays;

import org.apache.avro.reflect.Stringable;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;

/** This class stores text using standard UTF8 encoding.  It provides methods
 * to serialize, deserialize, and compare texts at byte level.  The type of
 * length is integer and is serialized using zero-compressed format.  <p>In
 * addition, it provides methods for string traversal without converting the
 * byte array to a string.  <p>Also includes utilities for
 * serializing/deserialing a string, coding/decoding a string, checking if a
 * byte array contains valid UTF8 code, calculating the length of an encoded
 * string.
 */
@Stringable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {
  
  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      @Override
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    @Override
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };
  
  private static final byte [] EMPTY_BYTES = new byte[0];
  
  private byte[] bytes;
  private int length;

  public Text() {
    bytes = EMPTY_BYTES;
  }

  /** Construct from a string. 
   */
  public Text(String string) {
    set(string);
  }

  /** Construct from another text. */
  public Text(Text utf8) {
    set(utf8);
  }

  /** Construct from a byte array.
   */
  public Text(byte[] utf8)  {
    set(utf8);
  }
  
  /**
   * Get a copy of the bytes that is exactly the length of the data.
   * See {@link #getBytes()} for faster access to the underlying array.
   */
  public byte[] copyBytes() {
    byte[] result = new byte[length];
    System.arraycopy(bytes, 0, result, 0, length);
    return result;
  }
  
  /**
   * Returns the raw bytes; however, only data up to {@link #getLength()} is
   * valid. Please use {@link #copyBytes()} if you
   * need the returned array to be precisely the length of the data.
   */
  @Override
  public byte[] getBytes() {
    return bytes;
  }

  /** Returns the number of bytes in the byte array */ 
  @Override
  public int getLength() {
    return length;
  }
  
  /**
   * Returns the Unicode Scalar Value (32-bit integer value)
   * for the character at <code>position</code>. Note that this
   * method avoids using the converter or doing String instantiation
   * @return the Unicode scalar value at position or -1
   *          if the position is invalid or points to a
   *          trailing byte
   */
  public int charAt(int position) {
    if (position > this.length) return -1; // too long
    if (position < 0) return -1; // duh.
      
    ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
    return bytesToCodePoint(bb.slice());
  }
  
  public int find(String what) {
    return find(what, 0);
  }
  
  /**
   * Finds any occurence of <code>what</code> in the backing
   * buffer, starting as position <code>start</code>. The starting
   * position is measured in bytes and the return value is in
   * terms of byte position in the buffer. The backing buffer is
   * not converted to a string for this operation.
   * @return byte position of the first occurence of the search
   *         string in the UTF-8 buffer or -1 if not found
   */
  public int find(String what, int start) {
    try {
      ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length);
      ByteBuffer tgt = encode(what);
      byte b = tgt.get();
      src.position(start);
          
      while (src.hasRemaining()) {
        if (b == src.get()) { // matching first byte
          src.mark(); // save position in loop
          tgt.mark(); // save position in target
          boolean found = true;
          int pos = src.position()-1;
          while (tgt.hasRemaining()) {
            if (!src.hasRemaining()) { // src expired first
              tgt.reset();
              src.reset();
              found = false;
              break;
            }
            if (!(tgt.get() == src.get())) {
              tgt.reset();
              src.reset();
              found = false;
              break; // no match
            }
          }
          if (found) return pos;
        }
      }
      return -1; // not found
    } catch (CharacterCodingException e) {
      // can't get here
      e.printStackTrace();
      return -1;
    }
  }  
  /** Set to contain the contents of a string. 
   */
  public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened ", e); 
    }
  }

  /** Set to a utf8 byte array
   */
  public void set(byte[] utf8) {
    set(utf8, 0, utf8.length);
  }
  
  /** copy a text. */
  public void set(Text other) {
    set(other.getBytes(), 0, other.getLength());
  }

  /**
   * Set the Text to range of bytes
   * @param utf8 the data to copy from
   * @param start the first position of the new string
   * @param len the number of bytes of the new string
   */
  public void set(byte[] utf8, int start, int len) {
    setCapacity(len, false);
    System.arraycopy(utf8, start, bytes, 0, len);
    this.length = len;
  }

  /**
   * Append a range of bytes to the end of the given text
   * @param utf8 the data to copy from
   * @param start the first position to append from utf8
   * @param len the number of bytes to append
   */
  public void append(byte[] utf8, int start, int len) {
    setCapacity(length + len, true);
    System.arraycopy(utf8, start, bytes, length, len);
    length += len;
  }

  /**
   * Clear the string to empty.
   *
   * <em>Note</em>: For performance reasons, this call does not clear the
   * underlying byte array that is retrievable via {@link #getBytes()}.
   * In order to free the byte-array memory, call {@link #set(byte[])}
   * with an empty byte array (For example, <code>new byte[0]</code>).
   */
  public void clear() {
    length = 0;
  }

  /*
   * Sets the capacity of this Text object to <em>at least</em>
   * <code>len</code> bytes. If the current buffer is longer,
   * then the capacity and existing content of the buffer are
   * unchanged. If <code>len</code> is larger
   * than the current capacity, the Text object's capacity is
   * increased to match.
   * @param len the number of bytes we need
   * @param keepData should the old data be kept
   */
  private void setCapacity(int len, boolean keepData) {
    if (bytes == null || bytes.length < len) {
      if (bytes != null && keepData) {
        bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
      } else {
        bytes = new byte[len];
      }
    }
  }
   
  /** 
   * Convert text back to string
   * @see java.lang.Object#toString()
   */
  @Override
  public String toString() {
    try {
      return decode(bytes, 0, length);
    } catch (CharacterCodingException e) { 
      throw new RuntimeException("Should not have happened " , e); 
    }
  }
  
  /** deserialize 
   */
  @Override
  public void readFields(DataInput in) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    readWithKnownLength(in, newLength);
  }
  
  public void readFields(DataInput in, int maxLength) throws IOException {
    int newLength = WritableUtils.readVInt(in);
    if (newLength < 0) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data!  newLength must be non-negative.");
    } else if (newLength >= maxLength) {
      throw new IOException("tried to deserialize " + newLength +
          " bytes of data, but maxLength = " + maxLength);
    }
    readWithKnownLength(in, newLength);
  }

  /** Skips over one Text in the input. */
  public static void skip(DataInput in) throws IOException {
    int length = WritableUtils.readVInt(in);
    WritableUtils.skipFully(in, length);
  }

  /**
   * Read a Text object whose length is already known.
   * This allows creating Text from a stream which uses a different serialization
   * format.
   */
  public void readWithKnownLength(DataInput in, int len) throws IOException {
    setCapacity(len, false);
    in.readFully(bytes, 0, len);
    length = len;
  }

  /** serialize
   * write this object to out
   * length uses zero-compressed encoding
   * @see Writable#write(DataOutput)
   */
  @Override
  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  public void write(DataOutput out, int maxLength) throws IOException {
    if (length > maxLength) {
      throw new IOException("data was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes, 0, length);
  }

  /** Returns true iff <code>o</code> is a Text with the same contents.  */
  @Override
  public boolean equals(Object o) {
    if (o instanceof Text)
      return super.equals(o);
    return false;
  }

  @Override
  public int hashCode() {
    return super.hashCode();
  }

  /** A WritableComparator optimized for Text keys. */
  public static class Comparator extends WritableComparator {
    public Comparator() {
      super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      int n1 = WritableUtils.decodeVIntSize(b1[s1]);
      int n2 = WritableUtils.decodeVIntSize(b2[s2]);
      return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
    }
  }

  static {
    // register this comparator
    WritableComparator.define(Text.class, new Comparator());
  }

  /// STATIC UTILITIES FROM HERE DOWN
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If the input is malformed,
   * replace by a default value.
   */
  public static String decode(byte[] utf8) throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8), true);
  }
  
  public static String decode(byte[] utf8, int start, int length) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), true);
  }
  
  /**
   * Converts the provided byte array to a String using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   */
  public static String decode(byte[] utf8, int start, int length, boolean replace) 
    throws CharacterCodingException {
    return decode(ByteBuffer.wrap(utf8, start, length), replace);
  }
  
  private static String decode(ByteBuffer utf8, boolean replace) 
    throws CharacterCodingException {
    CharsetDecoder decoder = DECODER_FACTORY.get();
    if (replace) {
      decoder.onMalformedInput(
          java.nio.charset.CodingErrorAction.REPLACE);
      decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    String str = decoder.decode(utf8).toString();
    // set decoder back to its default value: REPORT
    if (replace) {
      decoder.onMalformedInput(CodingErrorAction.REPORT);
      decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return str;
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If the input is malformed,
   * invalid chars are replaced by a default value.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */

  public static ByteBuffer encode(String string)
    throws CharacterCodingException {
    return encode(string, true);
  }

  /**
   * Converts the provided String to bytes using the
   * UTF-8 encoding. If <code>replace</code> is true, then
   * malformed input is replaced with the
   * substitution character, which is U+FFFD. Otherwise the
   * method throws a MalformedInputException.
   * @return ByteBuffer: bytes stores at ByteBuffer.array() 
   *                     and length is ByteBuffer.limit()
   */
  public static ByteBuffer encode(String string, boolean replace)
    throws CharacterCodingException {
    CharsetEncoder encoder = ENCODER_FACTORY.get();
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPLACE);
      encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    }
    ByteBuffer bytes = 
      encoder.encode(CharBuffer.wrap(string.toCharArray()));
    if (replace) {
      encoder.onMalformedInput(CodingErrorAction.REPORT);
      encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }
    return bytes;
  }

  static final public int DEFAULT_MAX_LEN = 1024 * 1024;

  /** Read a UTF8 encoded string from in
   */
  public static String readString(DataInput in) throws IOException {
    return readString(in, Integer.MAX_VALUE);
  }
  
  /** Read a UTF8 encoded string with a maximum size
   */
  public static String readString(DataInput in, int maxLength)
      throws IOException {
    int length = WritableUtils.readVIntInRange(in, 0, maxLength);
    byte [] bytes = new byte[length];
    in.readFully(bytes, 0, length);
    return decode(bytes);
  }
  
  /** Write a UTF8 encoded string to out
   */
  public static int writeString(DataOutput out, String s) throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  /** Write a UTF8 encoded string with a maximum size to out
   */
  public static int writeString(DataOutput out, String s, int maxLength)
      throws IOException {
    ByteBuffer bytes = encode(s);
    int length = bytes.limit();
    if (length > maxLength) {
      throw new IOException("string was too long to write!  Expected " +
          "less than or equal to " + maxLength + " bytes, but got " +
          length + " bytes.");
    }
    WritableUtils.writeVInt(out, length);
    out.write(bytes.array(), 0, length);
    return length;
  }

  ////// states for validateUTF8
  
  private static final int LEAD_BYTE = 0;

  private static final int TRAIL_BYTE_1 = 1;

  private static final int TRAIL_BYTE = 2;

  /** 
   * Check if a byte array contains valid utf-8
   * @param utf8 byte array
   * @throws MalformedInputException if the byte array contains invalid utf-8
   */
  public static void validateUTF8(byte[] utf8) throws MalformedInputException {
    validateUTF8(utf8, 0, utf8.length);     
  }
  
  /**
   * Check to see if a byte array is valid utf-8
   * @param utf8 the array of bytes
   * @param start the offset of the first byte in the array
   * @param len the length of the byte sequence
   * @throws MalformedInputException if the byte array contains invalid bytes
   */
  public static void validateUTF8(byte[] utf8, int start, int len)
    throws MalformedInputException {
    int count = start;
    int leadByte = 0;
    int length = 0;
    int state = LEAD_BYTE;
    while (count < start+len) {
      int aByte = utf8[count] & 0xFF;

      switch (state) {
      case LEAD_BYTE:
        leadByte = aByte;
        length = bytesFromUTF8[aByte];

        switch (length) {
        case 0: // check for ASCII
          if (leadByte > 0x7F)
            throw new MalformedInputException(count);
          break;
        case 1:
          if (leadByte < 0xC2 || leadByte > 0xDF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 2:
          if (leadByte < 0xE0 || leadByte > 0xEF)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        case 3:
          if (leadByte < 0xF0 || leadByte > 0xF4)
            throw new MalformedInputException(count);
          state = TRAIL_BYTE_1;
          break;
        default:
          // too long! Longest valid UTF-8 is 4 bytes (lead + three)
          // or if < 0 we got a trail byte in the lead byte position
          throw new MalformedInputException(count);
        } // switch (length)
        break;

      case TRAIL_BYTE_1:
        if (leadByte == 0xF0 && aByte < 0x90)
          throw new MalformedInputException(count);
        if (leadByte == 0xF4 && aByte > 0x8F)
          throw new MalformedInputException(count);
        if (leadByte == 0xE0 && aByte < 0xA0)
          throw new MalformedInputException(count);
        if (leadByte == 0xED && aByte > 0x9F)
          throw new MalformedInputException(count);
        // falls through to regular trail-byte test!!
      case TRAIL_BYTE:
        if (aByte < 0x80 || aByte > 0xBF)
          throw new MalformedInputException(count);
        if (--length == 0) {
          state = LEAD_BYTE;
        } else {
          state = TRAIL_BYTE;
        }
        break;
      default:
        break;
      } // switch (state)
      count++;
    }
  }

  /**
   * Magic numbers for UTF-8. These are the number of bytes
   * that <em>follow</em> a given lead byte. Trailing bytes
   * have the value -1. The values 4 and 5 are presented in
   * this table, even though valid UTF-8 cannot include the
   * five and six byte sequences.
   */
  static final int[] bytesFromUTF8 =
  { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0,
    // trail bytes
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
    3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 };

  /**
   * Returns the next code point at the current position in
   * the buffer. The buffer's position will be incremented.
   * Any mark set on this buffer will be changed by this method!
   */
  public static int bytesToCodePoint(ByteBuffer bytes) {
    bytes.mark();
    byte b = bytes.get();
    bytes.reset();
    int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
    if (extraBytesToRead < 0) return -1; // trailing byte!
    int ch = 0;

    switch (extraBytesToRead) {
    case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 0: ch += (bytes.get() & 0xFF);
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    return ch;
  }

  
  static final int offsetsFromUTF8[] =
  { 0x00000000, 0x00003080,
    0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 };

  /**
   * For the given string, returns the number of UTF-8 bytes
   * required to encode the string.
   * @param string text to encode
   * @return number of UTF-8 bytes required to encode
   */
  public static int utf8Length(String string) {
    CharacterIterator iter = new StringCharacterIterator(string);
    char ch = iter.first();
    int size = 0;
    while (ch != CharacterIterator.DONE) {
      if ((ch >= 0xD800) && (ch < 0xDC00)) {
        // surrogate pair?
        char trail = iter.next();
        if ((trail > 0xDBFF) && (trail < 0xE000)) {
          // valid pair
          size += 4;
        } else {
          // invalid pair
          size += 3;
          iter.previous(); // rewind one
        }
      } else if (ch < 0x80) {
        size++;
      } else if (ch < 0x800) {
        size += 2;
      } else {
        // ch < 0x10000, that is, the largest char value
        size += 3;
      }
      ch = iter.next();
    }
    return size;
  }
}

下面是Text类实现的方法:

 
Modifier and Type Method and Description
void append(byte[] utf8, int start, int len)
Append a range of bytes to the end of the given text
static int bytesToCodePoint(ByteBuffer bytes)
Returns the next code point at the current position in the buffer.
int charAt(int position)
Returns the Unicode Scalar Value (32-bit integer value) for the character atposition.
void clear()
Clear the string to empty.
byte[] copyBytes()
Get a copy of the bytes that is exactly the length of the data.
static String decode(byte[] utf8)
Converts the provided byte array to a String using the UTF-8 encoding.
static String decode(byte[] utf8, int start, int length) 
static String decode(byte[] utf8, int start, int length, boolean replace)
Converts the provided byte array to a String using the UTF-8 encoding.
static ByteBuffer encode(String string)
Converts the provided String to bytes using the UTF-8 encoding.
static ByteBuffer encode(String string, boolean replace)
Converts the provided String to bytes using the UTF-8 encoding.
boolean equals(Object o)
Returns true iff o is a Text with the same contents.
int find(String what) 
int find(String what, int start)
Finds any occurence of what in the backing buffer, starting as positionstart.
byte[] getBytes()
Returns the raw bytes; however, only data up to getLength() is valid.
int getLength()
Returns the number of bytes in the byte array
int hashCode()
Return a hash of the bytes returned from {#getBytes()}.
void readFields(DataInput in)
deserialize
void readFields(DataInput in, int maxLength) 
static String readString(DataInput in)
Read a UTF8 encoded string from in
static String readString(DataInput in, int maxLength)
Read a UTF8 encoded string with a maximum size
void readWithKnownLength(DataInput in, int len)
Read a Text object whose length is already known.
void set(byte[] utf8)
Set to a utf8 byte array
void set(byte[] utf8, int start, int len)
Set the Text to range of bytes
void set(String string)
Set to contain the contents of a string.
void set(Text other)
copy a text.
static void skip(DataInput in)
Skips over one Text in the input.
String toString()
Convert text back to string
static int utf8Length(String string)
For the given string, returns the number of UTF-8 bytes required to encode the string.
static void validateUTF8(byte[] utf8)
Check if a byte array contains valid utf-8
static void validateUTF8(byte[] utf8, int start, int len)
Check to see if a byte array is valid utf-8
void write(DataOutput out)
serialize write this object to out length uses zero-compressed encoding
void write(DataOutput out, int maxLength) 
static int writeString(DataOutput out,String s)
Write a UTF8 encoded string to out
static int writeString(DataOutput out,String s, int maxLength)
Write a UTF8 encoded string with a maximum size to out


为对上面阐述进行验证,接下来将列举一个测试实例,并进行分析。


参考:
http://hadoop.apache.org/docs/current/



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章