hadoop中的Text類,跟java中的String類很相似,在其定義的方法上,也多有相近之處,當然,由於用途的不同,兩者之間還是有很大的區別的,那麼,在分析Text類之前,先來回顧下java.lang.String類。
1、java中的String類:
字符串是常量;它們的值在創建之後不能更改。字符串緩衝區支持可變的字符串。因爲 String 對象是不可變的,所以可以共享。例如:
String str = "abc";
等效於:
char data[] = {'a', 'b', 'c'};
String str = new String(data);
下面給出了一些如何使用字符串的更多示例:
System.out.println("abc");
String cde = "cde";
System.out.println("abc" + cde);
String c = "abc".substring(2,3);
String d = cde.substring(1, 2);
String 類包括的方法可用於檢查序列的單個字符、比較字符串、搜索字符串、提取子字符串、創建字符串副本並將所有字符全部轉換爲大寫或小寫。大小寫映射基於 Character 類指定的 Unicode 標準版。
Java 語言提供對字符串串聯符號("+")以及將其他對象轉換爲字符串的特殊支持。字符串串聯是通過 StringBuilder(或 StringBuffer)類及其 append 方法實現的。字符串轉換是通過 toString 方法實現的,該方法由 Object 類定義,並可被 Java 中的所有類繼承。有關字符串串聯和轉換的更多信息,請參閱 Gosling、Joy 和 Steele 合著的 The Java Language Specification。
除非另行說明,否則將 null 參數傳遞給此類中的構造方法或方法將拋出 NullPointerException。
String 表示一個 UTF-16 格式的字符串,其中的增補字符 由代理項對 表示(有關詳細信息,請參閱 Character 類中的 Unicode 字符表示形式)。索引值是指 char 代碼單元,因此增補字符在 String 中佔用兩個位置。
String 類提供處理 Unicode 代碼點(即字符)和 Unicode 代碼單元(即 char 值)的方法。
2、Unicode
hadoop中的Text類和java中的String類都是使用標準的Unicode,但是在編碼方式上卻有不同之處,hadoop中的Text類使用修訂的UTF-8,而java中的String類使用的是UTF-16。接下來,對於Unicode做一個較爲詳細的闡述。Unicode
Unicode(統一碼、萬國碼、單一碼)是一種在計算機上使用的字符編碼。Unicode 是爲了解決傳統的字符編碼方案的侷限而產生的,它爲每種語言中的每個字符設定了統一併且唯一的二進制編碼,以滿足跨語言、跨平臺進行文本轉換、處理的要求。Unicode涉及到兩個步驟,首先是定義一個規範,給所有的字符指定一個唯一對應的數字,第二步纔是怎麼把字符對應的數字保存在計算機中,這才涉及到實際在計算機中佔多少字節空間。UTF-8 、UTF-16、UTF-32
那麼現在,我們的問題來了,UTF-8 、UTF-16、UTF-32之間到底有何區別呢?
於是又有個UTF-8,這裏的8非常容易誤導人,8不是指一個字節,難道一個字節表示一個字符?實際上不是。當用UTF-8時表示一個字符是可變的,有可能是用一個字節表示一個字符,也可能是兩個,三個,四個,當然最多不能超過4個字節,具體根據字符對應的數字大小來確定。實際表示ASCII字符的UNICODE字符,將會編碼成1個字節,並且UTF-8表示與ASCII字符表示是一樣的。所有其他的UNICODE字符轉化成UTF-8將需要至少2個字節。每個字節由一個換碼序列開始。第一個字節由唯一的換碼序列,由n位連續的1加一位0組成,首字節連續的1的個數表示字符編碼所需的字節數。
現在UTF-8和UTF-16的優劣很容易就看出來了:如果全部英文或英文與其他文字混合,但英文佔絕大部分,用UTF-8就比UTF-16節省了很多空間。
BYTE data_utf8[] = {0xE6, 0xB1, 0x89, 0xE5, 0xAD, 0x97}; // UTF-8編碼
WORD data_utf16[] = {0x6c49, 0x5b57}; // UTF-16編碼
DWORD data_utf32[] = {0x6c49, 0x5b57}; // UTF-32編碼
這裏用BYTE、WORD、DWORD分別表示無符號8位整數,無符號16位整數和無符號32位整數。UTF-8、UTF-16、UTF-32分別以BYTE、WORD、DWORD作爲編碼單位。“漢字”的UTF-8編碼需要6個字節。“漢字”的UTF-16編碼需要兩個WORD,大小是4個字節。“漢字”的UTF-32編碼需要兩個DWORD,大小是8個字節。根據字節序的不同,UTF-16可以被實現爲UTF-16LE或UTF-16BE,UTF-32可以被實現爲UTF-32LE或UTF-32BE。
下面分別介紹UTF-8、UTF-16、UTF-32。
UTF-8(注意:Text使用修訂的標準UTF-8編碼)
UTF-8以字節爲單位對Unicode進行編碼。從Unicode到UTF-8的編碼方式如下:
Unicode編碼(16進制) ║ UTF-8 字節流(二進制)
000000 - 00007F ║ 0xxxxxxx
000080 - 0007FF ║ 110xxxxx 10xxxxxx
000800 - 00FFFF ║ 1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF ║ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-8的特點是對不同範圍的字符使用不同長度的編碼。對於0x00-0x7F之間的字符,UTF-8編碼與ASCII編碼完全相同。UTF-8編碼的最大長度是4個字節。從上表可以看出,4字節模板有21個x,即可以容納21位二進制數字。Unicode的最大碼位0x10FFFF也只有21位。
例1:“漢”字的Unicode編碼是0x6C49。0x6C49在0x0800-0xFFFF之間,使用用3字節模板了:1110xxxx 10xxxxxx 10xxxxxx。將0x6C49寫成二進制是:0110 1100 0100 1001, 用這個比特流依次代替模板中的x,得到:11100110 10110001 10001001,即E6 B1 89。
例2:Unicode編碼0x20C30在0x010000-0x10FFFF之間,使用用4字節模板了:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。將0x20C30寫成21位二進制數字(不足21位就在前面補0):0 0010 0000 1100 0011 0000,用這個比特流依次代替模板中的x,得到:11110000 10100000 10110000 10110000,即F0 A0 B0 B0。
UTF-16
UTF-16編碼以16位無符號整數爲單位。我們把Unicode編碼記作U。編碼規則如下:
如果U<0x10000,U的UTF-16編碼就是U對應的16位無符號整數(爲書寫簡便,下文將16位無符號整數記作WORD)。
如果U≥0x10000,我們先計算U'=U-0x10000,然後將U'寫成二進制形式:yyyy yyyy yyxx xxxx xxxx,U的UTF-16編碼(二進制)就是:110110yyyyyyyyyy 110111xxxxxxxxxx。
爲什麼U'可以被寫成20個二進制位?Unicode的最大碼位是0x10ffff,減去0x10000後,U'的最大值是0xfffff,所以肯定可以用20個二進制位表示。例如:Unicode編碼0x20C30,減去0x10000後,得到0x10C30,寫成二進制是:0001 0000 1100 0011 0000。用前10位依次替代模板中的y,用後10位依次替代模板中的x,就得到:1101100001000011 1101110000110000,即0xD843 0xDC30。
按照上述規則,Unicode編碼0x10000-0x10FFFF的UTF-16編碼有兩個WORD,第一個WORD的高6位是110110,第二個WORD的高6位是110111。可見,第一個WORD的取值範圍(二進制)是11011000 00000000到11011011 11111111,即0xD800-0xDBFF。第二個WORD的取值範圍(二進制)是11011100 00000000到11011111 11111111,即0xDC00-0xDFFF。
爲了將一個WORD的UTF-16編碼與兩個WORD的UTF-16編碼區分開來,Unicode編碼的設計者將0xD800-0xDFFF保留下來,並稱爲代理區(Surrogate):
D800-DB7F ║ High Surrogates ║ 高位替代
DB80-DBFF ║ High Private Use Surrogates ║ 高位專用替代
DC00-DFFF ║ Low Surrogates ║ 低位替代
高位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第一個WORD。低位替代就是指這個範圍的碼位是兩個WORD的UTF-16編碼的第二個WORD。那麼,高位專用替代是什麼意思?我們來解答這個問題,順便看看怎麼由UTF-16編碼推導Unicode編碼。
如果一個字符的UTF-16編碼的第一個WORD在0xDB80到0xDBFF之間,那麼它的Unicode編碼在什麼範圍內?我們知道第二個WORD的取值範圍是0xDC00-0xDFFF,所以這個字符的UTF-16編碼範圍應該是0xDB80 0xDC00到0xDBFF 0xDFFF。我們將這個範圍寫成二進制:
1101101110000000 11011100 00000000 - 1101101111111111 1101111111111111
按照編碼的相反步驟,取出高低WORD的後10位,並拼在一起,得到
1110 0000 0000 0000 0000 - 1111 1111 1111 1111 1111
即0xe0000-0xfffff,按照編碼的相反步驟再加上0x10000,得到0xf0000-0x10ffff。這就是UTF-16編碼的第一個WORD在0xdb80到0xdbff之間的Unicode編碼範圍,即平面15和平面16。因爲Unicode標準將平面15和平面16都作爲專用區,所以0xDB80到0xDBFF之間的保留碼位被稱作高位專用替代。
UTF-32
UTF-32編碼以32位無符號整數爲單位。Unicode的UTF-32編碼就是其對應的32位無符號整數。
字節序
根據字節序的不同,UTF-16可以被實現爲UTF-16LE或UTF-16BE,UTF-32可以被實現爲UTF-32LE或UTF-32BE。例如:
Unicode編碼 ║ UTF-16LE ║ UTF-16BE ║ UTF32-LE ║ UTF32-BE
0x006C49 ║ 49 6C ║ 6C 49 ║ 49 6C 00 00 ║ 00 00 6C 49
0x020C30 ║ 43 D8 30 DC ║ D8 43 DC 30 ║ 30 0C 02 00 ║ 00 02 0C 30
Text類
hadoop2.7.1中的Text源碼:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.io;
import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.MalformedInputException;
import java.text.CharacterIterator;
import java.text.StringCharacterIterator;
import java.util.Arrays;
import org.apache.avro.reflect.Stringable;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
/** This class stores text using standard UTF8 encoding. It provides methods
* to serialize, deserialize, and compare texts at byte level. The type of
* length is integer and is serialized using zero-compressed format. <p>In
* addition, it provides methods for string traversal without converting the
* byte array to a string. <p>Also includes utilities for
* serializing/deserialing a string, coding/decoding a string, checking if a
* byte array contains valid UTF8 code, calculating the length of an encoded
* string.
*/
@Stringable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Text extends BinaryComparable
implements WritableComparable<BinaryComparable> {
private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
new ThreadLocal<CharsetEncoder>() {
@Override
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
new ThreadLocal<CharsetDecoder>() {
@Override
protected CharsetDecoder initialValue() {
return Charset.forName("UTF-8").newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static final byte [] EMPTY_BYTES = new byte[0];
private byte[] bytes;
private int length;
public Text() {
bytes = EMPTY_BYTES;
}
/** Construct from a string.
*/
public Text(String string) {
set(string);
}
/** Construct from another text. */
public Text(Text utf8) {
set(utf8);
}
/** Construct from a byte array.
*/
public Text(byte[] utf8) {
set(utf8);
}
/**
* Get a copy of the bytes that is exactly the length of the data.
* See {@link #getBytes()} for faster access to the underlying array.
*/
public byte[] copyBytes() {
byte[] result = new byte[length];
System.arraycopy(bytes, 0, result, 0, length);
return result;
}
/**
* Returns the raw bytes; however, only data up to {@link #getLength()} is
* valid. Please use {@link #copyBytes()} if you
* need the returned array to be precisely the length of the data.
*/
@Override
public byte[] getBytes() {
return bytes;
}
/** Returns the number of bytes in the byte array */
@Override
public int getLength() {
return length;
}
/**
* Returns the Unicode Scalar Value (32-bit integer value)
* for the character at <code>position</code>. Note that this
* method avoids using the converter or doing String instantiation
* @return the Unicode scalar value at position or -1
* if the position is invalid or points to a
* trailing byte
*/
public int charAt(int position) {
if (position > this.length) return -1; // too long
if (position < 0) return -1; // duh.
ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
return bytesToCodePoint(bb.slice());
}
public int find(String what) {
return find(what, 0);
}
/**
* Finds any occurence of <code>what</code> in the backing
* buffer, starting as position <code>start</code>. The starting
* position is measured in bytes and the return value is in
* terms of byte position in the buffer. The backing buffer is
* not converted to a string for this operation.
* @return byte position of the first occurence of the search
* string in the UTF-8 buffer or -1 if not found
*/
public int find(String what, int start) {
try {
ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length);
ByteBuffer tgt = encode(what);
byte b = tgt.get();
src.position(start);
while (src.hasRemaining()) {
if (b == src.get()) { // matching first byte
src.mark(); // save position in loop
tgt.mark(); // save position in target
boolean found = true;
int pos = src.position()-1;
while (tgt.hasRemaining()) {
if (!src.hasRemaining()) { // src expired first
tgt.reset();
src.reset();
found = false;
break;
}
if (!(tgt.get() == src.get())) {
tgt.reset();
src.reset();
found = false;
break; // no match
}
}
if (found) return pos;
}
}
return -1; // not found
} catch (CharacterCodingException e) {
// can't get here
e.printStackTrace();
return -1;
}
}
/** Set to contain the contents of a string.
*/
public void set(String string) {
try {
ByteBuffer bb = encode(string, true);
bytes = bb.array();
length = bb.limit();
}catch(CharacterCodingException e) {
throw new RuntimeException("Should not have happened ", e);
}
}
/** Set to a utf8 byte array
*/
public void set(byte[] utf8) {
set(utf8, 0, utf8.length);
}
/** copy a text. */
public void set(Text other) {
set(other.getBytes(), 0, other.getLength());
}
/**
* Set the Text to range of bytes
* @param utf8 the data to copy from
* @param start the first position of the new string
* @param len the number of bytes of the new string
*/
public void set(byte[] utf8, int start, int len) {
setCapacity(len, false);
System.arraycopy(utf8, start, bytes, 0, len);
this.length = len;
}
/**
* Append a range of bytes to the end of the given text
* @param utf8 the data to copy from
* @param start the first position to append from utf8
* @param len the number of bytes to append
*/
public void append(byte[] utf8, int start, int len) {
setCapacity(length + len, true);
System.arraycopy(utf8, start, bytes, length, len);
length += len;
}
/**
* Clear the string to empty.
*
* <em>Note</em>: For performance reasons, this call does not clear the
* underlying byte array that is retrievable via {@link #getBytes()}.
* In order to free the byte-array memory, call {@link #set(byte[])}
* with an empty byte array (For example, <code>new byte[0]</code>).
*/
public void clear() {
length = 0;
}
/*
* Sets the capacity of this Text object to <em>at least</em>
* <code>len</code> bytes. If the current buffer is longer,
* then the capacity and existing content of the buffer are
* unchanged. If <code>len</code> is larger
* than the current capacity, the Text object's capacity is
* increased to match.
* @param len the number of bytes we need
* @param keepData should the old data be kept
*/
private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length < len) {
if (bytes != null && keepData) {
bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
} else {
bytes = new byte[len];
}
}
}
/**
* Convert text back to string
* @see java.lang.Object#toString()
*/
@Override
public String toString() {
try {
return decode(bytes, 0, length);
} catch (CharacterCodingException e) {
throw new RuntimeException("Should not have happened " , e);
}
}
/** deserialize
*/
@Override
public void readFields(DataInput in) throws IOException {
int newLength = WritableUtils.readVInt(in);
readWithKnownLength(in, newLength);
}
public void readFields(DataInput in, int maxLength) throws IOException {
int newLength = WritableUtils.readVInt(in);
if (newLength < 0) {
throw new IOException("tried to deserialize " + newLength +
" bytes of data! newLength must be non-negative.");
} else if (newLength >= maxLength) {
throw new IOException("tried to deserialize " + newLength +
" bytes of data, but maxLength = " + maxLength);
}
readWithKnownLength(in, newLength);
}
/** Skips over one Text in the input. */
public static void skip(DataInput in) throws IOException {
int length = WritableUtils.readVInt(in);
WritableUtils.skipFully(in, length);
}
/**
* Read a Text object whose length is already known.
* This allows creating Text from a stream which uses a different serialization
* format.
*/
public void readWithKnownLength(DataInput in, int len) throws IOException {
setCapacity(len, false);
in.readFully(bytes, 0, len);
length = len;
}
/** serialize
* write this object to out
* length uses zero-compressed encoding
* @see Writable#write(DataOutput)
*/
@Override
public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, length);
out.write(bytes, 0, length);
}
public void write(DataOutput out, int maxLength) throws IOException {
if (length > maxLength) {
throw new IOException("data was too long to write! Expected " +
"less than or equal to " + maxLength + " bytes, but got " +
length + " bytes.");
}
WritableUtils.writeVInt(out, length);
out.write(bytes, 0, length);
}
/** Returns true iff <code>o</code> is a Text with the same contents. */
@Override
public boolean equals(Object o) {
if (o instanceof Text)
return super.equals(o);
return false;
}
@Override
public int hashCode() {
return super.hashCode();
}
/** A WritableComparator optimized for Text keys. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(Text.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
int n1 = WritableUtils.decodeVIntSize(b1[s1]);
int n2 = WritableUtils.decodeVIntSize(b2[s2]);
return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
}
}
static {
// register this comparator
WritableComparator.define(Text.class, new Comparator());
}
/// STATIC UTILITIES FROM HERE DOWN
/**
* Converts the provided byte array to a String using the
* UTF-8 encoding. If the input is malformed,
* replace by a default value.
*/
public static String decode(byte[] utf8) throws CharacterCodingException {
return decode(ByteBuffer.wrap(utf8), true);
}
public static String decode(byte[] utf8, int start, int length)
throws CharacterCodingException {
return decode(ByteBuffer.wrap(utf8, start, length), true);
}
/**
* Converts the provided byte array to a String using the
* UTF-8 encoding. If <code>replace</code> is true, then
* malformed input is replaced with the
* substitution character, which is U+FFFD. Otherwise the
* method throws a MalformedInputException.
*/
public static String decode(byte[] utf8, int start, int length, boolean replace)
throws CharacterCodingException {
return decode(ByteBuffer.wrap(utf8, start, length), replace);
}
private static String decode(ByteBuffer utf8, boolean replace)
throws CharacterCodingException {
CharsetDecoder decoder = DECODER_FACTORY.get();
if (replace) {
decoder.onMalformedInput(
java.nio.charset.CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
}
String str = decoder.decode(utf8).toString();
// set decoder back to its default value: REPORT
if (replace) {
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
}
return str;
}
/**
* Converts the provided String to bytes using the
* UTF-8 encoding. If the input is malformed,
* invalid chars are replaced by a default value.
* @return ByteBuffer: bytes stores at ByteBuffer.array()
* and length is ByteBuffer.limit()
*/
public static ByteBuffer encode(String string)
throws CharacterCodingException {
return encode(string, true);
}
/**
* Converts the provided String to bytes using the
* UTF-8 encoding. If <code>replace</code> is true, then
* malformed input is replaced with the
* substitution character, which is U+FFFD. Otherwise the
* method throws a MalformedInputException.
* @return ByteBuffer: bytes stores at ByteBuffer.array()
* and length is ByteBuffer.limit()
*/
public static ByteBuffer encode(String string, boolean replace)
throws CharacterCodingException {
CharsetEncoder encoder = ENCODER_FACTORY.get();
if (replace) {
encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
}
ByteBuffer bytes =
encoder.encode(CharBuffer.wrap(string.toCharArray()));
if (replace) {
encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
}
return bytes;
}
static final public int DEFAULT_MAX_LEN = 1024 * 1024;
/** Read a UTF8 encoded string from in
*/
public static String readString(DataInput in) throws IOException {
return readString(in, Integer.MAX_VALUE);
}
/** Read a UTF8 encoded string with a maximum size
*/
public static String readString(DataInput in, int maxLength)
throws IOException {
int length = WritableUtils.readVIntInRange(in, 0, maxLength);
byte [] bytes = new byte[length];
in.readFully(bytes, 0, length);
return decode(bytes);
}
/** Write a UTF8 encoded string to out
*/
public static int writeString(DataOutput out, String s) throws IOException {
ByteBuffer bytes = encode(s);
int length = bytes.limit();
WritableUtils.writeVInt(out, length);
out.write(bytes.array(), 0, length);
return length;
}
/** Write a UTF8 encoded string with a maximum size to out
*/
public static int writeString(DataOutput out, String s, int maxLength)
throws IOException {
ByteBuffer bytes = encode(s);
int length = bytes.limit();
if (length > maxLength) {
throw new IOException("string was too long to write! Expected " +
"less than or equal to " + maxLength + " bytes, but got " +
length + " bytes.");
}
WritableUtils.writeVInt(out, length);
out.write(bytes.array(), 0, length);
return length;
}
////// states for validateUTF8
private static final int LEAD_BYTE = 0;
private static final int TRAIL_BYTE_1 = 1;
private static final int TRAIL_BYTE = 2;
/**
* Check if a byte array contains valid utf-8
* @param utf8 byte array
* @throws MalformedInputException if the byte array contains invalid utf-8
*/
public static void validateUTF8(byte[] utf8) throws MalformedInputException {
validateUTF8(utf8, 0, utf8.length);
}
/**
* Check to see if a byte array is valid utf-8
* @param utf8 the array of bytes
* @param start the offset of the first byte in the array
* @param len the length of the byte sequence
* @throws MalformedInputException if the byte array contains invalid bytes
*/
public static void validateUTF8(byte[] utf8, int start, int len)
throws MalformedInputException {
int count = start;
int leadByte = 0;
int length = 0;
int state = LEAD_BYTE;
while (count < start+len) {
int aByte = utf8[count] & 0xFF;
switch (state) {
case LEAD_BYTE:
leadByte = aByte;
length = bytesFromUTF8[aByte];
switch (length) {
case 0: // check for ASCII
if (leadByte > 0x7F)
throw new MalformedInputException(count);
break;
case 1:
if (leadByte < 0xC2 || leadByte > 0xDF)
throw new MalformedInputException(count);
state = TRAIL_BYTE_1;
break;
case 2:
if (leadByte < 0xE0 || leadByte > 0xEF)
throw new MalformedInputException(count);
state = TRAIL_BYTE_1;
break;
case 3:
if (leadByte < 0xF0 || leadByte > 0xF4)
throw new MalformedInputException(count);
state = TRAIL_BYTE_1;
break;
default:
// too long! Longest valid UTF-8 is 4 bytes (lead + three)
// or if < 0 we got a trail byte in the lead byte position
throw new MalformedInputException(count);
} // switch (length)
break;
case TRAIL_BYTE_1:
if (leadByte == 0xF0 && aByte < 0x90)
throw new MalformedInputException(count);
if (leadByte == 0xF4 && aByte > 0x8F)
throw new MalformedInputException(count);
if (leadByte == 0xE0 && aByte < 0xA0)
throw new MalformedInputException(count);
if (leadByte == 0xED && aByte > 0x9F)
throw new MalformedInputException(count);
// falls through to regular trail-byte test!!
case TRAIL_BYTE:
if (aByte < 0x80 || aByte > 0xBF)
throw new MalformedInputException(count);
if (--length == 0) {
state = LEAD_BYTE;
} else {
state = TRAIL_BYTE;
}
break;
default:
break;
} // switch (state)
count++;
}
}
/**
* Magic numbers for UTF-8. These are the number of bytes
* that <em>follow</em> a given lead byte. Trailing bytes
* have the value -1. The values 4 and 5 are presented in
* this table, even though valid UTF-8 cannot include the
* five and six byte sequences.
*/
static final int[] bytesFromUTF8 =
{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0,
// trail bytes
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 };
/**
* Returns the next code point at the current position in
* the buffer. The buffer's position will be incremented.
* Any mark set on this buffer will be changed by this method!
*/
public static int bytesToCodePoint(ByteBuffer bytes) {
bytes.mark();
byte b = bytes.get();
bytes.reset();
int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
if (extraBytesToRead < 0) return -1; // trailing byte!
int ch = 0;
switch (extraBytesToRead) {
case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
case 0: ch += (bytes.get() & 0xFF);
}
ch -= offsetsFromUTF8[extraBytesToRead];
return ch;
}
static final int offsetsFromUTF8[] =
{ 0x00000000, 0x00003080,
0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 };
/**
* For the given string, returns the number of UTF-8 bytes
* required to encode the string.
* @param string text to encode
* @return number of UTF-8 bytes required to encode
*/
public static int utf8Length(String string) {
CharacterIterator iter = new StringCharacterIterator(string);
char ch = iter.first();
int size = 0;
while (ch != CharacterIterator.DONE) {
if ((ch >= 0xD800) && (ch < 0xDC00)) {
// surrogate pair?
char trail = iter.next();
if ((trail > 0xDBFF) && (trail < 0xE000)) {
// valid pair
size += 4;
} else {
// invalid pair
size += 3;
iter.previous(); // rewind one
}
} else if (ch < 0x80) {
size++;
} else if (ch < 0x800) {
size += 2;
} else {
// ch < 0x10000, that is, the largest char value
size += 3;
}
ch = iter.next();
}
return size;
}
}
下面是Text類實現的方法:
Modifier and Type | Method and Description |
---|---|
void |
append(byte[] utf8, int start, int len)
Append a range of bytes to the end of the given text
|
static int |
bytesToCodePoint(ByteBuffer bytes)
Returns the next code point at the current position in the buffer.
|
int |
charAt(int position)
Returns the Unicode Scalar Value (32-bit integer value) for the character at
position . |
void |
clear()
Clear the string to empty.
|
byte[] |
copyBytes()
Get a copy of the bytes that is exactly the length of the data.
|
static
String |
decode(byte[] utf8)
Converts the provided byte array to a String using the UTF-8 encoding.
|
static
String |
decode(byte[] utf8, int start, int length) |
static
String |
decode(byte[] utf8, int start, int length, boolean replace)
Converts the provided byte array to a String using the UTF-8 encoding.
|
static
ByteBuffer |
encode(String string)
Converts the provided String to bytes using the UTF-8 encoding.
|
static
ByteBuffer |
encode(String string,
boolean replace)
Converts the provided String to bytes using the UTF-8 encoding.
|
boolean |
equals(Object o)
Returns true iff
o is a Text with the same contents. |
int |
find(String what) |
int |
find(String what,
int start)
Finds any occurence of
what in the backing buffer, starting as positionstart . |
byte[] |
getBytes()
Returns the raw bytes; however, only data up to
getLength() is valid. |
int |
getLength()
Returns the number of bytes in the byte array
|
int |
hashCode()
Return a hash of the bytes returned from {#getBytes()}.
|
void |
readFields(DataInput in)
deserialize
|
void |
readFields(DataInput in,
int maxLength) |
static
String |
readString(DataInput in)
Read a UTF8 encoded string from in
|
static
String |
readString(DataInput in,
int maxLength)
Read a UTF8 encoded string with a maximum size
|
void |
readWithKnownLength(DataInput in,
int len)
Read a Text object whose length is already known.
|
void |
set(byte[] utf8)
Set to a utf8 byte array
|
void |
set(byte[] utf8, int start, int len)
Set the Text to range of bytes
|
void |
set(String string)
Set to contain the contents of a string.
|
void |
set(Text other)
copy a text.
|
static void |
skip(DataInput in)
Skips over one Text in the input.
|
String |
toString()
Convert text back to string
|
static int |
utf8Length(String string)
For the given string, returns the number of UTF-8 bytes required to encode the string.
|
static void |
validateUTF8(byte[] utf8)
Check if a byte array contains valid utf-8
|
static void |
validateUTF8(byte[] utf8, int start, int len)
Check to see if a byte array is valid utf-8
|
void |
write(DataOutput out)
serialize write this object to out length uses zero-compressed encoding
|
void |
write(DataOutput out,
int maxLength) |
static int |
writeString(DataOutput out,String s)
Write a UTF8 encoded string to out
|
static int |
writeString(DataOutput out,String s,
int maxLength)
Write a UTF8 encoded string with a maximum size to out
|
爲對上面闡述進行驗證,接下來將列舉一個測試實例,並進行分析。
參考:
http://hadoop.apache.org/docs/current/