1 提出問題
Map最小輸入數據單元是InputSplit。比如對於那麼對於一個記錄行形式的文本大於128M時,HDFS將會分成多塊存儲(block),同時分片並非到每行行尾。這樣就會產生兩個問題:1、Hadoop的一個Block默認是128M,那麼對於一個記錄行形式的文本,會不會造成一行記錄被分到兩個Block當中?
2、在把文件從Block中讀取出來進行切分時,會不會造成一行記錄被分成兩個InputSplit,如果被分成兩個InputSplit,這樣一個InputSplit裏面就有一行不完整的數據,那麼處理這個InputSplit的Map會不會得出不正確的結果?
對於上面的兩個問題,必須明確兩個概念:Block和InputSplit:
1、Block是HDFS存儲文件的單位(默認是128M)
2、InputSplit是MapReduce對文件進行處理和運算的輸入單位,只是一個邏輯概念,每個InputSplit並沒有對文件實際的切割,只是記錄了要處理的數據的位置(包括文件的path和hosts)和長度(由start和length決定)
因此以行記錄形式的文本,可能存在一行記錄被劃分到不同的Block,甚至不同的DataNode上去。通過分析FileInputFormat裏面的getSplits方法,可以得出,某一行記錄同樣也可能被劃分到不同的InputSplit。
2 分析源碼
/**
* Generate the list of files and make them into FileSplits.
* @param job the job context
* @throws IOException
*/
public List<InputSplit> getSplits(JobContext job) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus> files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
FileSystem fs = path.getFileSystem(job.getConfiguration());
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkIndex].getHosts()));
}
} else { // not splitable
splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts()));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
// Save the number of input files for metrics/loadgen
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
LOG.debug("Total # of splits: " + splits.size());
return splits;
}
從上面的代碼可以看出,對文件進行切分其實很簡單:獲取文件在HDFS上的路徑和Block信息,然後根據splitSize對文件進行切分,splitSize = computeSplitSize(blockSize, minSize, maxSize);maxSize,minSize,blockSize都可以配置,默認splitSize 就等於blockSize的默認值(128m)。FileInputFormat對文件的切分是嚴格按照偏移量來的,因此一行記錄比較長的話,可能被切分到不同的InputSplit。但這並不會對Map造成影響,儘管一行記錄可能被拆分到不同的InputSplit,但是與FileInputFormat關聯的RecordReader被設計的足夠健壯,當一行記錄跨InputSplit時,其能夠到讀取不同的InputSplit,直到把這一行記錄讀取完成。我們拿最常見的TextInputFormat源碼分析如何處理跨行InputSplit的,TextInputFormat關聯的是LineRecordReader,下面先看LineRecordReader的的nextKeyValue方法裏讀取文件的代碼:
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
1、其讀取文件是通過LineReader(in就是一個LineReader實例)的readLine方法完成的。關鍵的邏輯就在這個readLine方法裏,這個方法主要的邏輯歸納起來是3點:
A 總是從buffer裏讀取數據,如果buffer裏的數據讀完了,先加載下一批數據到buffer。
B 在buffer中查找"行尾",將開始位置至行尾處的數據拷貝給str(也就是最後的Value)。若爲遇到"行尾",繼續加載新的數據到buffer進行查找。
C 關鍵點在於:給到buffer的數據是直接從文件中讀取的,完全不會考慮是否超過了split的界限,而是一直讀取到當前行結束爲止。
/**
* Read a line terminated by one of CR, LF, or CRLF.
*/
private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
throws IOException {
/* We're reading data from in, but the head of the stream may be
* already buffered in buffer, so we have several cases:
* 1. No newline characters are in the buffer, so we need to copy
* everything and read another buffer from the stream.
* 2. An unambiguously terminated line is in buffer, so we just
* copy to str.
* 3. Ambiguously terminated line is in buffer, i.e. buffer ends
* in CR. In this case we copy everything up to CR to str, but
* we also need to see what follows CR: if it's LF, then we
* need consume LF as well, so next call to readLine will read
* from after that.
* We use a flag prevCharCR to signal if previous character was CR
* and, if it happens to be at the end of the buffer, delay
* consuming it until we have a chance to look at the char that
* follows.
*/
str.clear();
int txtLength = 0; //tracks str.getLength(), as an optimization
int newlineLength = 0; //length of terminating newline
boolean prevCharCR = false; //true of prev char was CR
long bytesConsumed = 0;
do {
int startPosn = bufferPosn; //starting from where we left off the last time
//如果buffer中的數據讀完了,先加載一批數據到buffer裏
if (bufferPosn >= bufferLength) {
startPosn = bufferPosn = 0;
if (prevCharCR) {
++bytesConsumed; //account for CR from previous read
}
bufferLength = in.read(buffer);
if (bufferLength <= 0) {
break; // EOF
}
}
//注意:由於不同操作系統對“行結束符“的定義不同:
//UNIX: '\n' (LF)
//Mac: '\r' (CR)
//Windows: '\r\n' (CR)(LF)
//爲了準確判斷一行的結尾,程序的判定邏輯是:
//1.如果當前符號是LF,可以確定一定是到了行尾,但是需要參考一下前一個
//字符,因爲如果前一個字符是CR,那就是windows文件,“行結束符的長度”
//(即變量:newlineLength)應該是2,否則就是UNIX文件,“行結束符的長度”爲1。
//2.如果當前符號不是LF,看一下前一個符號是不是CR,如果是也可以確定一定上個字符就是行尾了,這是一個mac文件。
//3.如果當前符號是CR的話,還需要根據下一個字符是不是LF判斷“行結束符的長度”,所以只是標記一下prevCharCR=true,供讀取下個字符時參考
for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
if (buffer[bufferPosn] == LF) {//存在'\n'換行字符
newlineLength = (prevCharCR) ? 2 : 1;
++bufferPosn; // at next invocation proceed from following byte
break;
}
if (prevCharCR) { //CR + notLF, we are at notLF
newlineLength = 1;
break;
}
prevCharCR = (buffer[bufferPosn] == CR);//存在'\r'回車字符
}
int readLength = bufferPosn - startPosn;
if (prevCharCR && newlineLength == 0) {
--readLength; //CR at the end of the buffer
}
bytesConsumed += readLength;
int appendLength = readLength - newlineLength;
if (appendLength > maxLineLength - txtLength) {
appendLength = maxLineLength - txtLength;
}
if (appendLength > 0) {
str.append(buffer, startPosn, appendLength);
txtLength += appendLength;
}
//newlineLength == 0 就意味着始終沒有讀到行尾,程序會繼續通過文件輸入流繼續從文件裏讀取數據。
//這裏有一個非常重要的地方:in的實例創建自構造函數
//org.apache.hadoop.mapreduce.LineRecordReader.lib.input.LineRecordReader.initialize(InputSplit, TaskAttemptContext)
//方法內:FSDataInputStream fileIn = fs.open(split.getPath()); 我們以看到:
//對於LineRecordReader:當它對取一行時,一定是讀取到完整的行,不會受filesplit的任何影響,
//因爲它讀取是filesplit所在的文件,而不是限定在filesplit的界限範圍內。
//所以不會出現“斷行”的問題!
} while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);
if (bytesConsumed > (long)Integer.MAX_VALUE) {
throw new IOException("Too many bytes before newline: " + bytesConsumed);
}
return (int)bytesConsumed;
}
2、按照readLine的上述行爲,在遇到跨split的行時,會將下一個split開始行數據讀取出來構成一行完整的數據,那麼下一個split怎麼判定開頭的一行有沒有被上一個split的LineRecordReader讀取過從而避免漏讀或重複讀取開頭一行呢?這方面LineRecordReader使用了一個簡單而巧妙的方法:既然無法斷定每一個split開始的一行是獨立的一行還是被切斷的一行的一部分,那就跳過每個split的開始一行(當然要除第一個split之外),從第二行開始讀取,然後在到達split的結尾端時總是再多讀一行,這樣數據既能接續起來又避開了斷行帶來的麻煩.以下是相關的源碼:
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {//非第一個InputSplit忽略掉第一行
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
3、相應地,在LineRecordReader判斷是否還有下一行的方法:org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue()中,while使用的判定條件保證了InputSplit讀取跨界的問題:當前位置小於或等於split的結尾位置,也就說:當前已處於split的結尾位置上時,while依然會執行一次,這一次讀到顯然已經是下一個split的開始行了。
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {//保證InputSplit讀取邊界的問題
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
至此通過上面的源碼分析我們清楚瞭解到TextInputFormat是如何解決跨行Block和InputSplit的,因此當我們需要實現自己的InputFormat時,也會面臨在切分數據時的連續性解析問題。
原貼地址:http://liangjf85-163-com.iteye.com/blog/2122583