hadoop源碼分析之文件拆分

InputFormat介紹

當我們編寫MapReduce程序的時候，都會進行輸入格式的設置，方便hadoop可以根據設置得文件格式正確的讀取數據進行處理，一般設置代碼如下:

job.setInputFormatClass(TextInputFormat.class)

通過上面的代碼來保證輸入的文件是按照我們想要的格式被讀取，所有的輸入格式都繼承於InputFormat，這是一個抽象類，其子類有專門用於讀取普通文件的FileInputFormatt，用於讀取數據庫文件的DBInputFromat，用於讀取HBase的TableInputFormat等等。下面是InputFormat的定義：

public abstract class InputFormat<K, V> {
  /** 文件切分 **/
  public abstract List<InputSplit> getSplits(JobContext context) 
      throws IOException, InterruptedException;
    
  /** map輸入生成 **/
  public abstract RecordReader<K,V> createRecordReader(
      InputSplit split, TaskAttemptContext context) 
      throws IOException, InterruptedException;
}

從定義中我們可以看到InputFormat我們可以做兩件事情，第一件是根據輸入文件格式進行文件切分，切分爲不同的分片，構成不同的Mapper；第二件是進行Mapper的輸入數據的產生，會根據輸入格式選取合適的RecordReader進行源源不斷的從文件讀取數據，交給我們自定義的map函數進行處理。本文對TextInputFormat文件格式的文件切分進行深入的源碼分析，方便看客瞭解其中的原理。

TextInputFormat

TextInputFormat繼承於FileInputFormat，文件切分沒有進行更改，對recordReader進行了定製。

org.apache.hadoop.mapreduce.lib.input.FileInputFormat
org.apache.hadoop.mapreduce.lib.input.TextInputFormat

文件切分

文件切分原理

public static final String SPLIT_MAXSIZE = "mapreduce.input.fileinputformat.split.maxsize";
public static final String SPLIT_MINSIZE = "mapreduce.input.fileinputformat.split.minsize";

FileInputFormat提供了三個參數來共同控制分片的大小:

一個文件分片最小的有效字節數:mapreduce.input.fileinputformat.split.minsize
一個文件分片最大有效字節數: mapreduce.input.fileinputformat.split.maxsize
HDFS中塊的大小: dfs.blocksize

這三個參數按照公式splitSize = max(minimumSize, min(maximumSize, blockSize))來進行分片大小的確定，可以通過改變上述三個參數來調節最終的分片大小。

文件切分源碼分析

先舉一個切分的實例，假設我們的輸入文件是128M，dfs的大小是40M，則這個文件應該是有5個塊，假設我們定義的最大切分大小是30M，則根據公式max(minimumSize, min(maximumSize, blockSize))，我們的分片的大小是30M，接下來我們分析如何切分產出InputSplit，主要是分析函數getSplits

首先獲取minimumSize和maximumSize

    /** 最小map分片長度 min<1, "mapreduce.input.fileinputformat.split.minsize"></>**/
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    /** 最大map分片長度 mapreduce.input.fileinputformat.split.maxsize **/
    long maxSize = getMaxSplitSize(job);

獲取輸入目錄下所有的文件狀態信息
```
List<FileStatus> files = listStatus(job);
```

遍歷文件進行每個文件的切分

for (FileStatus file: files) {
    Path path = file.getPath();
    long length = file.getLen();
    ...
}

文件長度不爲0時候，首先獲取文件的位置信息；文件長度爲0，直接創造一個空的host的數組返回

BlockLocation[] blkLocations; 
/** 獲取改文件所在的位置信息 **/
if (file instanceof LocatedFileStatus) {
    blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
    FileSystem fs = path.getFileSystem(job.getConfiguration());
    blkLocations = fs.getFileBlockLocations(file, 0, length);
}

然後如果文件是可切分的，進行blockSize獲取，求出切分大小，按照切分大小進行切分。

/** 一個文件塊大小：默認爲128M **/
long blockSize = file.getBlockSize();
/** 根據文件塊大小,map最小分片大小和最大分片大小確定分片的大小:
	公式：max(minimumSize, min(maximumSize, blockSize))` **/
long splitSize = computeSplitSize(blockSize, minSize, maxSize);

long bytesRemaining = length; // 還剩餘的文件長度
/** slpit_slop是1.1避免最後剩一點點文件大小也劃分一個map **/
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
    /** 獲取在文件中的第幾塊:length-bytesRemaining是offset<總長度-剩餘爲切分的長度> **/
    int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
    /** 切分:記錄了這個map要讀取的數據元信息:
    文件名稱--開始位置--讀取的長度--改塊所在的hosts--改塊所在的cached的host **/
    splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                         blkLocations[blkIndex].getHosts(),
                         blkLocations[blkIndex].getCachedHosts()));
    bytesRemaining -= splitSize;
}

/** 爲滿足切分條件，但是還剩下部分數據 **/
if (bytesRemaining != 0) {
    int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
    splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                         blkLocations[blkIndex].getHosts(),
                         blkLocations[blkIndex].getCachedHosts()));
}

獲取文件塊的index是通過函數getBlockIndex進行的，我們根據開頭的例子進行具體分析

protected int getBlockIndex(BlockLocation[] blkLocations, 
                              long offset) {
    /** 根據offset查找在第N個塊中
     * 假設塊大小40M,分片大小是30M,文件總大小是128M，則以共有4個塊即blkLocations長度爲4
     *    offset第一次是0, 則 0 <= 0 < 40,在第一個塊
     *    offset第二次是30, 則 0 <= 30 < 40，還在第一個塊
     *    offset第三次是60, 則 40 <= 60 < 80，在第二個塊
     *    offset第二次是90, 則 80 <= 90 < 120，在第三個塊
     *    offset第二次是120, 則 120 <= 120 < 128，在第四個塊
     * **/
    for (int i = 0 ; i < blkLocations.length; i++) {
      // is the offset inside this block?
      if ((blkLocations[i].getOffset() <= offset) &&
          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
        return i;
      }
    }
    BlockLocation last = blkLocations[blkLocations.length -1];
    long fileLength = last.getOffset() + last.getLength() -1;
    throw new IllegalArgumentException("Offset " + offset + 
                                       " is outside of file (0.." +
                                       fileLength + ")");
  }

至此我們分析完了切分的過程，最終返回的是切分的文件的元信息，包含了文件位置，要讀取得開始位置，讀取的長度，塊所在的host信息等。

參考

https://www.jianshu.com/p/12c66b6f5c57

hadoop源碼分析之文件拆分

InputFormat介紹

TextInputFormat

文件切分

文件切分原理

文件切分源碼分析

參考

實戰spark 2.x讀取&存儲數據

kafka源碼分析之環境搭建

hadoop源碼分析之文件拆分

實戰spark core數據讀取&存儲

spark-Row實戰&源碼分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結