在MR實踐中,會有很多小文件,單個文件產生一個mapper,資源比較浪費,後續沒有reduce邏輯的話,會產生很多小文件,文件數量暴漲,對後續的hive job產生影響。
所以需要在mapper中將多個文件合成一個split作爲輸入,CombineFileInputFormat滿足我們的需求。
CombineFileInputFormat 原理(網上牛人總結):
第一次:將同DN上的所有block生成Split,生成方式:
1.循環nodeToBlocks,獲得每個DN上有哪些block
2.循環這些block列表
3.將block從blockToNodes中移除,避免同一個block被包含在多個split中
4.將該block添加到一個有效block的列表中,這個列表主要是保留哪些block已經從blockToNodes中被移除了,方便後面恢復到blockToNodes中
5.向臨時變量curSplitSize增加block的大小
6.判斷curSplitSize是否已經超過了設置的maxSize
a) 如果超過,執行並添加split信息,並重置curSplitSize和validBlocks
b) 沒有超過,繼續循環block列表,跳到第2步
7.當前DN上的block列表循環完成,判斷剩餘的block是否允許被split(剩下的block大小之和是否大於每個DN的最小split大小)
a) 如果允許,執行並添加split信息
b) 如果不被允許,將這些剩餘的block歸還blockToNodes
8.重置
9.跳到步驟1
// process all nodes and create splits that are local
// to a node.
//創建同一個DN上的split
for (Iterator<Map.Entry<String,
List<OneBlockInfo>>> iter = nodeToBlocks.entrySet().iterator();
iter.hasNext();) {
Map.Entry<String, List<OneBlockInfo>> one = iter.next();
nodes.add(one.getKey());
List<OneBlockInfo> blocksInNode = one.getValue();
// for each block, copy it into validBlocks. Delete it from
// blockToNodes so that the same block does not appear in
// two different splits.
for (OneBlockInfo oneblock : blocksInNode) {
if (blockToNodes.containsKey(oneblock)) {
validBlocks.add(oneblock);
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;
// if the accumulated split size exceeds the maximum, then
// create this split.
if (maxSize != 0 && curSplitSize >= maxSize) {
// create an input split and add it to the splits array
//創建這些block合併後的split,並將其split添加到split列表中
addCreatedSplit(job, splits, nodes, validBlocks);
//重置
curSplitSize = 0;
validBlocks.clear();
}
}
}
// if there were any blocks left over and their combined size is
// larger than minSplitNode, then combine them into one split.
// Otherwise add them back to the unprocessed pool. It is likely
// that they will be combined with other blocks from the same rack later on.
//其實這裏的註釋已經說的很清楚,我再按照我的理解說一下
/**
* 這裏有幾種情況:
* 1、在這個DN上還有沒有被split的block,
* 而且這些block的大小大於了在一個DN上的split最小值(沒有達到最大值),
* 將把這些block合併成一個split
* 2、剩餘的block的大小還是沒有達到,將剩餘的這些block
* 歸還給blockToNodes,等以後統一處理
*/
if (minSizeNode != 0 && curSplitSize >= minSizeNode) {
// create an input split and add it to the splits array
addCreatedSplit(job, splits, nodes, validBlocks);
} else {
for (OneBlockInfo oneblock : validBlocks) {
blockToNodes.put(oneblock, oneblock.hosts);
}
}
validBlocks.clear();
nodes.clear();
curSplitSize = 0;
}
第二次:對不再同一個DN上但是在同一個Rack上的block進行合併(只是之前還剩下的block)
// if blocks in a rack are below the specified minimum size, then keep them
// in 'overflow'. After the processing of all racks is complete, these overflow
// blocks will be combined into splits.
ArrayList<OneBlockInfo> overflowBlocks = new ArrayList<OneBlockInfo>();
ArrayList<String> racks = new ArrayList<String>();
// Process all racks over and over again until there is no more work to do.
//這裏處理的就不再是同一個DN上的block
//同一個DN上的已經被處理過了(上面的代碼),這裏是一些
//還沒有被處理的block
while (blockToNodes.size() > 0) {
// Create one split for this rack before moving over to the next rack.
// Come back to this rack after creating a single split for each of the
// remaining racks.
// Process one rack location at a time, Combine all possible blocks that
// reside on this rack as one split. (constrained by minimum and maximum
// split size).
// iterate over all racks
//創建同機架的split
for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter =
rackToBlocks.entrySet().iterator(); iter.hasNext();) {
Map.Entry<String, List<OneBlockInfo>> one = iter.next();
racks.add(one.getKey());
List<OneBlockInfo> blocks = one.getValue();
// for each block, copy it into validBlocks. Delete it from
// blockToNodes so that the same block does not appear in
// two different splits.
boolean createdSplit = false;
for (OneBlockInfo oneblock : blocks) {
//這裏很重要,現在的blockToNodes說明的是還有哪些block沒有被split
if (blockToNodes.containsKey(oneblock)) {
validBlocks.add(oneblock);
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;
// if the accumulated split size exceeds the maximum, then
// create this split.
if (maxSize != 0 && curSplitSize >= maxSize) {
// create an input split and add it to the splits array
addCreatedSplit(job, splits, getHosts(racks), validBlocks);
createdSplit = true;
break;
}
}
}
// if we created a split, then just go to the next rack
if (createdSplit) {
curSplitSize = 0;
validBlocks.clear();
racks.clear();
continue;
}
//還有沒有被split的block
//如果這些block的大小大於了同機架的最小split,
//則創建split
//否則,將這些block留到後面處理
if (!validBlocks.isEmpty()) {
if (minSizeRack != 0 && curSplitSize >= minSizeRack) {
// if there is a mimimum size specified, then create a single split
// otherwise, store these blocks into overflow data structure
addCreatedSplit(job, splits, getHosts(racks), validBlocks);
} else {
// There were a few blocks in this rack that remained to be processed.
// Keep them in 'overflow' block list. These will be combined later.
overflowBlocks.addAll(validBlocks);
}
}
curSplitSize = 0;
validBlocks.clear();
racks.clear();
}
}
最後,對於既不在同DN也不在同rack的block進行合併(經過前兩步還剩下的block),這裏源碼就沒有什麼了,就不再貼了
源碼總結:
合併,經過了3個步驟。同DN----》同rack不同DN-----》不同rack
將可以合併的block寫到同一個split中
下面是實踐代碼:原始文件是70M每個的小文件,有些更小,sequence類型,需要自己實現RecordRead(Text就比較簡單),key是byteWrite類型,現在需要減少文件個數,每個文件的大小接近block的大小。
自定義CombineSequenceFileInputFormat:
package com.hadoop.combineInput;
import java.io.IOException;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
public class CombineSequenceFileInputFormat<K, V> extends CombineFileInputFormat<K, V> {
@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
return new CombineFileRecordReader((CombineFileSplit)split, context, CombineSequenceFileRecordReader.class);
}
}
實現 CombineSequenceFileRecordReader
package com.hadoop.combineInput;
import java.io.IOException;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader;
import org.apache.hadoop.util.ReflectionUtils;
public class CombineSequenceFileRecordReader<K, V> extends RecordReader<K, V> {
private CombineFileSplit split;
private TaskAttemptContext context;
private int index;
private RecordReader<K, V> rr;
@SuppressWarnings("unchecked")
public CombineSequenceFileRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) throws IOException, InterruptedException {
this.index = index;
this.split = (CombineFileSplit) split;
this.context = context;
this.rr = ReflectionUtils.newInstance(SequenceFileRecordReader.class, context.getConfiguration());
}
@SuppressWarnings("unchecked")
@Override
public void initialize(InputSplit curSplit, TaskAttemptContext curContext) throws IOException, InterruptedException {
this.split = (CombineFileSplit) curSplit;
this.context = curContext;
if (null == rr) {
rr = ReflectionUtils.newInstance(SequenceFileRecordReader.class, context.getConfiguration());
}
FileSplit fileSplit = new FileSplit(this.split.getPath(index),
this.split.getOffset(index), this.split.getLength(index),
this.split.getLocations());
this.rr.initialize(fileSplit, this.context);
}
@Override
public float getProgress() throws IOException, InterruptedException {
return rr.getProgress();
}
@Override
public void close() throws IOException {
if (null != rr) {
rr.close();
rr = null;
}
}
@Override
public K getCurrentKey()
throws IOException, InterruptedException {
return rr.getCurrentKey();
}
@Override
public V getCurrentValue()
throws IOException, InterruptedException {
return rr.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return rr.nextKeyValue();
}
}
參考資料:http://sourceforge.net/p/openimaj/code/HEAD/tree/trunk/hadoop/core-hadoop/src/main/java/org/openimaj/hadoop/sequencefile/combine/CombineSequenceFileRecordReader.java
main函數比較簡單,這裏也貼出來下,方便後續自己記憶:
package com.hadoop.combineInput;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MergeFiles extends Configured implements Tool {
public static class MapClass extends Mapper<BytesWritable, Text, BytesWritable, Text> {
public void map(BytesWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
} // END: MapClass
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapred.max.split.size", "157286400");
conf.setBoolean("mapred.output.compress", true);
Job job = new Job(conf);
job.setJobName("MergeFiles");
job.setJarByClass(MergeFiles.class);
job.setMapperClass(MapClass.class);
job.setInputFormatClass(CombineSequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(BytesWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPaths(job, args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(0);
return job.waitForCompletion(true) ? 0 : 1;
} // END: run
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new MergeFiles(), args);
System.exit(ret);
} // END: main
} //
性能測試:70M大小的壓縮sequence文件,2000個,轉換成是700個壓縮sequence文件,平均每個200M(可控),blocksize=256,耗時2分半到3分鐘。
存在問題:
- 合併後會造成mapper不能本地化,帶來mapper的額外開銷,需要權衡