目錄
- 1. 爲什麼需要自定義輸入
- 2. 默認輸入源碼分析
- 2.1 org.apache.hadoop.mapreduce.Mapper
- 2.2 org.apache.hadoop.mapred.MapTask
- 2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper
- 2.4 org.apache.hadoop.mapreduce.task.MapContextImpl
- 2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader
- 2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass
- 2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass
- 2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
- 2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader
- 2.10 源碼總結
- 3. 自定義輸入
1. 爲什麼需要自定義輸入
我們都知道namenode負責存儲文件的metadata,運行時所有數據都保存到內存,整個HDFS可存儲的文件數受限於NameNode的內存大小
一個Block在NameNode中對應一條記錄(一般一個block佔用150字節),如果是大量的小文件,會消耗大量內存。同時map task的數量是由splits來決定的,所以用MapReduce處理大量的小文件時,就會產生過多的maptask,線程管理開銷將會增加作業時間。處理大量小文件的速度遠遠小於處理同等大小的大文件的速度。因此Hadoop建議存儲大文件。
雖然我們可以在代碼中通過設置爲CombineTextInputFormat
- 但它只能在運行的時候將多個小文件加載到一個maptask中而已
- 物理存儲仍然是大量的小文件
- hdfs的namenode壓力依然很大
設置方式如:
//合併的時候 根據切片大小進行合併
job.setInputFormatClass(CombineTextInputFormat.class);
//設置切片大小 >128M
CombineTextInputFormat.setMinInputSplitSize(job, 130*1024*1024);
FileInputFormat.addInputPath(job, new Path("/in"));
那麼大量小文件的解決辦法:通過mapreduce進行小文件的合併
多個小文件各併爲一個大文件
分析:
- 默認的map() 方法是一行調用一次
- 我們要自定義輸入一次讀到一個文件 ,一個文件調用一次map,然後直接發送給reduce端
- reudce端將所有文件進行合併
2. 默認輸入源碼分析
先找到輸入入口,我們執行mapper的時候,默認一行執行一次,所以我們先看一下Mapper都幹了啥。
mapper中有三個方法
- setup(): maptask 開始的時候執行一次
- cleanup():maptask 結束的時候執行一次
- map():默認一行執行一次
- run():根據條件調用上邊兒三個方法
所以下邊兒代碼Mapper中 run() 開始分析
2.1 org.apache.hadoop.mapreduce.Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
/*
* context.nextKeyValue() 判斷是否還有下一行
* context.getCurrentKey() 獲取當前的偏移量
* context.getCurrentValue() 獲取當前行的內容
* 核心就是找contex參數
* 也就是誰調用了run(參數)
*/
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}
我們點擊run()方法找到他是在MapTask中調用的 mapper.run(mapperContext)
部分關鍵代碼如下:
2.2 org.apache.hadoop.mapred.MapTask
// make the input format
//通過反射創建對象 (重點是對象的類型taskContext.getInputFormatClass())然後再去找這個inputformat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
//this.real=inputFormat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.RecordWriter output = null;
// get an output object
if (job.getNumReduceTasks() == 0) {
output =
new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
} else {
output = new NewOutputCollector(taskContext, job, umbilical, reporter);
}
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
//mapContext----》input
mapContext =
new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),
input, output,
committer,
reporter, split);
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
//去WrappedMapper.getMapContext看返回來的是什麼
//mapperContext---》mapContext
mapperContext =
new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
mapContext);
try {
input.initialize(split, mapperContext);
//這個mapper就是 job.setMapperClass()對應的對象
//找mapperContext 裏邊肯定有這三個方法 nextKeyValue getcurrentkey getcurrentvalue
mapper.run(mapperContext);
mapPhase.complete();
setPhase(TaskStatus.Phase.SORT);
statusUpdate(umbilical);
input.close();
input = null;
output.close(mapperContext);
output = null;
} finally {
closeQuietly(input);
closeQuietly(output, mapperContext);
}
}
去WrappedMapper.getMapContext看返回來的是什麼?
返回一個Contex對象,這個對象中有三個方法getCurrentKey getCurrentValue nextKeyValue
2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper
/**
* Get a wrapped {@link Mapper.Context} for custom implementations.
* @param mapContext <code>MapContext</code> to be wrapped
* @return a wrapped <code>Mapper.Context</code> for custom implementations
*/
public Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context
getMapContext(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
//返回一個Context對象,這個Context中應該主有我們Mapper中的三個方法nextKeyValue getcurrentkey getcurrentvalue
return new Context(mapContext);
}
//下邊兒Context中確實有getCurrentKey getCurrentValue nextKeyValue
//這三個方法返回值來自this.mapContext=mapContext
//而mapContex參數我們再往回找傳入的地方在mapTask中創建了一個MapContextImpl對象,去看這個對象中繼續看
@InterfaceStability.Evolving
public class Context
extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context {
protected MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext;
// 構造方法
public Context(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
this.mapContext = mapContext;
}
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit() {
return mapContext.getInputSplit();
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return mapContext.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return mapContext.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return mapContext.nextKeyValue();
}
@Override
public Counter getCounter(Enum<?> counterName) {
return mapContext.getCounter(counterName);
}
@Override
public Counter getCounter(String groupName, String counterName) {
return mapContext.getCounter(groupName, counterName);
}
@Override
public OutputCommitter getOutputCommitter() {
return mapContext.getOutputCommitter();
}
@Override
public void write(KEYOUT key, VALUEOUT value) throws IOException,
InterruptedException {
mapContext.write(key, value);
}
@Override
public String getStatus() {
return mapContext.getStatus();
}
@Override
public TaskAttemptID getTaskAttemptID() {
return mapContext.getTaskAttemptID();
}
@Override
public void setStatus(String msg) {
mapContext.setStatus(msg);
}
@Override
public Path[] getArchiveClassPaths() {
return mapContext.getArchiveClassPaths();
}
@Override
public String[] getArchiveTimestamps() {
return mapContext.getArchiveTimestamps();
}
@Override
public URI[] getCacheArchives() throws IOException {
return mapContext.getCacheArchives();
}
@Override
public URI[] getCacheFiles() throws IOException {
return mapContext.getCacheFiles();
}
@Override
public Class<? extends Reducer<?, ?, ?, ?>> getCombinerClass()
throws ClassNotFoundException {
return mapContext.getCombinerClass();
}
@Override
public Configuration getConfiguration() {
return mapContext.getConfiguration();
}
@Override
public Path[] getFileClassPaths() {
return mapContext.getFileClassPaths();
}
@Override
public String[] getFileTimestamps() {
return mapContext.getFileTimestamps();
}
@Override
public RawComparator<?> getCombinerKeyGroupingComparator() {
return mapContext.getCombinerKeyGroupingComparator();
}
@Override
public RawComparator<?> getGroupingComparator() {
return mapContext.getGroupingComparator();
}
@Override
public Class<? extends InputFormat<?, ?>> getInputFormatClass()
throws ClassNotFoundException {
return mapContext.getInputFormatClass();
}
@Override
public String getJar() {
return mapContext.getJar();
}
@Override
public JobID getJobID() {
return mapContext.getJobID();
}
@Override
public String getJobName() {
return mapContext.getJobName();
}
@Override
public boolean getJobSetupCleanupNeeded() {
return mapContext.getJobSetupCleanupNeeded();
}
@Override
public boolean getTaskCleanupNeeded() {
return mapContext.getTaskCleanupNeeded();
}
@Override
public Path[] getLocalCacheArchives() throws IOException {
return mapContext.getLocalCacheArchives();
}
@Override
public Path[] getLocalCacheFiles() throws IOException {
return mapContext.getLocalCacheFiles();
}
@Override
public Class<?> getMapOutputKeyClass() {
return mapContext.getMapOutputKeyClass();
}
@Override
public Class<?> getMapOutputValueClass() {
return mapContext.getMapOutputValueClass();
}
@Override
public Class<? extends Mapper<?, ?, ?, ?>> getMapperClass()
throws ClassNotFoundException {
return mapContext.getMapperClass();
}
@Override
public int getMaxMapAttempts() {
return mapContext.getMaxMapAttempts();
}
@Override
public int getMaxReduceAttempts() {
return mapContext.getMaxReduceAttempts();
}
@Override
public int getNumReduceTasks() {
return mapContext.getNumReduceTasks();
}
@Override
public Class<? extends OutputFormat<?, ?>> getOutputFormatClass()
throws ClassNotFoundException {
return mapContext.getOutputFormatClass();
}
@Override
public Class<?> getOutputKeyClass() {
return mapContext.getOutputKeyClass();
}
@Override
public Class<?> getOutputValueClass() {
return mapContext.getOutputValueClass();
}
@Override
public Class<? extends Partitioner<?, ?>> getPartitionerClass()
throws ClassNotFoundException {
return mapContext.getPartitionerClass();
}
@Override
public Class<? extends Reducer<?, ?, ?, ?>> getReducerClass()
throws ClassNotFoundException {
return mapContext.getReducerClass();
}
@Override
public RawComparator<?> getSortComparator() {
return mapContext.getSortComparator();
}
@Override
public boolean getSymlink() {
return mapContext.getSymlink();
}
@Override
public Path getWorkingDirectory() throws IOException {
return mapContext.getWorkingDirectory();
}
@Override
public void progress() {
mapContext.progress();
}
@Override
public boolean getProfileEnabled() {
return mapContext.getProfileEnabled();
}
@Override
public String getProfileParams() {
return mapContext.getProfileParams();
}
@Override
public IntegerRanges getProfileTaskRange(boolean isMap) {
return mapContext.getProfileTaskRange(isMap);
}
@Override
public String getUser() {
return mapContext.getUser();
}
@Override
public Credentials getCredentials() {
return mapContext.getCredentials();
}
@Override
public float getProgress() {
return mapContext.getProgress();
}
}
2.4 org.apache.hadoop.mapreduce.task.MapContextImpl
從Maptask中mapContext等於new MapContextImpl
這個MapContextImpl中也有getCurrentKey() getCurrentValue() nextKeyValue() 而這幾個方法中返回值是通過reader賦值的,
而reader的值是從 MapContextImpl構造方法的第三個參數 RecordReader<KEYIN,VALUEIN> reader,
所以再回到maptask看第三個參數到底是怎麼來的
public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
private RecordReader<KEYIN,VALUEIN> reader;
private InputSplit split;
public MapContextImpl(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) {
super(conf, taskid, writer, committer, reporter);
this.reader = reader;
this.split = split;
}
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit() {
return split;
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return reader.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return reader.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
}
2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader
maptask的input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
所以我們到NewTrackingRecordReader對象中看他的三個方法getCurrentKey() getCurrentValue() nextKeyValue()
這三個方法的返回值來自org.apache.hadoop.mapreduce.RecordReader<K,V> real,
this.real = inputFormat.createRecordReader(split, taskContext);
而inputFormat來自NewTrackingRecordReader構造參數的第二個參數
所以需要再次回到maptask找inputformat
static class NewTrackingRecordReader<K,V>
extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
private final TaskReporter reporter;
private final List<Statistics> fsStats;
NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
TaskReporter reporter,
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
throws InterruptedException, IOException {
this.reporter = reporter;
this.inputRecordCounter = reporter
.getCounter(TaskCounter.MAP_INPUT_RECORDS);
this.fileInputByteCounter = reporter
.getCounter(FileInputFormatCounter.BYTES_READ);
List <Statistics> matchedStats = null;
if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
.getPath(), taskContext.getConfiguration());
}
fsStats = matchedStats;
long bytesInPrev = getInputBytes(fsStats);
this.real = inputFormat.createRecordReader(split, taskContext);
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public void close() throws IOException {
long bytesInPrev = getInputBytes(fsStats);
real.close();
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public K getCurrentKey() throws IOException, InterruptedException {
return real.getCurrentKey();
}
@Override
public V getCurrentValue() throws IOException, InterruptedException {
return real.getCurrentValue();
}
@Override
public float getProgress() throws IOException, InterruptedException {
return real.getProgress();
}
@Override
public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context
) throws IOException, InterruptedException {
long bytesInPrev = getInputBytes(fsStats);
real.initialize(split, context);
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
long bytesInPrev = getInputBytes(fsStats);
boolean result = real.nextKeyValue();
long bytesInCurr = getInputBytes(fsStats);
if (result) {
inputRecordCounter.increment(1);
}
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
reporter.setProgress(getProgress());
return result;
}
private long getInputBytes(List<Statistics> stats) {
if (stats == null) return 0;
long bytesRead = 0;
for (Statistics stat: stats) {
bytesRead = bytesRead + stat.getBytesRead();
}
return bytesRead;
}
}
2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass
從maptask中
inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
發現這是一個抽象類,而且返回類型是InputFormat的一個子類
我們繼結找實現方法getInputFormatClass的返回類型爲InputFormat的子類的實現類
/**
* Get the {@link InputFormat} class for the job.
*
* @return the {@link InputFormat} class for the job.
*/
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException;
2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass
/**
* Get the {@link InputFormat} class for the job.
*
* @return the {@link InputFormat} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException {
return (Class<? extends InputFormat<?,?>>)
//conf —————— job.xml
//public static final String INPUT_FORMAT_CLASS_ATTR = "mapreduce.job.inputformat.class";
//去mapper-default.xml中查找發現沒有mapreduce.job.inputformat.class,所以getClass返回第二個參數TextInputFormat.class
//去TextInputFormat中找createRecordReader返回了什麼
conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
}
2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
可以看到createRecordReader返回new LineRecordReader(recordDelimiterBytes),
所以去LineRecordReader中應該就能找到nextKeyValue() getCurrentKey() getCurrentValue() 真正實現
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
}
2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader
這個裏邊兒就是真正實現怎麼一行一行讀取了
public class LineRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
//位置 當前的行首的偏移量
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
//每一行的起始偏移量
private LongWritable key;
//每一行的內容
private Text value;
private boolean isCompressedInput;
private Decompressor decompressor;
private byte[] recordDelimiterBytes;
public LineRecordReader() {
}
public LineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new CompressedSplitLineReader(cIn, job,
this.recordDelimiterBytes);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
in = new SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength());
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private int maxBytesToConsume(long pos) {
return isCompressedInput
? Integer.MAX_VALUE
: (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
private long getFilePosition() throws IOException {
long retVal;
if (isCompressedInput && null != filePosition) {
retVal = filePosition.getPos();
} else {
retVal = pos;
}
return retVal;
}
private int skipUtfByteOrderMark() throws IOException {
// Strip BOM(Byte Order Mark)
// Text only support UTF-8, we only need to check UTF-8 BOM
// (0xEF,0xBB,0xBF) at the start of the text stream.
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
// Even we read 3 extra bytes for the first line,
// we won't alter existing behavior (no backwards incompat issue).
// Because the newSize is less than maxLineLength and
// the number of bytes copied to Text is always no more than newSize.
// If the return size from readLine is not less than maxLineLength,
// we will discard the current line and read the next line.
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
// find UTF-8 BOM, strip it.
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
// It may work to use the same buffer and not do the copyBytes
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
//判斷key==null 進行key的初始化
if (key == null) {
key = new LongWritable();
}
//將偏移量賦值給key pos初始0
key.set(pos);
//pos+=bytes 5 hello
//對value初始化
if (value == null) {
value = new Text();
}
//統計當前行的字節
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
//給newSize 賦值
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
//給pos重新賦值
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
//切片讀取完成
if (newSize == 0) {
//提醒垃圾回收
key = null;
value = null;
return false;
} else {
return true;
}
}
//返回的就是屬性
//返回的就是當前行的起始偏移量
@Override
public LongWritable getCurrentKey() {
return key;
}
//代表的是一行的內容
@Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);
decompressor = null;
}
}
}
}
2.10 源碼總結
整個流程走下來,我們首先要抓住線索,就是不斷的去找到底哪一步真正實現了getCurrentKey(),getCurrentValue(),nextKeyValue(),
最終找到TextInputFormat.createRecordReader()返回一個LineRecordReader中實現的。
我們自定義輸入也按着這個套路寫就對了。
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
默認的輸入:
- FileInputFormat
- TextInputFormat
- RecordReader
- LineRecordReader
- RecordReader
- TextInputFormat
3. 自定義輸入
現在分析大量小文件合併一個大文件
自定義輸入:
- 創建一個類繼承FileInputFormat
重寫createRecordReader() - 創建一個文件真正的讀取器,繼承RecordReader
重寫getcurrentkey() getcurrentvalue() nextkeyvalue() - job中指定自定義的輸入類
job.setInputFormatClass(MyFileInputFormat.class);
要求:
三個小文件1.txt 2.txt 3.txt
合併成一個大文件
MyFileInputFormat.java
/**
* 泛型指的是 讀取的key value的類型
* 讀取之後 mapper的輸入
*
* 每次讀取一個文件
* 文件內容 Text
* 這裏key可以爲Null
*/
public class MyFileInputFormat extends FileInputFormat<NullWritable, Text> {
/**
* 獲取文件讀取器
*
* FileInputFormat.addInputPath(job,"") job.xml
*
* InputSplit split, TaskAttemptContext context
*/
public RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
MyRecordReader mr = new MyRecordReader();
//傳參
mr.initialize(split,context);
return mr;
}
}
MyRecordReader.java
/**
* 文件讀取器 核心的進行文件讀取
* 創建一個流
* hdfs的流
* FileSystem fs
* fs.open(path)
* 注意
* 進行文件讀取的時候,首先就是進入nextKeyValue,判斷有沒有內容要讀取,然後纔會getCurrentKey() getCurrentValue()
*/
public class MyRecordReader extends RecordReader<NullWritable, Text> {
FileSystem fs;
int lenth;
FSDataInputStream fsDataInputStream;
Text value=new Text();
//屬性 判斷是否讀取完成 默認false true--讀取完成 false--沒有讀完
boolean isReader;
/**
* 初始化 創建hdfs的輸入流
* @param split 輸入的切片
* @param context task的上下文對象
* @throws IOException
* @throws InterruptedException
*/
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
//初始化fs對象 context.getConfiguration() 獲取配置文件
fs = FileSystem.get(context.getConfiguration());
//獲取文件路徑
FileSplit fsplit = (FileSplit) split;
Path path = fsplit.getPath();
//獲取文件的實際長度
lenth = (int) fsplit.getLength();
//創建流
fsDataInputStream = fs.open(path);
}
/**
* 判斷當前文件切片是否還有要讀取的內容
* @return 代表是否繼續讀取,false表示沒有讀完,true表示讀取完成
* @throws IOException
* @throws InterruptedException
*/
public boolean nextKeyValue() throws IOException, InterruptedException {
if(!isReader){//如果還有內容要讀
//讀取的內容
byte[] buf = new byte[lenth];
fsDataInputStream.readFully(buf,0,lenth);
//將讀取的內容放到value中
value.set(buf);
//標記是否讀取完成,我們一次讀取完了一個文件,所以是
isReader = true;
return true;
}else {
return false;
}
}
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
public Text getCurrentValue() throws IOException, InterruptedException {
return this.value;
}
public float getProgress() throws IOException, InterruptedException {
//文件要麼讀取完成,要麼沒讀
return isReader?1.0f:0.0f;
}
/**
* 關閉流
* @throws IOException
*/
public void close() throws IOException {
if(fsDataInputStream!=null){
fsDataInputStream.close();
}
if(fs!=null){
fs.close();
}
}
}
MergeFiles.java
/**
* 默認的map() 方法是一行調用一次
* 我們要自定義輸入一次讀到一個文件 ,一個文件調用一次map,然後直接發送給reduce端
* reudce端將所有文件進行合併
*/
public class MergeFiles {
static class MergeFilesMapper extends Mapper<NullWritable, Text, Text, NullWritable> {
//讀一個文件調用一次
@Override
protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException {
//將讀取的文件內容,直接發送給reduce端
context.write(value, NullWritable.get());
}
}
static class MergeFilesReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
for (NullWritable v : values) {
context.write(key,NullWritable.get());
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME","hdp01");
Configuration conf = new Configuration();
conf.set("mapperduce.framework.name","local");
conf.set("fs.defaultFS","hdfs://10.211.55.20:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(FlowSort2.class);
job.setMapperClass(MergeFilesMapper.class);
job.setReducerClass(MergeFilesReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//指定自定義輸入類
job.setInputFormatClass(MyFileInputFormat.class);
FileInputFormat.addInputPath(job,new Path("/tmpin/invetedIndex"));
FileSystem fs= FileSystem.get(conf);
Path outPath = new Path("/tmpout/mergeFiles");
if(fs.exists(outPath)){//存在 刪除
fs.delete(outPath,true);
}
FileOutputFormat.setOutputPath(job,outPath);
job.waitForCompletion(true);
}
}
輸入文件:
[hdp01@hdp01 tmpfiles]$ cat 1.txt
A friend in need is a friend indeed
Good is good but better carries it
[hdp01@hdp01 tmpfiles]$ cat 2.txt
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
[hdp01@hdp01 tmpfiles]$ cat 3.txt
Doubt is the key to knowledge
輸出文件:
[hdp01@hdp01 tmpfiles]$ hdfs dfs -cat /tmpout/mergeFiles/part-r-00000
A friend in need is a friend indeed
Good is good but better carries it
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
Doubt is the key to knowledge
PS:默認reducetask=1,也就只有一個文件輸出