【MapReduce】Mapreduce基礎知識整理 (四) 自定義輸入

1. 爲什麼需要自定義輸入

我們都知道namenode負責存儲文件的metadata,運行時所有數據都保存到內存,整個HDFS可存儲的文件數受限於NameNode的內存大小
一個Block在NameNode中對應一條記錄(一般一個block佔用150字節),如果是大量的小文件,會消耗大量內存。同時map task的數量是由splits來決定的,所以用MapReduce處理大量的小文件時,就會產生過多的maptask,線程管理開銷將會增加作業時間。處理大量小文件的速度遠遠小於處理同等大小的大文件的速度。因此Hadoop建議存儲大文件。

雖然我們可以在代碼中通過設置爲CombineTextInputFormat

  • 但它只能在運行的時候將多個小文件加載到一個maptask中而已
  • 物理存儲仍然是大量的小文件
  • hdfs的namenode壓力依然很大

設置方式如:

//合併的時候  根據切片大小進行合併
job.setInputFormatClass(CombineTextInputFormat.class);
//設置切片大小  >128M
CombineTextInputFormat.setMinInputSplitSize(job, 130*1024*1024);
FileInputFormat.addInputPath(job, new Path("/in"));

那麼大量小文件的解決辦法:通過mapreduce進行小文件的合併
多個小文件各併爲一個大文件
分析:

  • 默認的map() 方法是一行調用一次
  • 我們要自定義輸入一次讀到一個文件 ,一個文件調用一次map,然後直接發送給reduce端
  • reudce端將所有文件進行合併

2. 默認輸入源碼分析

先找到輸入入口,我們執行mapper的時候,默認一行執行一次,所以我們先看一下Mapper都幹了啥。

在這裏插入圖片描述
mapper中有三個方法

  • setup(): maptask 開始的時候執行一次
  • cleanup():maptask 結束的時候執行一次
  • map():默認一行執行一次
  • run():根據條件調用上邊兒三個方法

所以下邊兒代碼Mapper中 run() 開始分析

2.1 org.apache.hadoop.mapreduce.Mapper

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {    	
    /*
     * context.nextKeyValue()   判斷是否還有下一行
     * context.getCurrentKey()  獲取當前的偏移量
     * context.getCurrentValue()  獲取當前行的內容
     * 核心就是找contex參數
     * 也就是誰調用了run(參數)
     */
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

我們點擊run()方法找到他是在MapTask中調用的 mapper.run(mapperContext)
部分關鍵代碼如下:

2.2 org.apache.hadoop.mapred.MapTask

// make the input format
//通過反射創建對象 (重點是對象的類型taskContext.getInputFormatClass())然後再去找這個inputformat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

//this.real=inputFormat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }
    
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
	//mapContext----》input
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
    	//去WrappedMapper.getMapContext看返回來的是什麼
    	//mapperContext---》mapContext
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
      input.initialize(split, mapperContext);
      //這個mapper就是  job.setMapperClass()對應的對象
      //找mapperContext  裏邊肯定有這三個方法  nextKeyValue getcurrentkey  getcurrentvalue
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

去WrappedMapper.getMapContext看返回來的是什麼?
返回一個Contex對象,這個對象中有三個方法getCurrentKey getCurrentValue nextKeyValue

2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper

/**
   * Get a wrapped {@link Mapper.Context} for custom implementations.
   * @param mapContext <code>MapContext</code> to be wrapped
   * @return a wrapped <code>Mapper.Context</code> for custom implementations
   */
  public Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context
  getMapContext(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
  	//返回一個Context對象,這個Context中應該主有我們Mapper中的三個方法nextKeyValue getcurrentkey getcurrentvalue
    return new Context(mapContext);
  }
  
  //下邊兒Context中確實有getCurrentKey getCurrentValue nextKeyValue
  //這三個方法返回值來自this.mapContext=mapContext
  //而mapContex參數我們再往回找傳入的地方在mapTask中創建了一個MapContextImpl對象,去看這個對象中繼續看
  @InterfaceStability.Evolving
  public class Context 
      extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context {

    protected MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext;

	// 構造方法
    public Context(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
      this.mapContext = mapContext;
    }

    /**
     * Get the input split for this map.
     */
    public InputSplit getInputSplit() {
      return mapContext.getInputSplit();
    }

    @Override
    public KEYIN getCurrentKey() throws IOException, InterruptedException {
      return mapContext.getCurrentKey();
    }

    @Override
    public VALUEIN getCurrentValue() throws IOException, InterruptedException {
      return mapContext.getCurrentValue();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
      return mapContext.nextKeyValue();
    }

    @Override
    public Counter getCounter(Enum<?> counterName) {
      return mapContext.getCounter(counterName);
    }

    @Override
    public Counter getCounter(String groupName, String counterName) {
      return mapContext.getCounter(groupName, counterName);
    }

    @Override
    public OutputCommitter getOutputCommitter() {
      return mapContext.getOutputCommitter();
    }

    @Override
    public void write(KEYOUT key, VALUEOUT value) throws IOException,
        InterruptedException {
      mapContext.write(key, value);
    }

    @Override
    public String getStatus() {
      return mapContext.getStatus();
    }

    @Override
    public TaskAttemptID getTaskAttemptID() {
      return mapContext.getTaskAttemptID();
    }

    @Override
    public void setStatus(String msg) {
      mapContext.setStatus(msg);
    }

    @Override
    public Path[] getArchiveClassPaths() {
      return mapContext.getArchiveClassPaths();
    }

    @Override
    public String[] getArchiveTimestamps() {
      return mapContext.getArchiveTimestamps();
    }

    @Override
    public URI[] getCacheArchives() throws IOException {
      return mapContext.getCacheArchives();
    }

    @Override
    public URI[] getCacheFiles() throws IOException {
      return mapContext.getCacheFiles();
    }

    @Override
    public Class<? extends Reducer<?, ?, ?, ?>> getCombinerClass()
        throws ClassNotFoundException {
      return mapContext.getCombinerClass();
    }

    @Override
    public Configuration getConfiguration() {
      return mapContext.getConfiguration();
    }

    @Override
    public Path[] getFileClassPaths() {
      return mapContext.getFileClassPaths();
    }

    @Override
    public String[] getFileTimestamps() {
      return mapContext.getFileTimestamps();
    }

    @Override
    public RawComparator<?> getCombinerKeyGroupingComparator() {
      return mapContext.getCombinerKeyGroupingComparator();
    }

    @Override
    public RawComparator<?> getGroupingComparator() {
      return mapContext.getGroupingComparator();
    }

    @Override
    public Class<? extends InputFormat<?, ?>> getInputFormatClass()
        throws ClassNotFoundException {
      return mapContext.getInputFormatClass();
    }

    @Override
    public String getJar() {
      return mapContext.getJar();
    }

    @Override
    public JobID getJobID() {
      return mapContext.getJobID();
    }

    @Override
    public String getJobName() {
      return mapContext.getJobName();
    }

    @Override
    public boolean getJobSetupCleanupNeeded() {
      return mapContext.getJobSetupCleanupNeeded();
    }

    @Override
    public boolean getTaskCleanupNeeded() {
      return mapContext.getTaskCleanupNeeded();
    }

    @Override
    public Path[] getLocalCacheArchives() throws IOException {
      return mapContext.getLocalCacheArchives();
    }

    @Override
    public Path[] getLocalCacheFiles() throws IOException {
      return mapContext.getLocalCacheFiles();
    }

    @Override
    public Class<?> getMapOutputKeyClass() {
      return mapContext.getMapOutputKeyClass();
    }

    @Override
    public Class<?> getMapOutputValueClass() {
      return mapContext.getMapOutputValueClass();
    }

    @Override
    public Class<? extends Mapper<?, ?, ?, ?>> getMapperClass()
        throws ClassNotFoundException {
      return mapContext.getMapperClass();
    }

    @Override
    public int getMaxMapAttempts() {
      return mapContext.getMaxMapAttempts();
    }

    @Override
    public int getMaxReduceAttempts() {
      return mapContext.getMaxReduceAttempts();
    }

    @Override
    public int getNumReduceTasks() {
      return mapContext.getNumReduceTasks();
    }

    @Override
    public Class<? extends OutputFormat<?, ?>> getOutputFormatClass()
        throws ClassNotFoundException {
      return mapContext.getOutputFormatClass();
    }

    @Override
    public Class<?> getOutputKeyClass() {
      return mapContext.getOutputKeyClass();
    }

    @Override
    public Class<?> getOutputValueClass() {
      return mapContext.getOutputValueClass();
    }

    @Override
    public Class<? extends Partitioner<?, ?>> getPartitionerClass()
        throws ClassNotFoundException {
      return mapContext.getPartitionerClass();
    }

    @Override
    public Class<? extends Reducer<?, ?, ?, ?>> getReducerClass()
        throws ClassNotFoundException {
      return mapContext.getReducerClass();
    }

    @Override
    public RawComparator<?> getSortComparator() {
      return mapContext.getSortComparator();
    }

    @Override
    public boolean getSymlink() {
      return mapContext.getSymlink();
    }

    @Override
    public Path getWorkingDirectory() throws IOException {
      return mapContext.getWorkingDirectory();
    }

    @Override
    public void progress() {
      mapContext.progress();
    }

    @Override
    public boolean getProfileEnabled() {
      return mapContext.getProfileEnabled();
    }

    @Override
    public String getProfileParams() {
      return mapContext.getProfileParams();
    }

    @Override
    public IntegerRanges getProfileTaskRange(boolean isMap) {
      return mapContext.getProfileTaskRange(isMap);
    }

    @Override
    public String getUser() {
      return mapContext.getUser();
    }

    @Override
    public Credentials getCredentials() {
      return mapContext.getCredentials();
    }
    
    @Override
    public float getProgress() {
      return mapContext.getProgress();
    }
  }

2.4 org.apache.hadoop.mapreduce.task.MapContextImpl

從Maptask中mapContext等於new MapContextImpl
這個MapContextImpl中也有getCurrentKey() getCurrentValue() nextKeyValue() 而這幾個方法中返回值是通過reader賦值的,
而reader的值是從 MapContextImpl構造方法的第三個參數 RecordReader<KEYIN,VALUEIN> reader,
所以再回到maptask看第三個參數到底是怎麼來的

public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  private RecordReader<KEYIN,VALUEIN> reader;
  private InputSplit split;

  public MapContextImpl(Configuration conf, TaskAttemptID taskid,
                        RecordReader<KEYIN,VALUEIN> reader,
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    this.reader = reader;
    this.split = split;
  }

  /**
   * Get the input split for this map.
   */
  public InputSplit getInputSplit() {
    return split;
  }

  @Override
  public KEYIN getCurrentKey() throws IOException, InterruptedException {
    return reader.getCurrentKey();
  }

  @Override
  public VALUEIN getCurrentValue() throws IOException, InterruptedException {
    return reader.getCurrentValue();
  }

  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
  }

}
   

2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader

maptask的input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
所以我們到NewTrackingRecordReader對象中看他的三個方法getCurrentKey() getCurrentValue() nextKeyValue()
這三個方法的返回值來自org.apache.hadoop.mapreduce.RecordReader<K,V> real,
this.real = inputFormat.createRecordReader(split, taskContext);
而inputFormat來自NewTrackingRecordReader構造參數的第二個參數
所以需要再次回到maptask找inputformat

static class NewTrackingRecordReader<K,V> 
    extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
    private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
    private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
    private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
    private final TaskReporter reporter;
    private final List<Statistics> fsStats;
    
    NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      this.real = inputFormat.createRecordReader(split, taskContext);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public void close() throws IOException {
      long bytesInPrev = getInputBytes(fsStats);
      real.close();
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public K getCurrentKey() throws IOException, InterruptedException {
      return real.getCurrentKey();
    }

    @Override
    public V getCurrentValue() throws IOException, InterruptedException {
      return real.getCurrentValue();
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
      return real.getProgress();
    }

    @Override
    public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
                           org.apache.hadoop.mapreduce.TaskAttemptContext context
                           ) throws IOException, InterruptedException {
      long bytesInPrev = getInputBytes(fsStats);
      real.initialize(split, context);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
      long bytesInPrev = getInputBytes(fsStats);
      boolean result = real.nextKeyValue();
      long bytesInCurr = getInputBytes(fsStats);
      if (result) {
        inputRecordCounter.increment(1);
      }
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
      reporter.setProgress(getProgress());
      return result;
    }

    private long getInputBytes(List<Statistics> stats) {
      if (stats == null) return 0;
      long bytesRead = 0;
      for (Statistics stat: stats) {
        bytesRead = bytesRead + stat.getBytesRead();
      }
      return bytesRead;
    }
  }
  

2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass

從maptask中
inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
發現這是一個抽象類,而且返回類型是InputFormat的一個子類
我們繼結找實現方法getInputFormatClass的返回類型爲InputFormat的子類的實現類

/**
   * Get the {@link InputFormat} class for the job.
   * 
   * @return the {@link InputFormat} class for the job.
   */
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException;

2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass

/**
   * Get the {@link InputFormat} class for the job.
   * 
   * @return the {@link InputFormat} class for the job.
   */
  @SuppressWarnings("unchecked")
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      //conf —————— job.xml
      //public static final String INPUT_FORMAT_CLASS_ATTR = "mapreduce.job.inputformat.class";
      //去mapper-default.xml中查找發現沒有mapreduce.job.inputformat.class,所以getClass返回第二個參數TextInputFormat.class
      //去TextInputFormat中找createRecordReader返回了什麼
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }
  

2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat

可以看到createRecordReader返回new LineRecordReader(recordDelimiterBytes),
所以去LineRecordReader中應該就能找到nextKeyValue() getCurrentKey() getCurrentValue() 真正實現

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

}

2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader

這個裏邊兒就是真正實現怎麼一行一行讀取了

public class LineRecordReader extends RecordReader<LongWritable, Text> {
  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
  public static final String MAX_LINE_LENGTH = 
    "mapreduce.input.linerecordreader.line.maxlength";

  private long start;
  //位置   當前的行首的偏移量
  private long pos;
  private long end;
  private SplitLineReader in;
  private FSDataInputStream fileIn;
  private Seekable filePosition;
  private int maxLineLength;
  //每一行的起始偏移量
  private LongWritable key;
  //每一行的內容
  private Text value;
  private boolean isCompressedInput;
  private Decompressor decompressor;
  private byte[] recordDelimiterBytes;

  public LineRecordReader() {
  }

  public LineRecordReader(byte[] recordDelimiter) {
    this.recordDelimiterBytes = recordDelimiter;
  }

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }
  

  private int maxBytesToConsume(long pos) {
    return isCompressedInput
      ? Integer.MAX_VALUE
      : (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
  }

  private long getFilePosition() throws IOException {
    long retVal;
    if (isCompressedInput && null != filePosition) {
      retVal = filePosition.getPos();
    } else {
      retVal = pos;
    }
    return retVal;
  }

  private int skipUtfByteOrderMark() throws IOException {
    // Strip BOM(Byte Order Mark)
    // Text only support UTF-8, we only need to check UTF-8 BOM
    // (0xEF,0xBB,0xBF) at the start of the text stream.
    int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
        Integer.MAX_VALUE);
    int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
    // Even we read 3 extra bytes for the first line,
    // we won't alter existing behavior (no backwards incompat issue).
    // Because the newSize is less than maxLineLength and
    // the number of bytes copied to Text is always no more than newSize.
    // If the return size from readLine is not less than maxLineLength,
    // we will discard the current line and read the next line.
    pos += newSize;
    int textLength = value.getLength();
    byte[] textBytes = value.getBytes();
    if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
        (textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
      // find UTF-8 BOM, strip it.
      LOG.info("Found UTF-8 BOM and skipped it");
      textLength -= 3;
      newSize -= 3;
      if (textLength > 0) {
        // It may work to use the same buffer and not do the copyBytes
        textBytes = value.copyBytes();
        value.set(textBytes, 3, textLength);
      } else {
        value.clear();
      }
    }
    return newSize;
  }

  public boolean nextKeyValue() throws IOException {
	//判斷key==null   進行key的初始化
    if (key == null) {
      key = new LongWritable();
    }
    //將偏移量賦值給key  pos初始0
    key.set(pos);
    //pos+=bytes  5   hello
    
    //對value初始化
    if (value == null) {
      value = new Text();
    }
    //統計當前行的字節
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
    //給newSize  賦值
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        //給pos重新賦值
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    //切片讀取完成
    if (newSize == 0) {
    //提醒垃圾回收
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

  //返回的就是屬性
  //返回的就是當前行的起始偏移量
  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  //代表的是一行的內容
  @Override
  public Text getCurrentValue() {
    return value;
  }

  /**
   * Get the progress within the split
   */
  public float getProgress() throws IOException {
    if (start == end) {
      return 0.0f;
    } else {
      return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
    }
  }
  
  public synchronized void close() throws IOException {
    try {
      if (in != null) {
        in.close();
      }
    } finally {
      if (decompressor != null) {
        CodecPool.returnDecompressor(decompressor);
        decompressor = null;
      }
    }
  }
}

2.10 源碼總結

整個流程走下來,我們首先要抓住線索,就是不斷的去找到底哪一步真正實現了getCurrentKey(),getCurrentValue(),nextKeyValue(),
最終找到TextInputFormat.createRecordReader()返回一個LineRecordReader中實現的。
我們自定義輸入也按着這個套路寫就對了。

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

默認的輸入:

  • FileInputFormat
    • TextInputFormat
      • RecordReader
        • LineRecordReader

3. 自定義輸入

現在分析大量小文件合併一個大文件
自定義輸入:

  • 創建一個類繼承FileInputFormat
    重寫createRecordReader()
  • 創建一個文件真正的讀取器,繼承RecordReader
    重寫getcurrentkey() getcurrentvalue() nextkeyvalue()
  • job中指定自定義的輸入類
    job.setInputFormatClass(MyFileInputFormat.class);

要求:
三個小文件1.txt 2.txt 3.txt
合併成一個大文件

MyFileInputFormat.java

/**
 * 泛型指的是  讀取的key  value的類型
 * 讀取之後  mapper的輸入
 *
 * 每次讀取一個文件
 * 	文件內容   Text
 * 	這裏key可以爲Null
 */
public class MyFileInputFormat extends FileInputFormat<NullWritable, Text> {
    /**
     * 獲取文件讀取器
     *
     * FileInputFormat.addInputPath(job,"")  job.xml
     *
     * InputSplit split, TaskAttemptContext context
     */
    public RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        MyRecordReader mr = new MyRecordReader();
        //傳參
        mr.initialize(split,context);
        return mr;
    }
}

MyRecordReader.java

/**
 * 文件讀取器  核心的進行文件讀取
 * 創建一個流
 * 		hdfs的流
 * 		FileSystem  fs
 * 		fs.open(path)
 * 注意
 * 	進行文件讀取的時候,首先就是進入nextKeyValue,判斷有沒有內容要讀取,然後纔會getCurrentKey() getCurrentValue()
 */
public class MyRecordReader extends RecordReader<NullWritable, Text> {
    FileSystem fs;
    int lenth;
    FSDataInputStream fsDataInputStream;
    Text value=new Text();
    //屬性   判斷是否讀取完成  默認false  true--讀取完成   false--沒有讀完
    boolean isReader;


    /**
     * 初始化 創建hdfs的輸入流
     * @param split 輸入的切片
     * @param context task的上下文對象
     * @throws IOException
     * @throws InterruptedException
     */
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        //初始化fs對象  context.getConfiguration() 獲取配置文件
        fs = FileSystem.get(context.getConfiguration());
        //獲取文件路徑
        FileSplit fsplit = (FileSplit) split;
        Path path = fsplit.getPath();
        //獲取文件的實際長度
        lenth = (int) fsplit.getLength();
        //創建流
        fsDataInputStream = fs.open(path);
    }

    /**
     * 判斷當前文件切片是否還有要讀取的內容
     * @return 代表是否繼續讀取,false表示沒有讀完,true表示讀取完成
     * @throws IOException
     * @throws InterruptedException
     */
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if(!isReader){//如果還有內容要讀
            //讀取的內容
            byte[] buf = new byte[lenth];
            fsDataInputStream.readFully(buf,0,lenth);
            //將讀取的內容放到value中
            value.set(buf);
            //標記是否讀取完成,我們一次讀取完了一個文件,所以是
            isReader = true;
            return true;
        }else {
            return false;
        }
    }

    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    public Text getCurrentValue() throws IOException, InterruptedException {
        return this.value;
    }

    public float getProgress() throws IOException, InterruptedException {
        //文件要麼讀取完成,要麼沒讀
        return isReader?1.0f:0.0f;
    }

    /**
     * 關閉流
     * @throws IOException
     */
    public void close() throws IOException {
        if(fsDataInputStream!=null){
            fsDataInputStream.close();
        }
        if(fs!=null){
            fs.close();
        }
    }
}

MergeFiles.java

/**
 * 默認的map() 方法是一行調用一次
 * 我們要自定義輸入一次讀到一個文件 ,一個文件調用一次map,然後直接發送給reduce端
 * reudce端將所有文件進行合併
 */
public class MergeFiles {


    static class MergeFilesMapper extends Mapper<NullWritable, Text, Text, NullWritable> {
        //讀一個文件調用一次
        @Override
        protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException {
            //將讀取的文件內容,直接發送給reduce端
            context.write(value, NullWritable.get());
        }
    }

    static class MergeFilesReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
        @Override
        protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
            for (NullWritable v : values) {
                context.write(key,NullWritable.get());
            }
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        System.setProperty("HADOOP_USER_NAME","hdp01");
        Configuration conf = new Configuration();
        conf.set("mapperduce.framework.name","local");
        conf.set("fs.defaultFS","hdfs://10.211.55.20:9000");

        Job job = Job.getInstance(conf);

        job.setJarByClass(FlowSort2.class);
        job.setMapperClass(MergeFilesMapper.class);
        job.setReducerClass(MergeFilesReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //指定自定義輸入類
        job.setInputFormatClass(MyFileInputFormat.class);

        FileInputFormat.addInputPath(job,new Path("/tmpin/invetedIndex"));

        FileSystem fs= FileSystem.get(conf);
        Path outPath = new Path("/tmpout/mergeFiles");
        if(fs.exists(outPath)){//存在  刪除
            fs.delete(outPath,true);
        }
        FileOutputFormat.setOutputPath(job,outPath);

        job.waitForCompletion(true);

    }
}

輸入文件:

[hdp01@hdp01 tmpfiles]$ cat 1.txt 
A friend in need is a friend indeed
Good is good but better carries it
[hdp01@hdp01 tmpfiles]$ cat 2.txt 
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
[hdp01@hdp01 tmpfiles]$ cat 3.txt 
Doubt is the key to knowledge

輸出文件:

[hdp01@hdp01 tmpfiles]$ hdfs dfs -cat /tmpout/mergeFiles/part-r-00000
A friend in need is a friend indeed
Good is good but better carries it

A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple

Doubt is the key to knowledge

PS:默認reducetask=1,也就只有一個文件輸出

發佈了84 篇原創文章 · 獲贊 12 · 訪問量 10萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章