mapreduce程序中讀取文件過程詳解

hadoop的inputformat包括他的子類reader是maptask讀取數據的重要步驟

一、獲得splits-mapper數

1. jobclinet的submitJobInternal，生成split，獲取mapper數量

Java代碼  

public   

  RunningJob submitJobInternal {  

    return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() {  

....  

int maps = writeSplits(context, submitJobDir);//<span style="font-family: Helvetica, Tahoma, Arial, sans-serif; white-space: normal; background-color: #ffffff;">生成split，獲取mapper數量</span>  

....  

}}

jobclinet的writesplit方法

Java代碼  

private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,  

      Path jobSubmitDir) throws IOException,  

      InterruptedException, ClassNotFoundException {  

    JobConf jConf = (JobConf)job.getConfiguration();  

    int maps;  

    if (jConf.getUseNewMapper()) {  

      maps = writeNewSplits(job, jobSubmitDir);//新api調用此方法  

    } else {  

      maps = writeOldSplits(jConf, jobSubmitDir);  

    }  

    return maps;  

  }

2.writeNewSplits新api方法，反射inputformat類，調用getsplit方法，獲取split數據，並排序，並返回mapper數量

Java代碼  

private <T extends InputSplit>  

  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,  

      InterruptedException, ClassNotFoundException {  

    Configuration conf = job.getConfiguration();  

    InputFormat<?, ?> input =  

      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);//反射到inputsplit  

    List<InputSplit> splits = input.getSplits(job);//調用inputformat子類實現的getsplits方法  

    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);//生成數組，這麼簡單的方法寫的這麼複雜，真夠扯的，不懂這樣爲了什麼  

    // sort the splits into order based on size, so that the biggest  

    // go first  

    Arrays.sort(array, new SplitComparator());//splits排序  

    JobSplitWriter.createSplitFiles(jobSubmitDir, conf,  

        jobSubmitDir.getFileSystem(conf), array);  

    return array.length;//mapper數量  

  }

3.貼上最常用的FileInputSplit的getSplits方法

Java代碼  

public List<InputSplit> getSplits(JobContext job  

                                    ) throws IOException {  

    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));  

    long maxSize = getMaxSplitSize(job);  

    // generate splits  

    List<InputSplit> splits = new ArrayList<InputSplit>();  

    List<FileStatus>files = listStatus(job);  

    for (FileStatus file: files) {  

      Path path = file.getPath();  

      FileSystem fs = path.getFileSystem(job.getConfiguration());  

      long length = file.getLen();  

      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);  

      if ((length != 0) && isSplitable(job, path)) {   

        long blockSize = file.getBlockSize();  

        long splitSize = computeSplitSize(blockSize, minSize, maxSize);//獲得split文件的最大文件大小  

        long bytesRemaining = length;  

        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//分解大文件  

          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);  

          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,   

                                   blkLocations[blkIndex].getHosts()));  

          bytesRemaining -= splitSize;  

        }  

        if (bytesRemaining != 0) {  

          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,   

                     blkLocations[blkLocations.length-1].getHosts()));  

        }  

      } else if (length != 0) {  

        splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));  

      } else {   

        //Create empty hosts array for zero length files  

        splits.add(new FileSplit(path, 0, length, new String[0]));  

      }  

    }  

    // Save the number of input files in the job-conf  

    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());  

    LOG.debug("Total # of splits: " + splits.size());  

    return splits;  

  }

二、讀取keyvalue的過程

1.實例化inputformat，初始化reader

在MapTask類的runNewMapper方法中，生成inputformat和recordreader，並進行初始化，運行mapper

MapTask$NewTrackingRecordReader 由 RecordReader組成，是它的一個代理類

Java代碼  

 private <INKEY,INVALUE,OUTKEY,OUTVALUE>  

  void runNewMapper {  

 // 生成自定義inputformat  

    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =  

      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)  

        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);  

.....  

//生成自定義recordreader  

org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =  

      new NewTrackingRecordReader<INKEY,INVALUE>  

          (split, inputFormat, reporter, job, taskContext);  

.....  

//初始化recordreader  

input.initialize(split, mapperContext);  

.....  

//運行mapper  

mapper.run(mapperContext);  

   }

2.在運行mapper中，調用context讓reader讀取key和value，其中使用代理類MapTask$NewTrackingRecordReader，添加並推送讀取記錄

mapper代碼：

Java代碼  

public void run(Context context) throws IOException, InterruptedException {  

   setup(context);  

   while (context.nextKeyValue()) {  

     map(context.getCurrentKey(), context.getCurrentValue(), context);  

   }  

   cleanup(context);  

 }

MapContext代碼：

Java代碼  

@Override  

  public boolean nextKeyValue() throws IOException, InterruptedException {  

    return reader.nextKeyValue();  

  }  

@Override  

  public KEYIN getCurrentKey() throws IOException, InterruptedException {  

    return reader.getCurrentKey();  

  }  

  @Override  

  public VALUEIN getCurrentValue() throws IOException, InterruptedException {  

    return reader.getCurrentValue();  

  }

MapTask$NewTrackingRecordReader的代碼：

Java代碼  

@Override  

   public boolean nextKeyValue() throws IOException, InterruptedException {  

     boolean result = false;  

     try {  

       long bytesInPrev = getInputBytes(fsStats);  

       result = real.nextKeyValue();//recordreader實際讀取數據  

       long bytesInCurr = getInputBytes(fsStats);  

       if (result) {  

         inputRecordCounter.increment(1);//添加讀取記錄  

         fileInputByteCounter.increment(bytesInCurr - bytesInPrev);//記錄讀取數據  

       }  

       reporter.setProgress(getProgress());//將reporter的flag置爲true，推送記錄信息  

     } catch (IOException ioe) {  

       if (inputSplit instanceof FileSplit) {  

         FileSplit fileSplit = (FileSplit) inputSplit;  

         LOG.error("IO error in map input file "  

             + fileSplit.getPath().toString());  

         throw new IOException("IO error in map input file "  

             + fileSplit.getPath().toString(), ioe);  

       }  

       throw ioe;  

     }  

     return result;  

   }

3.執行完mapper方法，返回到maptask，關閉reader

Java代碼  

mapper.run(mapperContext);  

input.close();//關閉inputformat  

output.close(mapperContext);

兩個步驟不在同一個線程中完成，生成splits後進入monitor階段

以上也調用了所有的inputformat虛類的所有方法

mapreduce程序中讀取文件過程詳解

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

Hadoop Filesystem closed Exception

shell 啓動新的終端並運行命令

How to Solve Key_read Failed Error in Git Push

c socket 編程詳解

org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool (storage id unknown) servi

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結