hadoop的inputformat包括他的子類reader是maptask讀取數據的重要步驟
一、獲得splits-mapper數
1. jobclinet的submitJobInternal,生成split,獲取mapper數量
- public
- RunningJob submitJobInternal {
- return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() {
- ....
- int maps = writeSplits(context, submitJobDir);//<span style="font-family: Helvetica, Tahoma, Arial, sans-serif; white-space: normal; background-color: #ffffff;">生成split,獲取mapper數量</span>
- ....
- }}
- private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
- Path jobSubmitDir) throws IOException,
- InterruptedException, ClassNotFoundException {
- JobConf jConf = (JobConf)job.getConfiguration();
- int maps;
- if (jConf.getUseNewMapper()) {
- maps = writeNewSplits(job, jobSubmitDir);//新api調用此方法
- } else {
- maps = writeOldSplits(jConf, jobSubmitDir);
- }
- return maps;
- }
- private <T extends InputSplit>
- int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
- InterruptedException, ClassNotFoundException {
- Configuration conf = job.getConfiguration();
- InputFormat<?, ?> input =
- ReflectionUtils.newInstance(job.getInputFormatClass(), conf);//反射到inputsplit
- List<InputSplit> splits = input.getSplits(job);//調用inputformat子類實現的getsplits方法
- T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);//生成數組,這麼簡單的方法寫的這麼複雜,真夠扯的,不懂這樣爲了什麼
- // sort the splits into order based on size, so that the biggest
- // go first
- Arrays.sort(array, new SplitComparator());//splits排序
- JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
- jobSubmitDir.getFileSystem(conf), array);
- return array.length;//mapper數量
- }
3.貼上最常用的FileInputSplit的getSplits方法
- public List<InputSplit> getSplits(JobContext job
- ) throws IOException {
- long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
- long maxSize = getMaxSplitSize(job);
- // generate splits
- List<InputSplit> splits = new ArrayList<InputSplit>();
- List<FileStatus>files = listStatus(job);
- for (FileStatus file: files) {
- Path path = file.getPath();
- FileSystem fs = path.getFileSystem(job.getConfiguration());
- long length = file.getLen();
- BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
- if ((length != 0) && isSplitable(job, path)) {
- long blockSize = file.getBlockSize();
- long splitSize = computeSplitSize(blockSize, minSize, maxSize);//獲得split文件的最大文件大小
- long bytesRemaining = length;
- while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {//分解大文件
- int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
- splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
- blkLocations[blkIndex].getHosts()));
- bytesRemaining -= splitSize;
- }
- if (bytesRemaining != 0) {
- splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
- blkLocations[blkLocations.length-1].getHosts()));
- }
- } else if (length != 0) {
- splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
- } else {
- //Create empty hosts array for zero length files
- splits.add(new FileSplit(path, 0, length, new String[0]));
- }
- }
- // Save the number of input files in the job-conf
- job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
- LOG.debug("Total # of splits: " + splits.size());
- return splits;
- }
二、讀取keyvalue的過程
1.實例化inputformat,初始化reader
在MapTask類的runNewMapper方法中,生成inputformat和recordreader,並進行初始化,運行mapper
MapTask$NewTrackingRecordReader 由 RecordReader組成,是它的一個代理類
- private <INKEY,INVALUE,OUTKEY,OUTVALUE>
- void runNewMapper {
- // 生成自定義inputformat
- org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
- (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
- ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
- .....
- //生成自定義recordreader
- org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
- new NewTrackingRecordReader<INKEY,INVALUE>
- (split, inputFormat, reporter, job, taskContext);
- .....
- //初始化recordreader
- input.initialize(split, mapperContext);
- .....
- //運行mapper
- mapper.run(mapperContext);
- }
2.在運行mapper中,調用context讓reader讀取key和value,其中使用代理類MapTask$NewTrackingRecordReader,添加並推送讀取記錄
mapper代碼:
- public void run(Context context) throws IOException, InterruptedException {
- setup(context);
- while (context.nextKeyValue()) {
- map(context.getCurrentKey(), context.getCurrentValue(), context);
- }
- cleanup(context);
- }
MapContext代碼:
- @Override
- public boolean nextKeyValue() throws IOException, InterruptedException {
- return reader.nextKeyValue();
- }
- @Override
- public KEYIN getCurrentKey() throws IOException, InterruptedException {
- return reader.getCurrentKey();
- }
- @Override
- public VALUEIN getCurrentValue() throws IOException, InterruptedException {
- return reader.getCurrentValue();
- }
MapTask$NewTrackingRecordReader的代碼:
- @Override
- public boolean nextKeyValue() throws IOException, InterruptedException {
- boolean result = false;
- try {
- long bytesInPrev = getInputBytes(fsStats);
- result = real.nextKeyValue();//recordreader實際讀取數據
- long bytesInCurr = getInputBytes(fsStats);
- if (result) {
- inputRecordCounter.increment(1);//添加讀取記錄
- fileInputByteCounter.increment(bytesInCurr - bytesInPrev);//記錄讀取數據
- }
- reporter.setProgress(getProgress());//將reporter的flag置爲true,推送記錄信息
- } catch (IOException ioe) {
- if (inputSplit instanceof FileSplit) {
- FileSplit fileSplit = (FileSplit) inputSplit;
- LOG.error("IO error in map input file "
- + fileSplit.getPath().toString());
- throw new IOException("IO error in map input file "
- + fileSplit.getPath().toString(), ioe);
- }
- throw ioe;
- }
- return result;
- }
3.執行完mapper方法,返回到maptask,關閉reader
- mapper.run(mapperContext);
- input.close();//關閉inputformat
- output.close(mapperContext);
兩個步驟不在同一個線程中完成,生成splits後進入monitor階段
以上也調用了所有的inputformat虛類的所有方法