在hadoop 中一個Job中可以按順序運行多個mapper對數據進行前期的處理,再進行reduce,經reduce後的結果可經個經多個按順序執行的mapper進行後期的處理,這樣的Job是不會保存中間結果的,並大大減少了I/O操作。
例如:在一個Job中,按順序執行 MAP1->MAP2->REDUCE->MAP3->MAP4 在這種鏈式結構中,要將MAP2與REDUCE看成這個MAPREDUCE的核心部分(就像是單箇中的MAP與REDUCE),並且partitioning與shuffling在此處纔會被應用到。所以MAP1作爲前期處理,而MAP3與MAP4作爲後期處理。
- Configuration conf = getConf();
- JobConf job = new JobConf(conf);
- job.setJobName(“ChainJob”);
- job.setInputFormat(TextInputFormat.class);
- job.setOutputFormat(TextOutputFormat.class);
- FileInputFormat.setInputPaths(job, in);
- FileOutputFormat.setOutputPath(job, out);
- JobConf map1Conf = new JobConf(false);
- ChainMapper.addMapp(job,
- Map1.class,
- LongWritable.class,
- Text.class,
- Text.class,
- Text.class,
- true,
- map1Conf);
- //將map1加入到Job中
- JobConf map2Conf = new JobConf(false);
- ChainMapper.addMapper(job,
- BMap.class,
- Text.class,
- Text.class,
- LongWritable.class,
- Text.class,
- true,
- map2Conf);
- /將map2加入到Job中
- JobConf reduceConf = new JobConf(false);
- ChainReducer.setReducer(job,
- Reduce.class,
- LongWritable.class,
- Text.class,
- Text.class,
- Text.class,
- true,
- reduceConf);
- /將reduce加入到Job中
- JobConf map3Conf = new JobConf(false);
- ChainReducer.addMapper(job,
- Map3.class,
- Text.class,
- Text.class,
- LongWritable.class,
- Text.class,
- true,
- map3Conf);
- /將map3加入到Job中
- JobConf map4Conf = new JobConf(false);
- ChainReducer.addMapper(job,
- Map4.class,
- LongWritable.class,
- Text.class,
- LongWritable.class,
- Text.class,
- true,
- map4Conf);
- //將map4加入到Job中
- JobClient.runJob(job);
- 注:上一個的輸出是一下的輸入,所以上一個的輸出數據類型必須與下一個輸入的數據類型一樣
***************************************************
addMapper中的參數
public static <K1,V1,K2,V2> void
addMapper(JobConf job,
Class<? extends Mapper<K1,V1,K2,V2>> klass,
Class<? extends K1> inputKeyClass,
Class<? extends V1> inputValueClass,
Class<? extends K2> outputKeyClass,
Class<? extends V2> outputValueClass,
boolean byValue,
JobConf mapperConf)