1、应用Map/Reduce的过程如下:
1)将要处理的数据组成一对对Key-Value的方式,并生成文件;
2)将这些Key-Value数据转换映射成另外的Key-Value数据,这其中的转化映射逻辑(算法)封装成一个实现Mapper接口的Mapper;
public interface Mapper extends JobConfigurable, Closeable {
void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
throws IOException;
}
3)将上一步通过map函数处理后生成(映射成)的Key-Value数据,再转化为最终需要的结果,这其中的逻辑(算法)封装成一个实现Reducer接口的Reducer;
public interface Reducer extends JobConfigurable, Closeable {
void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException;
}
4)new一个JobConf对象,顾名思义,JobConf就是对要处理的这个job(作业)的定义对象;
JobConf genJob = new JobConf(conf);
5)对JobConf赋值,如输入数据文件路径、上面定义的Mapper、Reducer等(参看后面的下面代码中注释);
genJob.setInputDir(randomIns); //设置要处理的数据对应的文件路径
genJob.setInputKeyClass(IntWritable.class); //设置要处理的数据的Key的类型
genJob.setInputValueClass(IntWritable.class); //设置要处理的数据的Value的类型
genJob.setInputFormat(SequenceFileInputFormat.class); //设置要处理的数据对应的文件格式
genJob.setMapperClass(RandomGenMapper.class); //设置Mapper
genJob.setOutputDir(randomOuts); //设置最后输出的数据对应的文件路径
genJob.setOutputKeyClass(IntWritable.class); //设置最后输出的数据的Key的类型
genJob.setOutputValueClass(IntWritable.class); //设置最后输出的数据的Value的类型
genJob.setOutputFormat(TextOutputFormat.class); //设置最后输出的数据对应的文件格式
genJob.setReducerClass(RandomGenReducer.class); //设置Reducer
genJob.setNumReduceTasks(1); //
6)调用JobClient的静态方法runJob(),将上述的JobConf对象传入,然后等待执行完成;
JobClient.runJob(genJob);
2、JobClient具体runJob()的实现如下:
1)构造一个JobClient;
JobClient jc = new JobClient(job);
在构造函数中,初始化jobSubmitClient,是本地的LocalJobRunner,还是通过RPC的getProxy方法获取远端Map/Reduce集群中JobTracker的一个代理;
this.conf = conf;
String tracker = conf.get("mapred.job.tracker", "local");
if ("local".equals(tracker)) {
this.jobSubmitClient = new LocalJobRunner(conf); //本地的LocalJobRunner来
} else {
this.jobSubmitClient = (JobSubmissionProtocol)
RPC.getProxy(JobSubmissionProtocol.class,
JobTracker.getAddress(conf), conf);
}
2)提交job;
running = jc.submitJob(job);
3)循环等待job执行完毕,并每过1秒钟,报告进度;
while (!running.isComplete()) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {}
running = jc.getJob(jobId);
String report = null;
report = " map "+Math.round(running.mapProgress()*100)+"% reduce " + Math.round(running.reduceProgress()*100)+"%";
if (!report.equals(lastReport)) {
LOG.info(report);
lastReport = report;
}
}
if (!running.isSuccessful()) {
throw new IOException("Job failed!");
}
3、其中jc.submitJob(job)的具体实现过程如下:
1)将前面定义的jobConf生成文件,存于文件系统中,以便在集群环境下运行时使用;
File submitJobDir = new File(job.getSystemDir(), "submit_" + Integer.toString(Math.abs(r.nextInt()), 36));
File submitJobFile = new File(submitJobDir, "job.xml");
// Write job file to JobTracker's fs
FSDataOutputStream out = fileSys.create(submitJobFile);
try {
job.write(out);
} finally {
out.close();
}
2)然后真正提交job,剩下就交由Map/Reduce集群来完成了;
// Now, actually submit the job (using the submit name)
//
JobStatus status = jobSubmitClient.submitJob(submitJobFile.getPath());
这就是一个客户端应用Map/Reduce的过程,对于客户端来说比较简单。