Hadoop源码之Map/Reduce应用过程

原創

大唐9527

2020-06-19 13:09

1、应用Map/Reduce的过程如下：

1）将要处理的数据组成一对对Key-Value的方式，并生成文件；

2）将这些Key-Value数据转换映射成另外的Key-Value数据，这其中的转化映射逻辑（算法）封装成一个实现Mapper接口的Mapper；

public interface Mapper extends JobConfigurable, Closeable {

  void map(WritableComparable key, Writable value,
           OutputCollector output, Reporter reporter)
    throws IOException;
}

3）将上一步通过map函数处理后生成（映射成）的Key-Value数据，再转化为最终需要的结果，这其中的逻辑（算法）封装成一个实现Reducer接口的Reducer；

public interface Reducer extends JobConfigurable, Closeable {
  void reduce(WritableComparable key, Iterator values,
              OutputCollector output, Reporter reporter)
    throws IOException;
}

4）new一个JobConf对象，顾名思义，JobConf就是对要处理的这个job（作业）的定义对象；

JobConf genJob = new JobConf(conf);

5）对JobConf赋值，如输入数据文件路径、上面定义的Mapper、Reducer等（参看后面的下面代码中注释）；

genJob.setInputDir(randomIns);				//设置要处理的数据对应的文件路径
genJob.setInputKeyClass(IntWritable.class);		//设置要处理的数据的Key的类型
genJob.setInputValueClass(IntWritable.class);		//设置要处理的数据的Value的类型
genJob.setInputFormat(SequenceFileInputFormat.class);	//设置要处理的数据对应的文件格式
genJob.setMapperClass(RandomGenMapper.class);		//设置Mapper

genJob.setOutputDir(randomOuts);			//设置最后输出的数据对应的文件路径
genJob.setOutputKeyClass(IntWritable.class);		//设置最后输出的数据的Key的类型
genJob.setOutputValueClass(IntWritable.class);		//设置最后输出的数据的Value的类型
genJob.setOutputFormat(TextOutputFormat.class);		//设置最后输出的数据对应的文件格式
genJob.setReducerClass(RandomGenReducer.class);		//设置Reducer
genJob.setNumReduceTasks(1);				//

6）调用JobClient的静态方法runJob()，将上述的JobConf对象传入，然后等待执行完成；

JobClient.runJob(genJob);

2、JobClient具体runJob（）的实现如下：

1）构造一个JobClient；

JobClient jc = new JobClient(job);

在构造函数中，初始化jobSubmitClient，是本地的LocalJobRunner，还是通过RPC的getProxy方法获取远端Map/Reduce集群中JobTracker的一个代理；

      this.conf = conf;
      String tracker = conf.get("mapred.job.tracker", "local");
      if ("local".equals(tracker)) {
        this.jobSubmitClient = new LocalJobRunner(conf);		//本地的LocalJobRunner来
      } else {
        this.jobSubmitClient = (JobSubmissionProtocol) 
          RPC.getProxy(JobSubmissionProtocol.class,
                       JobTracker.getAddress(conf), conf);
      }

2）提交job；

running = jc.submitJob(job);

3）循环等待job执行完毕，并每过1秒钟，报告进度；

        while (!running.isComplete()) {
          try {
            Thread.sleep(1000);
          } catch (InterruptedException e) {}
          running = jc.getJob(jobId);
          String report = null;
          report = " map "+Math.round(running.mapProgress()*100)+"%  reduce " + Math.round(running.reduceProgress()*100)+"%";
          if (!report.equals(lastReport)) {
            LOG.info(report);
            lastReport = report;
          }
        }
        if (!running.isSuccessful()) {
          throw new IOException("Job failed!");
        }

3、其中jc.submitJob(job)的具体实现过程如下：

1）将前面定义的jobConf生成文件，存于文件系统中，以便在集群环境下运行时使用；

File submitJobDir = new File(job.getSystemDir(), "submit_" + Integer.toString(Math.abs(r.nextInt()), 36));
File submitJobFile = new File(submitJobDir, "job.xml");
// Write job file to JobTracker's fs        
FSDataOutputStream out = fileSys.create(submitJobFile);
try {
  job.write(out);
} finally {
  out.close();
}

2）然后真正提交job，剩下就交由Map/Reduce集群来完成了；

// Now, actually submit the job (using the submit name)
//
JobStatus status = jobSubmitClient.submitJob(submitJobFile.getPath());

这就是一个客户端应用Map/Reduce的过程，对于客户端来说比较简单。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hadoop源码之Map/Reduce应用过程

使用c#强大的表达式树实现对象的深克隆之解决循环引用的问题

GPT-4o 引领人机交互新风向，向量数据库赛道沸腾了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU启动那些事（12.A）- uSDHC eMMC启动时间(RT1170)

基于Ubuntu-22.04安装K8s-v1.28.2实验（二）使用kube-vip实现集群VIP访问

企业大模型如何成为自己数据的“百科全书”？

本地SSL证书过期输入命令在IIS自动生成

.NET周刊【5月第2期 2024-05-12】

基于Ubuntu-22.04安装K8s-v1.28.2实验（一）部署K8s

基于Ubuntu-22.04安装K8s-v1.28.2实验（三）数据卷挂载NFS（网络文件系统）

Hadoop源碼之Map/Reduce應用過程

hadoop源碼之DataNode

Hadoop源碼之RPC機制

hbase系列（一）：單機應用（不基於hdfs）

Mysql的簡單QPS測試（單機）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結