【Hadoop】JobTracker Restart Recovery

如果当前有MapReduce Job正在运行,而JobTracker突然down掉了,怎么办?由于JobTracker只是负责Job调度,记账,监控等工作,真正的任务执行在TaskTracker上,完全有可能重启JT而不丢失之前的任务运行。JT需要做的是将Job执行状态备份到文件,重启时读取文件以便恢复(在《MapReduce Job Files》一文中已经总结了几种备份文件)。

要打开Restart Recovery功能,需要设置mapreduce.jobtracker.restart.recover为true(默认为false)。JT重启时将顺次执行以下步骤来恢复Job运行状态:

1. 遍历system directory的文件,将要恢复的JobID放入集合jobsToRecover。因为JT为每个正在运行的Job在system directory下创建一个以jobId为名称帝directory,遍历可以得到JT重启前所有未完成的Job。
JobTracker.JobTracker()
        FileStatus[] systemDirData = fs.listStatus(this.systemDir);
        // Check if the history is enabled .. as we cant have persistence with 
        // history disabled
        if (conf.getBoolean("mapred.jobtracker.restart.recover", false) 
            && systemDirData != null) {
          for (FileStatus status : systemDirData) {
            try {
              recoveryManager.checkAndAddJob(status);
            } catch (Throwable t) {
              LOG.warn("Failed to add the job " + status.getPath().getName(), 
                       t);
            }
          }
2. 遍历集合jobsToRecover, 对每个要恢复的JobId,重建JobInProgress对象,加入Job调度队列,并根据JobId, userId, jobName等信息拼接成history filename
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
  String logFileName = 
	JobHistory.JobInfo.getJobHistoryFileName(job.getJobConf(), id);
  if (logFileName != null) {
	Path jobHistoryFilePath = 
	  JobHistory.JobInfo.getJobHistoryLogLocation(logFileName);

	JobHistory.JobInfo.recoverJobHistoryFile(job.getJobConf(), jobHistoryFilePath);

	jobHistoryFilenameMap.put(job.getJobID(), jobHistoryFilePath);
  } else {
	LOG.info("No history file found for job " + id);
	idIter.remove(); // remove from recovery list
  }

  addJob(id, job);

3. 从History directory中读取和解析history file,恢复Job运行数据。
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
        JobRecoveryListener listener = new JobRecoveryListener(pJob);
        try {
          JobHistory.parseHistoryFromFS(jobHistoryFilePath.toString(), 
                                        listener, fs);
        } catch (Throwable t) {
          LOG.info("Error reading history file of job " + pJob.getJobID() 
                   + ". Ignoring the error and continuing.", t);
        }

History file以key-value pair形式存储,比如:
Job JOBID="job_201310190623_0001" LAUNCH_TIME="1382181102677" TOTAL_MAPS="100" TOTAL_REDUCES="1" JOB_STATUS="PREP" .
Task TASKID="task_201310190623_0001_m_000101" TASK_TYPE="SETUP" START_TIME="1382181104661" SPLITS="" .
Job/Task是recordType,其余的key-value pair是Job相关的信息。JT定义了如下key:
  public static enum Keys { 
    JOBTRACKERID,
    START_TIME, FINISH_TIME, JOBID, JOBNAME, USER, JOBCONF, SUBMIT_TIME, 
    LAUNCH_TIME, TOTAL_MAPS, TOTAL_REDUCES, FAILED_MAPS, FAILED_REDUCES, 
    FINISHED_MAPS, FINISHED_REDUCES, JOB_STATUS, TASKID, HOSTNAME, TASK_TYPE, 
    ERROR, TASK_ATTEMPT_ID, TASK_STATUS, COPY_PHASE, SORT_PHASE, REDUCE_PHASE, 
    SHUFFLE_FINISHED, SORT_FINISHED, COUNTERS, SPLITS, JOB_PRIORITY, HTTP_PORT, 
    TRACKER_NAME, STATE_STRING, VERSION, MAP_COUNTERS, REDUCE_COUNTERS,
    VIEW_JOB, MODIFY_JOB, JOB_QUEUE, FAIL_REASON
  }

根据History file中读取出来的数据,JT可以恢复到重启前的状态。





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章