【Hadoop】JobTracker Restart Recovery

如果當前有MapReduce Job正在運行,而JobTracker突然down掉了,怎麼辦?由於JobTracker只是負責Job調度,記賬,監控等工作,真正的任務執行在TaskTracker上,完全有可能重啓JT而不丟失之前的任務運行。JT需要做的是將Job執行狀態備份到文件,重啓時讀取文件以便恢復(在《MapReduce Job Files》一文中已經總結了幾種備份文件)。

要打開Restart Recovery功能,需要設置mapreduce.jobtracker.restart.recover爲true(默認爲false)。JT重啓時將順次執行以下步驟來恢復Job運行狀態:

1. 遍歷system directory的文件,將要恢復的JobID放入集合jobsToRecover。因爲JT爲每個正在運行的Job在system directory下創建一個以jobId爲名稱帝directory,遍歷可以得到JT重啓前所有未完成的Job。
JobTracker.JobTracker()
        FileStatus[] systemDirData = fs.listStatus(this.systemDir);
        // Check if the history is enabled .. as we cant have persistence with 
        // history disabled
        if (conf.getBoolean("mapred.jobtracker.restart.recover", false) 
            && systemDirData != null) {
          for (FileStatus status : systemDirData) {
            try {
              recoveryManager.checkAndAddJob(status);
            } catch (Throwable t) {
              LOG.warn("Failed to add the job " + status.getPath().getName(), 
                       t);
            }
          }
2. 遍歷集合jobsToRecover, 對每個要恢復的JobId,重建JobInProgress對象,加入Job調度隊列,並根據JobId, userId, jobName等信息拼接成history filename
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
  String logFileName = 
	JobHistory.JobInfo.getJobHistoryFileName(job.getJobConf(), id);
  if (logFileName != null) {
	Path jobHistoryFilePath = 
	  JobHistory.JobInfo.getJobHistoryLogLocation(logFileName);

	JobHistory.JobInfo.recoverJobHistoryFile(job.getJobConf(), jobHistoryFilePath);

	jobHistoryFilenameMap.put(job.getJobID(), jobHistoryFilePath);
  } else {
	LOG.info("No history file found for job " + id);
	idIter.remove(); // remove from recovery list
  }

  addJob(id, job);

3. 從History directory中讀取和解析history file,恢復Job運行數據。
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
        JobRecoveryListener listener = new JobRecoveryListener(pJob);
        try {
          JobHistory.parseHistoryFromFS(jobHistoryFilePath.toString(), 
                                        listener, fs);
        } catch (Throwable t) {
          LOG.info("Error reading history file of job " + pJob.getJobID() 
                   + ". Ignoring the error and continuing.", t);
        }

History file以key-value pair形式存儲,比如:
Job JOBID="job_201310190623_0001" LAUNCH_TIME="1382181102677" TOTAL_MAPS="100" TOTAL_REDUCES="1" JOB_STATUS="PREP" .
Task TASKID="task_201310190623_0001_m_000101" TASK_TYPE="SETUP" START_TIME="1382181104661" SPLITS="" .
Job/Task是recordType,其餘的key-value pair是Job相關的信息。JT定義瞭如下key:
  public static enum Keys { 
    JOBTRACKERID,
    START_TIME, FINISH_TIME, JOBID, JOBNAME, USER, JOBCONF, SUBMIT_TIME, 
    LAUNCH_TIME, TOTAL_MAPS, TOTAL_REDUCES, FAILED_MAPS, FAILED_REDUCES, 
    FINISHED_MAPS, FINISHED_REDUCES, JOB_STATUS, TASKID, HOSTNAME, TASK_TYPE, 
    ERROR, TASK_ATTEMPT_ID, TASK_STATUS, COPY_PHASE, SORT_PHASE, REDUCE_PHASE, 
    SHUFFLE_FINISHED, SORT_FINISHED, COUNTERS, SPLITS, JOB_PRIORITY, HTTP_PORT, 
    TRACKER_NAME, STATE_STRING, VERSION, MAP_COUNTERS, REDUCE_COUNTERS,
    VIEW_JOB, MODIFY_JOB, JOB_QUEUE, FAIL_REASON
  }

根據History file中讀取出來的數據,JT可以恢復到重啓前的狀態。





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章