如果當前有MapReduce Job正在運行,而JobTracker突然down掉了,怎麼辦?由於JobTracker只是負責Job調度,記賬,監控等工作,真正的任務執行在TaskTracker上,完全有可能重啓JT而不丟失之前的任務運行。JT需要做的是將Job執行狀態備份到文件,重啓時讀取文件以便恢復(在《MapReduce Job Files》一文中已經總結了幾種備份文件)。
要打開Restart Recovery功能,需要設置mapreduce.jobtracker.restart.recover爲true(默認爲false)。JT重啓時將順次執行以下步驟來恢復Job運行狀態:
FileStatus[] systemDirData = fs.listStatus(this.systemDir);
// Check if the history is enabled .. as we cant have persistence with
// history disabled
if (conf.getBoolean("mapred.jobtracker.restart.recover", false)
&& systemDirData != null) {
for (FileStatus status : systemDirData) {
try {
recoveryManager.checkAndAddJob(status);
} catch (Throwable t) {
LOG.warn("Failed to add the job " + status.getPath().getName(),
t);
}
}
2. 遍歷集合jobsToRecover, 對每個要恢復的JobId,重建JobInProgress對象,加入Job調度隊列,並根據JobId, userId, jobName等信息拼接成history filename String logFileName =
JobHistory.JobInfo.getJobHistoryFileName(job.getJobConf(), id);
if (logFileName != null) {
Path jobHistoryFilePath =
JobHistory.JobInfo.getJobHistoryLogLocation(logFileName);
JobHistory.JobInfo.recoverJobHistoryFile(job.getJobConf(), jobHistoryFilePath);
jobHistoryFilenameMap.put(job.getJobID(), jobHistoryFilePath);
} else {
LOG.info("No history file found for job " + id);
idIter.remove(); // remove from recovery list
}
addJob(id, job);
3. 從History directory中讀取和解析history file,恢復Job運行數據。
JobRecoveryListener listener = new JobRecoveryListener(pJob);
try {
JobHistory.parseHistoryFromFS(jobHistoryFilePath.toString(),
listener, fs);
} catch (Throwable t) {
LOG.info("Error reading history file of job " + pJob.getJobID()
+ ". Ignoring the error and continuing.", t);
}
Job JOBID="job_201310190623_0001" LAUNCH_TIME="1382181102677" TOTAL_MAPS="100" TOTAL_REDUCES="1" JOB_STATUS="PREP" .
Task TASKID="task_201310190623_0001_m_000101" TASK_TYPE="SETUP" START_TIME="1382181104661" SPLITS="" .
public static enum Keys {
JOBTRACKERID,
START_TIME, FINISH_TIME, JOBID, JOBNAME, USER, JOBCONF, SUBMIT_TIME,
LAUNCH_TIME, TOTAL_MAPS, TOTAL_REDUCES, FAILED_MAPS, FAILED_REDUCES,
FINISHED_MAPS, FINISHED_REDUCES, JOB_STATUS, TASKID, HOSTNAME, TASK_TYPE,
ERROR, TASK_ATTEMPT_ID, TASK_STATUS, COPY_PHASE, SORT_PHASE, REDUCE_PHASE,
SHUFFLE_FINISHED, SORT_FINISHED, COUNTERS, SPLITS, JOB_PRIORITY, HTTP_PORT,
TRACKER_NAME, STATE_STRING, VERSION, MAP_COUNTERS, REDUCE_COUNTERS,
VIEW_JOB, MODIFY_JOB, JOB_QUEUE, FAIL_REASON
}
根據History file中讀取出來的數據,JT可以恢復到重啓前的狀態。