首先需要了解FairScheduler是如何在各個Pool之間分配資源,以及每個Pool如何在Job之間分配資源的。FairScheduler的分配資源發生在update()方法中,而該方法由一個線程UpdateThread每隔updateInterval(由mapred.fairscheduler.update.interval參數決定,默認是500ms)就調用一次,以保證資源分配的實時性。
FairScheduler的資源分配算法由SchedulingAlgorithms的computeFairShares()方法實現,原理是通過二分查找選擇出一個使得資源分配數最接近實際資源數的值。具體可以去閱讀下SchedulingAlgorithms.computeFairShares()的源碼(有點難理解,最好debug下)。
下面就來看看FairScheduler如何從衆多的任務中選擇出一個任務,即任務調度。
1.FairScheduler.assignTasks():該方法的調用是發生在JT接收到來自TT的心跳,在返回響應時會根據TT的實際情況選擇一個任務交由TT執行,具體可參考http://blog.csdn.net/vickyway/article/details/17127559。該方法爲指定TT選擇一組適合其執行的Task。
// Compute total runnable maps and reduces, and currently running ones
int runnableMaps = 0;
int runningMaps = 0;
int runnableReduces = 0;
int runningReduces = 0;
for (Pool pool: poolMgr.getPools()) {
runnableMaps += pool.getMapSchedulable().getDemand();
runningMaps += pool.getMapSchedulable().getRunningTasks();
runnableReduces += pool.getReduceSchedulable().getDemand();
runningReduces += pool.getReduceSchedulable().getRunningTasks();
}
此處計算所有的Pool(資源池)總的runnableMaps(所有Map任務運行所需的Slot數量),runningMaps(運行中的Map任務數量),runnableReduces(所有Reduce任務運行所需的Slot數量),runningReduces(運行中的Reduce任務數量)。
ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
// Compute total map/reduce slots
// In the future we can precompute this if the Scheduler becomes a
// listener of tracker join/leave events.
int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);
int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus);
接着根據JT獲取集羣狀態,獲取totalMapSlots(集羣中所有可運行Map的Slot數量)和totalReduceSlots(集羣中所有可運行Reduce的Slot數量).
// Update time waited for local maps for jobs skipped on last heartbeat
updateLocalityWaitTimes(currentTime);
次數是更新上一次TT發送心跳時沒有進行更新time waited for local maps的Job進行更新time waited for local maps。
2.FairScheduler.updateLocalityWaitTimes:
/**
* Update locality wait times for jobs that were skipped at last heartbeat.
*/
private void updateLocalityWaitTimes(long currentTime) {
long timeSinceLastHeartbeat =
(lastHeartbeatTime == 0 ? 0 : currentTime - lastHeartbeatTime);
lastHeartbeatTime = currentTime;
for (JobInfo info: infos.values()) {
if (info.skippedAtLastHeartbeat) {
info.timeWaitedForLocalMap += timeSinceLastHeartbeat;
info.skippedAtLastHeartbeat = false;
}
}
}
首先計算出從上次心跳到現在的時間間隔(timeSinceLastHeartbeat),並更新上次的心跳時間。然後遍歷infos(存放JobInProgress-->JobInfo的集合)中skippedAtLastHeartbeat==true的Job的JobInfo,將其timeWaitedForLocalMap值增加timeSinceLastHeartbeat,並將JobInfo的skippedAtLastHeartbeat設爲false。回到FairScheduler。
3.FairScheduler.assignTasks():
// Check for JT safe-mode
if (taskTrackerManager.isInSafeMode()) {
LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");
return null;
}
檢查JT是否處於SafeMode,處於SafeMode不進行任何任務的調度。
TaskTrackerStatus tts = tracker.getStatus();
int mapsAssigned = 0; // loop counter for map in the below while loop
int reducesAssigned = 0; // loop counter for reduce in the below while
int mapCapacity = maxTasksToAssign(TaskType.MAP, tts);
int reduceCapacity = maxTasksToAssign(TaskType.REDUCE, tts);
boolean mapRejected = false; // flag used for ending the loop
boolean reduceRejected = false; // flag used for ending the loop
// Keep track of which jobs were visited for map tasks and which had tasks
// launched, so that we can later mark skipped jobs for delay scheduling
Set<JobInProgress> visitedForMap = new HashSet<JobInProgress>();
Set<JobInProgress> visitedForReduce = new HashSet<JobInProgress>();
Set<JobInProgress> launchedMap = new HashSet<JobInProgress>();
ArrayList<Task> tasks = new ArrayList<Task>();
這段代碼是初始化一些在調度任務時需要用到的變量,mapsAssigned和reducesAssigned記錄已選擇的Map/Reduce任務數量,mapCapacity和reduceCapacity記錄該TT最大可接收到Map/Reduce任務數量,mapRejected和reduceRejected用來標識是否還可繼續接收Map/Reduce任務,visitedForMap和visitedForReduce隊列用來記錄爲尋找可執行的Task而訪問的Job,launchedMap隊列用來記錄選擇的Map任務,tasks隊列用來存放選擇的任務。下面看看maxTasksToAssign()方法是如何計算TT最大可接收的Map/Reduce數量的。
4.FairScheduler.maxTasksToAssign:
protected int maxTasksToAssign(TaskType type, TaskTrackerStatus tts) {
if (!assignMultiple)
return 1;
int cap = (type == TaskType.MAP) ? mapAssignCap : reduceAssignCap;
int availableSlots = (type == TaskType.MAP) ?
tts.getAvailableMapSlots(): tts.getAvailableReduceSlots();
if (cap == -1) // Infinite cap; use the TaskTracker's slot count
return availableSlots;
else
return Math.min(cap, availableSlots);
}
此處的assignMultiple變量是由mapred.fairscheduler.assignmultiple參數決定,默認是true,表示是否可同時調度Map和Reduce任務。mapAssignCap和reduceAssignCap變量分別是由mapred.fairscheduler.assignmultiple.maps參數和mapred.fairscheduler.assignmultiple.reduces參數決定,默認值都是-1,表示一次心跳最大可調度的Map/Reduce數量,-1表示無限制。availableSlots表示該TT在發送心跳時可使用的Map/Reduce
slot數量,所以接收的任務不能超過該值。
5.FairScheduler.assignTasks():
下面的代碼是一段無限循環,知道滿足一定條件才退出,分段來看看循環內部。
if (!mapRejected) {
if (mapsAssigned == mapCapacity ||
runningMaps == runnableMaps ||
!loadMgr.canAssignMap(tts, runnableMaps,
totalMapSlots, mapsAssigned)) {
eventLog.log("INFO", "Can't assign another MAP to " + trackerName);
mapRejected = true;
}
}
if (!reduceRejected) {
if (reducesAssigned == reduceCapacity ||
runningReduces == runnableReduces ||
!loadMgr.canAssignReduce(tts, runnableReduces,
totalReduceSlots, reducesAssigned)) {
eventLog.log("INFO", "Can't assign another REDUCE to " + trackerName);
reduceRejected = true;
}
}
if (mapRejected && reduceRejected ||
!assignMultiple && tasks.size() > 0) {
break; // This is the only exit of the while (true) loop
}
這一段主要是判斷是否退出循環,即通過跟新mapRejected和reduceRejected值來決定是否退出循環。當mapsAssigned==mapCapacity,即已選擇的Map數量已達到TT可接收的最大值時,或者runningMaps==runnableMaps,即所有的Map任務都已運行,或者loadMgr.canAssignReduce(tts, runnableReduces,totalReduceSlots, reducesAssigned)返回false,即LoadManager(實現類是CapBasedLoadManager)任務不可再繼續調度Map任務。Reduce同上。下面看看LoadManager如何判斷是否可以繼續調度Map/Reduce任務。
6.CapBasedLoadManager.LoadManager():
public boolean canAssignMap(TaskTrackerStatus tracker,
int totalRunnableMaps, int totalMapSlots, int alreadyAssigned) {
int cap = getCap(totalRunnableMaps, tracker.getMaxMapSlots(), totalMapSlots);
return tracker.countMapTasks() + alreadyAssigned < cap;
}
int getCap(int totalRunnableTasks, int localMaxTasks, int totalSlots) {
double load = maxDiff + ((double)totalRunnableTasks) / totalSlots;
return (int) Math.ceil(localMaxTasks * Math.min(1.0, load));
}
maxDiff值由mapred.fairscheduler.load.max.diff參數決定,默認是0.0f。該方法根據集羣總的任務運行數與集羣總的Slot數量的比例,來判斷一個TT應該運行多個任務,據此決定是否繼續向TT發送任務。
上面根據一定條件判斷mapRejected和reduceRejected的值,下面通過判斷mapRejected和reduceRejected值以及assignMultiple==false是已選擇的tasks數量是否大於0,因爲當assignMultiple==false時只能選擇一個任務。當判斷出需要退出循環時,則直接退出循環。
7.FairScheduler.assignTasks():
TaskType taskType;
if (mapRejected) {
taskType = TaskType.REDUCE;
} else if (reduceRejected) {
taskType = TaskType.MAP;
} else {
// If both types are available, choose the task type with fewer running
// tasks on the task tracker to prevent that task type from starving
if (tts.countMapTasks() + mapsAssigned <=
tts.countReduceTasks() + reducesAssigned) {
taskType = TaskType.MAP;
} else {
taskType = TaskType.REDUCE;
}
}
下面是決定選擇Map任務還是Reduce任務。如果mapRejected==true,則選擇Reduce任務,相反如何reduceRejected==true,則選擇Map任務,當兩者都==false時,根據TT上已運行的Map數量+已爲該TT選擇的Map任務數量與TT上已運行的Reduce數量+已爲該TT選擇的Reduce任務數量之間的大小決定如何選擇,當相等時優選選擇Map任務。上面是一些準備工作,下面就開始進行任務的調度了。
8.FairScheduler.assignTasks():
// Get the map or reduce schedulables and sort them by fair sharing
List<PoolSchedulable> scheds = getPoolSchedulables(taskType);
Collections.sort(scheds, new SchedulingAlgorithms.FairShareComparator());
第一句是獲取所有的Map/Reduce類型的PoolScheduler。每個Pool中都存放着兩個PoolScheduler,一個用來調度Map任務——mapSchedulable,另一個用來調度Reduce任務——reduceSchedulable。然後根據SchedulingAlgorithms.FairShareComparator進行排序,該排序算法主要是根據每個Pool或者Job中運行中的任務與Pool或者Job的自身狀態之間的一個比率關係進行排序,即按運行中的任務數/Math.min(minShare,demand)升序排序,按運行中的任務數/weight升序排序。下面看看SchedulingAlgorithms.FairShareComparator類。
public static class FairShareComparator implements Comparator<Schedulable> {
@Override
public int compare(Schedulable s1, Schedulable s2) {
double minShareRatio1, minShareRatio2;
double tasksToWeightRatio1, tasksToWeightRatio2;
int minShare1 = Math.min(s1.getMinShare(), s1.getDemand());
int minShare2 = Math.min(s2.getMinShare(), s2.getDemand());
boolean s1Needy = s1.getRunningTasks() < minShare1;
boolean s2Needy = s2.getRunningTasks() < minShare2;
minShareRatio1 = s1.getRunningTasks() / Math.max(minShare1, 1.0);
minShareRatio2 = s2.getRunningTasks() / Math.max(minShare2, 1.0);
tasksToWeightRatio1 = s1.getRunningTasks() / s1.getWeight();
tasksToWeightRatio2 = s2.getRunningTasks() / s2.getWeight();
int res = 0;
if (s1Needy && !s2Needy)
res = -1;
else if (s2Needy && !s1Needy)
res = 1;
else if (s1Needy && s2Needy)
res = (int) Math.signum(minShareRatio1 - minShareRatio2);
else // Neither schedulable is needy
res = (int) Math.signum(tasksToWeightRatio1 - tasksToWeightRatio2);
if (res == 0) {
// Jobs are tied in fairness ratio. Break the tie by submit time and job
// name to get a deterministic ordering, which is useful for unit tests.
res = (int) Math.signum(s1.getStartTime() - s2.getStartTime());
if (res == 0)
res = s1.getName().compareTo(s2.getName());
}
return res;
}
}
先說一下:compare(a,b)-->-1,則a,b;compare(a,b)-->1,則b,a(老是記不住)。這個比較算法還是較簡單的,原理就是哪個Scheduler中的運行中的任務數越接近其承受能力那麼排序就越靠後,這也是很合理的,優先調度較輕鬆的Scheduler(表達不好,嘿嘿)。排序好了就可以有序的進行任務調度了。
9.FairScheduler.assignTasks():
boolean foundTask = false;
for (Schedulable sched: scheds) { // This loop will assign only one task
eventLog.log("INFO", "Checking for " + taskType +
" task in " + sched.getName());
Task task = taskType == TaskType.MAP ?
sched.assignTask(tts, currentTime, visitedForMap) :
sched.assignTask(tts, currentTime, visitedForReduce);
if (task != null) {
foundTask = true;
JobInProgress job = taskTrackerManager.getJob(task.getJobID());
eventLog.log("ASSIGN", trackerName, taskType,
job.getJobID(), task.getTaskID());
// Update running task counts, and the job's locality level
if (taskType == TaskType.MAP) {
launchedMap.add(job);
mapsAssigned++;
runningMaps++;
updateLastMapLocalityLevel(job, task, tts);
} else {
reducesAssigned++;
runningReduces++;
}
// Add task to the list of assignments
tasks.add(task);
break; // This break makes this loop assign only one task
} // end if(task != null)
} // end for(Schedulable sched: scheds)
foundTask標誌是否選擇到任務,每次遍歷只選擇一個Task,因爲每個選擇一個Task之後,Scheduler的狀態都會發生變化,然後再重新進行排序,再選擇。這裏可以看出Task的選擇是調用Scheduler的assignTask()方法選擇的。Scheduler有兩個實現,分別是PoolScheduler和JobScheduler,此處是PoolScheduler。下面來看看PoolScheduler的assignTask()方法。
10.PoolScheduler.assignTask():
public Task assignTask(TaskTrackerStatus tts, long currentTime,
Collection<JobInProgress> visited) throws IOException {
int runningTasks = getRunningTasks();
if (runningTasks >= poolMgr.getMaxSlots(pool.getName(), taskType)) {
return null;
}
SchedulingMode mode = pool.getSchedulingMode();
Comparator<Schedulable> comparator;
if (mode == SchedulingMode.FIFO) {
comparator = new SchedulingAlgorithms.FifoComparator();
} else if (mode == SchedulingMode.FAIR) {
comparator = new SchedulingAlgorithms.FairShareComparator();
} else {
throw new RuntimeException("Unsupported pool scheduling mode " + mode);
}
Collections.sort(jobScheds, comparator);
for (JobSchedulable sched: jobScheds) {
Task task = sched.assignTask(tts, currentTime, visited);
if (task != null)
return task;
}
return null;
}
首先獲取該PoolScheduler運行中的Task數量,然後判斷如果運行中的任務數大於該Pool的該類型任務(Map/Reduce)的最大數量,則不調度任務,返回null。然後根據SchedulingMode mode = pool.getSchedulingMode()獲取Pool的調度模式(FIFO/FAIR),即FairScheduler在對Pool中的Job進行調度時支持兩種調度方式:FIFO和FAIR。FIFO:先進先出,先添加的Job先調度,使用SchedulingAlgorithms.FifoComparator比較器;FAIR:根據公平原則進行調度(和Pool的調度一樣,也是使用SchedulingAlgorithms.FairShareComparator比較器)。該參數由定義Pool時的schedulingMode參數指定。下面簡單說一下FIFO調度規則。
FIFO:先根據Hadoop自帶的Job的優先級priority(分爲5個等級,優先級從高到低依次是:VERY_HIGH,HIGH,NORMAL,LOW,VERY_LOW),由在創建Job時通過mapred.job.priority參數指定,默認是NORMAL。然後根據Job的StartTime進行比較,越早的Job優先調度。
FAIR方式和PoolScheduler調度時一樣。使用比較器對PoolScheduler中的Job(JobScheduler)進行排序。排序完成之後,遍歷JobScheduler,通過調用JobScheduler的assignTask()方法選擇任務。下面看看JobScheduler的assignTask()方法。
11.JobScheduler.assignTask():
public Task assignTask(TaskTrackerStatus tts, long currentTime,
Collection<JobInProgress> visited) throws IOException {
if (isRunnable()) {
visited.add(job);
TaskTrackerManager ttm = scheduler.taskTrackerManager;
ClusterStatus clusterStatus = ttm.getClusterStatus();
int numTaskTrackers = clusterStatus.getTaskTrackers();
// check with the load manager whether it is safe to
// launch this task on this taskTracker.
LoadManager loadMgr = scheduler.getLoadManager();
if (!loadMgr.canLaunchTask(tts, job, taskType)) {
return null;
}
if (taskType == TaskType.MAP) {
LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(
job, currentTime);
scheduler.getEventLog().log(
"ALLOWED_LOC_LEVEL", job.getJobID(), localityLevel);
switch (localityLevel) {
case NODE:
return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
case RACK:
return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
default:
return job.obtainNewMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return job.obtainNewReduceTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return null;
}
}
可以看出只有運行中的Job才能調度Task。visited.add(job)將該Job添加到visited隊列,表示該Job在任務調度時有被訪問過。然後通過LoadManager.canLaunchTask()方法判斷是否可以在該TT上運行任務,這裏默認是true。針對Map任務需要考慮任務的本地化,即儘可能的使Map任務運行在存放着輸入文件的TT上,以提高Map任務運行效率。LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(job,
currentTime)是獲取Map任務的一個本地化級別,然後根據本地化級別調用不同方法獲取不同的Task,而對於Reduce任務則直接選擇一個任務即可。下面看看FairScheduler.getAllowedLocalityLevel()方法。
12.FairScheduler.getAllowedLocalityLevel():
JobInfo info = infos.get(job);
if (info == null) { // Job not in infos (shouldn't happen)
LOG.error("getAllowedLocalityLevel called on job " + job
+ ", which does not have a JobInfo in infos");
return LocalityLevel.ANY;
}
if (job.nonLocalMaps.size() > 0) { // Job doesn't have locality information
return LocalityLevel.ANY;
}
首先任務的本地化級別存在四個級別:NODE(表示Map的輸入文件需要與任務運行的TT在一個節點上),NODEGROUP(表示Map的輸入文件需要與任務運行的TT在一個節點組上),RACK(表示Map的輸入文件需要與任務運行的TT在一個節機架上),ANY(無任何要求)。首先獲取Job的JobInfo信息,如果不存在對應的JobInfo信息則返回LocalityLevel.ANY,如果Job的nonLocalMaps隊列不爲空也返回LocalityLevel.ANY。nonLocalMaps是在Job進行初始化時通過判斷Job的Split如果沒有Location則將該Split對應的Map任務添加到nonLocalMaps隊列。
Pool pool = poolMgr.getPool(job);
PoolSchedulable sched = pool.getMapSchedulable();
long minShareTimeout = poolMgr.getMinSharePreemptionTimeout(pool.getName());
long fairShareTimeout = poolMgr.getFairSharePreemptionTimeout();
if (currentTime - sched.getLastTimeAtMinShare() > minShareTimeout ||
currentTime - sched.getLastTimeAtHalfFairShare() > fairShareTimeout) {
eventLog.log("INFO", "No delay scheduling for "
+ job.getJobID() + " because it is being starved");
return LocalityLevel.ANY;
}
判斷該Job所在的Pool是否處於飢餓狀態,是的話則直接返回LocalityLevel.ANY。此處根據Pool的minShareTimeout和fairShareTimeout兩個屬性值進行判斷。Pool的lastTimeAtMinShare和lastTimeAtHalfFairShare值是在FairScheduler的update()方法中更新的,而該方法由一個線程一直調用。
// In the common case, compute locality level based on time waited
switch(info.lastMapLocalityLevel) {
case NODE: // Last task launched was node-local
if (info.timeWaitedForLocalMap >=
nodeLocalityDelay + rackLocalityDelay)
return LocalityLevel.ANY;
else if (info.timeWaitedForLocalMap >= nodeLocalityDelay)
return LocalityLevel.RACK;
else
return LocalityLevel.NODE;
case RACK: // Last task launched was rack-local
if (info.timeWaitedForLocalMap >= rackLocalityDelay)
return LocalityLevel.ANY;
else
return LocalityLevel.RACK;
default: // Last task was non-local; can launch anywhere
return LocalityLevel.ANY;
}
下面根據Job的lastMapLocalityLevel,即該Job上一次調度Map任務時所選擇的的LocalityLevel值決定本次如何進行Map任務的調度。如果lastMapLocalityLevel==NODE,則表示Job上一次調度Map任務是本地化等級是NODE,當等待時間timeWaitedForLocalMap>(nodeLocalityDelay + rackLocalityDelay)這兩個屬性之和時則選擇LocalityLevel.ANY;但是如果timeWaitedForLocalMap只是>nodeLocalityDelay,那麼則可以選擇RACK級別的本地化,如果timeWaitedForLocalMap<nodeLocalityDelay,那麼則選擇NODE級別的本地化。timeWaitedForLocalMap的值是在FairScheduler的assignTasks()方法中更新的,即FairScheduler開始調度任務之前會先更新FairScheduler上保存的所有Job的JobInfo中的timeWaitedForLocalMap值,已確保後面能夠正確選擇Map任務的本地化級別。當lastMapLocalityLevel==RACK時,只需要timeWaitedForLocalMap>=rackLocalityDelay就可以返回ANY,否則返回RACK;默認返回RACK。這裏就計算出在選擇Map任務時的本地化級別,之後Job在選擇Map任務時會根據本地化級別選擇不同的任務進行運行。回到JobScheduler.assignTask()方法。
13.JobScheduler.assignTask():
switch (localityLevel) {
case NODE:
return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
case RACK:
return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
default:
return job.obtainNewMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
這是在選擇Map任務時根據本地化級別會調用不同的方法選擇不同的任務。主要是obtainNewNodeLocalMapTask(),obtainNewNodeOrRackLocalMapTask(),obtainNewMapTask()三個方法,以及選擇Reduce任務的obtainNewReduceTask()方法。這四個方法內部還是有點複雜的下次再深入分析。
到這裏JobScheduler的assignTask()就完成了,返回Task,回到PoolScheduler的assignTask()方法,可以看到只要得到一個Task,PoolScheduler就會返回該Task,所以繼續回到FairScheduler的assignTask()方法。在FairScheduler的assignTask()方法中可以看到,當返回一個Task之後會標誌foundTask=true,如果是Map任務則會將Task對應的Job添加到launchedMap中,然後調用updateLastMapLocalityLevel()方法更新Job的JobInfo的lastMapLocalityLevel和timeWaitedForLocalMap值,以便下次正確的選擇Map任務。
14.FairScheduler.assignTask():
if (!foundTask) {
if (taskType == TaskType.MAP) {
mapRejected = true;
} else {
reduceRejected = true;
}
}
該處很簡單,判斷如果沒有選到Map任務或者Reduce任務,則將相應的標誌設爲true。
for (JobInProgress job: visitedForMap) {
if (!launchedMap.contains(job)) {
infos.get(job).skippedAtLastHeartbeat = true;
}
}
visitedForMap該值在進行調度Map任務時每訪問一個Job都會被記錄在該隊列中,如果被訪問的Job並不在launchedMap隊列(存放被選中Map任務的Job)中,則將該Job對應的JobInfo的skippedAtLastHeartbeat參數設爲true,表示本次心跳沒有選擇該Job的Map任務。這個skippedAtLastHeartbeat參數會影響Job的timeWaitedForLocalMap值,具體可以參考FairScheduler的updateLocalityWaitTimes()方法。
以上就是FairScheduler調度任務源碼的一些簡單的解析,如有錯誤之處,請指出,謝謝。