DAGScheduler
SparkContext有兩中提交作業的方法:
1、是我前面一章講的runJob方法
2、還有一種是submit方法
它們都是提交到DAGScheduler中,DAGScheduler對外暴露的兩個入口
兩者的區別在於DAGScheduler.runJob在內部調用DAGScheduler.submit返回一個JobWaiter對象,阻塞等待直到作業完成或失敗;而後者直接調用DAGScheduler.submitJob,可以用在異步調用中,用來判斷作業完成或者取消作業
接下去的調用關係如下
eventProcessActor ! JobSubmitted
DAGScheduler.handleJobSubmitted
handleJobSubmitted.newStage
newSatge會創建一個Satge對象
new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
private[spark] class Stage(
val id: Int,
val rdd: RDD[_],
val numTasks: Int,
val shuffleDep: Option[ShuffleDependency[_,_]], // Output shuffle if stage is a map stage
val parents: List[Stage],
val jobId: Int,
callSite: Option[String])
可以看到Stage包含一個RDD,而這個RDD是每個Stage最後一個RDD
那它的parentStage是什麼呢?以下代碼是生成parentStage
private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
def visit(r: RDD[_]) { //遍歷RDD依賴鏈
if (!visited(r)) {
visited += r
// Kind of ugly: need to register RDDs with the cache here since
// we can't do it in its constructor because # of partitions is unknown
for (dep <- r.dependencies) {
dep match {
case shufDep: ShuffleDependency[_,_] =>//如果是ShuffleDependency,則據此劃分調度階段
parents += getShuffleMapStage(shufDep, jobId)//並添加到該調度階段的父調度階段列表中
case _ =>
visit(dep.rdd)//如果不是ShuffleDependency則繼續迭代遍歷RDD依賴鏈
}
}
}
}
visit(rdd)
parents.toList
}
因此每個Stage都有一批parent Stage List[Stage],上述過程如下圖所示
需要說明的是MapPartitionRDD、ShuffleRDD和MapPartitionRDD是reduceByKey轉換操作產生的
生成finalStage後就要提交Stage
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)//看看有沒有漏掉的Stage
logDebug("missing: " + missing)
if (missing == Nil) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)//如果沒有parent stage,則直接提交當前stage
runningStages += stage
} else {
for (parent <- missing) {
submitStage(parent)//如果有praent stage,則遞歸找到第一個stage
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id)
}
}
提交stage是一個遞歸過程,先把父stage submitStage,再把當前stage添加到waitingStages中,直到stage沒有父stage,就提交該stage的任務
接着看submitMissingTasks方法
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// Get our pending tasks and remember them in our pendingTasks entry
val myPending = pendingTasks.getOrElseUpdate(stage, new HashSet)
......
if (stage.isShuffleMap) {//不是final stage都是生成ShuffleMapTasks
for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) {
val locs = getPreferredLocs(stage.rdd, p)
tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs)
}
} else {
// This is a final stage; figure out its job's missing partitions
val job = resultStageToJob(stage)
for (id <- 0 until job.numPartitions if !job.finished(id)) {
val partition = job.partitions(id)
val locs = getPreferredLocs(stage.rdd, partition)
tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id)
}
}
......
taskScheduler.submitTasks(
new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
stageToInfos(stage).submissionTime = Some(System.currentTimeMillis())
} else {
logDebug("Stage " + stage + " is actually done; %b %d %d".format(
stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
runningStages -= stage
}
}
Task有兩種,一種是ShuffleMapTask,另一種是ResultTask,我們需要注意這兩種Task的runTask方法
最後將Tasks提交給TaskSchedulerImpl,注意它將Tasks封裝成TaskSet提交的
TaskScheduler
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
val manager = new TaskSetManager(this, taskSet, maxTaskFailures)
activeTaskSets(taskSet.id) = manager
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient memory")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT, STARVATION_TIMEOUT)
}
hasReceivedTask = true
}
backend.reviveOffers()
}
將TaskSet封裝成TaskSetManager-->SchedulerBuilder.addTaskSetManager-->rootPool.addSchedulable將TaskSetManager添加進去,其實TaskSetManager和Pool都是繼承相同的特質Schedulable,但是兩個類的核心接口完全不同,個人感覺設計的不好,所以對於Pool,可以理解爲TaskSetManager的容器,也可以放其他PoolCoraseGrainedSchedulerBackend.reviveOffers
DriverActor!ReviveOffers
makeOffers
def makeOffers() {
launchTasks(scheduler.resourceOffers(
executorHost.toArray.map {case (id, host) => new WorkerOffer(id, host, freeCores(id))}))
}
SchedulerBackend給TaskScheduler提供資源,首先看launchTasks裏面的方法TaskScheduler.resourceOffers def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
......
val sortedTaskSets = rootPool.getSortedTaskSetQueue//返回按調度模式排列好的TaskSetManager
......
// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
var launchedTask = false
for (taskSet <- sortedTaskSets; maxLocality <- TaskLocality.values) {
do {
launchedTask = false
for (i <- 0 until shuffledOffers.size) {
val execId = shuffledOffers(i).executorId
val host = shuffledOffers(i).host
if (availableCpus(i) >= CPUS_PER_TASK) {
for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {// 把數據本地性最高的任務交給worker
......
}
}
}
} while (launchedTask)
}
return tasks
}
先看下類TaskSetManager的結構private[spark] class TaskSetManager(
sched: TaskSchedulerImpl,
val taskSet: TaskSet,
val maxTaskFailures: Int,
clock: Clock = SystemClock)
extends Schedulable with Logging
{
......
private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]]
private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]]
private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]]
......
for (i <- (0 until numTasks).reverse) {// 創建該對象就會執行該方法,它會把Tasks對應的locality分別加入上述幾個集合
addPendingTask(i)
}
// Figure out which locality levels we have in our TaskSet, so we can do delay scheduling
val myLocalityLevels = computeValidLocalityLevels()//計算上述幾個集合有哪些locality並且存放在myLocalityLevel集合中,按process->node->rack順序存放
val localityWaits = myLocalityLevels.map(getLocalityWait) // Time to wait at each level// 每個locality默認的等待時間(從配置讀)
// Delay scheduling variables: we keep track of our current locality level and the time we
// last launched a task at that level, and move up a level when localityWaits[curLevel] expires.
// We then move down if we manage to launch a "more local" task.
var currentLocalityIndex = 0 // 當前myLocalityLevels的index,從0開始
var lastLaunchTime = clock.getTime() // 記錄最後啓動Task的時間,如果當前啓動Task的時間-lastLaunchTime大於閾值,currentLocalityIndex+1
......
}
接着看TaskSetManager.resourceOffer override def resourceOffer(
execId: String,
host: String,
availableCpus: Int,
maxLocality: TaskLocality.TaskLocality)
: Option[TaskDescription] =
{
if (tasksFinished < numTasks && availableCpus >= CPUS_PER_TASK) { // 前提是task沒有執行完和有足夠的available cores
val curTime = clock.getTime()
var allowedLocality = getAllowedLocalityLevel(curTime) // 如果發生超時currentLocalityIndex+1,取當前allowed LocalityLevel
if (allowedLocality > maxLocality) { // 不能超出作爲參數傳入的maxLocality, 調用者限定
allowedLocality = maxLocality // We're not allowed to search for farther-away tasks
}
findTask(execId, host, allowedLocality) match { // 調用findTask,根據allowedLocality、execId、host找出合適的Task
case Some((index, taskLocality)) => {
......
// Update our locality level for delay scheduling
currentLocalityIndex = getLocalityIndex(taskLocality) // 用當前Task的locality來更新currentLocalityIndex, 這裏currentLocalityIndex有可能會減少, 因爲調用者限定的locality可能會修改之前的allowedLocality
lastLaunchTime = curTime // 更新lastLaunchTime
// Serialize and return the task
......
return Some(new TaskDescription(taskId, execId, taskName, index, serializedTask)) // 最終返回schedule得到的那個task
}
case _ =>
}
}
return None
}
總結一下:TaskSetManager.resourceOffer內部會根據上一個任務成功提交的時間,自動調整自身的locality策略