Spark 源碼解析 : DAGScheduler中的DAG劃分與提交

一、Spark 運行架構

Spark 運行架構如下圖:
各個RDD之間存在着依賴關係,這些依賴關係形成有向無環圖DAG,DAGScheduler對這些依賴關係形成的DAG,進行Stage劃分,劃分的規則很簡單,從後往前回溯,遇到窄依賴加入本stage,遇見寬依賴進行Stage切分。完成了Stage的劃分,DAGScheduler基於每個Stage生成TaskSet,並將TaskSet提交給TaskScheduler。TaskScheduler 負責具體的task調度,在Worker節點上啓動task。





二、源碼解析:DAGScheduler中的DAG劃分
    當RDD觸發一個Action操作(如:colllect)後,導致SparkContext.runJob的執行。而在SparkContext的run方法中會調用DAGScheduler的run方法最終調用了DAGScheduler的submit方法:
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
 
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
 
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
//給eventProcessLoop發送JobSubmitted消息
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}

DAGScheduler的submit方法中,像eventProcessLoop對象發送了JobSubmitted消息。eventProcessLoop是DAGSchedulerEventProcessLoop類的對象

private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

DAGSchedulerEventProcessLoop,接收各種消息並進行處理,處理的邏輯在其doOnReceive方法中:

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
   //Job提交
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
 
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
 
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
 
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
 
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
 
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
 
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
 
case ExecutorLost(execId) =>
dagScheduler.handleExecutorLost(execId, fetchFailed = false)
 
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
 
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
 
case completion: CompletionEvent =>
dagScheduler.handleTaskCompletion(completion)
 
case TaskSetFailed(taskSet, reason, exception) =>
dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
 
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}

可以把DAGSchedulerEventProcessLoop理解成DAGScheduler的對外的功能接口。它對外隱藏了自己內部實現的細節。無論是內部還是外部消息,DAGScheduler可以共用同一消息處理代碼,邏輯清晰,處理方式統一。

接下來分析DAGScheduler的Stage劃分,handleJobSubmitted方法首先創建ResultStage

try {
//創建新stage可能出現異常,比如job運行依賴hdfs文文件被刪除
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}

然後調用submitStage方法,進行stage的劃分。




首先由finalRDD獲取它的父RDD依賴,判斷依賴類型,如果是窄依賴,則將父RDD壓入棧中,如果是寬依賴,則作爲父Stage。

看一下源碼的具體過程:

private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage] //存儲需要返回的父Stage
val visited = new HashSet[RDD[_]] //存儲訪問過的RDD
//自己建立棧,以免函數的遞歸調用導致
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage //遇到寬依賴,加入父stage
}
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd) //窄依賴入棧,
}
}
}
}
}
   //回溯的起始RDD入棧
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}

getMissingParentStages方法是由當前stage,返回他的父stage,父stage的創建由getShuffleMapStage返回,最終會調用newOrUsedShuffleStage方法返回ShuffleMapStage

private def newOrUsedShuffleStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
//Stage已經被計算過,從MapOutputTracker中獲取計算結果
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}

現在父Stage已經劃分好,下面看看你Stage的提交邏輯

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//如果沒有父stage,則提交當前stage
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//如果有父stage,則遞歸提交父stage
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}

提交的過程很簡單,首先當前stage獲取父stage,如果父stage爲空,則當前Stage爲起始stage,交給submitMissingTasks處理,如果當前stage不爲空,則遞歸調用submitStage進行提交。

到這裏,DAGScheduler中的DAG劃分與提交就講完了,下次解析這些stage是如果封裝成TaskSet交給TaskScheduler以及TaskSchedule的調度過程。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章