DAGScheduler 源碼走讀

 要理解DAGScheduler,首先就得了解RDD的生命週期。RDD是什麼?且看它的全稱 Resilient Distributed Datasets,彈性式分佈數據集。沒錯,RDD是一種數據結構,這種數據結構自帶了很多方法,這些方法可分爲兩種:transformation 和 action。在這兩種操作中,只有action操作會 觸發job。且看常用的action操作有哪些:

    


                   所有直接實現的action操作都會觸發job(注:有些算子是調用其它算子實現的,如first()算子是調用take()算子實現的),具體代碼爲:

sc.runJob(this, func,....)

                    runJob()是SparkContext中的方法,而這個方法最終用調用了DAGScheduler中的runJob方法,具體代碼爲:

dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)

                 DAGScheduler中的runJob()會調用submit():
submitJob(rdd, func, partitions, callSite, resultHandler, properties)

                 而submitJob() 會把job提交到eventProcessLoop線程:
eventProcessLoop.post(JobSubmitted(jobId, rdd, func2, partitions.toArray, callSite, waiter,SerializationUtils.clone(properties)))          
         
                類DAGSchedulerEventProcessLoop有一個監聽方法onReceive,這個方法會調用doOnReceive方法處理各種case,其代碼如下:

        def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
 
            case JobSubmitted(...) =>
 
dagScheduler.handleJobSubmitted(...)
            ......//此處省略了其它無關代碼
}
       handleJobSubmitted()方法,首先創建finalStage,最終調用submitStage方法提交創建好的finalStage,其主要代碼如下:

                 def handleJobSubmitted(...):{
                 //創建finalStage
                 finalStage = newResultStage(
finalRDD, func, partitions, jobId,callSite)
                 //創建job
                 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
                 //創建好的job加到finalStage中
                 finalStage.setActiveJob(job)
                 //最後提交創建的finalStage
 		submitStage(finalStage)
                ......//此處省略其它代碼 }

                接下來看看submitStage方法做了哪些事情:
                 
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)//根據傳入finalStage查看是否有遺漏的parent stage
      logDebug("missing: " + missing)
      if (missing.isEmpty) {//如果所有parent stage都已經提交,則調用submitMissingTasks方法
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)//如果有遺漏的parent stage,則遞歸調用submitStage方法,使得parent stage先被處理
        }
        waitingStages += stage
      }
  } else {
    }
    abortStage(stage, "No active job for stage " + stage.id, None)
}
  }

需要注意的是在getMissingParentStages()方法中,會根據rdd的dependency做不同的處理,代碼如下:
private def getMissingParentStages(stage: Stage): List[Stage] = {
  val missing = new HashSet[Stage]
  val visited = new HashSet[RDD[_]]

  val waitingForVisit = new Stack[RDD[_]]
  def visit(rdd: RDD[_]) {
    if (!visited(rdd)) {
      visited += rdd
      val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
      if (rddHasUncachedPartitions) {
        for (dep <- rdd.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_, _, _] =>
              val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)//如果是shuffle dependency,則對相應的finalStage生成parent stage
              if (!mapStage.isAvailable) {
                missing += mapStage
              }
            case narrowDep: NarrowDependency[_] =>
              waitingForVisit.push(narrowDep.rdd)//如果是窄依賴的話,則不生成新的stage
          }
        }
      }
    }
  }
  waitingForVisit.push(stage.rdd)
  while (waitingForVisit.nonEmpty) {
    visit(waitingForVisit.pop())
  }
  missing.toList
}

接下來看看submitMissingTasks做了哪些事情,其主要代碼如下(省略了大部分代碼,只列出主要代碼):

private def   submitMissingTasks(stage: Stage, jobId: Int) {

  runningStages += stage

  stage match {
    case s: ShuffleMapStage =>
      outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
    case s: ResultStage =>
      outputCommitCoordinator.stageStart(
        stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
  }
  val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
    stage match {
      case s: ShuffleMapStage =>
        partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
      case s: ResultStage =>
        val job = s.activeJob.get
        partitionsToCompute.map { id =>
          val p = s.partitions(id)
          (id, getPreferredLocs(stage.rdd, p))
        }.toMap
    }
  } catch {
    ......
  }
  // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
  // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
  // the serialized copy of the RDD and for each task we will deserialize it, which means each
  // task gets a different copy of the RDD. This provides stronger isolation between tasks that
  // might modify state of objects referenced in their closures. This is necessary in Hadoop
  // where the JobConf/Configuration object is not thread-safe.
  var taskBinary: Broadcast[Array[Byte]] = null
  try {
    // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
    // For ResultTask, serialize and broadcast (rdd, func).
    val taskBinaryBytes: Array[Byte] = stage match {
      case stage: ShuffleMapStage =>
        closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
      case stage: ResultStage =>
        closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
    }

    taskBinary = sc.broadcast(taskBinaryBytes)
  } catch {
   ......
  }

  val tasks: Seq[Task[_]] = try {
    stage match {
      case stage: ShuffleMapStage =>
        partitionsToCompute.map { id =>
          val locs = taskIdToLocations(id)
          val part = stage.rdd.partitions(id)
          new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
            taskBinary, part, locs, stage.internalAccumulators)
        }

      case stage: ResultStage =>
        val job = stage.activeJob.get
        partitionsToCompute.map { id =>
          val p: Int = stage.partitions(id)
          val part = stage.rdd.partitions(p)
          val locs = taskIdToLocations(id)
          new ResultTask(stage.id, stage.latestInfo.attemptId,
            taskBinary, part, locs, id, stage.internalAccumulators)
        }
    }
  } catch {
    ......
  }

  if (tasks.size > 0) {
    stage.pendingPartitions ++= tasks.map(_.partitionId)
    taskScheduler.submitTasks(new TaskSet(
      tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
    stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
  } else {
    // Because we posted SparkListenerStageSubmitted earlier, we should mark
    // the stage as completed here in case there are no tasks to run
    markStageAsFinished(stage, None)
  }
}
  1. 首先把運行的stage加入到runningStage中
  2. 根據stage的類型,啓動stage
  3. 找到task對應的rdd的物理地址
  4. 生成序列化的二進制task,然後廣播到各個節點。每一個task都會得到一份binary task,這樣保證了各個task之間的獨立性
  5. 找到每一個task對應的rdd的物理地址
  6. 根據stage的類型創建對應的task,如果是shuffleMapStage,則創建ShuffleMapTask;如果是resultStage,則創建ResultTask
  7. 最後調用taskSchdulerImp的submitTask方法。這樣job經過DAGScheduler處理之後,就交給taskScheduler了。




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章