Spark源碼分析(四)調度管理2

DAGScheduler

SparkContext有兩中提交作業的方法:

1、是我前面一章講的runJob方法

2、還有一種是submit方法

它們都是提交到DAGScheduler中,DAGScheduler對外暴露的兩個入口

兩者的區別在於DAGScheduler.runJob在內部調用DAGScheduler.submit返回一個JobWaiter對象,阻塞等待直到作業完成或失敗;而後者直接調用DAGScheduler.submitJob,可以用在異步調用中,用來判斷作業完成或者取消作業

接下去的調用關係如下

eventProcessActor ! JobSubmitted
DAGScheduler.handleJobSubmitted
handleJobSubmitted.newStage
newSatge會創建一個Satge對象

new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
private[spark] class Stage(
    val id: Int, 
    val rdd: RDD[_],
    val numTasks: Int,
    val shuffleDep: Option[ShuffleDependency[_,_]],  // Output shuffle if stage is a map stage
    val parents: List[Stage],
    val jobId: Int,
    callSite: Option[String])

可以看到Stage包含一個RDD,而這個RDD是每個Stage最後一個RDD

那它的parentStage是什麼呢?以下代碼是生成parentStage

  private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
    val parents = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    def visit(r: RDD[_]) { //遍歷RDD依賴鏈
      if (!visited(r)) {
        visited += r
        // Kind of ugly: need to register RDDs with the cache here since
        // we can't do it in its constructor because # of partitions is unknown
        for (dep <- r.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_,_] =>//如果是ShuffleDependency,則據此劃分調度階段
              parents += getShuffleMapStage(shufDep, jobId)//並添加到該調度階段的父調度階段列表中
            case _ =>
              visit(dep.rdd)//如果不是ShuffleDependency則繼續迭代遍歷RDD依賴鏈
          }
        }
      }
    }
    visit(rdd)
    parents.toList
  }
因此每個Stage都有一批parent Stage List[Stage],上述過程如下圖所示

吃屎

需要說明的是MapPartitionRDD、ShuffleRDD和MapPartitionRDD是reduceByKey轉換操作產生的

生成finalStage後就要提交Stage

  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)//看看有沒有漏掉的Stage
        logDebug("missing: " + missing)
        if (missing == Nil) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)//如果沒有parent stage,則直接提交當前stage
          runningStages += stage
        } else {
          for (parent <- missing) {
            submitStage(parent)//如果有praent stage,則遞歸找到第一個stage
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id)
    }
  }
提交stage是一個遞歸過程,先把父stage submitStage,再把當前stage添加到waitingStages中,直到stage沒有父stage,就提交該stage的任務

接着看submitMissingTasks方法

  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")
    // Get our pending tasks and remember them in our pendingTasks entry
    val myPending = pendingTasks.getOrElseUpdate(stage, new HashSet)
    ......
    if (stage.isShuffleMap) {//不是final stage都是生成ShuffleMapTasks
      for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) {
        val locs = getPreferredLocs(stage.rdd, p)
        tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs)
      }
    } else {
      // This is a final stage; figure out its job's missing partitions
      val job = resultStageToJob(stage)
      for (id <- 0 until job.numPartitions if !job.finished(id)) {
        val partition = job.partitions(id)
        val locs = getPreferredLocs(stage.rdd, partition)
        tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id)
      }
    }
    ......
      taskScheduler.submitTasks(
        new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
      stageToInfos(stage).submissionTime = Some(System.currentTimeMillis())
    } else {
      logDebug("Stage " + stage + " is actually done; %b %d %d".format(
        stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
      runningStages -= stage
    }
  }
Task有兩種,一種是ShuffleMapTask,另一種是ResultTask,我們需要注意這兩種Task的runTask方法

最後將Tasks提交給TaskSchedulerImpl,注意它將Tasks封裝成TaskSet提交的

TaskScheduler

  override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      val manager = new TaskSetManager(this, taskSet, maxTaskFailures)
      activeTaskSets(taskSet.id) = manager
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient memory")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT, STARVATION_TIMEOUT)
      }
      hasReceivedTask = true
    }
    backend.reviveOffers()
  }
將TaskSet封裝成TaskSetManager-->SchedulerBuilder.addTaskSetManager-->rootPool.addSchedulable將TaskSetManager添加進去,其實TaskSetManager和Pool都是繼承相同的特質Schedulable,但是兩個類的核心接口完全不同,個人感覺設計的不好,所以對於Pool,可以理解爲TaskSetManager的容器,也可以放其他Pool

TaskSetManager負責在具體的任務集的內部調度任務,並且會跟蹤各個Task的執行情況,而TaskScheduler負責將資源提供給TaskSetManager供其作爲調度任務的依據,但是每個Spark Job可能同時存在多個可運行的任務集(互相之間沒有依賴關係),這些任務集之間如何調度,由SchedulerBuilder來選擇哪一種調度模式,現在主要有FIFO和FAIR,而調度池Pool來決定調度哪一個任務集。
接下去的調用關係如下
CoraseGrainedSchedulerBackend.reviveOffers
DriverActor!ReviveOffers
makeOffers
    def makeOffers() {
      launchTasks(scheduler.resourceOffers(
        executorHost.toArray.map {case (id, host) => new WorkerOffer(id, host, freeCores(id))}))
    }
SchedulerBackend給TaskScheduler提供資源,首先看launchTasks裏面的方法TaskScheduler.resourceOffers
  def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
    ......
    val sortedTaskSets = rootPool.getSortedTaskSetQueue//返回按調度模式排列好的TaskSetManager
    ......
    // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
    // of locality levels so that it gets a chance to launch local tasks on all of them.
    var launchedTask = false
    for (taskSet <- sortedTaskSets; maxLocality <- TaskLocality.values) {
      do {
        launchedTask = false
        for (i <- 0 until shuffledOffers.size) {
          val execId = shuffledOffers(i).executorId
          val host = shuffledOffers(i).host
          if (availableCpus(i) >= CPUS_PER_TASK) {
            for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {// 把數據本地性最高的任務交給worker
            ......
            }
          }
        }
      } while (launchedTask)
    }
    return tasks
  }
先看下類TaskSetManager的結構
private[spark] class TaskSetManager(
    sched: TaskSchedulerImpl,
    val taskSet: TaskSet,
    val maxTaskFailures: Int,
    clock: Clock = SystemClock)
  extends Schedulable with Logging
{
  ......
  private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]]

  private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]]

  private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]]
  ......
  for (i <- (0 until numTasks).reverse) {// 創建該對象就會執行該方法,它會把Tasks對應的locality分別加入上述幾個集合
    addPendingTask(i)
  }

  // Figure out which locality levels we have in our TaskSet, so we can do delay scheduling
  val myLocalityLevels = computeValidLocalityLevels()//計算上述幾個集合有哪些locality並且存放在myLocalityLevel集合中,按process->node->rack順序存放
  val localityWaits = myLocalityLevels.map(getLocalityWait) // Time to wait at each level// 每個locality默認的等待時間(從配置讀)

  // Delay scheduling variables: we keep track of our current locality level and the time we
  // last launched a task at that level, and move up a level when localityWaits[curLevel] expires.
  // We then move down if we manage to launch a "more local" task.
  var currentLocalityIndex = 0    // 當前myLocalityLevels的index,從0開始
  var lastLaunchTime = clock.getTime()  // 記錄最後啓動Task的時間,如果當前啓動Task的時間-lastLaunchTime大於閾值,currentLocalityIndex+1
  ......
}
接着看TaskSetManager.resourceOffer
  override def resourceOffer(
      execId: String,
      host: String,
      availableCpus: Int,
      maxLocality: TaskLocality.TaskLocality)
    : Option[TaskDescription] =
  {
    if (tasksFinished < numTasks && availableCpus >= CPUS_PER_TASK) { // 前提是task沒有執行完和有足夠的available cores
      val curTime = clock.getTime()

      var allowedLocality = getAllowedLocalityLevel(curTime) // 如果發生超時currentLocalityIndex+1,取當前allowed LocalityLevel

      if (allowedLocality > maxLocality) {  //  不能超出作爲參數傳入的maxLocality, 調用者限定
        allowedLocality = maxLocality   // We're not allowed to search for farther-away tasks
      }

      findTask(execId, host, allowedLocality) match { // 調用findTask,根據allowedLocality、execId、host找出合適的Task
        case Some((index, taskLocality)) => {
          ......
          // Update our locality level for delay scheduling
          currentLocalityIndex = getLocalityIndex(taskLocality) // 用當前Task的locality來更新currentLocalityIndex, 這裏currentLocalityIndex有可能會減少, 因爲調用者限定的locality可能會修改之前的allowedLocality
          lastLaunchTime = curTime      // 更新lastLaunchTime 
          // Serialize and return the task
          ......
          return Some(new TaskDescription(taskId, execId, taskName, index, serializedTask)) // 最終返回schedule得到的那個task
        }
        case _ =>
      }
    }
    return None
  }
總結一下:TaskSetManager.resourceOffer內部會根據上一個任務成功提交的時間,自動調整自身的locality策略
這裏主要是對currentLocalityIndex的維護,如果上一次成功提交任務的時間間隔很長,則降低對locality的要求,反之提高對locality的要求
這一動態調整locality策略基本可以理解爲是爲了增加任務在最佳locality的情況下的運行機會

回到GoarseGrainedSchedulerBackend.launchTasks,得到Tasks後,將Tasks序列化後發送給GoarseGrainedExecutorBackend啓動任務,接下去的故事放到下一章吧

































發佈了20 篇原創文章 · 獲贊 9 · 訪問量 6萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章