Spark技術內幕:Executor分配詳解

當用戶應用new SparkContext後,集羣就會爲在Worker上分配executor,那麼這個過程是什麼呢?本文以Standalone的Cluster爲例,詳細的闡述這個過程。序列圖如下:

1. SparkContext創建TaskScheduler和DAG Scheduler

SparkContext是用戶應用和Spark集羣的交換的主要接口,用戶應用一般首先要創建它。如果你使用SparkShell,你不必自己顯式去創建它,系統會自動創建一個名字爲sc的SparkContext的實例。創建SparkContext的實例,主要的工作除了設置一些conf,比如executor使用到的memory的大小。如果系統的配置文件有,那麼就讀取該配置。否則則讀取環境變量。如果都沒有設置,那麼取默認值爲512M。當然了這個數值還是很保守的,特別是在內存已經那麼昂貴的今天。

private[spark] val executorMemory = conf.getOption("spark.executor.memory")
    .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
    .orElse(Option(System.getenv("SPARK_MEM")).map(warnSparkMem))
    .map(Utils.memoryStringToMb)
    .getOrElse(512)

除了加載這些集羣的參數,它完成了TaskScheduler和DAGScheduler的創建:

  // Create and start the scheduler
  private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)
  private val heartbeatReceiver = env.actorSystem.actorOf(
    Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver")
  @volatile private[spark] var dagScheduler: DAGScheduler = _
  try {
    dagScheduler = new DAGScheduler(this)
  } catch {
    case e: Exception => throw
      new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))
  }

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  taskScheduler.start()

TaskScheduler是通過不同的SchedulerBackend來調度和管理任務。它包含資源分配和任務調度。它實現了FIFO調度和FAIR調度,基於此來決定不同jobs之間的調度順序。並且管理任務,包括任務的提交和終止,爲飢餓任務啓動備份任務。

不同的Cluster,包括local模式,都是通過不同的SchedulerBackend的實現其不同的功能。這個模塊的類圖如下:



2. TaskScheduler通過SchedulerBackend創建AppClient

SparkDeploySchedulerBackend是Standalone模式的SchedulerBackend。通過創建AppClient,可以向Standalone的Master註冊Application,然後Master會通過Application的信息爲它分配Worker,包括每個worker上使用CPU core的數目等。

private[spark] class SparkDeploySchedulerBackend(
    scheduler: TaskSchedulerImpl,
    sc: SparkContext,
    masters: Array[String])
  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
  with AppClientListener
  with Logging {

  var client: AppClient = null  //注:Application與Master的接口

  val maxCores = conf.getOption("spark.cores.max").map(_.toInt) //注:獲得每個executor最多的CPU core數目
  override def start() {
    super.start()

    // The endpoint for executors to talk to us
    val driverUrl = "akka.tcp://%s@%s:%s/user/%s".format(
      SparkEnv.driverActorSystemName,
      conf.get("spark.driver.host"),
      conf.get("spark.driver.port"),
      CoarseGrainedSchedulerBackend.ACTOR_NAME)
    //注:現在executor還沒有申請,因此關於executor的所有信息都是未知的。
    //這些參數將會在org.apache.spark.deploy.worker.ExecutorRunner啓動ExecutorBackend的時候替換這些參數
    val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}")
    //注:設置executor運行時需要的環境變量
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath").toSeq.flatMap { cp =>
      cp.split(java.io.File.pathSeparator)
    }
    val libraryPathEntries =
      sc.conf.getOption("spark.executor.extraLibraryPath").toSeq.flatMap { cp =>
        cp.split(java.io.File.pathSeparator)
      }

    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    //注:在Worker上通過org.apache.spark.deploy.worker.ExecutorRunner啓動
    // org.apache.spark.executor.CoarseGrainedExecutorBackend,這裏準備啓動它需要的參數
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries, libraryPathEntries, javaOpts)
    //注:org.apache.spark.deploy.ApplicationDescription包含了所有註冊這個Application的所有信息。
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      sc.ui.appUIAddress, sc.eventLogger.map(_.logDir))

    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
    client.start()
    //注:在Master返回註冊Application成功的消息後,AppClient會回調本class的connected,完成了Application的註冊。
    waitForRegistration()
  }

org.apache.spark.deploy.client.AppClientListener是一個trait,主要爲了SchedulerBackend和AppClient之間的函數回調,在以下四種情況下,AppClient會回調相關函數以通知SchedulerBackend:
  1. 向Master成功註冊Application,即成功鏈接到集羣;
  2. 斷開連接,如果當前SparkDeploySchedulerBackend::stop == false,那麼可能原來的Master實效了,待新的Master ready後,會重新恢復原來的連接;
  3. Application由於不可恢復的錯誤停止了,這個時候需要重新提交出錯的TaskSet;
  4. 添加一個Executor,在這裏的實現僅僅是打印了log,並沒有額外的邏輯;
  5. 刪除一個Executor,可能有兩個原因,一個是Executor退出了,這裏可以得到Executor的退出碼,或者由於Worker的退出導致了運行其上的Executor退出,這兩種情況需要不同的邏輯來處理。
private[spark] trait AppClientListener {
  def connected(appId: String): Unit

  /** Disconnection may be a temporary state, as we fail over to a new Master. */
  def disconnected(): Unit

  /** An application death is an unrecoverable failure condition. */
  def dead(reason: String): Unit

  def executorAdded(fullId: String, workerId: String, hostPort: String, cores: Int, memory: Int)

  def executorRemoved(fullId: String, message: String, exitStatus: Option[Int]): Unit
}


小結:SparkDeploySchedulerBackend裝備好啓動Executor的必要參數後,創建AppClient,並通過一些回調函數來得到Executor和連接等信息;通過org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverActor與ExecutorBackend來進行通信。

3. AppClient向Master提交Application


AppClient是Application和Master交互的接口。它的包含一個類型爲org.apache.spark.deploy.client.AppClient.ClientActor的成員變量actor。它負責了所有的與Master的交互。actor首先向Master註冊Application。如果超過20s沒有接收到註冊成功的消息,那麼會重新註冊;如果重試超過3次仍未成功,那麼本次提交就以失敗結束了。

    def tryRegisterAllMasters() {
      for (masterUrl <- masterUrls) {
        logInfo("Connecting to master " + masterUrl + "...")
        val actor = context.actorSelection(Master.toAkkaUrl(masterUrl))
        actor ! RegisterApplication(appDescription) // 向Master註冊
      }
    }

    def registerWithMaster() {
      tryRegisterAllMasters()
      import context.dispatcher
      var retries = 0
      registrationRetryTimer = Some { // 如果註冊20s內未收到成功的消息,那麼再次重複註冊
        context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) {
          Utils.tryOrExit {
            retries += 1
            if (registered) { // 註冊成功,那麼取消所有的重試
              registrationRetryTimer.foreach(_.cancel())
            } else if (retries >= REGISTRATION_RETRIES) { // 重試超過指定次數(3次),則認爲當前Cluster不可用,退出
              markDead("All masters are unresponsive! Giving up.")
            } else { // 進行新一輪的重試
              tryRegisterAllMasters()
            }
          }
        }
      }
    }


主要的消息如下:

  1. RegisteredApplication(appId_, masterUrl) => //注:來自Master的註冊Application成功的消息
  2. ApplicationRemoved(message) => //注:來自Master的刪除Application的消息。Application執行成功或者失敗最終都會被刪除。
  3. ExecutorAdded(id: Int, workerId: String, hostPort: String, cores: Int, memory: Int) => //注:來自Master
  4. ExecutorUpdated(id, state, message, exitStatus) =>  //注:來自Master的Executor狀態更新的消息,如果是Executor是完成的狀態,那麼回調SchedulerBackend的executorRemoved的函數。
  5. MasterChanged(masterUrl, masterWebUiUrl) =>  //注:來自新競選成功的Master。Master可以選擇ZK實現HA,並且使用ZK來持久化集羣的元數據信息。因此在Master變成leader後,會恢復持久化的Application,Driver和Worker的信息。
  6. StopAppClient => //注:來自AppClient::stop()

4. Master根據AppClient的提交選擇Worker


Master接收到AppClient的registerApplication的請求後,處理邏輯如下:

    case RegisterApplication(description) => {
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response //注:AppClient有超時機制(20s),超時會重試
      } else {
        logInfo("Registering app " + description.name)
        val app = createApplication(description, sender)
        // app is ApplicationInfo(now, newApplicationId(date), desc, date, driver, defaultCores), driver就是AppClient的actor
        //保存到master維護的成員變量中,比如
        /* apps += app;
           idToApp(app.id) = app
           actorToApp(app.driver) = app
           addressToApp(appAddress) = app
           waitingApps += app */
        registerApplication(app)

        logInfo("Registered app " + description.name + " with ID " + app.id)
        persistenceEngine.addApplication(app) //持久化app的元數據信息,可以選擇持久化到ZooKeeper,本地文件系統,或者不持久化
        sender ! RegisteredApplication(app.id, masterUrl)
        schedule() //爲處於待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度
      }
    }

schedule() 爲處於待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度。爲Application分配資源選擇worker(executor),現在有兩種策略:

  1. 儘量的打散,即一個Application儘可能多的分配到不同的節點。這個可以通過設置spark.deploy.spreadOut來實現。默認值爲true,即儘量的打散。
  2. 儘量的集中,即一個Application儘量分配到儘可能少的節點。

對於同一個Application,它在一個worker上只能擁有一個executor;當然了,這個executor可能擁有多於1個core。對於策略1,任務的部署會慢於策略2,但是GC的時間會更快。

其主要邏輯如下:

if (spreadOutApps) { //儘量的打散負載,如有可能,每個executor分配一個core
      // Try to spread out each app among all the nodes, until it has all its cores
      for (app <- waitingApps if app.coresLeft > 0) { //使用FIFO的方式爲等待的app分配資源
        // 可用的worker的標準:State是Alive,其上並沒有該Application的executor,可用內存滿足要求。
        // 在可用的worker中,優先選擇可用core數多的。
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(canUse(app, _)).sortBy(_.coresFree).reverse
        val numUsable = usableWorkers.length
        val assigned = new Array[Int](numUsable) // Number of cores to give on each node 保存在該節點上預分配的core數
        var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
        var pos = 0
        while (toAssign > 0) {
          if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
            toAssign -= 1
            assigned(pos) += 1
          }
          pos = (pos + 1) % numUsable
        }
        // Now that we've decided how many cores to give on each node, let's actually give them
        for (pos <- 0 until numUsable) {
          if (assigned(pos) > 0) {
            val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
            launchExecutor(usableWorkers(pos), exec)
            app.state = ApplicationState.RUNNING
          }
        }
      }
    } else {//儘可能多的利用worker的core
      // Pack each app into as few nodes as possible until we've assigned all its cores
      for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
        for (app <- waitingApps if app.coresLeft > 0) {
          if (canUse(app, worker)) {
            val coresToUse = math.min(worker.coresFree, app.coresLeft)
            if (coresToUse > 0) {
              val exec = app.addExecutor(worker, coresToUse)
              launchExecutor(worker, exec)
              app.state = ApplicationState.RUNNING
            }
          }
        }
      }
    }


在選擇了worker和確定了worker上得executor需要的CPU core數後,Master會調用 launchExecutor(worker: WorkerInfo, exec: ExecutorInfo)向Worker發送請求,向AppClient發送executor已經添加的消息。同時會更新master保存的worker的信息,包括增加executor,減少可用的CPU core數和memory數。Master不會等到真正在worker上成功啓動executor後再更新worker的信息。如果worker啓動executor失敗,那麼它會發送FAILED的消息給Master,Master收到該消息時再次更新worker的信息即可。這樣是簡化了邏輯。

  def launchExecutor(worker: WorkerInfo, exec: ExecutorInfo) {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)//更新worker的信息,可用core數和memory數減去本次分配的executor佔用的
    // 向Worker發送啓動executor的請求
    worker.actor ! LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
    // 向AppClient發送executor已經添加的消息ß
    exec.application.driver ! ExecutorAdded(
      exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
  }


小結:現在的分配方式還是比較粗糙的。比如並沒有考慮節點的當前總體負載。可能會導致節點上executor的分配是比較均勻的,單純靜態的從executor分配到得CPU core數和內存數來看,負載是比較均衡的。但是從實際情況來看,可能有的executor的資源消耗比較大,因此會導致集羣負載不均衡。這個需要從生產環境的數據得到反饋來進一步的修正和細化分配策略,以達到更好的資源利用率。


5. Worker根據Master的資源分配結果來創建Executor


Worker接收到來自Master的LaunchExecutor的消息後,會創建org.apache.spark.deploy.worker.ExecutorRunner。Worker本身會記錄本身資源的使用情況,包括已經使用的CPU core數,memory數等;但是這個統計只是爲了web UI的展現。Master本身會記錄Worker的資源使用情況,無需Worker自身彙報。Worker與Master之間的心跳的目的僅僅是爲了報活,不會攜帶其他的信息。

ExecutorRunner會將在org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend中準備好的org.apache.spark.deploy.ApplicationDescription以進程的形式啓動起來。當時以下幾個參數還是未知的:

val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}")。ExecutorRunner需要將他們替換成已經分配好的實際值:

 /** Replace variables such as {{EXECUTOR_ID}} and {{CORES}} in a command argument passed to us */
  def substituteVariables(argument: String): String = argument match {
    case "{{WORKER_URL}}" => workerUrl
    case "{{EXECUTOR_ID}}" => execId.toString
    case "{{HOSTNAME}}" => host
    case "{{CORES}}" => cores.toString
    case other => other
  }

接下來就啓動org.apache.spark.deploy.ApplicationDescription中攜帶的org.apache.spark.executor.CoarseGrainedExecutorBackend:

def fetchAndRunExecutor() {
    try {
      // Create the executor's working directory
      val executorDir = new File(workDir, appId + "/" + execId)
      if (!executorDir.mkdirs()) {
        throw new IOException("Failed to create directory " + executorDir)
      }

      // Launch the process
      val command = getCommandSeq
      logInfo("Launch command: " + command.mkString("\"", "\" \"", "\""))
      val builder = new ProcessBuilder(command: _*).directory(executorDir)
      val env = builder.environment()
      for ((key, value) <- appDesc.command.environment) {
        env.put(key, value)
      }
      // In case we are running this from within the Spark Shell, avoid creating a "scala"
      // parent process for the executor command
      env.put("SPARK_LAUNCH_WITH_SCALA", "0")
      process = builder.start()

CoarseGrainedExecutorBackend啓動後,會首先通過傳入的driverUrl這個參數向在org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend::DriverActor發送RegisterExecutor(executorId, hostPort, cores),DriverActor會回覆RegisteredExecutor,此時CoarseGrainedExecutorBackend會創建一個org.apache.spark.executor.Executor。至此,Executor創建完畢。Executor在Mesos, YARN, and the standalone scheduler中,都是相同的。不同的只是資源的分配管理方式。

發佈了105 篇原創文章 · 獲贊 90 · 訪問量 224萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章