Spark 動態資源分配（Spark Dynamic Resource Allocation）

1. 問題背景
2. 原理分析
   2.1 Executor生命週期
   2.2 ExecutorAllocationManager上下游調用關係
3. 總結與反思
4. Community Feedback

1.問題背景

用戶提交Spark應用到Yarn上時，可以通過spark-submit的num-executors參數顯示地指定executor個數，隨後，ApplicationMaster會爲這些executor申請資源，每個executor作爲一個Container在Yarn上運行。Spark調度器會把Task按照合適的策略分配到executor上執行。所有任務執行完後，executor被殺死，應用結束。在job運行的過程中，無論executor是否領取到任務，都會一直佔有着資源不釋放。很顯然，這在任務量小且顯示指定大量executor的情況下會很容易造成資源浪費。

在探究Spark如何實現之前，首先思考下如果自己來解決這個問題，需要考慮哪些因素？大致的方案很容易想到：如果executor在一段時間內一直處於空閒狀態，那麼就可以kill該executor，釋放其佔用的資源。當然，一些細節及邊界條件需要考慮到：

executor動態調整的範圍？無限減少？無限制增加？
executor動態調整速率？線性增減？指數增減？
何時移除Executor？
何時新增Executor了？只要由新提交的Task就新增Executor嗎？
Spark中的executor不僅僅提供計算能力，還可能存儲持久化數據，這些數據在宿主executor被kill後，該如何訪問？
。。。

2.原理分析

2.1 Executor生命週期

首先，先簡單分析下Spark靜態資源分配中Executor的生命週期，以spark-shell中的wordcount爲例，執行命令如下：

# 以yarn模式執行，並指定executor個數爲1
$ spark-shell --master=yarn --num-executors=1

# 提交Job1 wordcount
scala> sc.textFile("file:///etc/hosts").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _).count();

# 提交Job2 wordcount
scala> sc.textFile("file:///etc/profile").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _).count();

# Ctrl+C Kill JVM

上述的Spark應用中，以yarn模式啓動spark-shell，並順序執行兩次wordcount，最後Ctrl+C退出spark-shell。此例中Executor的生命週期如下圖：

從上圖可以看出，Executor在整個應用執行過程中，其狀態一直處於Busy（執行Task）或Idle（空等）。處於Idle狀態的Executor造成資源浪費這個問題已經在上面提到。下面重點看下開啓Spark動態資源分配功能後，Executor如何運作。

下面分析下上圖中各個步驟：

spark-shell Start：啓動spark-shell應用，並通過--num-executor指定了1個執行器。
Executor1 Start：啓動執行器Executor1。注意：Executor啓動前存在一個AM向ResourceManager申請資源的過程，所以啓動時機略微滯後與Driver。
Job1 Start：提交第一個wordcount作業，此時，Executor1處於Busy狀態。
Job1 End：作業1結束，Executor1又處於Idle狀態。
Executor1 timeout：Executor1空閒一段時間後，超時被Kill。
Job2 Submit：提交第二個wordcount，此時，沒有Active的Executor可用。Job2處於Pending狀態。
Executor2 Start：檢測到有Pending的任務，此時Spark會啓動Executor2。
Job2 Start：此時，已經有Active的執行器，Job2會被分配到Executor2上執行。
Job2 End：Job2結束。
Executor2 End：Ctrl+C 殺死Driver，Executor2也會被RM殺死。

上述流程中需要重點關注的幾個問題：

Executor超時：當Executor不執行任何任務時，會被標記爲Idle狀態。空閒一段時間後即被認爲超時，會被kill。該空閒時間由spark.dynamicAllocation.executorIdleTimeout決定，默認值60s。對應上圖中：Job1 End到Executor1 timeout之間的時間。
資源不足時，何時新增Executor：當有Task處於pending狀態，意味着資源不足，此時需要增加Executor。這段時間由spark.dynamicAllocation.schedulerBacklogTimeout控制，默認1s。對應上述step6和step7之間的時間。
該新增多少Executor：新增Executor的個數主要依據是當前負載情況，即running和pending任務數以及當前Executor個數決定。用maxNumExecutorsNeeded代表當前實際需要的最大Executor個數，maxNumExecutorsNeeded和當前Executor個數的差值即是潛在的新增Executor的個數。注意：之所以說潛在的個數，是因爲最終新增的Executor個數還有別的因素需要考慮，後面會有分析。下面是maxNumExecutorsNeeded計算方法：

  private def maxNumExecutorsNeeded(): Int = {
    val numRunningOrPendingTasks = listener.totalPendingTasks + listener.totalRunningTasks
    math.ceil(numRunningOrPendingTasks * executorAllocationRatio /
              tasksPerExecutorForFullParallelism)
      .toInt
  }

其中numRunningOrPendingTasks爲當前running和pending任務數之和。
executorAllocationRatio：最理想的情況下，有多少待執行的任務，那麼我們就新增多少個Executor，從而達到最大的任務併發度。但是這也有副作用，如果當前任務都是小任務，那麼這一策略就會造成資源浪費。可能最後申請的Executor還沒啓動，這些小任務已經被執行完了。該值是一個係數值，範圍[0~1]。默認1.
tasksPerExecutorForFullParallelism：每個Executor的最大併發數，簡單理解爲：cpu核心數（spark.executor.cores）/ 每個任務佔用的核心數（spark.task.cpus）。

問題1：executor動態調整的範圍？無限減少？無限制增加？調整速率？

要實現資源的動態調整，那麼限定調整範圍是最先考慮的事情，Spark通過下面幾個參數實現：

spark.dynamicAllocation.minExecutors：Executor調整下限。（默認值：0）
spark.dynamicAllocation.maxExecutors：Executor調整上限。（默認值：Integer.MAX_VALUE）
spark.dynamicAllocation.initialExecutors：Executor初始數量（默認值：minExecutors）。

三者的關係必須滿足：minExecutors <= initialExecutors <= maxExecutors

注意：如果顯示指定了num-executors參數，那麼initialExecutors就是num-executor指定的值。

問題2：Spark中的Executor既提供計算能力，也提供存儲能力。這些因超時被殺死的Executor中持久化的數據如何處理？

如果Executor中緩存了數據，那麼該Executor的Idle-timeout時間就不是由executorIdleTimeout決定，而是用spark.dynamicAllocation.cachedExecutorIdleTimeout控制，默認值：Integer.MAX_VALUE。如果手動設置了該值，當這些緩存數據的Executor被kill後，我們可以通過NodeManannger的External Shuffle Server來訪問這些數據。這就要求NodeManager中spark.shuffle.service.enabled必須開啓。

2.2 ExecutorAllocationManager上下游調用關係

Spark動態分配的主要邏輯由ExecutorAllocationManager類實現，首先分析下與其交互的上下游關係，如下圖所示：

主要的邏輯很簡單：ExecutorAllocationManager中啓動一個週期性任務，監控當前Executor是否超時，如果超時就將其移除。當然Executor狀態的收集主要依賴於Spark提供的SparkListener機制。週期性任務邏輯如下：

private[spark] class ExecutorAllocationManager {

  // Executor that handles the scheduling task.
  private val executor =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("spark-dynamic-executor-allocation")

  def start(): Unit = {
    。。。
    val scheduleTask = new Runnable() {
      override def run(): Unit = {
        try {
          schedule()
        } catch {...}
      }
    }
    executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS)
    。。。
  }
  
  private def schedule(): Unit = synchronized {
    val now = clock.getTimeMillis
    // 同步當前所需要的Executor數
    updateAndSyncNumExecutorsTarget(now)

    val executorIdsToBeRemoved = ArrayBuffer[String]()
    // removeTimes是<executorId, expireTime>的映射。
    removeTimes.retain { case (executorId, expireTime) =>
      val expired = now >= expireTime
      if (expired) {
        initializing = false
        executorIdsToBeRemoved += executorId
      }
      !expired
    }
    // 移除所有超時的Executor
    if (executorIdsToBeRemoved.nonEmpty) {
      removeExecutors(executorIdsToBeRemoved)
    }
  }
}

以上就是對於Spark的動態資源分配的原理分析，相關源碼可以參考Apache Spark：ExecutorAllocationManager。完整的配置參數見：Spark Configuration: Dynamic Allocation。

3.總結與反思

Pascal之父Nicklaus Wirth曾經說過一句名言：程序=算法+數據結構。對於Spark動態資源分配來說，我們應更加關注算法方面，即其動態行爲。如何分配？如何伸縮？上下游關係如何？等等。
回饋社區：回饋是一種輸出，就迫使我們輸入的質量要足夠高。這是一種很有效的技能提升方式。萬事開頭難，從最簡單的typo fix/docs improvement起步。

4. Community Feedback

完善Executor相關參數的文檔說明。SPARK-26446: Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
fix bug：SPARK-26588:Idle executor should properly be killed when no job is submitted

Spark 動態資源分配（Spark Dynamic Resource Allocation）

1.問題背景

2.原理分析

2.1 Executor生命週期

問題1：executor動態調整的範圍？無限減少？無限制增加？調整速率？

問題2：Spark中的Executor既提供計算能力，也提供存儲能力。這些因超時被殺死的Executor中持久化的數據如何處理？

2.2 ExecutorAllocationManager上下游調用關係

3.總結與反思

4. Community Feedback

參考

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

Hadoop Delegation Tokens詳解【譯文】譯文

Spark：Dynamic Resource Allocation【動態資源分配】

Spark 動態資源分配（Spark Dynamic Resource Allocation）

Hadoop Delegation Tokens詳解【譯文】

Hadoop Distributed Cache Deploy

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結