Spark之SparkEnv實例的構建

SparkEnv

SparkEnv會在Driver和Executor角色創建時,創建該類的一個實例,爲當前結點的正常工作提供必要的功能,例如管理交互數據在本地的緩存、shuffle文件、跟蹤Map任務的輸出等。它實例化了Spark實例運行時所需要的各類對象,(不論是在master還是worker端),用戶代碼裏則可以全局變量的方式來獲取SparkEnv實例,因此它可以被多個線程所共享。可以有多種方式來獲取SparkEnv的實例,比如如果創建了SparkContext時,可以直接調用如下的語句來獲取此變量。

SparkEnv.get()

SparkEnv on Executor

當用戶提交一個App到集羣中,且對應的Driver進程已經在某個Worker上啓動,並且嘗試爲當前App在Worker結點啓動Executor時,會給Worker端點(RpcEndpoint)發送LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) 消息,對應的Worker進程會在相應的事件處理流程中創建ExecutorRunner實例(它內部維護一個線程,在啓動時使用子進程的方式來創建單Executor管理後端CoarseGrainedExecutorBackend),然後啓動該實例,最終在CoarseGrainedExecutorBackend進程啓動時,調用如下的方法來創建SparkEnv

private[spark] object CoarseGrainedExecutorBackend extends Logging {
  private def run(
      driverUrl: String,
      executorId: String,
      hostname: String,
      cores: Int,
      appId: String,
      workerUrl: Option[String],
      userClassPath: Seq[URL]) {
      
    Utils.initDaemon(log)

    SparkHadoopUtil.get.runAsSparkUser { () =>
      // Debug code
      Utils.checkHost(hostname)

      // Bootstrap to fetch the driver's Spark properties.
      val executorConf = new SparkConf
      val fetcher = RpcEnv.create(
        "driverPropsFetcher",
        hostname,
        -1,
        executorConf,
        new SecurityManager(executorConf),
        clientMode = true)
      val driver = fetcher.setupEndpointRefByURI(driverUrl)
      val cfg = driver.askSync[SparkAppConfig](RetrieveSparkAppConfig)
      val props = cfg.sparkProperties ++ Seq[(String, String)](("spark.app.id", appId))
      fetcher.shutdown()

      // Create SparkEnv using properties we fetched from the driver.
      val driverConf = new SparkConf()
      
      for ((key, value) <- props) {
        // this is required for SSL in standalone mode
        if (SparkConf.isExecutorStartupConf(key)) {
          driverConf.setIfMissing(key, value)
        } else {
          driverConf.set(key, value)
        }
      }

      cfg.hadoopDelegationCreds.foreach { tokens =>
        SparkHadoopUtil.get.addDelegationTokens(tokens, driverConf)
      }

      val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)

      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      env.rpcEnv.awaitTermination()
    }
  }
  
}

從上面的代碼可以看到,Executor管理後端啓動時,有關Driver的信息,是通過與Driver交換RPC消息來拿到的,並在創建executor端的SparkEnv時作爲變量轉遞給createExecutorEnv(...)方法,如下所示:

  private[spark] def createExecutorEnv(
      conf: SparkConf,
      executorId: String,
      hostname: String,
      numCores: Int,
      ioEncryptionKey: Option[Array[Byte]],
      isLocal: Boolean): SparkEnv = {
    val env = create(
      conf,
      executorId,
      hostname, // bindAddress,本地監聽地址,啓動Netty TransportServer時綁定的地址
      hostname, // advertiseAddress,廣播地址,用於同其它Endpoint交互的地址
      None,
      isLocal,
      numCores,
      ioEncryptionKey
    )
    SparkEnv.set(env)
    env
  }

SparkEnv on Driver

Driver端在創建SparkEnv時,相較於executor端,會傳遞兩個參數:

listenerBus: LiveListenerBus
異步轉遞事件SparkListenerEvent到已註冊的SparkListener對象

mockOutputCommitCoordinator: Option[OutputCommitCoordinator]
用於授權任務(Tasks)是否可以提交輸出到HDFS,使用先來先提交的原則

  private[spark] def createDriverEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus,
      numCores: Int,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
    assert(conf.contains(DRIVER_HOST_ADDRESS),
      s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
    assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
    val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
    val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
    val port = conf.get("spark.driver.port").toInt
    val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
      Some(CryptoStreamUtils.createKey(conf))
    } else {
      None
    }
    create(
      conf,
      SparkContext.DRIVER_IDENTIFIER,
      bindAddress,
      advertiseAddress,
      Option(port),
      isLocal,
      numCores,
      ioEncryptionKey,
      listenerBus = listenerBus,
      mockOutputCommitCoordinator = mockOutputCommitCoordinator
    )
  }

SparkEnv的創建

不論是Driver端還是Executor端創建SparkEnv,最終都調用的是同一個create(...)方法,如下所示。

  private def create(
      conf: SparkConf,
      executorId: String, // 用於標識當前的角色是executor還是driver
      bindAddress: String,
      advertiseAddress: String,
      port: Option[Int],
      isLocal: Boolean,
      numUsableCores: Int,
      ioEncryptionKey: Option[Array[Byte]],
      listenerBus: LiveListenerBus = null,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {

    val isDriver = executorId == SparkContext.DRIVER_IDENTIFIER

    val systemName = if (isDriver) driverSystemName else executorSystemName
    val rpcEnv = RpcEnv.create(systemName, bindAddress, advertiseAddress, port.getOrElse(-1), conf,
      securityManager, numUsableCores, !isDriver)

    // Figure out which port RpcEnv actually bound to in case the original port is 0 or occupied.
    if (isDriver) {
      conf.set("spark.driver.port", rpcEnv.address.port.toString)
    }

    val serializer = instantiateClassFromConf[Serializer](
      "spark.serializer", "org.apache.spark.serializer.JavaSerializer")
    logDebug(s"Using serializer: ${serializer.getClass}")

    val serializerManager = new SerializerManager(serializer, conf, ioEncryptionKey)

    val closureSerializer = new JavaSerializer(conf)

    def registerOrLookupEndpoint(
        name: String, endpointCreator: => RpcEndpoint):
      RpcEndpointRef = {
      if (isDriver) {
        logInfo("Registering " + name)
        rpcEnv.setupEndpoint(name, endpointCreator)
      } else {
        RpcUtils.makeDriverRef(name, conf, rpcEnv)
      }
    }

    val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)

    val mapOutputTracker = if (isDriver) {
      new MapOutputTrackerMaster(conf, broadcastManager, isLocal)
    } else {
      new MapOutputTrackerWorker(conf)
    }

    // Have to assign trackerEndpoint after initialization as MapOutputTrackerEndpoint
    // requires the MapOutputTracker itself
    mapOutputTracker.trackerEndpoint = registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
      new MapOutputTrackerMasterEndpoint(
        rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

    // Let the user specify short names for shuffle managers
    val shortShuffleMgrNames = Map(
      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
    val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
    val shuffleMgrClass =
      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

    val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)
    val memoryManager: MemoryManager =
      if (useLegacyMemoryManager) {
        new StaticMemoryManager(conf, numUsableCores)
      } else {
        UnifiedMemoryManager(conf, numUsableCores)
      }

    val blockManagerPort = if (isDriver) {
      conf.get(DRIVER_BLOCK_MANAGER_PORT)
    } else {
      conf.get(BLOCK_MANAGER_PORT)
    }

    val blockTransferService =
      new NettyBlockTransferService(conf, securityManager, bindAddress, advertiseAddress,
        blockManagerPort, numUsableCores)

    val blockManagerMaster = new BlockManagerMaster(registerOrLookupEndpoint(
      BlockManagerMaster.DRIVER_ENDPOINT_NAME,
      new BlockManagerMasterEndpoint(rpcEnv, isLocal, conf, listenerBus)),
      conf, isDriver)

    // NB: blockManager is not valid until initialize() is called later.
    val blockManager = new BlockManager(executorId, rpcEnv, blockManagerMaster,
      serializerManager, conf, memoryManager, mapOutputTracker, shuffleManager,
      blockTransferService, securityManager, numUsableCores)

    val metricsSystem = if (isDriver) {
      // Don't start metrics system right now for Driver.
      // We need to wait for the task scheduler to give us an app ID.
      // Then we can start the metrics system.
      MetricsSystem.createMetricsSystem("driver", conf, securityManager)
    } else {
      // We need to set the executor ID before the MetricsSystem is created because sources and
      // sinks specified in the metrics configuration file will want to incorporate this executor's
      // ID into the metrics they report.
      conf.set("spark.executor.id", executorId)
      val ms = MetricsSystem.createMetricsSystem("executor", conf, securityManager)
      ms.start()
      ms
    }

    val outputCommitCoordinator = mockOutputCommitCoordinator.getOrElse {
      new OutputCommitCoordinator(conf, isDriver)
    }
    val outputCommitCoordinatorRef = registerOrLookupEndpoint("OutputCommitCoordinator",
      new OutputCommitCoordinatorEndpoint(rpcEnv, outputCommitCoordinator))
    outputCommitCoordinator.coordinatorRef = Some(outputCommitCoordinatorRef)

    val envInstance = new SparkEnv(
      executorId,
      rpcEnv,
      serializer,
      closureSerializer,
      serializerManager,
      mapOutputTracker,
      shuffleManager,
      broadcastManager,
      blockManager,
      securityManager,
      metricsSystem,
      memoryManager,
      outputCommitCoordinator,
      conf)

    // Add a reference to tmp dir created by driver, we will delete this tmp dir when stop() is
    // called, and we only need to do it for driver. Because driver may run as a service, and if we
    // don't delete this tmp dir when sc is stopped, then will create too many tmp dirs.
    if (isDriver) {
      val sparkFilesDir = Utils.createTempDir(Utils.getLocalDir(conf), "userFiles").getAbsolutePath
      envInstance.driverTmpDir = Some(sparkFilesDir)
    }

    envInstance
  }

如果當前的Endpoint是SparkConText.DRIVER_IDENTIFIER,則會完成如下操作:

新建SecurityManager實例,用於集羣和應用的安全相關的管理

根據指定的綁定地址和廣播地址(bindAddress & advertiseAddress)等創建SparkEnv實例,並啓動基於Netty的Server端監聽接口

新建SerializerManager實例,用於配置Spark各組件的序列化、壓縮和加密方式,同時也包括爲Shuffle行爲自動選擇序列器

新建JavaSerializer實例,用於序列化可閉包的對象,其可返回一個新的Java內置的序列化器實例類SerializerInstance,且須保證在一個線程中使用一個新的實例

新建BroadcastManager實例,(廣播變量管理器),用來管事在一個Spark應用定義的Broadcast對象,目前只使用TorrentBroadcast實例類

新建MapOutputTrackerMaster實例並註冊到本地的SparkEnv中,用於跟蹤每一個Stage所有對應的Map任務的輸出位置。

新建ShuffleManager實例,目前只會實例化其子類SortShuffleManager,用於在發生Shuffle行爲時,根據注入的每條記錄對應的Partition ID排序這些記錄,然後輸出這些記錄到一個單獨的文件。

新建MemoryManager,根據配置的不同,可創建StaticMemoryManagerUnifiedMemoryManager實例,用於管理本地內存的使用,根據底層的內存分配策略爲計算和存儲分配劃分內存。

新建BlockManager實例並以Master的角色註冊到本地的SparkEnv中,提供了Put和Retrieve本地或遠程(memory/disk/off-heap)數據Block的接口

新建MetricsSystem實例,用於週期性地從獲取源指標數據並落地到目標位置。

創建OutputCommitCoordinator實例並註冊到本地的SparkEnv中,用於授權任務(Tasks)是否可以提交輸出到HDFS,使用先來先提交的原則

綜合以上新建的各種組件,新建SparkEnv實例

在Spark應用的工作目錄下新建文件存放目錄userFiles,用來存放用戶通過相關參數指定的各種外部文件

如果當前環境是Executor是,則會完成如下的操作:

新建SecurityManager實例,用於集羣和應用的安全相關的管理

新建SparkEnv實例,

創建OutputCommitCoordinator實例並通過本地的SparkEnv查找和賦值Driver端註冊的OutputCommitCoordinatorEndpoint的引用,用於授權任務(Tasks)是否可以提交輸出到HDFS,使用先來先提交的原則

根據指定的綁定地址和廣播地址(bindAddress & advertiseAddress)等創建SparkEnv實例,但不啓動Socket Server,只作爲客戶端請求連接

新建SerializerManager實例,用於配置Spark各組件的序列化、壓縮和加密方式,同時也包括爲Shuffle行爲自動選擇序列器

新建JavaSerializer實例,用於序列化可閉包的對象,其可返回一個新的Java內置的序列化器實例類SerializerInstance,且須保證在一個線程中使用一個新的實例

新建BroadcastManager實例,(廣播變量管理器),用來管事在一個Spark應用定義的Broadcast對象,目前只使用TorrentBroadcast實例類

新建MapOutputTrackerWorker實例並賦值通過本地的SparkEnv查找到的在Driver端維護的MapOutputTrackerMaster組件的引用,用於獲取Master端中記錄的Map輸出信息。

新建ShuffleManager實例,目前只會實例化其子類SortShuffleManager,用於在發生Shuffle行爲時,根據注入的每條記錄對應的Partition ID排序這些記錄,然後輸出這些記錄到一個單獨的文件。

新建MemoryManager,根據配置的不同,可創建StaticMemoryManagerUnifiedMemoryManager實例,用於管理本地內存的使用,根據底層的內存分配策略爲計算和存儲分配劃分內存。

新建BlockManager實例並設置Master端的引用然後註冊到本地的SparkEnv中,提供了Put和Retrieve本地或遠程(memory/disk/off-heap)數據Block的接口

新建MetricsSystem實例,用於週期性地從獲取源指標數據並落地到目標位置。

創建OutputCommitCoordinator實例並註冊到本地的SparkEnv中,用於授權任務(Tasks)是否可以提交輸出到HDFS,使用先來先提交的原則

綜合以上新建的各種組件,新建SparkEnv實例

內置各功能組件簡述

ShuffleManager管理器

Spark 2.4.x版本中,ShuffleManager的實現類只有一個SortShuffleManager,基於排序的shuffle,所有的記錄都根據他們對應的Partition ID來排序,並將它們寫入到一個單獨的文件中,作爲Map端的輸出。Reduce端會連續地讀取在shuffle文件中的自己對應的數據塊。Shuffle文件的生成是由於map端的輸出數據量過大,導致無法存放在內存中,這些排序的數據塊會被溢出到磁盤文件,最終這些小的數據文件會合成爲一個文件。

當DAGScheduler生成Spark應用的依賴圖時,會調用registerShuffle(...)方法,向manager請求合適的處理器對象ShuffleHandle,並在執行ShuffleMapTask任務時,從manager中獲取相應的排序模型處理接口(ShuffleWriter的實現類對象),輸出數據到目標位置。當前一共有如下3種類型的處理ShuffleHandle可選:

BypassMergeSortShuffleHandle: 如果當前RDD的Partition數量小於用戶指定的spark.shuffle.sort.bypassMergeThreshold這個屬性的值,那麼不需要在Map端聚合分區數據,而是直接輸出numPartitions個分區文件並最終拼接(注意不是聚合)成一個文件。這樣就減少了兩次序列化和兩次反序列化分區文件的操作,但缺點就是會在Reduce端打開更多的文件和佔用更多的緩存空間。

SerializedShuffleHandle:

  1. BaseShuffleHandle
private[spark] trait ShuffleManager {

  /**
   * 這個方法會創建ShuffleDependency對象時調用,用來向manager請求一個處理shuffle數據的方法ShuffleHandle。
   */
  def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

  /** 在executor端獲取寫partition數據的writer */
  def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext): ShuffleWriter[K, V]

  /**
   * Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).
   * Called on executors by reduce tasks.
   */
  def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C]

  /**
   * Remove a shuffle's metadata from the ShuffleManager.
   * @return true if the metadata removed successfully, otherwise false.
   */
  def unregisterShuffle(shuffleId: Int): Boolean

  /**
   * Return a resolver capable of retrieving shuffle block data based on block coordinates.
   */
  def shuffleBlockResolver: ShuffleBlockResolver

  /** Shut down this ShuffleManager. */
  def stop(): Unit
}

序列化排序模型簡述

基於排序的Shuffle在生成文件時會有兩種不同的輸出路徑,且滿足以下條件:

  • 序列化後排序:
    • Shuffle依賴沒有指定聚合方法或要求輸出結果有序
    • Shuffle序列化器支持對序列化後的對象重分配,目前只有Kryo和Spark SQL自定義的序列化器支持
    • 輸出的partion數量小於16777216時
  • 反序列化排序:
    • 所有其它情況

使用此算法模型,所有輸出的記錄一旦被傳遞到Shuffle Writer中就會被序列化,並且在排序期間依然以序列化的格式存在於緩存中,這樣做有如下幾個優點:

  • 排序直接作用於序列化後的二進制對象,而不是JAVA對象,這樣會減少內存消耗和GC時間,但需要序列化器支持對序列化後的對象進行重排序,而不需反序列化。
  • 排序模型使用了一種特定的、高效的緩存排序器ShuffleExternalSorter,直接對壓縮後的數據記錄的指針和分區ID數組排序,一條數據記錄只佔用8個字節,這樣就可以緩存更多的待排序數據。
  • 溢寫合併過程會直接作用於同一個分區內的序列化後的數據塊,而不用反序列化
  • 如果溢寫壓縮器支持壓縮數據塊的拼接,那麼溢寫合併過程會簡單地先合併序列化的、壓縮過的分區數據,生成最終的Partition。這樣就可以使用高效地數據拷貝方法,例如NIO’s transferTo,避免解壓和拷貝緩衝區。

更多的優化細節可以參考官方ISSUE SPARK-7081

排序模型實例

  • BypassMergeSortShuffleHandle
    對應於ShuffleWriter的實例類BypassMergeSortShuffleWriter,使用這個Writer輸出Shuffle數據時,會把每個分區待寫入的數據記錄輸出到其對應的分區文件中,然後把這些分區文件拼接成一個文件,最終的文件會被劃分成多個區域供Reduce端讀取。所有的這些記錄不會存在於內存中,而是是通過org.apache.spark.shuffle.IndexShuffleBlockResolver的實例來完成數據的服務及消費。
    這種策略的缺點也是比較明顯的,如果一個Stage的Reduce分區數量很多,那麼會同時打開同樣多的文件連接及創建新的序列器,消耗更大的內存。因此ShuffleManager只在滿足以下條件時纔會使用此模型:

無排序需求
無聚合需求
分區數小於spark.shuffle.sort.bypassMergeThreshold的值

當前有一些提案想要完全移除該模型,詳見SPARK-6026。

  • SerializedShuffleHandle
    序列化數據排序的場景,ShuffleManager會對應創建UnsafeShuffleWriter實例,所有流入的數據都被序列化後緩存在內存中,可以直接排序。使用此算法模型需要滿足以下幾個條件:

指定的序列化器需要 支持直接對序列化的對象排序
上游依賴不需要在Map端聚合數據記錄
當前的Dependency對象所需要的Partitions數量小於等於2^24 - 1

  • BaseShuffleHandle
    除了前面優先使用的排序場景,此處理過程將使用未序列化過的數據排序,對應於實際的寫出器實例SortShuffleWriter,這種排序方式佔用比較大的內存。

Map任務輸出跟蹤器

MapOutputTracker

MapOutputTrackerMaster

在Driver進程中啓動的Master RpcEndpoint端點,會被註冊到Driver端的RpcEnv中,用來跟蹤每一個stage中的map任務輸出數據的位置。
DagScheduler使用該類來註冊/註銷ShuffleMap任務的輸出結果的狀態,同時也會根據統計信息來感知數據本地化,以減少reduce端任務數。
ShuffleMapStage則使用該類來跟蹤可用的或是丟失的分區數據,並按需執行相應的任務。

/**
 * Master tracker的邏輯比較簡單
 */
private[spark] class MapOutputTrackerMaster(
    conf: SparkConf,
    broadcastManager: BroadcastManager,
    isLocal: Boolean)
  extends MapOutputTracker(conf) {
  // 開啓Reduce端數據本地性優化功能,使得reduce任務分配到Map端輸出比較多的Worker結點
  private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)
  // 保存了Shuffle Id到對應狀態的映射
  val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala
  // 消息隊列,保存所有請求Shuffle狀態的消息
  private val mapOutputRequests = new LinkedBlockingQueue[GetMapOutputMessage]
  // 消息處理工作線程池,默認爲8線程,用來回復所有的請求消息
  private val threadpool: ThreadPoolExecutor = {
    val numThreads = conf.getInt("spark.shuffle.mapOutput.dispatcher.numThreads", 8)
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "map-output-dispatcher")
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }

MapOutputTrackerWorker

在Executor進程初始化SparkEnv時,創建此類的實例,通過與Driver端綁定的MapOutputTrackerMaster端點交換RPC消息,以獲取Map任務輸出結果的信息。但它在local模式啓動的Spark應用中不起作用,因爲local模式下Tracker Master和 Tracker Worker同進程且同父類,可以直接讀取結果信息。

private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {

  /** 緩存了所有從Master端獲取的狀態信息, */
  val mapStatuses: Map[Int, Array[MapStatus]] =
    new ConcurrentHashMap[Int, Array[MapStatus]]().asScala

  /** 記錄了正在獲取目標Shuffle任務的Map端輸出信息的索引信息,ShuffleId . */
  private val fetching = new HashSet[Int]
}

在我另外的博文裏面有介紹任務啓動執行的過程,一個Spark App會依據數據依賴及Partitions數量,創建一系列任務在Executor端執行,任務總共分爲兩類:ShuffleMapTaskResultTask(每一個任務都對應一個Partition)。他們在執行時會最終調用各自的override def runTask(context: TaskContext): U = {}方法,這個方法會通過SparkEnv中維護的BlockManager實例,從本地或是遠程的存儲介質中獲取所需要的Partition數據,如果依賴的數據沒有生成並輸出,則會觸發計算邏輯,嘗試通過ShuffleManager獲取每個Partition的ShuffleReader,並調用ShuffleReader.read()方法讀取通過過底層的數據塊服務組件BlockTransferService獲取數據:

  /** Read the combined key-values for this reduce task */
  override def read(): Iterator[Product2[K, C]] = {
    // Step 1: 讀取數據
    val wrappedStreams = new ShuffleBlockFetcherIterator(...),
    val serializerInstance = dep.serializer.newInstance()

    // Create a key/value iterator for each stream
    val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
      // Note: the asKeyValueIterator below wraps a key/value iterator inside of a
      // NextIterator. The NextIterator makes sure that close() is called on the
      // underlying InputStream when all records have been read.
      serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
    }
    // Step 2:更新Metrics
    val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
    val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
      recordIter.map { record =>
        readMetrics.incRecordsRead(1)
        record
      },
      context.taskMetrics().mergeShuffleReadMetrics())

    // An interruptible iterator must be used here in order to support task cancellation
    val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
    // Step 3:如果指定子聚合方法,則嘗試聚合獲取的Map結果
    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // We are reading values that are already combined
        val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
        dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
      } else {
        // We don't know the value type, but also don't care -- the dependency *should*
        // have made sure its compatible w/ this aggregator, which will convert the value
        // type to the combined type C
        val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
        dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
      }
    } else {
      interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
    }
    // Step 4:如果指定了排序方法,則嘗試排序聚合後的結果
    val resultIter = dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        // Create an ExternalSorter to sort the data.
        val sorter =
          new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
        sorter.insertAll(aggregatedIter)
        context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
        context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
        // Use completion callback to stop sorter if task was finished/cancelled.
        context.addTaskCompletionListener[Unit](_ => {
          sorter.stop()
        })
        CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
      case None =>
        aggregatedIter
    }
    // Step 5:返回結果
    resultIter match {
      case _: InterruptibleIterator[Product2[K, C]] => resultIter
      case _ =>
        // Use another interruptible iterator here to support task cancellation as aggregator
        // or(and) sorter may have consumed previous interruptible iterator.
        new InterruptibleIterator[Product2[K, C]](context, resultIter)
    }
  }

數據塊管理器BlockManager

該類會在Driver或Executor端,會在實例化SparkEnv時創建一個新的BlockManager實例對象,同時會創建對應的RpcEndpoint(BlockManagerSlaveEndpoint)並註冊到本地的RpcEnv中。該類一個入口類,用來從指定的介質,如內存、磁盤或其它外部的存儲服務,添加、刪除、查詢數據塊及數據塊信息等。
在創建BlockManger實例時,同時創建綁定相應的RpcEndpoint用於集羣中的信息交互,其邏輯上可以分爲兩類,一類是BlockManagerMasterEndpoint端點,只會在Driver端創建的實例,它維護了所有BlockManager的信息(但不直接更新具體的BlockManager對象),總數=CoarseGrainedExecutorBackend的數量 + 1,核心類定義如下:

class BlockManagerMasterEndpoint(
    override val rpcEnv: RpcEnv,
    val isLocal: Boolean,
    conf: SparkConf,
    listenerBus: LiveListenerBus)
  extends ThreadSafeRpcEndpoint with Logging {

  // Mapping from block manager id to the block manager's information.
  private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]

  // Mapping from executor ID to block manager ID.
  private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]

  // BlockManager管理的數據是以bytes的形式存放的,比如緩存的RDD數據比較大,會被分隔成一個個的數據塊,
  // 通過底層的NettyBlockTransferService [extends BlockTransferService]組件在不同的端點傳輸,
  // 因此此變量存放了所有已知的數據塊的及存放的位置信息,BlockId以前綴來區分不同的數據類型,目前共有如下幾種:
  //   val RDD = "rdd_([0-9]+)_([0-9]+)".r
  //   val SHUFFLE = "shuffle_([0-9]+)_([0-9]+)_([0-9]+)".r
  //   val SHUFFLE_DATA = "shuffle_([0-9]+)_([0-9]+)_([0-9]+).data".r
  //   val SHUFFLE_INDEX = "shuffle_([0-9]+)_([0-9]+)_([0-9]+).index".r
  //   val BROADCAST = "broadcast_([0-9]+)([_A-Za-z0-9]*)".r
  //   val TASKRESULT = "taskresult_([0-9]+)".r
  //   val STREAM = "input-([0-9]+)-([0-9]+)".r
  //   val TEMP_LOCAL = "temp_local_([-A-Fa-f0-9]+)".r
  //   val TEMP_SHUFFLE = "temp_shuffle_([-A-Fa-f0-9]+)".r
  //   val TEST = "test_(.*)".r
  private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
}

一類是BlockManagerSlaveEndpoint端點, 由於Driver端也會發生計算或聚合數據的情況,因此不論是Driver或是Executor端都會創建一個RpcEndpoint實例,用來更新本地的BlockManager。

private[storage]
class BlockManagerSlaveEndpoint(
    override val rpcEnv: RpcEnv,
    blockManager: BlockManager,
    mapOutputTracker: MapOutputTracker)
  extends ThreadSafeRpcEndpoint with Logging {

  private val asyncThreadPool =
    ThreadUtils.newDaemonCachedThreadPool("block-manager-slave-async-thread-pool", 100)
  private implicit val asyncExecutionContext = ExecutionContext.fromExecutorService(asyncThreadPool)

  // Operations that involve removing blocks may be slow and should be done asynchronously
  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    case RemoveBlock(blockId) =>
      doAsync[Boolean]("removing block " + blockId, context) {
        blockManager.removeBlock(blockId)
        true
      }

    case RemoveRdd(rddId) =>
      doAsync[Int]("removing RDD " + rddId, context) {
        blockManager.removeRdd(rddId)
      }

    case RemoveShuffle(shuffleId) =>
      doAsync[Boolean]("removing shuffle " + shuffleId, context) {
        if (mapOutputTracker != null) {
          mapOutputTracker.unregisterShuffle(shuffleId)
        }
        SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId)
      }

    case RemoveBroadcast(broadcastId, _) =>
      doAsync[Int]("removing broadcast " + broadcastId, context) {
        blockManager.removeBroadcast(broadcastId, tellMaster = true)
      }

    case GetBlockStatus(blockId, _) =>
      context.reply(blockManager.getStatus(blockId))

    case GetMatchingBlockIds(filter, _) =>
      context.reply(blockManager.getMatchingBlockIds(filter))

    case TriggerThreadDump =>
      context.reply(Utils.getThreadDump())

    case ReplicateBlock(blockId, replicas, maxReplicas) =>
      context.reply(blockManager.replicateBlock(blockId, replicas.toSet, maxReplicas))

  }

  override def onStop(): Unit = {
    asyncThreadPool.shutdownNow()
  }
}

輸出確認協調器OutputCommitCoordinator

該類會在每一個結點中,driver或executor,創建一個新實例,並綁定一個新的RpcEndpoint實例OutputCommitCoordinatorEndpoint到本地的RpcEnv中,用來授權Task是否可以將輸出結果數據提交到HDFS。
在Executor端,該類會包含一個Driver端OutputCommitCoordinatorEndpoint的引用,以便將提交輸出的請求轉發到Driver。
爲了解決JIRA SPARK-4879這個問題提出的解決方案。

簡單說明之,當用戶開啓了spark.speculation配置時,某些比較慢的任務會重跑,這就會導致輸出到HDFS上的任務存在目錄覆蓋的問題,即某一個分區的數據文件在真正落地到目標目錄時,會先寫出到一個臨時目錄,而當任務重跑時,可能會發重臨時目錄被另一個任務執行完成並刪除的情況,而導致任務失敗。

spark.speculation
If set to “true”, performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

BroadcastManager

廣播變量管理器,該類會在Driver或Executor中被實例化,用來管理一系列Broadcast類的實例。當用戶定義了broadcast變量時,如下面的代碼:

 scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
 broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
 
 scala> broadcastVar.value
 res0: Array[Int] = Array(1, 2, 3)

上面的操作會生成一個Broadcast變量broadcastVar,它會被緩存到當前進程的SparkEnv對象的BlockManager中,如果這個對象比較大,會被分隔成一系列的數據塊BlockData,每個塊的大小可通過配置conf.getSizeAsKb("spark.broadcast.blockSize", "4m").toInt * 1024屬性來控制。

當用戶程序在Executor端讀取某個Broadcast變量的值時,需要調用Broadcast.value()方法,從本地的BroadcastManager中讀取想要的對象,如果本地不存在,則會向Driver或其它的Executor的管理器獲取此變量對應的BroadcastBlockId對應的對象,並緩存到本地。

private[spark] class BroadcastManager(
    val isDriver: Boolean,
    conf: SparkConf,
    securityManager: SecurityManager)
  extends Logging {

  private var initialized = false
  private var broadcastFactory: BroadcastFactory = null

  initialize()

  // Called by SparkContext or Executor before using Broadcast
  private def initialize() {
    synchronized {
      if (!initialized) {
        broadcastFactory = new TorrentBroadcastFactory
        broadcastFactory.initialize(isDriver, conf, securityManager)
        initialized = true
      }
    }
  }

  def stop() {
    broadcastFactory.stop()
  }

  private val nextBroadcastId = new AtomicLong(0)

  /** 緩存,用於存放廣播變量的引用 */
  private[broadcast] val cachedValues = {
    new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
  }

  /** 創建一個新的廣播變量,以BlockData的形式存放在本地的BlockManager中 */
  def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean): Broadcast[T] = {
    broadcastFactory.newBroadcast[T](value_, isLocal, nextBroadcastId.getAndIncrement())
  }

  def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean) {
    broadcastFactory.unbroadcast(id, removeFromDriver, blocking)
  }
}

Spark現在的版本中默認使用基於BitTorrent協議的實現的類TorrentBroadcast,用來作爲廣播對象存放在BroadcastManager。

MemoryManager

內存管理器,按一定規則劃分執行內存和存儲內存。執行內存是指在執行Shuffles、joins、sorts及aggregations操作時佔用的內存,而存儲內存是指用來在集羣中緩存數據和傳遞內部數據時佔用的內存,每一個JVM只會存在一個該類的實例對象。

MemoryManager一共有2個可用的子類:

  • StaticMemoryManager
    靜態內存管理器,由spark.shuffle.memoryFractionspark.storage.memoryFraction這兩個屬性來控制內存的使用佔比,執行內存和存儲內存分開管理,互不干涉。這種管理模式不支持off-heap的內存分配,即使設置了spark.memory.offHeap.enabled=true,僅僅是在創建內存管理器時的名字不同而已,與不開啓堆外內存分配的邏輯相同。

  • UnifiedMemoryManager
    統一內存管理器,執行內存和存儲內存間的共享內存是基於堆內存大小-300M的總量,同時通過spark.memory.fraction (default 0.6)來調整的。同時共享內存內的劃分執行內存和存儲內存的佔用比,是通過設置spark.memory.storageFraction (default 0.5)來完成的,即默認情況下,存儲內存一共會佔用0.6 * 0.5 = 0.3比例的可分配內存堆內存大小-300M
    如果存儲內存不夠,則會盡可能多的佔用執行內存的剩餘內存,除非此時的執行內存不滿足所需,需要拿回屬於自己的內存空間,管理器會釋放掉被緩存的數據塊,並將釋放的內存空間將由執行任務使用。
    反過來,如果執行時內存不足,則會盡可能多地佔用存儲內存的可用內存空間,但決不會由其主動釋放緩存的數據,因爲釋放邏輯很複雜,目前的實現方式很直接,如果有新的數據塊請求緩存,拋棄之。
    支持堆外內存的分配。

MemoryManager在初始化時會創建以下兩類內存資源池,分配用於管理存儲內存和執行內存。

StorageMemoryPool

存儲內存的內存池,按數據塊的大小來分配和回收內存資源,分配內存的邏輯比較簡單,代碼如下:

  def acquireMemory(
      blockId: BlockId,
      numBytesToAcquire: Long,
      numBytesToFree: Long): Boolean = lock.synchronized {
    assert(numBytesToAcquire >= 0)
    assert(numBytesToFree >= 0)
    assert(memoryUsed <= poolSize)
    if (numBytesToFree > 0) {
      memoryStore.evictBlocksToFreeSpace(Some(blockId), numBytesToFree, memoryMode)
    }
    // NOTE: If the memory store evicts blocks, then those evictions will synchronously call
    // back into this StorageMemoryPool in order to free memory. Therefore, these variables
    // should have been updated.
    val enoughMemory = numBytesToAcquire <= memoryFree
    if (enoughMemory) {
      _memoryUsed += numBytesToAcquire
    }
    enoughMemory
  }

ExecutionMemoryPool

執行內存的內存池,根據一定的策略來調整任務分配的內存,假如有N個活動任務,以保證每個等待的任務最少能夠被分配1 / 2N大小的內存空間,最多被分配1 / N大小的內存空間。
由於N是動態變化的,因此需要記錄每一個活動任務的內存狀態,同時監聽之,一旦發生變化,就爲等待的任務重新計算1 / 2N1 / N的值。

  private[memory] def acquireMemory(
      numBytes: Long,
      taskAttemptId: Long,
      maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit,
      computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {
    assert(numBytes > 0, s"invalid number of bytes requested: $numBytes")

    // TODO: clean up this clunky method signature

    // Add this task to the taskMemory map just so we can keep an accurate count of the number
    // of active tasks, to let other tasks ramp down their memory in calls to `acquireMemory`
    if (!memoryForTask.contains(taskAttemptId)) {
      memoryForTask(taskAttemptId) = 0L
      // This will later cause waiting tasks to wake up and check numTasks again
      lock.notifyAll()
    }

    // Keep looping until we're either sure that we don't want to grant this request (because this
    // task would have more than 1 / numActiveTasks of the memory) or we have enough free
    // memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)).
    // TODO: simplify this to limit each task to its own slot
    while (true) {
      val numActiveTasks = memoryForTask.keys.size
      val curMem = memoryForTask(taskAttemptId)

      // In every iteration of this loop, we should first try to reclaim any borrowed execution
      // space from storage. This is necessary because of the potential race condition where new
      // storage blocks may steal the free execution memory that this task was waiting for.
      maybeGrowPool(numBytes - memoryFree)

      // Maximum size the pool would have after potentially growing the pool.
      // This is used to compute the upper bound of how much memory each task can occupy. This
      // must take into account potential free memory as well as the amount this pool currently
      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
      // we did not take into account space that could have been freed by evicting cached blocks.
      val maxPoolSize = computeMaxPoolSize()
      val maxMemoryPerTask = maxPoolSize / numActiveTasks
      val minMemoryPerTask = poolSize / (2 * numActiveTasks)

      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
      // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
      val toGrant = math.min(maxToGrant, memoryFree)

      // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;
      // if we can't give it this much now, wait for other tasks to free up memory
      // (this happens if older tasks allocated lots of memory before N grew)
      if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
        logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
        lock.wait()
      } else {
        memoryForTask(taskAttemptId) += toGrant
        return toGrant
      }
    }
    0L  // Never reached
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章