Spark之SparkEnv实例的构建

SparkEnv

SparkEnv会在Driver和Executor角色创建时,创建该类的一个实例,为当前结点的正常工作提供必要的功能,例如管理交互数据在本地的缓存、shuffle文件、跟踪Map任务的输出等。它实例化了Spark实例运行时所需要的各类对象,(不论是在master还是worker端),用户代码里则可以全局变量的方式来获取SparkEnv实例,因此它可以被多个线程所共享。可以有多种方式来获取SparkEnv的实例,比如如果创建了SparkContext时,可以直接调用如下的语句来获取此变量。

SparkEnv.get()

SparkEnv on Executor

当用户提交一个App到集群中,且对应的Driver进程已经在某个Worker上启动,并且尝试为当前App在Worker结点启动Executor时,会给Worker端点(RpcEndpoint)发送LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) 消息,对应的Worker进程会在相应的事件处理流程中创建ExecutorRunner实例(它内部维护一个线程,在启动时使用子进程的方式来创建单Executor管理后端CoarseGrainedExecutorBackend),然后启动该实例,最终在CoarseGrainedExecutorBackend进程启动时,调用如下的方法来创建SparkEnv

private[spark] object CoarseGrainedExecutorBackend extends Logging {
  private def run(
      driverUrl: String,
      executorId: String,
      hostname: String,
      cores: Int,
      appId: String,
      workerUrl: Option[String],
      userClassPath: Seq[URL]) {
      
    Utils.initDaemon(log)

    SparkHadoopUtil.get.runAsSparkUser { () =>
      // Debug code
      Utils.checkHost(hostname)

      // Bootstrap to fetch the driver's Spark properties.
      val executorConf = new SparkConf
      val fetcher = RpcEnv.create(
        "driverPropsFetcher",
        hostname,
        -1,
        executorConf,
        new SecurityManager(executorConf),
        clientMode = true)
      val driver = fetcher.setupEndpointRefByURI(driverUrl)
      val cfg = driver.askSync[SparkAppConfig](RetrieveSparkAppConfig)
      val props = cfg.sparkProperties ++ Seq[(String, String)](("spark.app.id", appId))
      fetcher.shutdown()

      // Create SparkEnv using properties we fetched from the driver.
      val driverConf = new SparkConf()
      
      for ((key, value) <- props) {
        // this is required for SSL in standalone mode
        if (SparkConf.isExecutorStartupConf(key)) {
          driverConf.setIfMissing(key, value)
        } else {
          driverConf.set(key, value)
        }
      }

      cfg.hadoopDelegationCreds.foreach { tokens =>
        SparkHadoopUtil.get.addDelegationTokens(tokens, driverConf)
      }

      val env = SparkEnv.createExecutorEnv(
        driverConf, executorId, hostname, cores, cfg.ioEncryptionKey, isLocal = false)

      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
        env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
      workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
      }
      env.rpcEnv.awaitTermination()
    }
  }
  
}

从上面的代码可以看到,Executor管理后端启动时,有关Driver的信息,是通过与Driver交换RPC消息来拿到的,并在创建executor端的SparkEnv时作为变量转递给createExecutorEnv(...)方法,如下所示:

  private[spark] def createExecutorEnv(
      conf: SparkConf,
      executorId: String,
      hostname: String,
      numCores: Int,
      ioEncryptionKey: Option[Array[Byte]],
      isLocal: Boolean): SparkEnv = {
    val env = create(
      conf,
      executorId,
      hostname, // bindAddress,本地监听地址,启动Netty TransportServer时绑定的地址
      hostname, // advertiseAddress,广播地址,用于同其它Endpoint交互的地址
      None,
      isLocal,
      numCores,
      ioEncryptionKey
    )
    SparkEnv.set(env)
    env
  }

SparkEnv on Driver

Driver端在创建SparkEnv时,相较于executor端,会传递两个参数:

listenerBus: LiveListenerBus
异步转递事件SparkListenerEvent到已注册的SparkListener对象

mockOutputCommitCoordinator: Option[OutputCommitCoordinator]
用于授权任务(Tasks)是否可以提交输出到HDFS,使用先来先提交的原则

  private[spark] def createDriverEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus,
      numCores: Int,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
    assert(conf.contains(DRIVER_HOST_ADDRESS),
      s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
    assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
    val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
    val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
    val port = conf.get("spark.driver.port").toInt
    val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
      Some(CryptoStreamUtils.createKey(conf))
    } else {
      None
    }
    create(
      conf,
      SparkContext.DRIVER_IDENTIFIER,
      bindAddress,
      advertiseAddress,
      Option(port),
      isLocal,
      numCores,
      ioEncryptionKey,
      listenerBus = listenerBus,
      mockOutputCommitCoordinator = mockOutputCommitCoordinator
    )
  }

SparkEnv的创建

不论是Driver端还是Executor端创建SparkEnv,最终都调用的是同一个create(...)方法,如下所示。

  private def create(
      conf: SparkConf,
      executorId: String, // 用于标识当前的角色是executor还是driver
      bindAddress: String,
      advertiseAddress: String,
      port: Option[Int],
      isLocal: Boolean,
      numUsableCores: Int,
      ioEncryptionKey: Option[Array[Byte]],
      listenerBus: LiveListenerBus = null,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {

    val isDriver = executorId == SparkContext.DRIVER_IDENTIFIER

    val systemName = if (isDriver) driverSystemName else executorSystemName
    val rpcEnv = RpcEnv.create(systemName, bindAddress, advertiseAddress, port.getOrElse(-1), conf,
      securityManager, numUsableCores, !isDriver)

    // Figure out which port RpcEnv actually bound to in case the original port is 0 or occupied.
    if (isDriver) {
      conf.set("spark.driver.port", rpcEnv.address.port.toString)
    }

    val serializer = instantiateClassFromConf[Serializer](
      "spark.serializer", "org.apache.spark.serializer.JavaSerializer")
    logDebug(s"Using serializer: ${serializer.getClass}")

    val serializerManager = new SerializerManager(serializer, conf, ioEncryptionKey)

    val closureSerializer = new JavaSerializer(conf)

    def registerOrLookupEndpoint(
        name: String, endpointCreator: => RpcEndpoint):
      RpcEndpointRef = {
      if (isDriver) {
        logInfo("Registering " + name)
        rpcEnv.setupEndpoint(name, endpointCreator)
      } else {
        RpcUtils.makeDriverRef(name, conf, rpcEnv)
      }
    }

    val broadcastManager = new BroadcastManager(isDriver, conf, securityManager)

    val mapOutputTracker = if (isDriver) {
      new MapOutputTrackerMaster(conf, broadcastManager, isLocal)
    } else {
      new MapOutputTrackerWorker(conf)
    }

    // Have to assign trackerEndpoint after initialization as MapOutputTrackerEndpoint
    // requires the MapOutputTracker itself
    mapOutputTracker.trackerEndpoint = registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
      new MapOutputTrackerMasterEndpoint(
        rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

    // Let the user specify short names for shuffle managers
    val shortShuffleMgrNames = Map(
      "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
      "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
    val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")
    val shuffleMgrClass =
      shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)

    val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)
    val memoryManager: MemoryManager =
      if (useLegacyMemoryManager) {
        new StaticMemoryManager(conf, numUsableCores)
      } else {
        UnifiedMemoryManager(conf, numUsableCores)
      }

    val blockManagerPort = if (isDriver) {
      conf.get(DRIVER_BLOCK_MANAGER_PORT)
    } else {
      conf.get(BLOCK_MANAGER_PORT)
    }

    val blockTransferService =
      new NettyBlockTransferService(conf, securityManager, bindAddress, advertiseAddress,
        blockManagerPort, numUsableCores)

    val blockManagerMaster = new BlockManagerMaster(registerOrLookupEndpoint(
      BlockManagerMaster.DRIVER_ENDPOINT_NAME,
      new BlockManagerMasterEndpoint(rpcEnv, isLocal, conf, listenerBus)),
      conf, isDriver)

    // NB: blockManager is not valid until initialize() is called later.
    val blockManager = new BlockManager(executorId, rpcEnv, blockManagerMaster,
      serializerManager, conf, memoryManager, mapOutputTracker, shuffleManager,
      blockTransferService, securityManager, numUsableCores)

    val metricsSystem = if (isDriver) {
      // Don't start metrics system right now for Driver.
      // We need to wait for the task scheduler to give us an app ID.
      // Then we can start the metrics system.
      MetricsSystem.createMetricsSystem("driver", conf, securityManager)
    } else {
      // We need to set the executor ID before the MetricsSystem is created because sources and
      // sinks specified in the metrics configuration file will want to incorporate this executor's
      // ID into the metrics they report.
      conf.set("spark.executor.id", executorId)
      val ms = MetricsSystem.createMetricsSystem("executor", conf, securityManager)
      ms.start()
      ms
    }

    val outputCommitCoordinator = mockOutputCommitCoordinator.getOrElse {
      new OutputCommitCoordinator(conf, isDriver)
    }
    val outputCommitCoordinatorRef = registerOrLookupEndpoint("OutputCommitCoordinator",
      new OutputCommitCoordinatorEndpoint(rpcEnv, outputCommitCoordinator))
    outputCommitCoordinator.coordinatorRef = Some(outputCommitCoordinatorRef)

    val envInstance = new SparkEnv(
      executorId,
      rpcEnv,
      serializer,
      closureSerializer,
      serializerManager,
      mapOutputTracker,
      shuffleManager,
      broadcastManager,
      blockManager,
      securityManager,
      metricsSystem,
      memoryManager,
      outputCommitCoordinator,
      conf)

    // Add a reference to tmp dir created by driver, we will delete this tmp dir when stop() is
    // called, and we only need to do it for driver. Because driver may run as a service, and if we
    // don't delete this tmp dir when sc is stopped, then will create too many tmp dirs.
    if (isDriver) {
      val sparkFilesDir = Utils.createTempDir(Utils.getLocalDir(conf), "userFiles").getAbsolutePath
      envInstance.driverTmpDir = Some(sparkFilesDir)
    }

    envInstance
  }

如果当前的Endpoint是SparkConText.DRIVER_IDENTIFIER,则会完成如下操作:

新建SecurityManager实例,用于集群和应用的安全相关的管理

根据指定的绑定地址和广播地址(bindAddress & advertiseAddress)等创建SparkEnv实例,并启动基于Netty的Server端监听接口

新建SerializerManager实例,用于配置Spark各组件的序列化、压缩和加密方式,同时也包括为Shuffle行为自动选择序列器

新建JavaSerializer实例,用于序列化可闭包的对象,其可返回一个新的Java内置的序列化器实例类SerializerInstance,且须保证在一个线程中使用一个新的实例

新建BroadcastManager实例,(广播变量管理器),用来管事在一个Spark应用定义的Broadcast对象,目前只使用TorrentBroadcast实例类

新建MapOutputTrackerMaster实例并注册到本地的SparkEnv中,用于跟踪每一个Stage所有对应的Map任务的输出位置。

新建ShuffleManager实例,目前只会实例化其子类SortShuffleManager,用于在发生Shuffle行为时,根据注入的每条记录对应的Partition ID排序这些记录,然后输出这些记录到一个单独的文件。

新建MemoryManager,根据配置的不同,可创建StaticMemoryManagerUnifiedMemoryManager实例,用于管理本地内存的使用,根据底层的内存分配策略为计算和存储分配划分内存。

新建BlockManager实例并以Master的角色注册到本地的SparkEnv中,提供了Put和Retrieve本地或远程(memory/disk/off-heap)数据Block的接口

新建MetricsSystem实例,用于周期性地从获取源指标数据并落地到目标位置。

创建OutputCommitCoordinator实例并注册到本地的SparkEnv中,用于授权任务(Tasks)是否可以提交输出到HDFS,使用先来先提交的原则

综合以上新建的各种组件,新建SparkEnv实例

在Spark应用的工作目录下新建文件存放目录userFiles,用来存放用户通过相关参数指定的各种外部文件

如果当前环境是Executor是,则会完成如下的操作:

新建SecurityManager实例,用于集群和应用的安全相关的管理

新建SparkEnv实例,

创建OutputCommitCoordinator实例并通过本地的SparkEnv查找和赋值Driver端注册的OutputCommitCoordinatorEndpoint的引用,用于授权任务(Tasks)是否可以提交输出到HDFS,使用先来先提交的原则

根据指定的绑定地址和广播地址(bindAddress & advertiseAddress)等创建SparkEnv实例,但不启动Socket Server,只作为客户端请求连接

新建SerializerManager实例,用于配置Spark各组件的序列化、压缩和加密方式,同时也包括为Shuffle行为自动选择序列器

新建JavaSerializer实例,用于序列化可闭包的对象,其可返回一个新的Java内置的序列化器实例类SerializerInstance,且须保证在一个线程中使用一个新的实例

新建BroadcastManager实例,(广播变量管理器),用来管事在一个Spark应用定义的Broadcast对象,目前只使用TorrentBroadcast实例类

新建MapOutputTrackerWorker实例并赋值通过本地的SparkEnv查找到的在Driver端维护的MapOutputTrackerMaster组件的引用,用于获取Master端中记录的Map输出信息。

新建ShuffleManager实例,目前只会实例化其子类SortShuffleManager,用于在发生Shuffle行为时,根据注入的每条记录对应的Partition ID排序这些记录,然后输出这些记录到一个单独的文件。

新建MemoryManager,根据配置的不同,可创建StaticMemoryManagerUnifiedMemoryManager实例,用于管理本地内存的使用,根据底层的内存分配策略为计算和存储分配划分内存。

新建BlockManager实例并设置Master端的引用然后注册到本地的SparkEnv中,提供了Put和Retrieve本地或远程(memory/disk/off-heap)数据Block的接口

新建MetricsSystem实例,用于周期性地从获取源指标数据并落地到目标位置。

创建OutputCommitCoordinator实例并注册到本地的SparkEnv中,用于授权任务(Tasks)是否可以提交输出到HDFS,使用先来先提交的原则

综合以上新建的各种组件,新建SparkEnv实例

内置各功能组件简述

ShuffleManager管理器

Spark 2.4.x版本中,ShuffleManager的实现类只有一个SortShuffleManager,基于排序的shuffle,所有的记录都根据他们对应的Partition ID来排序,并将它们写入到一个单独的文件中,作为Map端的输出。Reduce端会连续地读取在shuffle文件中的自己对应的数据块。Shuffle文件的生成是由于map端的输出数据量过大,导致无法存放在内存中,这些排序的数据块会被溢出到磁盘文件,最终这些小的数据文件会合成为一个文件。

当DAGScheduler生成Spark应用的依赖图时,会调用registerShuffle(...)方法,向manager请求合适的处理器对象ShuffleHandle,并在执行ShuffleMapTask任务时,从manager中获取相应的排序模型处理接口(ShuffleWriter的实现类对象),输出数据到目标位置。当前一共有如下3种类型的处理ShuffleHandle可选:

BypassMergeSortShuffleHandle: 如果当前RDD的Partition数量小于用户指定的spark.shuffle.sort.bypassMergeThreshold这个属性的值,那么不需要在Map端聚合分区数据,而是直接输出numPartitions个分区文件并最终拼接(注意不是聚合)成一个文件。这样就减少了两次序列化和两次反序列化分区文件的操作,但缺点就是会在Reduce端打开更多的文件和占用更多的缓存空间。

SerializedShuffleHandle:

  1. BaseShuffleHandle
private[spark] trait ShuffleManager {

  /**
   * 这个方法会创建ShuffleDependency对象时调用,用来向manager请求一个处理shuffle数据的方法ShuffleHandle。
   */
  def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

  /** 在executor端获取写partition数据的writer */
  def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext): ShuffleWriter[K, V]

  /**
   * Get a reader for a range of reduce partitions (startPartition to endPartition-1, inclusive).
   * Called on executors by reduce tasks.
   */
  def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C]

  /**
   * Remove a shuffle's metadata from the ShuffleManager.
   * @return true if the metadata removed successfully, otherwise false.
   */
  def unregisterShuffle(shuffleId: Int): Boolean

  /**
   * Return a resolver capable of retrieving shuffle block data based on block coordinates.
   */
  def shuffleBlockResolver: ShuffleBlockResolver

  /** Shut down this ShuffleManager. */
  def stop(): Unit
}

序列化排序模型简述

基于排序的Shuffle在生成文件时会有两种不同的输出路径,且满足以下条件:

  • 序列化后排序:
    • Shuffle依赖没有指定聚合方法或要求输出结果有序
    • Shuffle序列化器支持对序列化后的对象重分配,目前只有Kryo和Spark SQL自定义的序列化器支持
    • 输出的partion数量小于16777216时
  • 反序列化排序:
    • 所有其它情况

使用此算法模型,所有输出的记录一旦被传递到Shuffle Writer中就会被序列化,并且在排序期间依然以序列化的格式存在于缓存中,这样做有如下几个优点:

  • 排序直接作用于序列化后的二进制对象,而不是JAVA对象,这样会减少内存消耗和GC时间,但需要序列化器支持对序列化后的对象进行重排序,而不需反序列化。
  • 排序模型使用了一种特定的、高效的缓存排序器ShuffleExternalSorter,直接对压缩后的数据记录的指针和分区ID数组排序,一条数据记录只占用8个字节,这样就可以缓存更多的待排序数据。
  • 溢写合并过程会直接作用于同一个分区内的序列化后的数据块,而不用反序列化
  • 如果溢写压缩器支持压缩数据块的拼接,那么溢写合并过程会简单地先合并序列化的、压缩过的分区数据,生成最终的Partition。这样就可以使用高效地数据拷贝方法,例如NIO’s transferTo,避免解压和拷贝缓冲区。

更多的优化细节可以参考官方ISSUE SPARK-7081

排序模型实例

  • BypassMergeSortShuffleHandle
    对应于ShuffleWriter的实例类BypassMergeSortShuffleWriter,使用这个Writer输出Shuffle数据时,会把每个分区待写入的数据记录输出到其对应的分区文件中,然后把这些分区文件拼接成一个文件,最终的文件会被划分成多个区域供Reduce端读取。所有的这些记录不会存在于内存中,而是是通过org.apache.spark.shuffle.IndexShuffleBlockResolver的实例来完成数据的服务及消费。
    这种策略的缺点也是比较明显的,如果一个Stage的Reduce分区数量很多,那么会同时打开同样多的文件连接及创建新的序列器,消耗更大的内存。因此ShuffleManager只在满足以下条件时才会使用此模型:

无排序需求
无聚合需求
分区数小于spark.shuffle.sort.bypassMergeThreshold的值

当前有一些提案想要完全移除该模型,详见SPARK-6026。

  • SerializedShuffleHandle
    序列化数据排序的场景,ShuffleManager会对应创建UnsafeShuffleWriter实例,所有流入的数据都被序列化后缓存在内存中,可以直接排序。使用此算法模型需要满足以下几个条件:

指定的序列化器需要 支持直接对序列化的对象排序
上游依赖不需要在Map端聚合数据记录
当前的Dependency对象所需要的Partitions数量小于等于2^24 - 1

  • BaseShuffleHandle
    除了前面优先使用的排序场景,此处理过程将使用未序列化过的数据排序,对应于实际的写出器实例SortShuffleWriter,这种排序方式占用比较大的内存。

Map任务输出跟踪器

MapOutputTracker

MapOutputTrackerMaster

在Driver进程中启动的Master RpcEndpoint端点,会被注册到Driver端的RpcEnv中,用来跟踪每一个stage中的map任务输出数据的位置。
DagScheduler使用该类来注册/注销ShuffleMap任务的输出结果的状态,同时也会根据统计信息来感知数据本地化,以减少reduce端任务数。
ShuffleMapStage则使用该类来跟踪可用的或是丢失的分区数据,并按需执行相应的任务。

/**
 * Master tracker的逻辑比较简单
 */
private[spark] class MapOutputTrackerMaster(
    conf: SparkConf,
    broadcastManager: BroadcastManager,
    isLocal: Boolean)
  extends MapOutputTracker(conf) {
  // 开启Reduce端数据本地性优化功能,使得reduce任务分配到Map端输出比较多的Worker结点
  private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)
  // 保存了Shuffle Id到对应状态的映射
  val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala
  // 消息队列,保存所有请求Shuffle状态的消息
  private val mapOutputRequests = new LinkedBlockingQueue[GetMapOutputMessage]
  // 消息处理工作线程池,默认为8线程,用来回复所有的请求消息
  private val threadpool: ThreadPoolExecutor = {
    val numThreads = conf.getInt("spark.shuffle.mapOutput.dispatcher.numThreads", 8)
    val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "map-output-dispatcher")
    for (i <- 0 until numThreads) {
      pool.execute(new MessageLoop)
    }
    pool
  }

MapOutputTrackerWorker

在Executor进程初始化SparkEnv时,创建此类的实例,通过与Driver端绑定的MapOutputTrackerMaster端点交换RPC消息,以获取Map任务输出结果的信息。但它在local模式启动的Spark应用中不起作用,因为local模式下Tracker Master和 Tracker Worker同进程且同父类,可以直接读取结果信息。

private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {

  /** 缓存了所有从Master端获取的状态信息, */
  val mapStatuses: Map[Int, Array[MapStatus]] =
    new ConcurrentHashMap[Int, Array[MapStatus]]().asScala

  /** 记录了正在获取目标Shuffle任务的Map端输出信息的索引信息,ShuffleId . */
  private val fetching = new HashSet[Int]
}

在我另外的博文里面有介绍任务启动执行的过程,一个Spark App会依据数据依赖及Partitions数量,创建一系列任务在Executor端执行,任务总共分为两类:ShuffleMapTaskResultTask(每一个任务都对应一个Partition)。他们在执行时会最终调用各自的override def runTask(context: TaskContext): U = {}方法,这个方法会通过SparkEnv中维护的BlockManager实例,从本地或是远程的存储介质中获取所需要的Partition数据,如果依赖的数据没有生成并输出,则会触发计算逻辑,尝试通过ShuffleManager获取每个Partition的ShuffleReader,并调用ShuffleReader.read()方法读取通过过底层的数据块服务组件BlockTransferService获取数据:

  /** Read the combined key-values for this reduce task */
  override def read(): Iterator[Product2[K, C]] = {
    // Step 1: 读取数据
    val wrappedStreams = new ShuffleBlockFetcherIterator(...),
    val serializerInstance = dep.serializer.newInstance()

    // Create a key/value iterator for each stream
    val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
      // Note: the asKeyValueIterator below wraps a key/value iterator inside of a
      // NextIterator. The NextIterator makes sure that close() is called on the
      // underlying InputStream when all records have been read.
      serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
    }
    // Step 2:更新Metrics
    val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
    val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
      recordIter.map { record =>
        readMetrics.incRecordsRead(1)
        record
      },
      context.taskMetrics().mergeShuffleReadMetrics())

    // An interruptible iterator must be used here in order to support task cancellation
    val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
    // Step 3:如果指定子聚合方法,则尝试聚合获取的Map结果
    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // We are reading values that are already combined
        val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
        dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
      } else {
        // We don't know the value type, but also don't care -- the dependency *should*
        // have made sure its compatible w/ this aggregator, which will convert the value
        // type to the combined type C
        val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
        dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
      }
    } else {
      interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
    }
    // Step 4:如果指定了排序方法,则尝试排序聚合后的结果
    val resultIter = dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        // Create an ExternalSorter to sort the data.
        val sorter =
          new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
        sorter.insertAll(aggregatedIter)
        context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
        context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
        // Use completion callback to stop sorter if task was finished/cancelled.
        context.addTaskCompletionListener[Unit](_ => {
          sorter.stop()
        })
        CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
      case None =>
        aggregatedIter
    }
    // Step 5:返回结果
    resultIter match {
      case _: InterruptibleIterator[Product2[K, C]] => resultIter
      case _ =>
        // Use another interruptible iterator here to support task cancellation as aggregator
        // or(and) sorter may have consumed previous interruptible iterator.
        new InterruptibleIterator[Product2[K, C]](context, resultIter)
    }
  }

数据块管理器BlockManager

该类会在Driver或Executor端,会在实例化SparkEnv时创建一个新的BlockManager实例对象,同时会创建对应的RpcEndpoint(BlockManagerSlaveEndpoint)并注册到本地的RpcEnv中。该类一个入口类,用来从指定的介质,如内存、磁盘或其它外部的存储服务,添加、删除、查询数据块及数据块信息等。
在创建BlockManger实例时,同时创建绑定相应的RpcEndpoint用于集群中的信息交互,其逻辑上可以分为两类,一类是BlockManagerMasterEndpoint端点,只会在Driver端创建的实例,它维护了所有BlockManager的信息(但不直接更新具体的BlockManager对象),总数=CoarseGrainedExecutorBackend的数量 + 1,核心类定义如下:

class BlockManagerMasterEndpoint(
    override val rpcEnv: RpcEnv,
    val isLocal: Boolean,
    conf: SparkConf,
    listenerBus: LiveListenerBus)
  extends ThreadSafeRpcEndpoint with Logging {

  // Mapping from block manager id to the block manager's information.
  private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]

  // Mapping from executor ID to block manager ID.
  private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]

  // BlockManager管理的数据是以bytes的形式存放的,比如缓存的RDD数据比较大,会被分隔成一个个的数据块,
  // 通过底层的NettyBlockTransferService [extends BlockTransferService]组件在不同的端点传输,
  // 因此此变量存放了所有已知的数据块的及存放的位置信息,BlockId以前缀来区分不同的数据类型,目前共有如下几种:
  //   val RDD = "rdd_([0-9]+)_([0-9]+)".r
  //   val SHUFFLE = "shuffle_([0-9]+)_([0-9]+)_([0-9]+)".r
  //   val SHUFFLE_DATA = "shuffle_([0-9]+)_([0-9]+)_([0-9]+).data".r
  //   val SHUFFLE_INDEX = "shuffle_([0-9]+)_([0-9]+)_([0-9]+).index".r
  //   val BROADCAST = "broadcast_([0-9]+)([_A-Za-z0-9]*)".r
  //   val TASKRESULT = "taskresult_([0-9]+)".r
  //   val STREAM = "input-([0-9]+)-([0-9]+)".r
  //   val TEMP_LOCAL = "temp_local_([-A-Fa-f0-9]+)".r
  //   val TEMP_SHUFFLE = "temp_shuffle_([-A-Fa-f0-9]+)".r
  //   val TEST = "test_(.*)".r
  private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
}

一类是BlockManagerSlaveEndpoint端点, 由于Driver端也会发生计算或聚合数据的情况,因此不论是Driver或是Executor端都会创建一个RpcEndpoint实例,用来更新本地的BlockManager。

private[storage]
class BlockManagerSlaveEndpoint(
    override val rpcEnv: RpcEnv,
    blockManager: BlockManager,
    mapOutputTracker: MapOutputTracker)
  extends ThreadSafeRpcEndpoint with Logging {

  private val asyncThreadPool =
    ThreadUtils.newDaemonCachedThreadPool("block-manager-slave-async-thread-pool", 100)
  private implicit val asyncExecutionContext = ExecutionContext.fromExecutorService(asyncThreadPool)

  // Operations that involve removing blocks may be slow and should be done asynchronously
  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    case RemoveBlock(blockId) =>
      doAsync[Boolean]("removing block " + blockId, context) {
        blockManager.removeBlock(blockId)
        true
      }

    case RemoveRdd(rddId) =>
      doAsync[Int]("removing RDD " + rddId, context) {
        blockManager.removeRdd(rddId)
      }

    case RemoveShuffle(shuffleId) =>
      doAsync[Boolean]("removing shuffle " + shuffleId, context) {
        if (mapOutputTracker != null) {
          mapOutputTracker.unregisterShuffle(shuffleId)
        }
        SparkEnv.get.shuffleManager.unregisterShuffle(shuffleId)
      }

    case RemoveBroadcast(broadcastId, _) =>
      doAsync[Int]("removing broadcast " + broadcastId, context) {
        blockManager.removeBroadcast(broadcastId, tellMaster = true)
      }

    case GetBlockStatus(blockId, _) =>
      context.reply(blockManager.getStatus(blockId))

    case GetMatchingBlockIds(filter, _) =>
      context.reply(blockManager.getMatchingBlockIds(filter))

    case TriggerThreadDump =>
      context.reply(Utils.getThreadDump())

    case ReplicateBlock(blockId, replicas, maxReplicas) =>
      context.reply(blockManager.replicateBlock(blockId, replicas.toSet, maxReplicas))

  }

  override def onStop(): Unit = {
    asyncThreadPool.shutdownNow()
  }
}

输出确认协调器OutputCommitCoordinator

该类会在每一个结点中,driver或executor,创建一个新实例,并绑定一个新的RpcEndpoint实例OutputCommitCoordinatorEndpoint到本地的RpcEnv中,用来授权Task是否可以将输出结果数据提交到HDFS。
在Executor端,该类会包含一个Driver端OutputCommitCoordinatorEndpoint的引用,以便将提交输出的请求转发到Driver。
为了解决JIRA SPARK-4879这个问题提出的解决方案。

简单说明之,当用户开启了spark.speculation配置时,某些比较慢的任务会重跑,这就会导致输出到HDFS上的任务存在目录覆盖的问题,即某一个分区的数据文件在真正落地到目标目录时,会先写出到一个临时目录,而当任务重跑时,可能会发重临时目录被另一个任务执行完成并删除的情况,而导致任务失败。

spark.speculation
If set to “true”, performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

BroadcastManager

广播变量管理器,该类会在Driver或Executor中被实例化,用来管理一系列Broadcast类的实例。当用户定义了broadcast变量时,如下面的代码:

 scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
 broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
 
 scala> broadcastVar.value
 res0: Array[Int] = Array(1, 2, 3)

上面的操作会生成一个Broadcast变量broadcastVar,它会被缓存到当前进程的SparkEnv对象的BlockManager中,如果这个对象比较大,会被分隔成一系列的数据块BlockData,每个块的大小可通过配置conf.getSizeAsKb("spark.broadcast.blockSize", "4m").toInt * 1024属性来控制。

当用户程序在Executor端读取某个Broadcast变量的值时,需要调用Broadcast.value()方法,从本地的BroadcastManager中读取想要的对象,如果本地不存在,则会向Driver或其它的Executor的管理器获取此变量对应的BroadcastBlockId对应的对象,并缓存到本地。

private[spark] class BroadcastManager(
    val isDriver: Boolean,
    conf: SparkConf,
    securityManager: SecurityManager)
  extends Logging {

  private var initialized = false
  private var broadcastFactory: BroadcastFactory = null

  initialize()

  // Called by SparkContext or Executor before using Broadcast
  private def initialize() {
    synchronized {
      if (!initialized) {
        broadcastFactory = new TorrentBroadcastFactory
        broadcastFactory.initialize(isDriver, conf, securityManager)
        initialized = true
      }
    }
  }

  def stop() {
    broadcastFactory.stop()
  }

  private val nextBroadcastId = new AtomicLong(0)

  /** 缓存,用于存放广播变量的引用 */
  private[broadcast] val cachedValues = {
    new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK)
  }

  /** 创建一个新的广播变量,以BlockData的形式存放在本地的BlockManager中 */
  def newBroadcast[T: ClassTag](value_ : T, isLocal: Boolean): Broadcast[T] = {
    broadcastFactory.newBroadcast[T](value_, isLocal, nextBroadcastId.getAndIncrement())
  }

  def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean) {
    broadcastFactory.unbroadcast(id, removeFromDriver, blocking)
  }
}

Spark现在的版本中默认使用基于BitTorrent协议的实现的类TorrentBroadcast,用来作为广播对象存放在BroadcastManager。

MemoryManager

内存管理器,按一定规则划分执行内存和存储内存。执行内存是指在执行Shuffles、joins、sorts及aggregations操作时占用的内存,而存储内存是指用来在集群中缓存数据和传递内部数据时占用的内存,每一个JVM只会存在一个该类的实例对象。

MemoryManager一共有2个可用的子类:

  • StaticMemoryManager
    静态内存管理器,由spark.shuffle.memoryFractionspark.storage.memoryFraction这两个属性来控制内存的使用占比,执行内存和存储内存分开管理,互不干涉。这种管理模式不支持off-heap的内存分配,即使设置了spark.memory.offHeap.enabled=true,仅仅是在创建内存管理器时的名字不同而已,与不开启堆外内存分配的逻辑相同。

  • UnifiedMemoryManager
    统一内存管理器,执行内存和存储内存间的共享内存是基于堆内存大小-300M的总量,同时通过spark.memory.fraction (default 0.6)来调整的。同时共享内存内的划分执行内存和存储内存的占用比,是通过设置spark.memory.storageFraction (default 0.5)来完成的,即默认情况下,存储内存一共会占用0.6 * 0.5 = 0.3比例的可分配内存堆内存大小-300M
    如果存储内存不够,则会尽可能多的占用执行内存的剩余内存,除非此时的执行内存不满足所需,需要拿回属于自己的内存空间,管理器会释放掉被缓存的数据块,并将释放的内存空间将由执行任务使用。
    反过来,如果执行时内存不足,则会尽可能多地占用存储内存的可用内存空间,但决不会由其主动释放缓存的数据,因为释放逻辑很复杂,目前的实现方式很直接,如果有新的数据块请求缓存,抛弃之。
    支持堆外内存的分配。

MemoryManager在初始化时会创建以下两类内存资源池,分配用于管理存储内存和执行内存。

StorageMemoryPool

存储内存的内存池,按数据块的大小来分配和回收内存资源,分配内存的逻辑比较简单,代码如下:

  def acquireMemory(
      blockId: BlockId,
      numBytesToAcquire: Long,
      numBytesToFree: Long): Boolean = lock.synchronized {
    assert(numBytesToAcquire >= 0)
    assert(numBytesToFree >= 0)
    assert(memoryUsed <= poolSize)
    if (numBytesToFree > 0) {
      memoryStore.evictBlocksToFreeSpace(Some(blockId), numBytesToFree, memoryMode)
    }
    // NOTE: If the memory store evicts blocks, then those evictions will synchronously call
    // back into this StorageMemoryPool in order to free memory. Therefore, these variables
    // should have been updated.
    val enoughMemory = numBytesToAcquire <= memoryFree
    if (enoughMemory) {
      _memoryUsed += numBytesToAcquire
    }
    enoughMemory
  }

ExecutionMemoryPool

执行内存的内存池,根据一定的策略来调整任务分配的内存,假如有N个活动任务,以保证每个等待的任务最少能够被分配1 / 2N大小的内存空间,最多被分配1 / N大小的内存空间。
由于N是动态变化的,因此需要记录每一个活动任务的内存状态,同时监听之,一旦发生变化,就为等待的任务重新计算1 / 2N1 / N的值。

  private[memory] def acquireMemory(
      numBytes: Long,
      taskAttemptId: Long,
      maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit,
      computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {
    assert(numBytes > 0, s"invalid number of bytes requested: $numBytes")

    // TODO: clean up this clunky method signature

    // Add this task to the taskMemory map just so we can keep an accurate count of the number
    // of active tasks, to let other tasks ramp down their memory in calls to `acquireMemory`
    if (!memoryForTask.contains(taskAttemptId)) {
      memoryForTask(taskAttemptId) = 0L
      // This will later cause waiting tasks to wake up and check numTasks again
      lock.notifyAll()
    }

    // Keep looping until we're either sure that we don't want to grant this request (because this
    // task would have more than 1 / numActiveTasks of the memory) or we have enough free
    // memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)).
    // TODO: simplify this to limit each task to its own slot
    while (true) {
      val numActiveTasks = memoryForTask.keys.size
      val curMem = memoryForTask(taskAttemptId)

      // In every iteration of this loop, we should first try to reclaim any borrowed execution
      // space from storage. This is necessary because of the potential race condition where new
      // storage blocks may steal the free execution memory that this task was waiting for.
      maybeGrowPool(numBytes - memoryFree)

      // Maximum size the pool would have after potentially growing the pool.
      // This is used to compute the upper bound of how much memory each task can occupy. This
      // must take into account potential free memory as well as the amount this pool currently
      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
      // we did not take into account space that could have been freed by evicting cached blocks.
      val maxPoolSize = computeMaxPoolSize()
      val maxMemoryPerTask = maxPoolSize / numActiveTasks
      val minMemoryPerTask = poolSize / (2 * numActiveTasks)

      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
      // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
      val toGrant = math.min(maxToGrant, memoryFree)

      // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;
      // if we can't give it this much now, wait for other tasks to free up memory
      // (this happens if older tasks allocated lots of memory before N grew)
      if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
        logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
        lock.wait()
      } else {
        memoryForTask(taskAttemptId) += toGrant
        return toGrant
      }
    }
    0L  // Never reached
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章