Spark源碼學習- SparkContext

SparkContext是整個Spark的唯一入口,是Spark上層應用和底層實現的中轉站,以重要性不言而喻,這也是我學習Spark源碼的第一步。
借鑑http://blog.csdn.net/OiteBody/article/details/54959608 博主裏面的時序圖,可以清楚的看到SparkContext的執行流程。

SparkContext在初始化過程中,主要實現以下幾個組件:

  • SparkEnv
  • DAGScheduler
  • TaskScheduler
  • SchedulerBackend
  • WebUI

在SparkContext中最重要的參數就是SparkConf,在源碼中可以看到,SparkContext裏面的conf是調用clone()而來的,然後進行各種驗證。

try {
    _conf = config.clone()
    _conf.validateSettings()

    if (!_conf.contains("spark.master")) {
      throw new SparkException("A master URL must be set in your configuration")
    }
    if (!_conf.contains("spark.app.name")) {
      throw new SparkException("An application name must be set in your configuration")
    }

    // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
    // yarn-standalone is deprecated, but still supported
    if ((master == "yarn-cluster" || master == "yarn-standalone") &&
        !_conf.contains("spark.yarn.app.id")) {
      throw new SparkException("Detected yarn-cluster mode, but isn't running on a cluster. " +
        "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
    }

    if (_conf.getBoolean("spark.logConf", false)) {
      logInfo("Spark configuration:\n" + _conf.toDebugString)
    }

    // Set Spark driver host and port system properties
    _conf.setIfMissing("spark.driver.host", Utils.localHostName())
    _conf.setIfMissing("spark.driver.port", "0")

    _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)

第一步,創建SparkEnv

SparkEnv是Spark的執行環境對象,其中包括與衆多Executor執行相關的對象。在local模式下Driver會創建Executor,local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend進程中也會創建Executor,所以SparkEnv存在於Driver或者CoarseGrainedExecutorBackend進程中。

_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
private[spark] def createSparkEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus): SparkEnv = {
    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))

通過SparkEnv.createDriverEnv()最終還是調用SparkEnv.create(),

private def create(
      conf: SparkConf,
      executorId: String,
      hostname: String,
      port: Int,
      isDriver: Boolean,
      isLocal: Boolean,
      numUsableCores: Int,
      listenerBus: LiveListenerBus = null,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {

      ...

     val envInstance = new SparkEnv(
      executorId,
      rpcEnv,
      actorSystem,
      serializer,
      closureSerializer,
      cacheManager,
      mapOutputTracker,
      shuffleManager,
      broadcastManager,
      blockTransferService,
      blockManager,
      securityManager,
      sparkFilesDir,
      metricsSystem,
      memoryManager,
      outputCommitCoordinator,
      conf)
    if (isDriver) {
      envInstance.driverTmpDirToDelete = Some(sparkFilesDir)
    }

    envInstance

 }

其目的只是返回一個SparkEnv的實例, 中間的大量操作只是爲構造SparkEnv準備參數而已,所以我們先看一下SparkEnv構造函數入參。

class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    _actorSystem: ActorSystem, // TODO Remove actorSystem
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val cacheManager: CacheManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockTransferService: BlockTransferService,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val sparkFilesDir: String,
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging {

先看看這些參數的用途

  • rpcEnv: 網絡通信,默認使用netty,這個較複雜,以後單獨解析
  • ActorSystem: Spark中最基礎的設施,Spark既使用它發送分佈式消息,又用它實現併發編程。
  • cacheManager : 用以存儲中間計算結果
  • mapOutputTracker: 用來緩存MapStatus 信息,並提供從MapOutputMaster獲取信息的功能
  • shuffleManager: 路由維護表
  • broadcastManager: 廣播管理器
  • blockManager: 塊管理
  • securityManager: 安全管理
  • sparkFilesDir: 文件存儲目錄
  • metricsSystem: 測量

第二步,創建TaskScheduler

根據Spark的運行模式選擇相應的SchedulerBackend,同時啓動TaskScheduler,這一步至關重要。

val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this) _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
_taskScheduler.start()
private def createTaskScheduler(
      sc: SparkContext,
      master: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

            ...
    }

createTaskScheduler最爲關鍵的一點在於就是根據master環境變量來判斷Spark當前的部署方式,進而生成相應的SchedulerBackend的不同子類。
taskScheduler.start()的目的是啓動相應的SchedulerBackend,並啓動定時器進行檢測,以localBackend爲例

override def start() {
    val rpcEnv = SparkEnv.get.rpcEnv
    val executorEndpoint = new LocalEndpoint(rpcEnv, userClassPath, scheduler, this, totalCores)
    localEndpoint = rpcEnv.setupEndpoint("LocalBackendEndpoint", executorEndpoint)
    listenerBus.post(SparkListenerExecutorAdded(
      System.currentTimeMillis,
      executorEndpoint.localExecutorId,
      new ExecutorInfo(executorEndpoint.localExecutorHostname, totalCores, Map.empty)))
    launcherBackend.setAppId(appId)
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

第三步,創建DAGScheduler並啓動

直接看代碼

_dagScheduler = new DAGScheduler(this)

我們再查看DAGScheduler的構造方法

def this(sc: SparkContext) = this(sc, sc.taskScheduler)

很清楚的看到,這是以上一步中創建的taskScheduler爲參數,創建DAGScheduler

第四步,啓動WebUI

_ui =
      if (conf.getBoolean("spark.ui.enabled", true)) {
        Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
          _env.securityManager, appName, startTime = startTime))
      } else {
        // For tests, do not enable the UI
        None
      }
    // Bind the UI before starting the task scheduler to communicate
    // the bound port to the cluster manager properly
    _ui.foreach(_.bind())
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章