Spark源碼學習- SparkContext

SparkContext是整個Spark的唯一入口，是Spark上層應用和底層實現的中轉站，以重要性不言而喻，這也是我學習Spark源碼的第一步。
借鑑http://blog.csdn.net/OiteBody/article/details/54959608 博主裏面的時序圖，可以清楚的看到SparkContext的執行流程。

SparkContext在初始化過程中，主要實現以下幾個組件：

SparkEnv
DAGScheduler
TaskScheduler
SchedulerBackend
WebUI

在SparkContext中最重要的參數就是SparkConf，在源碼中可以看到，SparkContext裏面的conf是調用clone()而來的，然後進行各種驗證。

try {
    _conf = config.clone()
    _conf.validateSettings()

    if (!_conf.contains("spark.master")) {
      throw new SparkException("A master URL must be set in your configuration")
    }
    if (!_conf.contains("spark.app.name")) {
      throw new SparkException("An application name must be set in your configuration")
    }

    // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
    // yarn-standalone is deprecated, but still supported
    if ((master == "yarn-cluster" || master == "yarn-standalone") &&
        !_conf.contains("spark.yarn.app.id")) {
      throw new SparkException("Detected yarn-cluster mode, but isn't running on a cluster. " +
        "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
    }

    if (_conf.getBoolean("spark.logConf", false)) {
      logInfo("Spark configuration:\n" + _conf.toDebugString)
    }

    // Set Spark driver host and port system properties
    _conf.setIfMissing("spark.driver.host", Utils.localHostName())
    _conf.setIfMissing("spark.driver.port", "0")

    _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)

第一步，創建SparkEnv

SparkEnv是Spark的執行環境對象，其中包括與衆多Executor執行相關的對象。在local模式下Driver會創建Executor，local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend進程中也會創建Executor，所以SparkEnv存在於Driver或者CoarseGrainedExecutorBackend進程中。

_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

private[spark] def createSparkEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus): SparkEnv = {
    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))

通過SparkEnv.createDriverEnv()最終還是調用SparkEnv.create(),

private def create(
      conf: SparkConf,
      executorId: String,
      hostname: String,
      port: Int,
      isDriver: Boolean,
      isLocal: Boolean,
      numUsableCores: Int,
      listenerBus: LiveListenerBus = null,
      mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {

      ...

     val envInstance = new SparkEnv(
      executorId,
      rpcEnv,
      actorSystem,
      serializer,
      closureSerializer,
      cacheManager,
      mapOutputTracker,
      shuffleManager,
      broadcastManager,
      blockTransferService,
      blockManager,
      securityManager,
      sparkFilesDir,
      metricsSystem,
      memoryManager,
      outputCommitCoordinator,
      conf)
    if (isDriver) {
      envInstance.driverTmpDirToDelete = Some(sparkFilesDir)
    }

    envInstance

 }

其目的只是返回一個SparkEnv的實例，中間的大量操作只是爲構造SparkEnv準備參數而已，所以我們先看一下SparkEnv構造函數入參。

class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    _actorSystem: ActorSystem, // TODO Remove actorSystem
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val cacheManager: CacheManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockTransferService: BlockTransferService,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val sparkFilesDir: String,
    val metricsSystem: MetricsSystem,
    val memoryManager: MemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging {

先看看這些參數的用途

rpcEnv：網絡通信，默認使用netty，這個較複雜，以後單獨解析
ActorSystem： Spark中最基礎的設施，Spark既使用它發送分佈式消息，又用它實現併發編程。
cacheManager ：用以存儲中間計算結果
mapOutputTracker: 用來緩存MapStatus 信息，並提供從MapOutputMaster獲取信息的功能
shuffleManager: 路由維護表
broadcastManager: 廣播管理器
blockManager: 塊管理
securityManager: 安全管理
sparkFilesDir: 文件存儲目錄
metricsSystem: 測量

第二步，創建TaskScheduler

根據Spark的運行模式選擇相應的SchedulerBackend,同時啓動TaskScheduler，這一步至關重要。

val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this) _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
_taskScheduler.start()

private def createTaskScheduler(
      sc: SparkContext,
      master: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

            ...
    }

createTaskScheduler最爲關鍵的一點在於就是根據master環境變量來判斷Spark當前的部署方式，進而生成相應的SchedulerBackend的不同子類。
taskScheduler.start()的目的是啓動相應的SchedulerBackend，並啓動定時器進行檢測，以localBackend爲例

override def start() {
    val rpcEnv = SparkEnv.get.rpcEnv
    val executorEndpoint = new LocalEndpoint(rpcEnv, userClassPath, scheduler, this, totalCores)
    localEndpoint = rpcEnv.setupEndpoint("LocalBackendEndpoint", executorEndpoint)
    listenerBus.post(SparkListenerExecutorAdded(
      System.currentTimeMillis,
      executorEndpoint.localExecutorId,
      new ExecutorInfo(executorEndpoint.localExecutorHostname, totalCores, Map.empty)))
    launcherBackend.setAppId(appId)
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

第三步，創建DAGScheduler並啓動

直接看代碼

_dagScheduler = new DAGScheduler(this)

我們再查看DAGScheduler的構造方法

def this(sc: SparkContext) = this(sc, sc.taskScheduler)

很清楚的看到，這是以上一步中創建的taskScheduler爲參數，創建DAGScheduler

第四步，啓動WebUI

_ui =
      if (conf.getBoolean("spark.ui.enabled", true)) {
        Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
          _env.securityManager, appName, startTime = startTime))
      } else {
        // For tests, do not enable the UI
        None
      }
    // Bind the UI before starting the task scheduler to communicate
    // the bound port to the cluster manager properly
    _ui.foreach(_.bind())

Spark源碼學習- SparkContext

第一步，創建SparkEnv

第二步，創建TaskScheduler

第三步，創建DAGScheduler並啓動

第四步，啓動WebUI

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

tensorflow serving安裝、部署、調用、多模型版本管理教程

中文文本拼寫檢查錯誤糾正方案整理

pyspark 遇到的問題

依存句法分析總結

Spark源碼學習- SparkContext

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結