Spark-2.4.0 源碼學習系列《二》 - SparkContext

SparkContext 是Spark功能的主入口。一個SparkContext 代表一個spark集羣的鏈接，可以用來在集羣上創建RDD,累加器和廣播變量。每個JVM中只能有一個活動的SparkContext。必須在創建新的SparkContext之前調用 stop()方法來停止當前處於active狀態的SparkContext。這個限制最終可能會被移除。

下面我們來看看我們使用 val sc = new SparkContext(sparkConf) 創建SparkContext時spark幹了哪些事：

首先調用this(new SparkConf)構造函數初始化一個 SparkContext:

 /**
   * Create a SparkContext that loads settings from system properties (for instance, when
   * launching with ./bin/spark-submit).
   */
  def this() = this(new SparkConf())

下面我們來看看SparkContext初始化時都幹了哪些事：

class SparkContext(config: SparkConf) extends Logging {

  // The call site where this SparkContext was constructed.
  private val creationSite: CallSite = Utils.getCallSite()

  // If true, log warnings instead of throwing exceptions when multiple SparkContexts are active
  private val allowMultipleContexts: Boolean =
    config.getBoolean("spark.driver.allowMultipleContexts", false)

  // In order to prevent multiple SparkContexts from being active at the same time, mark this
  // context as having started construction.
  // NOTE: this must be placed at the beginning of the SparkContext constructor.
  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)

  val startTime = System.currentTimeMillis()

  private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)

1. 創建變量creationSite: CallSite,查看是哪些user代碼在調用SparkContext中的方法。

2. 創建變量allowMultipleContexts:Boolean,是否允許多個SparkContext同時處於活動狀態。默認false,如果設置爲true,在檢測到多個活動的SparkContext時會日誌警告，而不是拋出異常。

3. SparkContext.markPartiallyConstructed(this, allowMultipleContexts) 標記當前SparkContext爲已開始構建狀態，防止多個SparkContext同時處於活動狀態（該操作需要放在構造函數的最前面）。

4. 創建常量startTime。

5. 創建常量stopped，標記SparkContext是否已經停止。

// log out Spark Version in Spark driver log
  logInfo(s"Running Spark version $SPARK_VERSION")

  /* ------------------------------------------------------------------------------------- *
   | Private variables. These variables keep the internal state of the context, and are    |
   | not accessible by the outside world. They're mutable since we want to initialize all  |
   | of them to some neutral value ahead of time, so that calling "stop()" while the       |
   | constructor is still running is safe.                                                 |
   * ------------------------------------------------------------------------------------- */

  private var _conf: SparkConf = _
  private var _eventLogDir: Option[URI] = None
  private var _eventLogCodec: Option[String] = None
  private var _listenerBus: LiveListenerBus = _
  private var _env: SparkEnv = _
  private var _statusTracker: SparkStatusTracker = _
  private var _progressBar: Option[ConsoleProgressBar] = None
  private var _ui: Option[SparkUI] = None
  private var _hadoopConfiguration: Configuration = _
  private var _executorMemory: Int = _
  private var _schedulerBackend: SchedulerBackend = _
  private var _taskScheduler: TaskScheduler = _
  private var _heartbeatReceiver: RpcEndpointRef = _
  @volatile private var _dagScheduler: DAGScheduler = _
  private var _applicationId: String = _
  private var _applicationAttemptId: Option[String] = None
  private var _eventLogger: Option[EventLoggingListener] = None
  private var _executorAllocationManager: Option[ExecutorAllocationManager] = None
  private var _cleaner: Option[ContextCleaner] = None
  private var _listenerBusStarted: Boolean = false
  private var _jars: Seq[String] = _
  private var _files: Seq[String] = _
  private var _shutdownHookRef: AnyRef = _
  private var _statusStore: AppStatusStore = _

6. 打日誌，顯示當前運行的Spark版本

7. 聲明一大堆SparkContext內部使用的變量：

7.1 _conf: sparkConf

7.2 _eventLogDir: 事件日誌路徑

7.3 _eventLogCodec：事件日誌編解碼器

7.4 _listenerBus： LiveListenerBus,用來異步的傳遞SparkListenerEvents到已註冊的SparkListeners。在LiveListenerBus的start()方法被調用之前，所有已經發布的事件都只會被緩存下來，啓動後，這些事件纔會被傳遞到附着的Listener那裏。LiveListenerBus通過調用stop()方法結束，結束後，後續的時間都會被丟棄。LiveListenerBus在其內部實現了一個線程安全的ArrayList變體(CopyOnWriteArrayList),用於存放異步事件。

7.5 _env： Spark運行環境。

7.6 _statusTracker：用來監控job和stage進度的低階狀態報告API

7.7 _progressBar：控制檯stage進度條，如果有多個stage，顯示的將是多個stage的綜合進度。

7.8 _ui: spark程序的頂階用戶接口

7.9 _hadoopConfiguration：hadoop配置

7.10 _executorMemory： executor內存，單位MB,默認1024M

7.11 _schedulerBackend:調度系統的後後端接口，允許在TaskSchedulerImpl下植入多個。

7.12 _taskScheduler：低階任務調度接口，當前只有一個實現類 org.apache.spark.scheduler.TaskSchedulerImpl(以後再細看)。

7.13 _heartbeatReceiver: 一個遠程 RpcEndPoint的引用。RpcEndpointRef是線程安全的。

7.14 _dagScheduler： DAG調度器，一個實現了面向stage調度的高階調度器。它會爲每個job生成一個由stage構成的DAG,跟蹤哪些RDD和stage的輸出已經持久化，並找到運行這個job的最小調度。之後它會將stage以taskSet的形式提交給底層的TaskScheduler(很重要，以後再細看)。

7.15 _applicationId： spark程序的唯一標識符

7.16 _applicationAttemptId: spark程序唯一標符的唯一嘗試標識符

7.17 _eventLogger：一個將事件以日誌的形式持久化的SparkListener

7.18 _executorAllocationManager: executor分配管理器，基於工作負載，動態分配或移除executor的代理。

7.19 _cleaner： RDD，shuffle和廣播變量狀態的異步清理器。

7.20 _listenerBusStarted：LiveListenerBus是否已經啓動

7.21 _jars：用戶提交的jar包路徑。多個路徑以逗號分隔。

7.22 _files：用戶提交的文件

7.23 _shutdownHookRef：關閉鉤子的引用

7.24 _statusStore：一個Kv存儲的包裝類，它提供了獲取API數據的方法。

// Used to store a URL for each static file/jar together with the file's local timestamp
  private[spark] val addedFiles = new ConcurrentHashMap[String, Long]().asScala
  private[spark] val addedJars = new ConcurrentHashMap[String, Long]().asScala

  // Keeps track of all persisted RDDs
  private[spark] val persistentRdds = {
    val map: ConcurrentMap[Int, RDD[_]] = new MapMaker().weakValues().makeMap[Int, RDD[_]]()
    map.asScala
  }

8. 創建私有變量addedFiles：一個ConcurrentHashMap用於存儲添加的文件。

9. 創建私有變量addedJars：一個ConcurrentHashMap用於存儲添加的jar包。

10. 創建常量persistentRdds：一個ConcurrentHashMap,用於追蹤所有已經緩存了的RDD

// Environment variables to pass to our executors.
  private[spark] val executorEnvs = HashMap[String, String]()

  // Set SPARK_USER for user who is running SparkContext.
  val sparkUser = Utils.getCurrentUserName()

11. 創建executorEnvs：executor運行環境。實現了一個HashMap，用於存放傳遞給executor的環境變量。

12. 創建sparkUser:運行當前SparkContext的用戶

private[spark] var checkpointDir: Option[String] = None

  // Thread Local variable that can be used by users to pass information down the stack
  protected[spark] val localProperties = new InheritableThreadLocal[Properties] {
    override protected def childValue(parent: Properties): Properties = {
      // Note: make a clone such that changes in the parent properties aren't reflected in
      // the those of the children threads, which has confusing semantics (SPARK-10563).
      SerializationUtils.clone(parent)
    }
    override protected def initialValue(): Properties = new Properties()
  }

13. 聲明checkpointDir：checkPoint路徑

14. 創建localProperties：本地線程中可以被用戶使用的屬性

下面是重點！！！(註釋寫在代碼中)

 try {
    // 克隆一份SparkConf
    _conf = config.clone()
    // 檢查是否有非法或者過期的配置設置，如果有非法配置將拋出異常，過期配置將會被替換成當前支持的配置
    _conf.validateSettings()
    // 檢查配置中是否配置了master,沒有則拋出異常
    if (!_conf.contains("spark.master")) {
      throw new SparkException("A master URL must be set in your configuration")
    }
    // 檢查配置中是否配置了app名稱，沒有則拋出異常
    if (!_conf.contains("spark.app.name")) {
      throw new SparkException("An application name must be set in your configuration")
    }
    // log out spark.app.name in the Spark driver logs
    logInfo(s"Submitted application: $appName")

    // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
    // 當程序運行在yarn集羣上時必須設置spark.yarn.app.id屬性
    if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
      throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
        "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
    }

    // 如果開啓了配置日誌，則打印出所有的配置信息
    if (_conf.getBoolean("spark.logConf", false)) {
      logInfo("Spark configuration:\n" + _conf.toDebugString)
    }

    // Set Spark driver host and port system properties. This explicitly sets the configuration
    // instead of relying on the default value of the config constant.
    // 設置driver主機地址
    _conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
    // 如果driver端口未配置，則設置爲0
    _conf.setIfMissing("spark.driver.port", "0")
    // 設置executor ID
    _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
    // 加載用戶提交的jar包
    _jars = Utils.getUserJars(_conf)
    // 加載用戶提交的文件
    _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
      .toSeq.flatten
    // 如果啓動了時間日誌，則配置事件日誌路徑
    _eventLogDir =
      if (isEventLogEnabled) {
        val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
          .stripSuffix("/")
        Some(Utils.resolveURI(unresolvedDir))
      } else {
        None
      }
    // 初始化事件日誌編解碼器
    _eventLogCodec = {
      val compress = _conf.getBoolean("spark.eventLog.compress", false)
      if (compress && isEventLogEnabled) {
        Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
      } else {
        None
      }
    }
    // 初始化ListenerBus，一個用於傳遞監聽到的事件到監聽器的Map
    _listenerBus = new LiveListenerBus(_conf)

    // Initialize the app status store and listener before SparkEnv is created so that it gets
    // all events.
    // 在SparkEnv創建前初始化app狀態和app監聽器，這樣這個statusStore可以獲取所有的事件
    _statusStore = AppStatusStore.createLiveStore(conf)
    listenerBus.addToStatusQueue(_statusStore.listener.get)
    
    // Create the Spark execution environment (cache, map output tracker, etc)
    // 創建Spark運行環境
    _env = createSparkEnv(_conf, isLocal, listenerBus)
    SparkEnv.set(_env)

下面我們重點看一下 _env = createSparkEnv(_conf, isLocal, listenerBus)

private[spark] def createSparkEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus): SparkEnv = {
    // 實際創建的是Driver環境
    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf))
  }

可以看到實際調用的是SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf))，創建的是driverEnv。關於發driverEnv的底層實現我們以後再講。

現在我們回到SparkContext。

// If running the REPL, register the repl's output dir with the file server.
    // 如果運行了REPL，則註冊repl的輸出路徑到文件服務器
    _conf.getOption("spark.repl.class.outputDir").foreach { path =>
      val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
      _conf.set("spark.repl.class.uri", replUri)
    }
    // 初始化妝臺追蹤器
    _statusTracker = new SparkStatusTracker(this, _statusStore)
    // 初始化進度條
    _progressBar =
      if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
        Some(new ConsoleProgressBar(this))
      } else {
        None
      }
    // 初始化SparkUI界面接口
    _ui =
      if (conf.getBoolean("spark.ui.enabled", true)) {
        Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
          startTime))
      } else {
        // For tests, do not enable the UI
        None
      }
    // Bind the UI before starting the task scheduler to communicate
    // the bound port to the cluster manager properly
    // 綁定ui
    _ui.foreach(_.bind())
    // 加載hadoop運行環境
    _hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

    // Add each JAR given through the constructor
    // 像task中添加jar依賴
    if (jars != null) {
      jars.foreach(addJar)
    }
    // 添加每個node需要下載文件
    if (files != null) {
      files.foreach(addFile)
    }
    // 初始化executor內存，默認1024MB
    _executorMemory = _conf.getOption("spark.executor.memory")
      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
      .orElse(Option(System.getenv("SPARK_MEM"))
      .map(warnSparkMem))
      .map(Utils.memoryStringToMb)
      .getOrElse(1024)

    // Convert java options to env vars as a work around
    // since we can't set env vars directly in sbt.
    // 將java選項轉換成環境變量
    for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
      value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
      executorEnvs(envKey) = value
    }
    Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
      executorEnvs("SPARK_PREPEND_CLASSES") = v
    }
    // The Mesos scheduler backend relies on this environment variable to set executor memory.
    // TODO: Set this only in the Mesos scheduler.
    executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
    executorEnvs ++= _conf.getExecutorEnv
    executorEnvs("SPARK_USER") = sparkUser

    // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
    // 初始化心跳接收器，用於接收心跳
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
        
   // Create and start the scheduler
    // 初始化作業調度器、DAG調度器
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    // 在啓動任務調度器
    _taskScheduler.start()

這裏初始化了一堆信息，重要的有ui,_hadoopConfiguration,executorMemory,以及TaskScheduler.這裏的createTaskScheduler會根據提供的master URL創建不同的TaskScheduler，返回一個包含scheduleBackend 和 taskScheduler的Tuple.

// 設置應用程序id
    _applicationId = _taskScheduler.applicationId()
    // 設置應用程序嘗試id
    _applicationAttemptId = taskScheduler.applicationAttemptId()
    // 程序id保存到sparkConf
    _conf.set("spark.app.id", _applicationId)
    // 設置ui代理
    if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
      System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
    }
    // 將應用id設置到ui
    _ui.foreach(_.setAppId(_applicationId))
    // 初始化blockManager
    _env.blockManager.initialize(_applicationId)

    // The metrics system for Driver need to be set spark.app.id to app ID.
    // So it should start after we get app ID from the task scheduler and set spark.app.id.
    // 啓動指標系統
    _env.metricsSystem.start()
    // Attach the driver metrics servlet handler to the web ui after the metrics system is started.
    _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))
    // 設置事件日誌監聽器，並添加到listenerBus
    _eventLogger =
      if (isEventLogEnabled) {
        val logger =
          new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
            _conf, _hadoopConfiguration)
        logger.start()
        listenerBus.addToEventLogQueue(logger)
        Some(logger)
      } else {
        None
      }

    // Optionally scale number of executors dynamically based on workload. Exposed for testing.
    // 是否啓用動態分配
    val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
    // 初始化executor動態分配管理器
    _executorAllocationManager =
      if (dynamicAllocationEnabled) {
        schedulerBackend match {
          case b: ExecutorAllocationClient =>
            Some(new ExecutorAllocationManager(
              schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf,
              _env.blockManager.master))
          case _ =>
            None
        }
      } else {
        None
      }
    // 啓動executor動態分配管理器
    _executorAllocationManager.foreach(_.start())

    // 初始化並啓動狀態清除器
    _cleaner =
      if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
        Some(new ContextCleaner(this))
      } else {
        None
      }
    _cleaner.foreach(_.start())

    // 設置並啓動ListenerBus(主要是額外的listener)
    setupAndStartListenerBus()
    // 發佈環境更新事件
    postEnvironmentUpdate()
    // 發佈程序啓動事件
    postApplicationStart()

    // Post init
    _taskScheduler.postStartHook()
    _env.metricsSystem.registerSource(_dagScheduler.metricsSource)
    _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
    _executorAllocationManager.foreach { e =>
      _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
    }

    // Make sure the context is stopped if the user forgets about it. This avoids leaving
    // unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
    // is killed, though.
    logDebug("Adding shutdown hook") // force eager creation of logger
    _shutdownHookRef = ShutdownHookManager.addShutdownHook(
      ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
      logInfo("Invoking stop() from shutdown hook")
      try {
        stop()
      } catch {
        case e: Throwable =>
          logWarning("Ignoring Exception while stopping SparkContext from shutdown hook", e)
      }
    }
  } catch {
    case NonFatal(e) =>
      logError("Error initializing SparkContext.", e)
      try {
        stop()
      } catch {
        case NonFatal(inner) =>
          logError("Error stopping SparkContext after init error.", inner)
      } finally {
        throw e
      }
  }

總的來說，SparkContext的初始化，創建了一大堆系統運行需要的變量，檢測參數的，監聽任務的，調度任務的等等。

後續我會繼續學習SparkContext初始化時創建的幾個在重量級的類：SparkEnv,TaskScheduler、DAGScheduler.

注：轉載請註明出處。

Spark-2.4.0 源碼學習系列《二》 - SparkContext

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

Mellanox網卡開啓SR-IOV

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

Spark-2.4.0 源碼學習系列《通信框架》之Dispatcher

《Flink 使用 JDBCAppendTableSink 操作 Mysql入門示例》 Java版

State TTL in Flink 1.8.0: How to Automatically Cleanup Application State in Apache Flink (轉載)

《StreamSets安裝及Mysql數據同步入門示例》

Introducing Complex Event Processing (CEP) with Apache Flink(轉載)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結