Spark源碼解讀之SparkContext初始化

SparkContext初始化是Driver應用程序提交執行的前提，這裏以local模式來了解SparkContext的初始化過程。
本文以
val conf = new SparkConf().setAppName(“mytest”).setMaster(“local[2]”)
val sc = new SparkContext(conf)
爲例，打開debug模式，然後進行分析。
一、SparkConf概述

SparkContext需要傳入SparkConf來進行初始化，SparkConf用於維護Spark的配置屬性。官方解釋：

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
簡單看下SparkConf的源碼：
[java] view plain copy
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {

import SparkConf._

/* Create a SparkConf that loads defaults from system properties and the classpath /
def this() = this(true)

private val settings = new ConcurrentHashMapString, String

if (loadDefaults) {
// Load any spark.* system properties
for ((key, value) <- Utils.getSystemProperties if key.startsWith(“spark.”)) {
set(key, value)
}
}

/* Set a configuration variable. /
def set(key: String, value: String): SparkConf = {
if (key == null) {
throw new NullPointerException(“null key”)
}
if (value == null) {
throw new NullPointerException(“null value for ” + key)
}
logDeprecationWarning(key)
settings.put(key, value)
this
}

/**
* The master URL to connect to, such as “local” to run locally with one thread, “local[4]” to
* run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.
*/
def setMaster(master: String): SparkConf = {
set(“spark.master”, master)
}

/* Set a name for your application. Shown in the Spark web UI. /
def setAppName(name: String): SparkConf = {
set(“spark.app.name”, name)
}
//省略
}

SparkConf內部使用ConcurrentHashMap來維護所有的配置。由於SparkConf提供的setter方法返回的是this，也就是
一個SparkConf對象，所有它允許使用鏈式來設置屬性。
如：new SparkConf().setAppName(“mytest”).setMaster(“local[2]”)

二、SparkContext的初始化

SparkContext的初始化步驟主要包含以下幾步：
1）創建JobProgressListener
2）創建SparkEnv
3）創建

複製SparkConf配置信息，然後校驗或者添加新的配置信息

SparkContext的住構造器參數爲SparkConf：
[java] view plain copy
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {

// The call site where this SparkContext was constructed.
private val creationSite: CallSite = Utils.getCallSite()

// If true, log warnings instead of throwing exceptions when multiple SparkContexts are active
private val allowMultipleContexts: Boolean =
config.getBoolean(“spark.driver.allowMultipleContexts”, false)

// In order to prevent multiple SparkContexts from being active at the same time, mark this
// context as having started construction.
// NOTE: this must be placed at the beginning of the SparkContext constructor.
SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
//省略
}

getCallSite方法會得到一個CallSite對象，改對象存儲了線程棧中最靠近棧頂的用戶類及最靠近棧底的Scala或者Spark核心類信息。SparkContext默認只有一個實例，由屬性"spark.driver.allowMultipleContexts"來控制。markPartiallyConstructed方法用於確保實例的唯一性，並將當前SparkContext標記爲正在構建中。
之後會初始化一些其他實例對象，比如會在內部創建一個SparkConf類型的對象_conf，然後在將傳過來的config進行復制，然後會對配置信息進行校驗。

[java] view plain copy
private var conf: SparkConf =
…

_conf = config.clone()
_conf.validateSettings() // Checks for illegal or deprecated config settings

if (!_conf.contains(“spark.master”)) {
throw new SparkException(“A master URL must be set in your configuration”)
}
if (!_conf.contains(“spark.app.name”)) {
throw new SparkException(“An application name must be set in your configuration”)
}

從上面代碼可以看出，必須要指定spark.master（運行模式）和spark.app.name（應用程序名稱），否則會拋出異常。

創建SparkEnv

SparkEnv包含了一個Spark應用的運行環境對象。官方解釋：
- Holds all the runtime environment objects for a running Spark instance (either master or worker),
- including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
- Spark code finds the SparkEnv through a global variable, so all the threads can access the same
- SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
  上面的意思大概就是說SparkEnv包含了Spark實例（master or worker）運行時的環境對象，包括serializer, Akka actor system, block manager, map output tracker等等。
  SparkContext中創建SparkEnv實例的部分代碼：
  [java] view plain copy
  // An asynchronous listener bus for Spark events
  private[spark] val listenerBus = new LiveListenerBus

// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}

…

// “_jobProgressListener” should be set up before creating SparkEnv because when creating
// “SparkEnv”, some messages will be posted to “listenerBus” and we should not miss them.
_jobProgressListener = new JobProgressListener(_conf)
listenerBus.addListener(jobProgressListener)

// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

也就是說最後會調用SparkEnv.createDriverEnv方法
[java] view plain copy
/**
* Create a SparkEnv for the driver.
*/
private[spark] def createDriverEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus,
numCores: Int,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
assert(conf.contains(“spark.driver.host”), “spark.driver.host is not set on the driver!”)
assert(conf.contains(“spark.driver.port”), “spark.driver.port is not set on the driver!”)
val hostname = conf.get(“spark.driver.host”)
val port = conf.get(“spark.driver.port”).toInt
create(
conf,
SparkContext.DRIVER_IDENTIFIER,
hostname,
port,
isDriver = true,
isLocal = isLocal,
numUsableCores = numCores,
listenerBus = listenerBus,
mockOutputCommitCoordinator = mockOutputCommitCoordinator
)
}

private def create(
conf: SparkConf,
executorId: String,
hostname: String,
port: Int,
isDriver: Boolean,
isLocal: Boolean,
numUsableCores: Int,
listenerBus: LiveListenerBus = null,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {

…

val envInstance = new SparkEnv(  
  executorId,  
  rpcEnv,  
  actorSystem,  // 基於Akka的分佈式消息系統  
  serializer,  
  closureSerializer,  
  cacheManager,  // 緩存管理器  
  mapOutputTracker,  // map任務輸出跟蹤器  
  shuffleManager,  // shuffle管理器  
  broadcastManager,  // 廣播管理器  
  blockTransferService,  //塊傳輸服務  
  blockManager,  // 塊管理器  
  securityManager,  // 安全管理器  
  sparkFilesDir,  //   
  metricsSystem,  // 測量系統  
  memoryManager,  // 內存管理器  
  outputCommitCoordinator,  
  conf)

…

}

 SparkEnv的createDriverEnv方法會調用私有create方法來創建serializer，closureSerializer，cacheManager等，創建完成後會創建SparkEnv對象。

創建MetadataCleaner

MetadataCleaner是用來定時的清理metadata的，metadata有6種類型，封裝在了MetadataCleanerType類中。
[java] view plain copy
private[spark] object MetadataCleanerType extends Enumeration {

val MAP_OUTPUT_TRACKER, SPARK_CONTEXT, HTTP_BROADCAST, BLOCK_MANAGER,
SHUFFLE_BLOCK_MANAGER, BROADCAST_VARS = Value

type MetadataCleanerType = Value

def systemProperty(which: MetadataCleanerType.MetadataCleanerType): String = {
“spark.cleaner.ttl.” + which.toString
}
}
MAP_OUTPUT_TRACKER：map任務的輸出元數據，SPARK_CONTEXT：緩存到內存中的RDD，HTTP_BROADCAST：採用http方式廣播broadcast的元數據，BLOCK_MANAGER：BlockManager中非Broadcast類型的Block數據，SHUFFLE_BLOCK_MANAGER：shuffle輸出的數據，BROADCAST_VARS：Torrent方式廣播broadcast的元數據，底層依賴於BlockManager。
在SparkContext初始化的過程中，會創建一個MetadataCleaner，用於清理緩存到內存中的RDD。
[java] view plain copy
// Keeps track of all persisted RDDs
private[spark] val persistentRdds = new TimeStampedWeakValueHashMap[Int, RDD[_]]

/* Called by MetadataCleaner to clean up the persistentRdds map periodically /
private[spark] def cleanup(cleanupTime: Long) {
persistentRdds.clearOldValues(cleanupTime)
}

_metadataCleaner = new MetadataCleaner(MetadataCleanerType.SPARK_CONTEXT, this.cleanup, _conf)

創建SparkStatusTracker

SparkStatusTracker是低級別的狀態報告API，用於監控job和stage。
初始化Spark UI

SparkUI爲Spark監控Web平臺提供了Spark環境、任務的整個生命週期的監控。
HadoopConfiguration

由於Spark默認使用HDFS作爲分佈式文件系統，所以需要獲取Hadoop相關的配置信息。
[java] view plain copy
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
可以看下newConfiguration做了那些事情。
[java] view plain copy
def newConfiguration(conf: SparkConf): Configuration = {
val hadoopConf = new Configuration()

// Note: this null check is around more than just access to the “conf” object to maintain
// the behavior of the old implementation of this code, for backwards compatibility.
if (conf != null) {
// Explicitly check for S3 environment variables
if (System.getenv(“AWS_ACCESS_KEY_ID”) != null &&
System.getenv(“AWS_SECRET_ACCESS_KEY”) != null) {
val keyId = System.getenv(“AWS_ACCESS_KEY_ID”)
val accessKey = System.getenv(“AWS_SECRET_ACCESS_KEY”)

hadoopConf.set(“fs.s3.awsAccessKeyId”, keyId)
hadoopConf.set(“fs.s3n.awsAccessKeyId”, keyId)
hadoopConf.set(“fs.s3a.access.key”, keyId)
hadoopConf.set(“fs.s3.awsSecretAccessKey”, accessKey)
hadoopConf.set(“fs.s3n.awsSecretAccessKey”, accessKey)
hadoopConf.set(“fs.s3a.secret.key”, accessKey)
}
// Copy any “spark.hadoop.foo=bar” system properties into conf as “foo=bar”
conf.getAll.foreach { case (key, value) =>
if (key.startsWith(“spark.hadoop.”)) {
hadoopConf.set(key.substring(“spark.hadoop.”.length), value)
}
}
val bufferSize = conf.get(“spark.buffer.size”, “65536”)
hadoopConf.set(“io.file.buffer.size”, bufferSize)
}

hadoopConf
}
1）將Amazon S3文件系統的AccessKeyId和SecretAccessKey加載到Hadoop的Configuration。
2）將SparkConf中所有以spark.hadoop.開頭的屬性複製到Hadoop的Configuration。
3）將SparkConf的spark.buffer.size屬性複製爲Hadoop的Configuration的io.file.buffer.size屬性。
ExecutorEnvs

ExecutorEnvs包含的環境變量會在註冊應用時發送給Master，Master給Worker發送調度後，Worker最終使用executorEnvs提供的信息啓動Executor。
[java] view plain copy
private[spark] val executorEnvs = HashMapString, String

// Convert java options to env vars as a work around
// since we can’t set env vars directly in sbt.
for { (envKey, propKey) <- Seq((“SPARK_TESTING”, “spark.testing”))
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
executorEnvs(envKey) = value
}
Option(System.getenv(“SPARK_PREPEND_CLASSES”)).foreach { v =>
executorEnvs(“SPARK_PREPEND_CLASSES”) = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// TODO: Set this only in the Mesos scheduler.
executorEnvs(“SPARK_EXECUTOR_MEMORY”) = executorMemory + “m”
executorEnvs ++= _conf.getExecutorEnv
executorEnvs(“SPARK_USER”) = sparkUser
由上面代碼剋制，可以通過配置spark.executor.memory指定Executor佔用內存的大小，也可以配置系統變量
SPARK_EXECUTOR_MEMORY或SPARK_MEM對其大小進行設置。

註冊HeartbeatReceiver

We need to register “HeartbeatReceiver” before “createTaskScheduler” because Executor will retrieve “HeartbeatReceiver” in the constructor. (SPARK-6640)
[java] view plain copy
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
創建TaskScheduler

TaskScheduler爲Spark的任務調度器，Spark通過它提交任務並且請求集羣調度任務；TaskScheduler通過master的配置匹配部署模式，創建TashSchedulerImpl，根據不同的集羣管理模式（local、local[n]、standalone、local-cluster、mesos、YARN）創建不同的SchedulerBackend。
[java] view plain copy
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
createTaskScheduler方法會使用模式匹配來創建不同的 TaskSchedulerImpl和Backend。由於這兒使用的是本地模式，所以返回LocalBackend。
[java] view plain copy
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == “*”) localCpuCount else threads.toInt
if (threadCount <= 0) {
throw new SparkException(s”Asked to run locally with $threadCount threads”)
}
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)

9.1 創建TaskSchedulerImpl

TaskSchedulerImpl構造過程如下：
1）從SparkConf中讀取配置信息，包括每個任務分配的CPU數，迪奧多事（FAIR或FIFO，默認爲FIFO）

[java] view plain copy
val conf = sc.conf

// How often to check for speculative tasks
val SPECULATION_INTERVAL_MS = conf.getTimeAsMs(“spark.speculation.interval”, “100ms”)

private val speculationScheduler =
ThreadUtils.newDaemonSingleThreadScheduledExecutor(“task-scheduler-speculation”)

// Threshold above which we warn user initial TaskSet may be starved
val STARVATION_TIMEOUT_MS = conf.getTimeAsMs(“spark.starvation.timeout”, “15s”)

// CPUs to request per task
val CPUS_PER_TASK = conf.getInt(“spark.task.cpus”, 1)

…

// default scheduler is FIFO
private val schedulingModeConf = conf.get(“spark.scheduler.mode”, “FIFO”)
2）創建TaskResultGetter，它的作用是：
Runs a thread pool that deserializes and remotely fetches (if necessary) task results.
通過線程池對Worker上的Executor發送的Task的執行結果進行處理。默認會通過Executors.newFixedThreadPool創建一個包含4個、線程名以task-result-getter開頭的線程池。
[java] view plain copy
private val THREADS = sparkEnv.conf.getInt(“spark.resultGetter.threads”, 4)
private val getTaskResultExecutor = ThreadUtils.newDaemonFixedThreadPool(
THREADS, “task-result-getter”)
TaskSchedulerImpl的調度模式有FAIR和FIFO兩種。任務的最終調度實際都是有SchedulerBackend實現的。local模式下的SchedulerBackend爲LocalBackend。

9.2 TaskSchedulerImpl的初始化

創建完TaskSchedulerImpl和LocalBackend後，需要對TaskSchedulerImpl調用initializeinitialize方法進行初始化。以默認的FIFO調度爲例，TaskSchedulerImpl初始化過程如下。

[java] view plain copy
def initialize(backend: SchedulerBackend) {
this.backend = backend
// temporarily set rootPool name to empty
rootPool = new Pool(“”, schedulingMode, 0, 0)
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
}
}
schedulableBuilder.buildPools()
}
1）使TaskSchedulerImpl持有LocalBackend的引用
2）創建Pool，Pool中緩存了調度隊列，調度算法以及TaskSetManager集合等信息。
3）創建FIFOSchedulableBuilder，FIFOSchedulableBuilder用來操作Pool中的調度隊列。

創建DAGScheduler

DAGScheduler的主要作用是在TaskSchedulerImpl正式提交任務之前做一些準備工作，包括：創建Job，將DAG中的RDD劃分到不同的Stage，提交Stage等等。DAGScheduler的創建代碼如下：
[java] view plain copy
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.askBoolean

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler’s
// constructor
_taskScheduler.start()

啓動TaskScheduler

啓動TaskScheduler時，實際上調用量backend的start方法。
[java] view plain copy
override def start() {
backend.start()

if (!isLocal && conf.getBoolean(“spark.speculation”, false)) {
logInfo(“Starting speculative execution thread”)
speculationScheduler.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
checkSpeculatableTasks()
}
}, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
}
}

Spark源碼解讀之SparkContext初始化

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

Java多線程——FutureTask+ExecutorService

Spring——IOC(控制反轉)、DI(依賴注入)

spring+mybatis useGeneratedKeys返回主鍵

Linux內存信息詳解

mysql實現窗口函數功能

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結