SparkContext是整個Spark的唯一入口,是Spark上層應用和底層實現的中轉站,以重要性不言而喻,這也是我學習Spark源碼的第一步。
借鑑http://blog.csdn.net/OiteBody/article/details/54959608 博主裏面的時序圖,可以清楚的看到SparkContext的執行流程。
SparkContext在初始化過程中,主要實現以下幾個組件:
- SparkEnv
- DAGScheduler
- TaskScheduler
- SchedulerBackend
- WebUI
在SparkContext中最重要的參數就是SparkConf,在源碼中可以看到,SparkContext裏面的conf是調用clone()而來的,然後進行各種驗證。
try {
_conf = config.clone()
_conf.validateSettings()
if (!_conf.contains("spark.master")) {
throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
throw new SparkException("An application name must be set in your configuration")
}
// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
// yarn-standalone is deprecated, but still supported
if ((master == "yarn-cluster" || master == "yarn-standalone") &&
!_conf.contains("spark.yarn.app.id")) {
throw new SparkException("Detected yarn-cluster mode, but isn't running on a cluster. " +
"Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
}
if (_conf.getBoolean("spark.logConf", false)) {
logInfo("Spark configuration:\n" + _conf.toDebugString)
}
// Set Spark driver host and port system properties
_conf.setIfMissing("spark.driver.host", Utils.localHostName())
_conf.setIfMissing("spark.driver.port", "0")
_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
第一步,創建SparkEnv
SparkEnv是Spark的執行環境對象,其中包括與衆多Executor執行相關的對象。在local模式下Driver會創建Executor,local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend進程中也會創建Executor,所以SparkEnv存在於Driver或者CoarseGrainedExecutorBackend進程中。
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
通過SparkEnv.createDriverEnv()最終還是調用SparkEnv.create(),
private def create(
conf: SparkConf,
executorId: String,
hostname: String,
port: Int,
isDriver: Boolean,
isLocal: Boolean,
numUsableCores: Int,
listenerBus: LiveListenerBus = null,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
...
val envInstance = new SparkEnv(
executorId,
rpcEnv,
actorSystem,
serializer,
closureSerializer,
cacheManager,
mapOutputTracker,
shuffleManager,
broadcastManager,
blockTransferService,
blockManager,
securityManager,
sparkFilesDir,
metricsSystem,
memoryManager,
outputCommitCoordinator,
conf)
if (isDriver) {
envInstance.driverTmpDirToDelete = Some(sparkFilesDir)
}
envInstance
}
其目的只是返回一個SparkEnv的實例, 中間的大量操作只是爲構造SparkEnv準備參數而已,所以我們先看一下SparkEnv構造函數入參。
class SparkEnv (
val executorId: String,
private[spark] val rpcEnv: RpcEnv,
_actorSystem: ActorSystem, // TODO Remove actorSystem
val serializer: Serializer,
val closureSerializer: Serializer,
val cacheManager: CacheManager,
val mapOutputTracker: MapOutputTracker,
val shuffleManager: ShuffleManager,
val broadcastManager: BroadcastManager,
val blockTransferService: BlockTransferService,
val blockManager: BlockManager,
val securityManager: SecurityManager,
val sparkFilesDir: String,
val metricsSystem: MetricsSystem,
val memoryManager: MemoryManager,
val outputCommitCoordinator: OutputCommitCoordinator,
val conf: SparkConf) extends Logging {
先看看這些參數的用途
- rpcEnv: 網絡通信,默認使用netty,這個較複雜,以後單獨解析
- ActorSystem: Spark中最基礎的設施,Spark既使用它發送分佈式消息,又用它實現併發編程。
- cacheManager : 用以存儲中間計算結果
- mapOutputTracker: 用來緩存MapStatus 信息,並提供從MapOutputMaster獲取信息的功能
- shuffleManager: 路由維護表
- broadcastManager: 廣播管理器
- blockManager: 塊管理
- securityManager: 安全管理
- sparkFilesDir: 文件存儲目錄
- metricsSystem: 測量
第二步,創建TaskScheduler
根據Spark的運行模式選擇相應的SchedulerBackend,同時啓動TaskScheduler,這一步至關重要。
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this) _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
_taskScheduler.start()
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount <= 0) {
throw new SparkException(s"Asked to run locally with $threadCount threads")
}
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
...
}
createTaskScheduler最爲關鍵的一點在於就是根據master環境變量來判斷Spark當前的部署方式,進而生成相應的SchedulerBackend的不同子類。
taskScheduler.start()的目的是啓動相應的SchedulerBackend,並啓動定時器進行檢測,以localBackend爲例
override def start() {
val rpcEnv = SparkEnv.get.rpcEnv
val executorEndpoint = new LocalEndpoint(rpcEnv, userClassPath, scheduler, this, totalCores)
localEndpoint = rpcEnv.setupEndpoint("LocalBackendEndpoint", executorEndpoint)
listenerBus.post(SparkListenerExecutorAdded(
System.currentTimeMillis,
executorEndpoint.localExecutorId,
new ExecutorInfo(executorEndpoint.localExecutorHostname, totalCores, Map.empty)))
launcherBackend.setAppId(appId)
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
第三步,創建DAGScheduler並啓動
直接看代碼
_dagScheduler = new DAGScheduler(this)
我們再查看DAGScheduler的構造方法
def this(sc: SparkContext) = this(sc, sc.taskScheduler)
很清楚的看到,這是以上一步中創建的taskScheduler爲參數,創建DAGScheduler
第四步,啓動WebUI
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
_env.securityManager, appName, startTime = startTime))
} else {
// For tests, do not enable the UI
None
}
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
_ui.foreach(_.bind())