1 將任務打成jar包
2 調用spark-submit腳本提交到集羣上運行
3 運行sparkSubmit的main方法,在這個方法中通過反射的方式創建我們編寫的主類的實例對象,然後調用main方法,開始執行我們的代碼。(Spark程序中的driver就運行在sparkSubmit進程中)
- 運行SparkSubmit的main方法
//源碼來自:SparkSubmit.scala
override def main(args: Array[String]): Unit = {
val submit = new SparkSubmit() {
override def doSubmit(args: Array[String]): Unit = {
try {
super.doSubmit(args)
} catch {
case e: SparkUserAppException =>
exitFn(e.exitCode)
}}}
submit.doSubmit(args)
}
- 進入doSubmit()方法,執行submit(appArgs, uninitLog)語句。
def doSubmit(args: Array[String]): Unit = {
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
case SparkSubmitAction.PRINT_VERSION => printVersion()
}
}
進入submit()方法後,執行prepareSubmitEnvironment獲取要執行的類名,執行runMain()語句。
private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
//childMainClass即我們自己寫的類的類名,也是要執行的類。
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
//執行runMain
runMain(childArgs, childClasspath, sparkConf, childMainClass, args.verbose)
}
})
} catch {
}
} else {
//執行runMain
runMain(childArgs, childClasspath, sparkConf, childMainClass, args.verbose)
}
}
進入runMain後,通過反射創建我們寫的類(比如WordCount)的實例,然後創建APP,並啓動app。
private def runMain(childArgs: Seq[String],childClasspath: Seq[String],sparkConf: SparkConf,childMainClass: String,verbose: Boolean): Unit = {
var mainClass: Class[_] = null
try {
//通過反射創建實例
mainClass = Utils.classForName(childMainClass)
} catch {
}
//創建app
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
mainClass.newInstance().asInstanceOf[SparkApplication]
} else {
if (classOf[scala.App].isAssignableFrom(mainClass)) {
logWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
new JavaMainApplication(mainClass)
}
//並啓動app
try {
app.start(childArgs.toArray, sparkConf)
} catch {
case t: Throwable =>
throw findCause(t)
}
}
啓動app後,進入start源碼。SparkApplication的所有子類必須通過start()方法執行,而JavaMainApplication繼承了父類SparkApplication,並重寫了start方法,在start方法中通過反射獲取實例的main方法,並調用main方法。(上一份源碼中的app.start(childArgs.toArray, sparkConf)即執行了JavaMainApplication的start方法。)
//源碼來自於SparkApplication.scala
/**
* Entry point for a Spark application. Implementations must provide a no-argument constructor.
*/
private[spark] trait SparkApplication {
def start(args: Array[String], conf: SparkConf): Unit
}
/**
* Implementation of SparkApplication that wraps a standard Java class with a "main" method.
*
* Configuration is propagated to the application via system properties, so running multiple
* of these in the same JVM may lead to undefined behavior due to configuration leaks.
*/
private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication {
override def start(args: Array[String], conf: SparkConf): Unit = {
//通過反射,獲取實例的main方法
val mainMethod = klass.getMethod("main", new Array[String](0).getClass)
...
//調用main方法
mainMethod.invoke(null, args)
}
}
注:學習自spark-源碼-submit命令-<<簡書-scandly>>。
4 當代碼運行到創建SparkContext對象時,那就開始初始化SparkContext對象了。
5 在初始化SparkContext對象的時候,會創建兩個特別重要的對象TaskScheduler、DAGScheduler。
TaskScheduler負責提交任務,並且請求集羣管理器對任務調度。
DAGScheduler負責提交Job,將DAG中的RDD依賴切分成一個個的stage,然後將stage作爲taskSet提交給DriverActor。
關於TaskScheduler,請參考筆者的另一篇文章TaskScheduler詳解及源碼介紹-張之海
如下SparkContext.scala源碼中,創建了taskScheduler、DAGScheduler。並且在DAGScheduler的構造方法中,將DAGScheduler作爲參數(this),傳遞給了taskScheduler。
//源碼來自:SparkContext.scala
//創建taskScheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched
_taskScheduler = ts
//創建DAGScheduler
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's constructor
//啓動taskScheduler,由前行英文註釋可知,在DAGScheduler的構造方法中,設置了與taskScheduler對應的DAGScheduler,
//即taskScheduler.setDAGScheduler(this)。
_taskScheduler.start()
//源碼來自DAGScheduler.scala
private[spark] class DAGScheduler(
private[scheduler] val sc: SparkContext,
private[scheduler] val taskScheduler: TaskScheduler,
listenerBus: LiveListenerBus,
mapOutputTracker: MapOutputTrackerMaster,
blockManagerMaster: BlockManagerMaster,
env: SparkEnv,
clock: Clock = new SystemClock())
extends Logging {
...
private[spark] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
//通過該語句,taskScheduler持有了DAGScheduler。
taskScheduler.setDAGScheduler(this)
...
}
在上面DAGScheduler.scala的源碼中,有一句eventProcessLoop = new DAGSchedulerEventProcessLoop(this),其中DAGSchedulerEventProcessLoop是一個EventLoop中處理DAGSchedulerEvent事件的單個業務邏輯線程,可參考(核心服務 DAGSchedulerEventProcessLoop)。絕不是偷懶,後面會單獨寫一篇文章進行介紹滴,2333~。
6 在創建TaskScheduler的同時,會創建兩個非常重要的對象,分別是DriverActor和ClientActor。
clientActor的作用:向master註冊用戶提交的任務。
DriverActor的作用:接受executor的反向註冊,將任務提交給executor。