1、介紹
總體概括應該這樣:首先啓動Driver 程序,創建SparkContext程序,然後和ClusterManager通信,ClusterManager根據程序的邏輯,在相應的Worker上啓動Executor,最後 Driver 和Executor通信,把任務分發到Executor進行運行。中間還有很多細節,比如任務的調度,DAGScheduler,Shuffle環節等等。後面會做相應的介紹。本篇博客只介紹Driver的啓動,源碼基於spark-2.4.0版本。
2、Driver的啓動流程
創建ClientAPP,在Client的onStart方法裏面
override def onStart(): Unit = {
driverArgs.cmd match {
case "launch" =>
// TODO: We could add an env variable here and intercept it in `sc.addJar` that would
// truncate filesystem paths similar to what YARN does. For now, we just require
// people call `addJar` assuming the jar is in the same directory.
val mainClass = "org.apache.spark.deploy.worker.DriverWrapper"
val classPathConf = "spark.driver.extraClassPath"
val classPathEntries = sys.props.get(classPathConf).toSeq.flatMap { cp =>
cp.split(java.io.File.pathSeparator)
}
val libraryPathConf = "spark.driver.extraLibraryPath"
val libraryPathEntries = sys.props.get(libraryPathConf).toSeq.flatMap { cp =>
cp.split(java.io.File.pathSeparator)
}
val extraJavaOptsConf = "spark.driver.extraJavaOptions"
val extraJavaOpts = sys.props.get(extraJavaOptsConf)
.map(Utils.splitCommandString).getOrElse(Seq.empty)
val sparkJavaOpts = Utils.sparkJavaOpts(conf)
val javaOpts = sparkJavaOpts ++ extraJavaOpts
val command = new Command(mainClass,
Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions,
sys.env, classPathEntries, libraryPathEntries, javaOpts)
val driverDescription = new DriverDescription(
driverArgs.jarUrl,
driverArgs.memory,
driverArgs.cores,
driverArgs.supervise,
command)
asyncSendToMasterAndForwardReply[SubmitDriverResponse](
RequestSubmitDriver(driverDescription))
case "kill" =>
val driverId = driverArgs.driverId
asyncSendToMasterAndForwardReply[KillDriverResponse](RequestKillDriver(driverId))
}
}
上面的onStart方法裏,首先是創建driverDescription,然後向Master發送提交Driver的消息。也就是在我們提交程序後,創建的Client會向master發送要啓動Driver這樣的一個消息。下面就是Master接收到消息後進行相應的處理。下面進入到Master:
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
//master接收到消息後進行模式匹配
case RequestSubmitDriver(description) =>
//首先判斷master的狀態是否是ALIVE,如果不是,則向
if (state != RecoveryState.ALIVE) {
val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
"Can only accept driver submissions in ALIVE state."
//如果master的狀態不是alive,則發送失敗的消息
context.reply(SubmitDriverResponse(self, false, None, msg))
} else {
logInfo("Driver submitted " + description.command.mainClass)
//根據driverDescription的信息,創建driver
val driver = createDriver(description)
//把Driver的信息進行持久化
persistenceEngine.addDriver(driver)
//把Driver添加到等待的隊列中
waitingDrivers += driver
//將Driver添加到Hashset中
drivers.add(driver)
//進行調度
schedule()
// TODO: It might be good to instead have the submission client poll the master to determine
// the current status of the driver. For now it's simply "fire and forget".
context.reply(SubmitDriverResponse(self, true, Some(driver.id),
s"Driver successfully submitted as ${driver.id}"))
}
}
Master 接收到client發送的提交Driver的消息後,首先就會創建一個Driver,然後把創建的Driver加入到等待隊列,等待後續的調度執行。下面看一下Driver的創建:
private def createDriver(desc: DriverDescription): DriverInfo = {
val now = System.currentTimeMillis()
//創建Date
val date = new Date(now)
//把Driver的信息封裝爲一個DriverInfo的對象
new DriverInfo(now, newDriverId(date), desc, date)
}
Driver創建完成後,就會把這些信息添加到隊列中去。最後執行調度,下面看一下調度方法,sheduler:
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) {
return
}
// Drivers take strict precedence over executors
//把集羣上的處於Alive狀態的worker隨機打亂,放到放到shuffleAliveWorkers中
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
//統計有多少個Alive狀態的worker
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
//遍歷Driver
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
//用於統計已經訪問的worker數量
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
//取出shuffledAliveWorkers中第一個worker,
val worker = shuffledAliveWorkers(curPos)
//訪問的worker數量加1
numWorkersVisited += 1
//如果這個worke的資源滿足Driver的需求
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//那麼就在這個worker上啓動Driver
launchDriver(worker, driver)
//Driver的等待隊列中把這個啓動的driver移除
waitingDrivers -= driver
//Driver的啓動狀態標記爲是
launched = true
}
//用於遍歷下一個worker的參數
curPos = (curPos + 1) % numWorkersAlive
}
}
//在worker上啓動Executor
startExecutorsOnWorkers()
}
在執行調度時,會把集羣中的worker隨機打亂,放到一個數組中,然後遍歷這個數組中的worker,如果在這個過程中,worker上的資源能夠滿足Driver的需求,就在這個worker上啓動Driver。下面看一下,Driver的啓動,進入launchDriver方法中:
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
//把Driver的信息添加到worker中
worker.addDriver(driver)
//把worker的信息添加到Driver信息裏面
driver.worker = Some(worker)
//向相應的worker發送LaunchDriver的信息
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
//把Driver的狀態標記爲RUNNING
driver.state = DriverState.RUNNING
}
把worker的信息添加到Driver後,就向相應的worker發送啓動Driver的消息,worker接收到消息後,就會執行啓動Driver的程序,下面看一下worker接收到消息後,是怎麼進行啓動Driver的,進入到Worker中
//worker的receive方法中根據模式匹配進入下面的代碼
case LaunchDriver(driverId, driverDesc) =>
logInfo(s"Asked to launch driver $driverId")
//把Driver的信息封裝一個DriverRunner對象
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
//創建DriverId
drivers(driverId) = driver
//啓動Driver
driver.start()
//更新該worker上用掉的cores數
coresUsed += driverDesc.cores
//更新worker上用掉的內存
memoryUsed += driverDesc.mem
封裝好Driver對象後,調用start方法啓動Driver
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
//創建一個新的線程啓動Driver
new Thread("DriverRunner for " + driverId) {
override def run() {
var shutdownHook: AnyRef = null
try {
shutdownHook = ShutdownHookManager.addShutdownHook { () =>
logInfo(s"Worker shutting down, killing driver $driverId")
kill()
}
// prepare driver jars and run driver
//獲取退出碼,根據退出碼反應Driver的狀態
val exitCode = prepareAndRunDriver()
// set final state depending on if forcibly killed and process exit code
finalState = if (exitCode == 0) {
Some(DriverState.FINISHED)
} else if (killed) {
Some(DriverState.KILLED)
} else {
Some(DriverState.FAILED)
}
} catch {
case e: Exception =>
kill()
finalState = Some(DriverState.ERROR)
finalException = Some(e)
} finally {
if (shutdownHook != null) {
ShutdownHookManager.removeShutdownHook(shutdownHook)
}
}
// notify worker of final driver state, possible exception
//向worker發送Driver的狀態
worker.send(DriverStateChanged(driverId, finalState.get, finalException))
}
}.start()
}
下面進入到prepareAndRunDriver的方法中:
private[worker] def prepareAndRunDriver(): Int = {
//創建Driver的工作目錄
val driverDir = createWorkingDirectory()
//下載Jar包到該工作目錄
val localJarFilename = downloadUserJar(driverDir)
//根據參數,匹配相應的模式
def substituteVariables(argument: String): String = argument match {
case "{{WORKER_URL}}" => workerUrl
case "{{USER_JAR}}" => localJarFilename
case other => other
}
// TODO: If we add ability to submit multiple jars they should also be added here
//根據參數創建一個ProcessBuilder,啓動Driver的執行命令
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
//執行命令啓動Driver
runDriver(builder, driverDir, driverDesc.supervise)
}
上面代碼主要是準備Driver的運行環境,創建啓動Driver的執行命令,最後調用runDriver方法,進入到這個方法:
private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
//設置Driver的工作目錄
builder.directory(baseDir)
def initialize(process: Process): Unit = {
// Redirect stdout and stderr to files
//創建stdout文件,把InputStream重定向到stdout文件
val stdout = new File(baseDir, "stdout")
CommandUtils.redirectStream(process.getInputStream, stdout)
//創建stderr文件,爲後面保存出現錯誤信息的日誌做準備
val stderr = new File(baseDir, "stderr")
//格式化builder命令
val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"")
val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40)
//將出現的錯誤信息重新定向到stderr文件
Files.append(header, stderr, StandardCharsets.UTF_8)
CommandUtils.redirectStream(process.getErrorStream, stderr)
}
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
}
上面的代碼主要是創建一些保存日誌的文件,最後調用runCommandWithRetry的方法:
private[worker] def runCommandWithRetry(
command: ProcessBuilderLike, initialize: Process => Unit, supervise: Boolean): Int = {
//設置初始的退出碼爲-1
var exitCode = -1
// Time to wait between submission retries.
//設置重試的時間間隔
var waitSeconds = 1
// A run of this many seconds resets the exponential back-off.
val successfulRunDuration = 5
//keepTrying爲true
var keepTrying = !killed
while (keepTrying) {
logInfo("Launch Command: " + command.command.mkString("\"", "\" \"", "\""))
synchronized {
//如果是fasle,返回退出碼
if (killed) { return exitCode }
//執行命令啓動,這裏其實才是真正啓動命令來啓動Driver
process = Some(command.start())
initialize(process.get)
}
val processStart = clock.getTimeMillis()
exitCode = process.get.waitFor()
// check if attempting another run
keepTrying = supervise && exitCode != 0 && !killed
if (keepTrying) {
if (clock.getTimeMillis() - processStart > successfulRunDuration * 1000L) {
waitSeconds = 1
}
logInfo(s"Command exited with status $exitCode, re-launching after $waitSeconds s.")
sleeper.sleep(waitSeconds)
waitSeconds = waitSeconds * 2 // exponential back-off
}
}
//返回退出碼
exitCode
}
}
再回到start的方法中,根據退出碼,返回Driver是FINISHED 或者是KILLED 還是FAILED的狀態。把返回的狀態發送給Worker,下面看一下Worker接收到消息後的處理:
case driverStateChanged @ DriverStateChanged(driverId, state, exception) =>
handleDriverStateChanged(driverStateChanged)
會調用handleDriverStateChanged的方法,進入到該方法:
private[worker] def handleDriverStateChanged(driverStateChanged: DriverStateChanged): Unit = {
//獲取Driver的ID
val driverId = driverStateChanged.driverId
val exception = driverStateChanged.exception
//獲取Driver的狀態
val state = driverStateChanged.state
//根據狀態,輸出相應的日誌信息
state match {
case DriverState.ERROR =>
logWarning(s"Driver $driverId failed with unrecoverable exception: ${exception.get}")
case DriverState.FAILED =>
logWarning(s"Driver $driverId exited with failure")
case DriverState.FINISHED =>
logInfo(s"Driver $driverId exited successfully")
case DriverState.KILLED =>
logInfo(s"Driver $driverId was killed by user")
case _ =>
logDebug(s"Driver $driverId changed state to $state")
}
//向Master發送Driver的狀態信息
sendToMaster(driverStateChanged)
//移除Driver
val driver = drivers.remove(driverId).get
//把Driver狀態標記爲完成
finishedDrivers(driverId) = driver
//如果需要則刪除鏈表裏面的處於finished狀態的Driver
trimFinishedDriversIfNecessary()
//更新一下用掉的內存和cores數
memoryUsed -= driver.driverDesc.mem
coresUsed -= driver.driverDesc.cores
}
主要是worker向Master發送Driver狀態改變的消息,master在接收到消息後進行相應的處理:
進入到Master中:
case DriverStateChanged(driverId, state, exception) =>
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
//以上三種狀態都會調用removeDriver的方法
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
Master接收到消息後,就會調用removeDriver的方法移除driver:
private def removeDriver(
driverId: String,
finalState: DriverState,
exception: Option[Exception]) {
drivers.find(d => d.id == driverId) match {
case Some(driver) =>
logInfo(s"Removing driver: $driverId")
//把Driver從隊列中移除
drivers -= driver
if (completedDrivers.size >= RETAINED_DRIVERS) {
val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
completedDrivers.trimStart(toRemove)
}
//把Driver添加到已完成completeDrivers的數組中
completedDrivers += driver
//持久化引擎中也把該Driver移除
persistenceEngine.removeDriver(driver)
//更新Driver的狀態爲最終的狀態
driver.state = finalState
driver.exception = exception
//移除worker上的Driver
driver.worker.foreach(w => w.removeDriver(driver))
//最後在進行調度
schedule()
case None =>
logWarning(s"Asked to remove unknown driver: $driverId")
}
}
}
以上就是整個Driver的啓動,以及完成後被移除的過程,即整個生命週期。
有問題關注一下 公衆號 阿龍學堂