spark啓動過程及通信-消息的形式

1、介紹

總體概括應該這樣:首先啓動Driver 程序,創建SparkContext程序,然後和ClusterManager通信,ClusterManager根據程序的邏輯,在相應的Worker上啓動Executor,最後 Driver 和Executor通信,把任務分發到Executor進行運行。中間還有很多細節,比如任務的調度,DAGScheduler,Shuffle環節等等。後面會做相應的介紹。本篇博客只介紹Driver的啓動,源碼基於spark-2.4.0版本。

2、Driver的啓動流程

創建ClientAPP,在Client的onStart方法裏面

override def onStart(): Unit = {
  driverArgs.cmd match {
    case "launch" =>
      // TODO: We could add an env variable here and intercept it in `sc.addJar` that would
      //       truncate filesystem paths similar to what YARN does. For now, we just require
      //       people call `addJar` assuming the jar is in the same directory.
      val mainClass = "org.apache.spark.deploy.worker.DriverWrapper"

      val classPathConf = "spark.driver.extraClassPath"
      val classPathEntries = sys.props.get(classPathConf).toSeq.flatMap { cp =>
        cp.split(java.io.File.pathSeparator)
      }

      val libraryPathConf = "spark.driver.extraLibraryPath"
      val libraryPathEntries = sys.props.get(libraryPathConf).toSeq.flatMap { cp =>
        cp.split(java.io.File.pathSeparator)
      }

      val extraJavaOptsConf = "spark.driver.extraJavaOptions"
      val extraJavaOpts = sys.props.get(extraJavaOptsConf)
        .map(Utils.splitCommandString).getOrElse(Seq.empty)
      val sparkJavaOpts = Utils.sparkJavaOpts(conf)
      val javaOpts = sparkJavaOpts ++ extraJavaOpts
      val command = new Command(mainClass,
        Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions,
        sys.env, classPathEntries, libraryPathEntries, javaOpts)

      val driverDescription = new DriverDescription(
        driverArgs.jarUrl,
        driverArgs.memory,
        driverArgs.cores,
        driverArgs.supervise,
        command)
      asyncSendToMasterAndForwardReply[SubmitDriverResponse](
        RequestSubmitDriver(driverDescription))

    case "kill" =>
      val driverId = driverArgs.driverId
      asyncSendToMasterAndForwardReply[KillDriverResponse](RequestKillDriver(driverId))
  }
}

上面的onStart方法裏,首先是創建driverDescription,然後向Master發送提交Driver的消息。也就是在我們提交程序後,創建的Client會向master發送要啓動Driver這樣的一個消息。下面就是Master接收到消息後進行相應的處理。下面進入到Master:
 

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
//master接收到消息後進行模式匹配
  case RequestSubmitDriver(description) =>
//首先判斷master的狀態是否是ALIVE,如果不是,則向
    if (state != RecoveryState.ALIVE) {
      val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
        "Can only accept driver submissions in ALIVE state."
      //如果master的狀態不是alive,則發送失敗的消息
context.reply(SubmitDriverResponse(self, false, None, msg))
    } else {
      logInfo("Driver submitted " + description.command.mainClass)
//根據driverDescription的信息,創建driver
      val driver = createDriver(description)
//把Driver的信息進行持久化
      persistenceEngine.addDriver(driver)
//把Driver添加到等待的隊列中
      waitingDrivers += driver
//將Driver添加到Hashset中
      drivers.add(driver)
//進行調度
      schedule()

      // TODO: It might be good to instead have the submission client poll the master to determine
      //       the current status of the driver. For now it's simply "fire and forget".

      context.reply(SubmitDriverResponse(self, true, Some(driver.id),
        s"Driver successfully submitted as ${driver.id}"))
    }
}

Master 接收到client發送的提交Driver的消息後,首先就會創建一個Driver,然後把創建的Driver加入到等待隊列,等待後續的調度執行。下面看一下Driver的創建:

private def createDriver(desc: DriverDescription): DriverInfo = {
  val now = System.currentTimeMillis()
//創建Date
  val date = new Date(now)
//把Driver的信息封裝爲一個DriverInfo的對象
  new DriverInfo(now, newDriverId(date), desc, date)
}
Driver創建完成後,就會把這些信息添加到隊列中去。最後執行調度,下面看一下調度方法,sheduler:
private def schedule(): Unit = {
  if (state != RecoveryState.ALIVE) {
    return
  }
  // Drivers take strict precedence over executors
//把集羣上的處於Alive狀態的worker隨機打亂,放到放到shuffleAliveWorkers中
  val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
  //統計有多少個Alive狀態的worker
val numWorkersAlive = shuffledAliveWorkers.size
  var curPos = 0
//遍歷Driver
  for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
    // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
    // start from the last worker that was assigned a driver, and continue onwards until we have
    // explored all alive workers.
    var launched = false
//用於統計已經訪問的worker數量
    var numWorkersVisited = 0
    while (numWorkersVisited < numWorkersAlive && !launched) {
//取出shuffledAliveWorkers中第一個worker,
      val worker = shuffledAliveWorkers(curPos)
//訪問的worker數量加1
      numWorkersVisited += 1
//如果這個worke的資源滿足Driver的需求
      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//那麼就在這個worker上啓動Driver
        launchDriver(worker, driver)
//Driver的等待隊列中把這個啓動的driver移除
        waitingDrivers -= driver
//Driver的啓動狀態標記爲是
        launched = true
      }
//用於遍歷下一個worker的參數
      curPos = (curPos + 1) % numWorkersAlive
    }
  }
//在worker上啓動Executor
  startExecutorsOnWorkers()
}

在執行調度時,會把集羣中的worker隨機打亂,放到一個數組中,然後遍歷這個數組中的worker,如果在這個過程中,worker上的資源能夠滿足Driver的需求,就在這個worker上啓動Driver。下面看一下,Driver的啓動,進入launchDriver方法中:

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
  logInfo("Launching driver " + driver.id + " on worker " + worker.id)
//把Driver的信息添加到worker中
  worker.addDriver(driver)
//把worker的信息添加到Driver信息裏面
  driver.worker = Some(worker)
//向相應的worker發送LaunchDriver的信息
  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
//把Driver的狀態標記爲RUNNING
  driver.state = DriverState.RUNNING
}

把worker的信息添加到Driver後,就向相應的worker發送啓動Driver的消息,worker接收到消息後,就會執行啓動Driver的程序,下面看一下worker接收到消息後,是怎麼進行啓動Driver的,進入到Worker中

//worker的receive方法中根據模式匹配進入下面的代碼
case LaunchDriver(driverId, driverDesc) =>
  logInfo(s"Asked to launch driver $driverId")
//把Driver的信息封裝一個DriverRunner對象
  val driver = new DriverRunner(
    conf,
    driverId,
    workDir,
    sparkHome,
    driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
    self,
    workerUri,
    securityMgr)
//創建DriverId
  drivers(driverId) = driver
//啓動Driver
  driver.start()
//更新該worker上用掉的cores數
  coresUsed += driverDesc.cores
//更新worker上用掉的內存
  memoryUsed += driverDesc.mem

封裝好Driver對象後,調用start方法啓動Driver

/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
//創建一個新的線程啓動Driver
  new Thread("DriverRunner for " + driverId) {
    override def run() {
      var shutdownHook: AnyRef = null
      try {
        shutdownHook = ShutdownHookManager.addShutdownHook { () =>
          logInfo(s"Worker shutting down, killing driver $driverId")
          kill()
        }

        // prepare driver jars and run driver
//獲取退出碼,根據退出碼反應Driver的狀態
        val exitCode = prepareAndRunDriver()

        // set final state depending on if forcibly killed and process exit code
        finalState = if (exitCode == 0) {
          Some(DriverState.FINISHED)
        } else if (killed) {
          Some(DriverState.KILLED)
        } else {
          Some(DriverState.FAILED)
        }
      } catch {
        case e: Exception =>
          kill()
          finalState = Some(DriverState.ERROR)
          finalException = Some(e)
      } finally {
        if (shutdownHook != null) {
          ShutdownHookManager.removeShutdownHook(shutdownHook)
        }
      }

      // notify worker of final driver state, possible exception
//向worker發送Driver的狀態
      worker.send(DriverStateChanged(driverId, finalState.get, finalException))
    }
  }.start()
}

下面進入到prepareAndRunDriver的方法中:

private[worker] def prepareAndRunDriver(): Int = {
//創建Driver的工作目錄
  val driverDir = createWorkingDirectory()
//下載Jar包到該工作目錄
  val localJarFilename = downloadUserJar(driverDir)
//根據參數,匹配相應的模式
  def substituteVariables(argument: String): String = argument match {
    case "{{WORKER_URL}}" => workerUrl
    case "{{USER_JAR}}" => localJarFilename
    case other => other
  }

  // TODO: If we add ability to submit multiple jars they should also be added here
//根據參數創建一個ProcessBuilder,啓動Driver的執行命令
  val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
    driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
//執行命令啓動Driver
  runDriver(builder, driverDir, driverDesc.supervise)
}

上面代碼主要是準備Driver的運行環境,創建啓動Driver的執行命令,最後調用runDriver方法,進入到這個方法:

private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
//設置Driver的工作目錄
  builder.directory(baseDir)
  def initialize(process: Process): Unit = {
    // Redirect stdout and stderr to files
//創建stdout文件,把InputStream重定向到stdout文件
    val stdout = new File(baseDir, "stdout")
    CommandUtils.redirectStream(process.getInputStream, stdout)
//創建stderr文件,爲後面保存出現錯誤信息的日誌做準備
    val stderr = new File(baseDir, "stderr")
//格式化builder命令
    val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"")
    val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40)
//將出現的錯誤信息重新定向到stderr文件
    Files.append(header, stderr, StandardCharsets.UTF_8)
    CommandUtils.redirectStream(process.getErrorStream, stderr)
  }
  runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
}

上面的代碼主要是創建一些保存日誌的文件,最後調用runCommandWithRetry的方法:

private[worker] def runCommandWithRetry(
      command: ProcessBuilderLike, initialize: Process => Unit, supervise: Boolean): Int = {
  //設置初始的退出碼爲-1
var exitCode = -1
// Time to wait between submission retries.
//設置重試的時間間隔
var waitSeconds = 1
// A run of this many seconds resets the exponential back-off.
val successfulRunDuration = 5
//keepTrying爲true
var keepTrying = !killed

while (keepTrying) {
  logInfo("Launch Command: " + command.command.mkString("\"", "\" \"", "\""))

  synchronized {
    //如果是fasle,返回退出碼
    if (killed) { return exitCode }
    //執行命令啓動,這裏其實才是真正啓動命令來啓動Driver
      process = Some(command.start())
       initialize(process.get)
      }

      val processStart = clock.getTimeMillis()
      exitCode = process.get.waitFor()

      // check if attempting another run
      keepTrying = supervise && exitCode != 0 && !killed
      if (keepTrying) {
        if (clock.getTimeMillis() - processStart > successfulRunDuration * 1000L) {
          waitSeconds = 1
        }
        logInfo(s"Command exited with status $exitCode, re-launching after $waitSeconds s.")
        sleeper.sleep(waitSeconds)
        waitSeconds = waitSeconds * 2 // exponential back-off
      }
    }
//返回退出碼
    exitCode
  }
}

再回到start的方法中,根據退出碼,返回Driver是FINISHED 或者是KILLED 還是FAILED的狀態。把返回的狀態發送給Worker,下面看一下Worker接收到消息後的處理:

case driverStateChanged @ DriverStateChanged(driverId, state, exception) =>
  handleDriverStateChanged(driverStateChanged)

 會調用handleDriverStateChanged的方法,進入到該方法:

private[worker] def handleDriverStateChanged(driverStateChanged: DriverStateChanged): Unit = {
//獲取Driver的ID
  val driverId = driverStateChanged.driverId
  val exception = driverStateChanged.exception
//獲取Driver的狀態
  val state = driverStateChanged.state
//根據狀態,輸出相應的日誌信息
  state match {
    case DriverState.ERROR =>
      logWarning(s"Driver $driverId failed with unrecoverable exception: ${exception.get}")
    case DriverState.FAILED =>
      logWarning(s"Driver $driverId exited with failure")
    case DriverState.FINISHED =>
      logInfo(s"Driver $driverId exited successfully")
    case DriverState.KILLED =>
      logInfo(s"Driver $driverId was killed by user")
    case _ =>
      logDebug(s"Driver $driverId changed state to $state")
  }
//向Master發送Driver的狀態信息
  sendToMaster(driverStateChanged)
//移除Driver
  val driver = drivers.remove(driverId).get
//把Driver狀態標記爲完成
  finishedDrivers(driverId) = driver
//如果需要則刪除鏈表裏面的處於finished狀態的Driver
  trimFinishedDriversIfNecessary()
//更新一下用掉的內存和cores數
  memoryUsed -= driver.driverDesc.mem
  coresUsed -= driver.driverDesc.cores
}
主要是worker向Master發送Driver狀態改變的消息,master在接收到消息後進行相應的處理:
進入到Master中:
case DriverStateChanged(driverId, state, exception) =>
  state match {
    case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
//以上三種狀態都會調用removeDriver的方法
      removeDriver(driverId, state, exception)
    case _ =>
      throw new Exception(s"Received unexpected state update for driver $driverId: $state")
  }

Master接收到消息後,就會調用removeDriver的方法移除driver:

private def removeDriver(
      driverId: String,
      finalState: DriverState,
      exception: Option[Exception]) {
    drivers.find(d => d.id == driverId) match {
      case Some(driver) =>
        logInfo(s"Removing driver: $driverId")
//把Driver從隊列中移除
        drivers -= driver
        if (completedDrivers.size >= RETAINED_DRIVERS) {
          val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
          completedDrivers.trimStart(toRemove)
        }
//把Driver添加到已完成completeDrivers的數組中
        completedDrivers += driver
//持久化引擎中也把該Driver移除
        persistenceEngine.removeDriver(driver)
//更新Driver的狀態爲最終的狀態
        driver.state = finalState
        driver.exception = exception
//移除worker上的Driver
        driver.worker.foreach(w => w.removeDriver(driver))
//最後在進行調度
        schedule()
      case None =>
        logWarning(s"Asked to remove unknown driver: $driverId")
    }
  }
}

以上就是整個Driver的啓動,以及完成後被移除的過程,即整個生命週期。

有問題關注一下   公衆號   阿龍學堂

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章