spark版本: 2.0.0
1. 引入
通过前一篇介绍spark submit的文章,我们知道如果以客户端模式最终运行的是–class指定类的main方法,这也是执行作业的入口。接下来,我们就以一个简单的例子,说明作业是如何执行的。
2.作业执行过程
val sparkConf = new SparkConf().setAppName("earthquake").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile("home/haha/helloSpark.txt")
lines.flatMap(_.split(" ")).count()
这是一个非常常见的例子,flatMap只是作为转换并不会触发作业执行,真正执行的是count
方法,即执行sparkContext.runJob方法
RDD.scala
--------------
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
SparkContext.scala
-----------------
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
// 当在spark包中调用类时,返回调用spark的用户代码类的名称,以及它们调用的spark方法。(解析栈)
val callSite = getCallSite
// clean方法实际上调用了ClosureCleaner的clean方法,这里一再清除闭包中的不能序列化的变量,防止RDD在网络传输过程中反序列化失败。
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
// 交给dagScheduler处理
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
// 打印当前进度
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
在运行job时,解析了运行栈,显示了运行进度,设置检查点等,但是最核心的还是dagScheduler.runJob
,通过dagScheduler运行job,现在来看一下这个方法的实现:
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
// 提交任务,返回回调对象
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
val awaitPermission = null.asInstanceOf[scala.concurrent.CanAwait]
// 一直等待任务执行成功
waiter.completionFuture.ready(Duration.Inf)(awaitPermission)
// 判断任务执行结果
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
在提交作业之后,就是等待作业执行并处理结果,所以执行作业的过程是同步进行的。所以需要关注的就是提交作业方法submitJob
。
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
// 判断分区数是否正常
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
// 产生jobId(自增)
val jobId = nextJobId.getAndIncrement()
// 如果没有作业直接返回jobwaiter对象
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
// waiter就是一个等待回调对象
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// 如果需要执行作业,需要先提交到作业队列并等待执行
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
submitJob方法主要是做了一些校验,确定唯一标识等操作,最后将作业封装成JobSubmitted对象提交到事件执行队列eventProcessLoop
(DAGSchedulerEventProcessLoop对象)中,但是DAGSchedulerEventProcessLoop没有post方法,需要找到它父类的post方法。
EventLoop.scala
-----------------
/**
* 添加事件到事件队列中
*/
def post(event: E): Unit = {
eventQueue.put(event)
}
/**
* 事件队列中有事件就执行
*/
private val eventThread = new Thread(name) {
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take()
try {
onReceive(event)
} catch {
case NonFatal(e) =>
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
上面有一个属性eventThread
它表示一个线程对象,如果有事件放入到eventQueue中,就会有一个后台线程,调用onReceive的方法处理。
// 接收并处理事件
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
/**
* 处理事件的真正逻辑
* @param event
*/
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
事件处理最终调用的方法是dagScheduler.handleJobSubmitted
,这个方法会确定最终阶段,触发作业开始的监听器,最终提交所有的阶段。
/**
* 处理已经提交的作业
*/
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// 先确定ResultStage(job的最后一个阶段)
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
// 如果创建ResultStage成功,表明当前job可以执行,改为ActiveJob对象
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
// 清除本地的中间数据
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
// jobId -> ActiveJob
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
// 设置监听通知(比如:执行进度,日志打印等)
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
// 提交阶段(触发任务执行的核心)
submitStage(finalStage)
submitWaitingStages()
}
在提交阶段过程中,先要确保所有的阶段都正常提交,然后开始提交tasks(一个stage对应多个task)。
private def submitStage(stage: Stage) {
// 通过stage查找jobId
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
// 等待/运行/失败的阶段不处理
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
// 查找缺失的阶段列表
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
// 如果所有的阶段都正常,直接提交task
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
// 提交task集合
submitMissingTasks(stage, jobId.get)
} else {
// 将所有缺失的stage完善,递归执行
for (parent <- missing) {
submitStage(parent)
}
// 添加到等待阶段
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
在submitMissingTasks
中主要是将ShuffleMapStage转化为了ShuffleMapTask集合,ResultStage转换为了ResultTask,然后交给taskScheduler执行。
DAGScheduler.scala
-----------------------
.......
/**
* stage -> task集合
*/
val tasks: Seq[Task[_]] = try {
stage match {
// ShuffleMapStage
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
}
// ResultStage
case stage: ResultStage =>
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
// 封装成taskset对象并交给taskScheduler处理
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// 标记stage已经完成
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
}
我们现在可以做一个简单的总结,作业提交完成之后就会交给dagScheduler处理将作业划分为多个阶段,然后转换为taskset对象交给taskScheduler执行。
现在来看一下submitTasks方法的实现,注意这个实际的调用类是TaskSchedulerImpl
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
// maxTaskFailures:最大失败次数
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
// 判断同一个阶段是不是存在多个taskSet
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
// 调度器中添加taskset manager
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
// 不是本地且没有接收task,启动一个timer定时调度,如果一直没有task就警告,直到有task
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
// backend对象创建参考org.apache.spark.SparkContext.createTaskScheduler
backend.reviveOffers()
}
最后一行代码backend.reviveOffers()
将触发任务提交与执行,但是这个backend其实根据不同的master(用户传入的参数)会产生不同的对象,现在以standalone模式说明。此时backend是StandaloneSchedulerBackend
,我们来看一下。
CoarseGrainedSchedulerBackend.scala
------------------------------
// 最终会将driver endpoint发送ReviveOffers请求
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
可以看出处理逻辑是通过driverEndpoint处理的,所以先看一下driver endpoint的定义是如何?在CoarseGrainedSchedulerBackend类中有一个class DriverEndpoint
定义了driver endpoint的实现。在前面的rpc原理中可知,driver endpoint启动的时候肯定会执行onStart方法,并且在接收到请求之后会调用receive,receiveAndReply等方法。
CoarseGrainedSchedulerBackend.scala
-------------------
override def onStart() {
// Periodically revive offers to allow delay scheduling to work
val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
// 定义给自身发送ReviveOffers,但是如果没有offer进来是不会有任何处理
reviveThread.scheduleAtFixedRate(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
Option(self).foreach(_.send(ReviveOffers))
}
}, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
}
可以看出自身也是在等待ReviveOffers,那就来看一下ReviveOffers请求类型的处理方式吧:
override def receive: PartialFunction[Any, Unit] = {
.......
case ReviveOffers =>
makeOffers()
它直接调用自身的makeOffers方法。
/**
* driver处理请求
*
*/
private def makeOffers() {
// Filter out executors under killing
// 所有可用的executors
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toSeq
// 根据调度器分配资源,并启动任务
launchTasks(scheduler.resourceOffers(workOffers))
}
/**
*launchTasks方法就是对提交的tasks做了判断,然后通过executorEndpoint发送一个LaunchTask请求,即让executor执行task
*/
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
// 序列化task
val serializedTask = ser.serialize(task)
// 是否超出rpc 消息大小
if (serializedTask.limit >= maxRpcMessageSize) {
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.rpc.message.maxSize (%d bytes). Consider increasing " +
"spark.rpc.message.maxSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
s"${executorData.executorHost}.")
// 通知executor执行task,endpoint来自类CoarseGrainedExecutorBackend
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
executorEndpoint对应的是CoarseGrainedExecutorBackend,在该类中可以看到处理LaunchTask请求类型。
CoarseGrainedExecutorBackend.scala
-----------
override def receive: PartialFunction[Any, Unit] = {
/**
* 这个方法只是反序列化了task实例,最终调用的是executor.launchTask
*/
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
// 反序列化任务描述
val taskDesc = ser.deserialize[TaskDescription](data.value)
logInfo("Got assigned task " + taskDesc.taskId)
executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
taskDesc.name, taskDesc.serializedTask)
}
Executor.scala
-----------------
/**
* 封装任务,并放入线程池中,真正运行的是TaskRunner.run
*/
def launchTask(
context: ExecutorBackend,
taskId: Long,
attemptNumber: Int,
taskName: String,
serializedTask: ByteBuffer): Unit = {
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
serializedTask)
// 记录正在运行的task
runningTasks.put(taskId, tr)
// 放到线程池中执行
threadPool.execute(tr)
}
TaskRunner将是最终运行的位置,代码量比较大,将抽出核心代码进行介绍:
override def run(): Unit = {
// 内存管理器
val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
.....
// 告知driver endpoint task状态更新
execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
var taskStart: Long = 0
startGCTime = computeTotalGcTime()
try {
// 反序列化运行任务需要的jar,文件,属性等
val (taskFiles, taskJars, taskProps, taskBytes) =
Task.deserializeWithDependencies(serializedTask)
Executor.taskDeserializationProps.set(taskProps)
// 根据文件/jar的元数据更新依赖
updateDependencies(taskFiles, taskJars)
task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
task.localProperties = taskProps
task.setTaskMemoryManager(taskMemoryManager)
.....
// Run the actual task and measure its runtime.
taskStart = System.currentTimeMillis()
var threwException = true
val value = try {
// 运行任务,这个任务分为ShuffleMapTask和ResultTask。ShuffleMapTask会将结果记录到blockManager,用于下一个task使用,而ResultTask直接得到结果。
val res = task.run(
taskAttemptId = taskId,
attemptNumber = attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
} finally {
........
}
val taskFinish = System.currentTimeMillis()
.....
// 将结果序列化
val directResult = new DirectTaskResult(valueBytes, accumUpdates)
val serializedDirectResult = ser.serialize(directResult)
val resultSize = serializedDirectResult.limit
/**
* 处理不同大小的序列化结果
*/
val serializedResult: ByteBuffer = {
if (maxResultSize > 0 && resultSize > maxResultSize) {
// 大于结果的最大值(1G),直接丢弃
logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " +
s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " +
s"dropping it.")
ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize))
} else if (resultSize > maxDirectResultSize) {
// 小于最大值,但是大于maxDirectResultSize,通过blockManager保存,返回保存结果元数据
val blockId = TaskResultBlockId(taskId)
env.blockManager.putBytes(
blockId,
new ChunkedByteBuffer(serializedDirectResult.duplicate()),
StorageLevel.MEMORY_AND_DISK_SER)
logInfo(
s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)")
ser.serialize(new IndirectTaskResult[Any](blockId, resultSize))
} else {
// 其他情况(小于maxDirectResultSize),直接传输
logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver")
serializedDirectResult
}
}
// 通知driver更新状态
execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
} catch {
.....
} finally {
// 移除task
runningTasks.remove(taskId)
}
}
}