(版本定製)第7課：Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

本期內容：

1、JobScheduler內幕實現

2、JobScheduler深度思考

JobScheduler是Spark Streaming的調度核心，地位相當於Spark Core上調度中心的DAG Scheduler，非常重要！

JobGenerator每隔Batch Duration時間會動態的生成JobSet提交給JobScheduler，JobScheduler接收到JobSet後，如何處理呢？

產生Job

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

處理產生的JobSet

def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
    logInfo("No jobs added for time " + jobSet.time)
  } else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
    logInfo("Added jobs for time " + jobSet.time)
  }
}

這裏會爲每個job生成一個新的JobHandler，交給jobExecutor運行。

這裏最重要的處理邏輯是 job => jobExecutor.execute(new JobHandler(job))，也就是將每個 job 都在 jobExecutor 線程池中、用 new JobHandler 來處理

先來看JobHandler針對Job的主要處理邏輯：

var _eventLoop = eventLoop
if (_eventLoop != null) {
  _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
// Disable checks for existing output directories in jobs launched by the streaming
  // scheduler, since we may need to write output to an existing directory during checkpoint
  // recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
job.run()
  }
  _eventLoop = eventLoop
if (_eventLoop != null) {
    _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
  }

也就是說，JobHandler除了做一些狀態記錄外，最主要的就是調用job.run()！這裏就與我們在 DStream 生成 RDD 實例詳解裏分析的對應起來了，在ForEachDStream.generateJob(time)時，是定義了Job的運行邏輯，即定義了Job.func。而在JobHandler這裏，是真正調用了Job.run()、將觸發Job.func的真正執行！

def run() {
_result = Try(func())
}

參考博客：http://lqding.blog.51cto.com/9123978/1773391

備註：

資料來源於：DT_大數據夢工廠（Spark發行版本定製）

更多私密內容，請關注微信公衆號：DT_Spark

如果您對大數據Spark感興趣，可以免費聽由王家林老師每天晚上20：00開設的Spark永久免費公開課，地址YY房間號：68917580

(版本定製)第7課：Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

Wireshark 安裝+使用（一）

第12課：HA下的Spark集羣工作機制解密

第14課：Spark RDD解密

第93課：Spark Streaming updateStateByKey案例實戰和內幕源碼解密

第15課：RDD創建內幕徹底解密

第88課：Spark Streaming從Flume Pull數據案例實戰及內幕源碼解密

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結