Spark源碼閱讀——streaming模塊作業生成和提交

通常我們開發spark-streaming都會用到如下代碼：

val sparkConf = new SparkConf()
    .set("xxx", "")
    ...
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(5))

//然後我們使用ssc來創建一個InputDStream
val stream = ssc.socketTextStream("localhost", 9090)
stream.map(...)
    .filter(...)
    .reduce(...)
    .foreachRDD(...)
ssc.start()
ssc.awaitTermination()

我們知道stream的一系列調用鏈實際上是構建dag的過程，我們深入foreachRDD代碼中，查看到：

  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

DStream::register

  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }

DStreamGraph::addOutputStream

  def addOutputStream(outputStream: DStream[_]) {
    this.synchronized {
      outputStream.setGraph(this)
      outputStreams += outputStream
    }
  }

這裏DStream將自己註冊到DStreamGraph裏。後續生成作業的時候會遍歷這個outputStreams。到這裏dag已經構建並註冊完畢，下面我們看StreamingContext啓動的之後(start)是如何每隔一段時間生成一個作業的。

  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

這裏最核心的代碼就是調用了JobScheduler.start()，JobScheduler是在StreamingContext創建的時候初始化的。

  //JobScheduler::start
  def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)
    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      ssc.sparkContext,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    receiverTracker.start()
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

這裏我們關注jobGenerator的啓動。來看下JobGenerator.start

  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }

我們看到這裏也初始化並啓動了一個事件循環線程，接受JobGeneratorEvent類型的事件，同時第一次啓動時調用了startFirstTime方法，

  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }

在這個方法中啓動了DStreamGraph和timer，我們看timer是個什麼東西：

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

發現它實際上是一個定時器，每隔一段時間就向事件循環線程中發送一個GenerateJobs事件，而這一段時間就是我們創建StreamingContext時傳進來的時間間隔參數。然後我們來看事件循環線程是如何處理這個事件的：

  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }
  
  private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

這裏重點關注調用了DStreamGraph的generateJobs方法返回job集合和後續調用JobScheduler.submitJobSet方法

  //DStreamGraph::generateJobs
  def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }

看到這裏遍歷了上面提到的outputStreams，針對每個註冊的DStream調用其generateJob方法生成job

  //DStream::generateJob
  private[streaming] def generateJob(time: Time): Option[Job] = {
    getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => {
          val emptyFunc = { (iterator: Iterator[T]) => {} }
          context.sparkContext.runJob(rdd, emptyFunc)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

getOrCompute方法內部會調用compute方法來返回一個RDD，而這個compute方法是一個抽象方法，需要子類去實現，大家有興趣可以看一下對接kafka模塊的DirectInputDStream是如何實現的。這裏返回一個rdd之後，創建了一個閉包封裝在Job對象中返回，閉包中走了SparkContext提交job的邏輯，那麼我們接下來看Job中的閉包是什麼時候執行的。從這裏返回一些列調用棧回到JobGenerator.generateJobs方法的JobScheduler.submitJobSet處

  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

在submitJobSet中，依次將job封裝在JobHandler中，提交到jobExecutor來執行，JobHandler是Runnable接口的一個實現，它的run方法中，比較核心的邏輯如下：

var _eventLoop = eventLoop
if (_eventLoop != null) {
  _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
  PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
    job.run()
  }
  _eventLoop = eventLoop
  if (_eventLoop != null) {
    _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
  }
}

這裏調用了job.run方法，而在run方法中實際執行的就是我們傳遞給Job對象的閉包。整個job的運行是一個阻塞的過程，會獨佔jobExecutor的一個線程，而JobExecutor的線程數在初始化的時候是通過判斷spark.streaming.concurrentJobs參數來指定的，默認爲1。好了，streaming模塊生成作業的邏輯已經大體分析完成了。

Spark源碼閱讀——streaming模塊作業生成和提交原

Spark源碼閱讀——streaming模塊作業生成和提交

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Spark 內存管理原

記Structured Streaming 2.3.1的OOM排查過程原薦

Spark源碼閱讀——streaming模塊作業生成和提交原

jstorm源碼閱讀（2） —— supervisor簡介原

spring源碼閱讀筆記（一）原

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Spark源碼閱讀——streaming模塊作業生成和提交 原

Spark源碼閱讀——streaming模塊作業生成和提交

Spark源碼閱讀——streaming模塊作業生成和提交原