第6課:Spark Streaming源碼解讀之Job動態生成和深度思考


1. DStream三種類型:
1) 輸入的DStreams: Kafka,Socket,Flume;
2) 輸出的DStreams,是一個邏輯級的Action,它是SparkStreaming框架提出的,底層還是會被翻譯成物理級別的Action,所謂物理級別的Action是RDD的Action;
3) 中間的Transformation, 業務邏輯處理 

2. 產生DStream有兩種方式:基於(1)數據源和(2)其他的DStream而產生
1)直接基於數據源構建DStream;
2)針對其他DStream進行操作產生的新DStream;

3.一切不是流處理的數據或和流處理沒有關係的數據都沒有價值的數據。
SparkStream以時間爲觸發的流處理

val ssc = new StreamingContext(conf, Seconds(5))
每5秒中JobGenerator都會產生 一個Job,邏輯級別的。
邏輯級別的Job:有個Job,說怎麼做,但沒有做,誰去做,由底層物理級別的Action去觸發的。
DStream的Action操作也是邏輯級別的,作爲Runnable接口進行封裝, 沒有直接生成物理級別的Job,讓我們有機會調度優化.

DStream依賴關係與RDD依賴關係
把DStream的依賴關係翻譯成RDD之間的依賴關係,由於DStream依賴關係的最後一個一定是Action的操作,翻譯成RDD的時候,RDD最後一個也是Action級別的操作,如果翻譯的時候,直接執行了,它就直接生成了Job,就沒有所謂的隊列之類,它也就不受管理了,會把這個翻譯的RDD放在一個方法中,只不過是方法的定義的部分,我們所說的物理級別的RDD,實際上確實是被翻譯成了RDD,只不過RDD所有的翻譯的內容,都是在一個方法中,這個方法還沒有執行,所以方法中的RDD無法執行,當JobScheduler要調度這個Job的時候,就轉過來在線程池中拿出一條線程執行剛纔的封裝的方法。

源碼
JobGenerator:負責Job的生成;
JobScheduler:負責Job的調度;
ReceiverTracker:負責接收數據;

JobGenerator&ReceiverTracker其實是JobScheduler的成員。

//JobScheduler的start方法中有兩個調用
    receiverTracker.start()
    jobGenerator.start()

//JobGeneratorstart方法
    if (eventLoop != null) return // generator has already been started
  ……
//匿名內部類
    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
//重載了EventLoop的方法
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()//啓動eventLoop

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }

/** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
graph.start(startTime - graph.batchDuration)
timer.start(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
}

//Timer的Start方法,線程調用
def start(startTime: Long): Long = synchronized {
nextTime = startTime
thread.start()
logInfo(
"Started timer for " + name + " at time " + nextTime)
nextTime
}

// 下面的EventLoop的定義及其中的start方法
 /**
 * An event loop to receive events from the caller and process all events in the event thread. It
 * will start an exclusive event thread to process all events.
 *
 * Note: The event queue will grow indefinitely. So subclasses should make sure `onReceive` can
 * handle events in time to avoid the potential OOM.
 */
private[spark] abstract class EventLoop[E](name: String) extends Logging {

  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

  private val stopped = new AtomicBoolean(false)

  private val eventThread = new Thread(name) {
    setDaemon(true);  //線程後臺運行

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()  //不斷地取消息
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) => {
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
            }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }

  }

  def start(): Unit = {
    if (stopped.get) {
      throw new IllegalStateException(name + " has already been stopped")
    }
    // Call onStart before starting the event thread to make sure it happens before onReceive
    onStart()
    eventThread.start()
  }

/**
   * Invoked in the event thread when polling events from the event queue.
   *
   * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked
   * and cannot process events in time. If you want to call some blocking actions, run them in
   * another thread.
   */
  protected def onReceive(event: E): Unit
……

//JobGenerator的processEvent方法
/** Processes all events */
  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)  //以time作爲參數
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
jobScheduler.
receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.
submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) //*** submitJobSet
case Failure(e) =>
jobScheduler.reportError(
"Error generating jobs for time " + time, e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
graph的generateJobs方法 
def generateJobs(time: Time): Seq[Job] = {
logDebug(
"Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap { outputStream =>
val jobOption = outputStream.generateJob(time)//整個DStream的最後一個
jobOption.foreach(_.setCallSite(outputStream.
creationSite))
jobOption
}
}
logDebug(
"Generated " + jobs.length + " jobs for time " + time)
jobs
}
DStream的generateJob
/**
* Generate a SparkStreaming job for the given time. This is an internal method that
* should not be called directly. This default implementation creates a job
* that materializes the corresponding RDD. Subclasses of DStream may override this
* to generate their own jobs.
*/
private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) => {
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
}
case None => None
}
}

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
RecurringTimer定義
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
extends Logging {

private val thread = new Thread("RecurringTimer - " + name) {
setDaemon(
true)
override def run() { loop }
}
/**
* Start at the given start time.
*/
def start(startTime: Long): Long = synchronized {
nextTime = startTime
thread.start()
logInfo("Started timer for " + name + " at time " + nextTime)
nextTime
}
  /**
* Repeatedly call the callback every interval.
*/
private def loop() {
try {
while (!stopped) {
triggerActionForNextInterval()
}
triggerActionForNextInterval()
} catch {
case e: InterruptedException =>
}
}
}

***submitJobSet方法
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo(
"No jobs added for time " + jobSet.time)
}
else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job =>
jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
JobHandler是繼承了Runnable方法
private class JobHandler(job: Job) extends Runnable with Logging {
import JobScheduler._

def run() {
try {
val formattedTime = UIUtils.formatBatchTime(
job.time.milliseconds
, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

ssc.sc.setJobDescription(
s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
ssc.
sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
ssc.
sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

// We need to assign `eventLoop` to a temp variable. Otherwise, because
// `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
// it's possible that when `post` is called, `eventLoop` happens to null.
var _eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(
JobStarted(job, clock.getTimeMillis()))
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
job.run()
}
_eventLoop =
eventLoop
if (_eventLoop != null) {
_eventLoop.post(
JobCompleted(job, clock.getTimeMillis()))
}
}
else {
// JobScheduler has been stopped.
}
}
finally {
ssc.
sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
ssc.
sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
}
}
}
}



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章