1. StreamingContext初始化過程
StreamingContext是很多Streaming功能的入口,如:它提供從多種數據源創建DStream的方法等。在StreamingContext創建時將會創建如下主要組件:
1.1 創建DStreamGraph,併爲其設置轉換成RDD的時間間隔
private[streaming] val graph: DStreamGraph = {
// 如果進行了檢查點,這從檢查點恢復
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
// 創建DStreamGraph
val newGraph = new DStreamGraph()
// 設置batchDuration。
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
1.2 創建JobScheduler
/**
* JobGenerator會負責每個batch interval生成一個job,然後通過JobScheduler來調度和提交job。
* 在這底層,其實還是基於Spark的核心計算引擎,底層DAGScheduler、TaskScheduler、Worker、Executor、Task
* 如果我們定義了reduceByKey這種算子,還是會有shuffle過程。而且底層的數據存取組件還是Executor關聯的BlockManager。
* 負責持久化數據存儲的組件還是CacheManager。
*/
private[streaming] val scheduler = new JobScheduler(this)
2. DStream的創建和轉換
在創建和完成StreamCotnext的初始化時,它創建了DStreamGraph、JobScheduler等關鍵組件,就會調用StreamContext的socketTextStream等方法來創建DStream,然後針對input DStream執行一系列的transformation轉換操作。最後會執行一個output輸出操作來觸發和執行鍼對一個個batch的job。
在創建DStream對象時,會先初始化其父類InputDStream,在InputDStream中實現將自身加入DStreamGraph中。
ssc.graph.addInputStream(this)
InputDStream之類都有一個getReceiver()方法,此方法用來獲取Receiver對象,用於接收數據。
def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
需要注意的是,DStream中算子的轉換,和RDD中的轉換一樣都是屬於惰性計算。只有遇到output算子(如print操作)時,纔會將DStream轉換爲ForEachDStream,並調用register方法作爲OutputStream註冊到DStreamGraph的outputStreams列表。
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println(s"Time: $time")
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
private def foreachRDD(
foreachFunc: (RDD[T], Time) => Unit,
displayInnerRDDOps: Boolean): Unit = {
new ForEachDStream(this,
context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
}
其中ForEachDStream不同於其它DStream的地方是其重載了generateJob方法。
/**
* DStream的所有output操作,都會調用ForEachDStream的generatorJob()方法
* 然後底層就會觸發job的提交。
*/
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
3. 啓動過程
在上述初始化過程完成之後,有一個方法時必須要調用的,這個方法就是StreamingContext的start()方法,用來啓動一個Spark Streaming應用程序。在這個方法中會創建StreamingContext相關的另外兩個組件:ReceiverTracker和JobGenerator。另外,在Spark集羣的某個Worker節點的Executor中啓動整個Spark Streaming應用程序的Input DStream對應的Receiver。
在StreamingContext的start()方法中,最最重要的是調用了JobScheduler的start()方法。
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
// 最重要的時調用JobScheduler的start()
scheduler.start()
}
state = StreamingContextState.ACTIVE
scheduler.listenerBus.post(
StreamingListenerStreamingStarted(System.currentTimeMillis()))
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
這裏着重講解一下
JobScheduler的創建和執行:
JobScheduler用來調度運行在Spark上的作業,它使用JobGenerator生成Jobs,然後使用一個線程池並行運行提交作業。
JobScheduler創建
- JobScheduler由StreamingContext創建,並觸發start調用。
- JobScheduler初始化時,會創建一個ThreadPool(jobExecutor)和jobGenerator。
其中:
jobExecutor用於提交作業,ThreadPool中線程的數量爲Job併發量,由”spark.streaming.concurrentJobs”指定,默認爲1。
JobGenerator爲JobGenerator類實例。其用於依據DStreams生成jobs。
JobScheduler執行
start方法執行時會創建並啓動以下服務:
- eventLoop:EventLoop[JobSchedulerEvent]對象,用以接收和處理事件。調用者通過調用其post方法向事件隊列註冊事件。EventLoop開始執行時,會開啓一deamon線程用於處理隊列中的事件。EventLoop是一個抽象類,JobScheduler中初始化EventLoop時實現了其OnReceive方法。該方法中指定接收的事件由processEvent(event)方法處理。
- receiverTracker:ReceiverTracker對象,用以管理ReceiverInputDStream中receiver的執行。這個對象必須在所有InputStream添加至DStreamGraph中後創建。因其實例化時會從DStreamGraph中抽取InputDStream。以便用在其啓動時抽取其中的Receiver。
- jobGenertor:其在JobScheduler實例化時創建,在此處進行啓動。
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start()
// 創建ReceiverTracker
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
case _ => null
}
executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
executorAllocClient,
receiverTracker,
ssc.conf,
ssc.graph.batchDuration.milliseconds,
clock)
executorAllocationManager.foreach(ssc.addStreamingListener)
// 啓動ReceiverTracker
receiverTracker.start()
// 啓動JobGenerator
jobGenerator.start()
executorAllocationManager.foreach(_.start())
logInfo("Started JobScheduler")
}
4. Receiver啓動原理
在上面我們說到JobScheduler執行start方法的時候,會去創建ReceiverTracker,然後執行ReceiverTracker的start方法。那麼接下來我們分析下ReceiverTracker的start()源碼。
/** 創建一個ReceiverTrackerEndpoint對象,並且開始加載Receivers*/
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
// 開始啓動所有的Receivers
endpoint.send(StartAllReceivers(receivers))
}
case StartAllReceivers(receivers) =>
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
// 拿到Receiver的最佳位置
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
// 在Executor啓動Receiver
startReceiver(receiver, executors)
}
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
def shouldStartReceiver: Boolean = {
// It's okay to start when trackerState is Initialized or Started
!(isTrackerStopping || isTrackerStopped)
}
val receiverId = receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// Function to start the receiver on the worker node
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(ThreadUtils.sameThread)
logInfo(s"Receiver ${receiver.streamId} started")
}
參考資料:
https://blog.csdn.net/Anbang713/article/details/82049637