StreamingContext初始化
StreamingContext在初始化的時候,會創建兩個重要的組件DStreamGraph和JobScheduler,如下所示:
// 這裏初始化的一個重要的組件DStreamGraph,
// 它裏面保存了Spark Streaming Application中一系列的DStream的依賴關係,以及互相之間的算子的應用
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
// 初始化JobScheduler,涉及到Job的調度;JobGenerator生成的Job,就是通過它來調度和提交的
// 其底層還是基於Spark Core Engine
private[streaming] val scheduler = new JobScheduler(this)
在初始化StreamingContext之後,我們以WordCount程序爲例,程序接着往下執行。在前面Spark Core中分析過,觸發Job的運行是通過一個output操作(也即action),我們以簡單的print() action操作爲例,它裏面調用了print(10),也就是打印RDD中前10個數據,並且裏面調用了foreachRDD函數,而這個函數裏面調用了ForEachDStream的register()方法,而最終會調用generatorJob()方法,到這裏會觸發Job的提交。
然而上面僅僅只是觸發Job的提交,這裏並沒有涉及到Job的產生以及Receiver數據的接收,而觸發這些功能則是調用StreamingContext的start()方法,所以在Spark Streaming中,如果沒有調用它的start()方法程序是不會執行的,當然沒有action操作程序也不會執行,因爲沒有Job提交。
下面我們就着重分析start()方法:
StreamingContext的start()方法
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
// 加鎖,保證一個節點上只有一個StreamingContext在運行
StreamingContext.ACTIVATION_LOCK.synchronized {
// 判斷是否有多個StreamingContext在運行
StreamingContext.assertNoOtherContextIsActive()
try {
// 檢測初始化組件是否合法,以及是否設置了checkpoint等。
validate()
// 啓用一個單獨的線程啓動Streaming Application
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
// 調用JobScheduler的start方法,來進行Receiver的啓動
scheduler.start()
}
// 更新當前狀態
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
// 省略部分代碼
.............................
}
}
從上面代碼中可以看出,具體調用了JobScheduler的start()方法,我們到這個方法裏面看一下。
JobScheduler的start()方法
// StreamingContext的start()方法,其實裏面真正調用的是JobScheduler的start()方法
def start(): Unit = synchronized {
// 假如這個StreamingContext已經在啓動,那麼返回(可能是故障重啓動等)
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
// 創建一個接收消息的消息隊列
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
// 啓動消息接收(接收的是本地消息)
eventLoop.start()
// 獲取輸入DStream的限流率
// 這個其實還挺重要的,在基於Kafka Direct方式接收數據的時候(或者普通Receiver),
// 可以設置一個最大接收速度,也就是進行限速
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
// 創建ReceiverTracker組件,這是數據接收相關組件
receiverTracker = new ReceiverTracker(ssc)
// 記錄輸入DStream的數據信息,以便Streaming進行監控
inputInfoTracker = new InputInfoTracker(ssc)
// 啓動receiverTracker,這裏啓動輸入DStream關聯的Receiver
receiverTracker.start()
// 創建JobScheduler的時候,直接就把JobGenerator給創建出來了,並在這裏啓動
jobGenerator.start()
logInfo("Started JobScheduler")
}
我們重點分析一下上面的代碼:首先創建一個消息接收器,用於接收本地消息。接着獲取DStream的限流器,這裏涉及到輸入DStream的限流。
這裏簡單說說Receiver的限流,如果集羣資源有限,並沒有大到Receiver一接收到數據就立即處理它,這會導致Receiver端有數據積壓,爲了防止數據積壓太多,因此有必要調整接收數據的速度,這裏可以通過兩個參數來設置:spark.streaming.receiver.maxRate 和 spark.streaming.kafka.maxRatePerPartition
前者是設置普通Receiver,後者是設置Kafka 的,然而從Spark 1.5之後,對於Kafka Direct方式而言引入了backpressure(背壓)機制,從而不需要設置Receiver的限速,Spark可以自動估計Receiver最合理的接收速度,並根據情況動態調整。啓動這個機制只需要設置 spark.streaming.backpressure.enabled爲true即可。
接着分析上述代碼,然後創建了兩個重要組件ReceiverTracker和JobGenerator,並啓動他們,我們先分析ReceiverTracker的start()方法。
ReceiverTracker的start()方法
由於ReceiverTracker的start方法中實際上調用的是launchReceivers()方法,我們就看這個方法:
private def launchReceivers(): Unit = {
// 獲取所有的Receiver
val receivers = receiverInputStreams.map(nis => {
// 將程序中創建的所有輸入DStream,調用其getReceiver方法,拿到一個Receiver集合
val rcvr = nis.getReceiver()
// 設置Receiver的ID
rcvr.setReceiverId(nis.id)
rcvr
})
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
// 向ReceiverTrackerEndpoint發送啓動所有Receiver消息,
// 其實就是在本地進行消息收發
endpoint.send(StartAllReceivers(receivers))
}
從上述代碼可以看到,啓動Receiver是通過發送一個本地消息StartAllReceiver來啓動的,下面我們看一下這個源碼:
override def receive: PartialFunction[Any, Unit] = {
// Local messages
// 啓動所有的Receivers
case StartAllReceivers(receivers) =>
// 計算Receiver的啓動位置,說白了就是看Receiver在哪個executor啓動
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
// 獲取Receiver啓動的所在節點的executor
val executors = scheduledLocations(receiver.streamId)
// 更新到ReceiverInfo中
updateReceiverScheduledExecutors(receiver.streamId, executors)
// 記錄每個Receiver的啓動位置
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
// 啓動Receiver,這裏傳入了Receiver要啓動的executor位置
startReceiver(receiver, executors)
}
// 省略代碼
}
這裏主要就是startReceiver()這個函數來啓動Receiver,注意傳入的參數,一個是待啓動的Receiver的集合,還有就是每個Receiver的啓動位置(也即在哪個Worker的executor節點上啓動)。
下面重點分析startReceiver方法:
ReceiverTracker的startReceiver()方法
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
// 檢測Receiver是否已經啓動或已經被關閉等
val receiverId = receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}
// 是否設置Checkpoint
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
/**
* 這裏定義了Receiver的核心邏輯,
* 注意:這裏以及之後的操作都只是定義,不是在Driver端執行的
* 這裏只是定義了一個函數,這個函數的執行,以及往後的過程
* 都是在executor上執行的。這裏強調,Receiver的啓動絕對不是在Driver上的,是在Executor上的
*/
// 遍歷每個Receiver,並進行啓動
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
// 獲取一個Receiver
val receiver = iterator.next()
assert(iterator.hasNext == false)
// 將每個Receiver封裝在ReceiverSupervisorImpl中,並調用其start方法啓動
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
// 這裏調用了它的父類ReceiverSupervisor的start方法
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// 這裏做了優化,receiver接收到的數據,所封裝成的RDD,它的最佳位置在Receiver啓動的那個節點上
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
// 這裏的submitJob會真正將Receiver的啓動函數,分佈到各個Worker節點的Executor上去執行
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
// 這裏就是判斷Job的運行結果狀態
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
上面代碼中重要的部分就是對每個Receiver封裝了一個startReceiverFunc,這個Receiver的啓動函數,它裏面具體的就是將每個Receiver封裝進了ReceiverSupervisorImpl中,然後調用它的start()方法,啓動Receiver;接着就是封裝了receiverRDD;最重要的是將他們兩通過SparkContext的submitJob進行Job的提交發送到各個Worker節點的executor上去執行。
上面需要注意的是,真正啓動Receiver是在Worker節點的executor上,而不是Driver上,Driver只是將Receiver進行封裝,然後發送到各個Executor上進行啓動。
下面我們看看在每個Worker的executor上,Receiver是如何啓動的,首先調用的是ReceiverSupervisorImpl的start()方法。
ReceiverSupervisorImpl的start()
ReceiverSupervisorImpl的start()方法在這個類中沒有,因爲在其父類中實現的,我們看其父類ReceiverSupervisor的start方法:
def start() {
// 調用ReceiverSupervisorImpl的onStart()方法
onStart()
startReceiver()
}
start方法中只有兩個方法,一個是onStart(),如下所示,用於啓動JobGenerator(後面再進行分析);還有一個就是startReceiver,用於啓動Receiver。
override protected def onStart() {
// 這裏啓動了一個BlockGenerator,非常重要,它允許在worker的executor端負責
// 數據接收後的一些存取工作,以及配合ReceiverTracker。
// 所以在Executor上,啓動Receiver之前,就會先啓動這個Receiver,相關的BlockGenerator
// 這裏啓動已經註冊的BlockGenerator
registeredBlockGenerators.foreach { _.start() }
}
在startReceiver中啓動Receiver。
startReceiver啓動Receiver
// 這裏就會啓動Receiver
def startReceiver(): Unit = synchronized {
try {
// 先向ReceiverTracker發送啓動Receiver的信息進行註冊
if (onReceiverStart()) {
logInfo("Starting receiver")
// 啓動Receiver
receiverState = Started
// 這裏就啓動Receiver
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
首先向ReceiverTracker發送啓動Receiver的消息,進行註冊。發送成功之後,就會進行Receiver的啓動,調用了Receiver的onStart()方法,我們這裏以socket receiver爲例,來進行說明,其他的Receiver啓動都大同小異。
// 啓動Receiver
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
// 這裏主要就是建立一個socket連接,用於接收數據
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
從上面可以很清楚的看到,在Worker節點的executor上啓動的Socket Receiver,主要就是與數據源建立一個Socket連接,然後接受數據,並保存到它對應的BlockManager上,然後進行後面一系列的算子處理。
總結一下:上面主要分析了StreamingContext初始化的過程,以及它的start()方法;這裏面主要作用是創建了四個重要的組件JobScheduler、DStreamGraph、ReceiverTracker和JobGenerator。其中着重分析了Receiver的啓動過程,首先將Receiver封裝進啓動函數startReceiverFunc中,然後通過SparkContext的submitJob,將各個Receiver分發到Worker節點的executor上去啓動,啓動的時候主要調用的是ReceiverSupervosor的startReceiver()方法,首先給ReceiverTracker發送啓動消息,然後調用receiver的onStart()方法啓動Receiver的數據接收。我們以Socket Receiver爲例,進行了簡單分析。