1, 還是從這個案例開始
object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]") val ssc = new StreamingContext(sparkConf, Seconds(40)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
2,在“SparkStream例子HdfsWordCount--從Dstream到RDD全過程解析”這一文中詳細說明了DstreamGraph回溯生成RDD的。這邊再簡單回顧一下:
a,Dstream.print()==>對應的ForEachDStream的generateJob(time:Time)方法會被DstreamGraph.generateJobs(time)調用
b, ForEachDStream的generateJob(time:Time){ parent.getOrCompute(time)….},通過parent對應的Dstream一直找到FileInputDStream的compute方法,來生成RDD
===》此處NetworkWordCount對應的是ReceiverInputDstream,通過DstreamGraph回溯生成Rdd的過程是一樣的。不過ReceiverInputDstream是取預先由SocketReceiver存放在spark的BlockManager中的數據來生成RDD的.
===》( 在這一文“ReceiverSupervisorImpl中的startReceiver(),Receiver如何將數據store到RDD的”中分析過,Receiver如何將數據存放到RDD中)
3,咱們直接查看一下SocketInputDstream的父類ReceiverInputDStream的compute方法.
abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext) extends InputDStream[T](ssc_) { 。。。。 /** * Generates RDDs with blocks received by the receiver of this stream. */ override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { // If this is called for any time before the start time of the context, // then this returns an empty RDD. This may happen when recovering from a // driver failure without any write ahead log to recover pre-failure data.
// 發生返回空的Rdd,可能是因爲driver失敗後重啓並且沒有做 WAL new BlockRDD[T](ssc.sc, Array.empty) } else { // Otherwise, ask the tracker for all the blocks that have been allocated to this stream for this batch
//否則會通過ReceiverTracker取得當前批次所有ReceivedBlockInfo信息 val receiverTracker = ssc.scheduler.receiverTracker //receiverTracker.getBlocksOfBatch(validTime)取得當前批次對應的所有的ReceiverId和每個receiverId對應的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] // InputStream的id和receiverId 有對應關係 val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker //將註冊的輸入的blocks信息放到StreamInputInfo類中,id是ReceiverInputDstream對應的,numRecords存放到BlockManager中所有條數 val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
4,分析一下這一段代碼還是挺值得學習的:目標就是爲了得到當前批次中所有receiverId對應的Seq[ReceivedBlockInfo]信息。
==》這個ReceivedBlockInfo類中,存放streamId,store()進來的總條數,及BlockId等原數據信息
val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
a,在ReceiverTracker中getBlockOfBatch方法是要得到所有輸入流的數據
private[streaming] class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging { 。。。。。 private val receivedBlockTracker = new ReceivedBlockTracker( ssc.sparkContext.conf, ssc.sparkContext.hadoopConfiguration, receiverInputStreamIds, ssc.scheduler.clock, ssc.isCheckpointPresent, Option(ssc.checkpointDir) )
。。。。。
/** Get the blocks for the given batch and all input streams. */
def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = { receivedBlockTracker.getBlocksOfBatch(batchTime) }
b,從下面的代碼可以得知,所有批次對應的數據信息都是通過timeToAllocatedBlocks這個map對應的AllocateBlocks中。
private[streaming] class ReceivedBlockTracker( conf: SparkConf, hadoopConf: Configuration, streamIds: Seq[Int], clock: Clock, recoverFromWriteAheadLog: Boolean, checkpointDirOption: Option[String]) extends Logging { private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue] private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks] 。。。。 。。。
/** Get the blocks allocated to the given batch. *按當前批次取得指定 Map 裏面是當前批次對應的所有receiverId 和receiverId對應的Seq[ReceivedBlockInfo] * */ def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] =
synchronized { timeToAllocatedBlocks.get(batchTime).map{ _.streamIdToAllocatedBlocks }
.getOrElse(Map.empty) }
b,即然是從timeToAllocatedBlocks中取的數據,哪是由誰將當前數據放進去的呢?
==》當Reciver將數據store到spark的BlockManager之後,JobGenerate纔開始工作.看一下JobScheduler的start方法執行流程,就可以證明這一點。
def start(): Unit = synchronized { 。。。 listenerBus.start(ssc.sparkContext) //處理ReceiverInputDstream的數據源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子類 receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }
c,再跟蹤到JobGenerator. generateJobs方法,關鍵代碼就是
ReceiverTracker.allocateBlockToBatch(time),從註釋上看可以得知這個方法的作用是:分配接收到的Blocks到當前批次中
===》allocateBlockToBatch調用在先,ReceivedInputDstream的compute調用在後。
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch //調用graph的generateJobs方法,通過scala的Try的apply函數,返回Success(jobs) 或者 Failure(e), // 其中的jobs就是該方法返回的Job對象集合,如果Job創建成功,再調用JobScheduler的submitJobSet方法將job提交給集羣執行。 graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) //其中streamIdToInputInfos就是store接收到的數據對應的元數據 //JobSet代表了一個batch duration中的一批jobs。就是一個普通對象,包含了未提交的jobs,提交的時間,執行開始和結束時間等信息。 jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } //發送執行CheckPoint時間,發送週期爲streaming batch接收數據的時間 eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
d,看一下ReceiverTracker.allocateBlockToBatch(time)是如何實現的?
==>,分配所有未分配的blocks到給定的batch中
/** Allocate all unallocated blocks to the given batch.* */ def allocateBlocksToBatch(batchTime: Time): Unit = { if (receiverInputStreams.nonEmpty) { receivedBlockTracker.allocateBlocksToBatch(batchTime) } }
f,還是進入ReceivedBlockTracker中:
該allocateBlocksToBatch方法作用就是:填充timeToAllocatedBlocks是HashMap[Time, AllocatedBlocks],key表示每個time批次,value 是AllocatedBlocks, AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int,Seq[ReceivedBlockInfo]])表示當前批次所有receiverId, 對應的Seq[ReceivedBlockInfo],放在這個map中
private[streaming] class ReceivedBlockTracker( conf: SparkConf, hadoopConf: Configuration, streamIds: Seq[Int], clock: Clock, recoverFromWriteAheadLog: Boolean, checkpointDirOption: Option[String]) extends Logging { private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue] private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks] private var lastAllocatedBatchTime: Time = null 。。。。
/** * Allocate all unallocated blocks to the given batch. * This event will get written to the write ahead log (if enabled). * 如果啓用WAL,會將該事件將被寫入日誌。 */ def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { //將所有Receiver的id(streamId就是receiver的id)及它的ReceivedBlockInfo放在一個Map[streamId, Seq[ReceivedBlockInfo]]中 val streamIdToBlocks = streamIds.map { streamId => //1,dequeueAll會遍歷隊列中所有元素:ReceivedBlockInfo即Block信息,傳給匿名函數,如果返回true則元素被取出,並將該元素從隊列中移除。 //==》使用將元素從隊列中移除這種特性來保證,即便下一次批次的ReceivedBlockInfo存放到這個隊列中也沒有關係,就當做當前批量進行處理,
然後從隊列中移除
//2,能從隊列中取數據是因爲,先由receiver通過store將數據存放到BlockManager中-》executor會使用AddBlock(ReceivedBlockInfo)
通知Driver的ReceiverTrackerEndPoint==> //然後將當前批次中,所有ReceiverBlockInfo放在一個HashMap[Int, ReceivedBlockQueue]的value中,這個map的key就是receiverId (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) }.toMap //將上面的streamIdToBlocks:Map[streamId, Seq[ReceivedBlockInfo]]放到批次對應的Block類中:AllocatedBlocks val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) //BatchAllocationEvent,代表當前ReceivedBlockTracker事件的狀態批次分配完成,即數據已存放到BlockManager中,它是給WAL使用的 //writeToLog不管是否寫入到日誌中都會返回true的 if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { //timeToAllocatedBlocks表示HashMap[Time, AllocatedBlocks] timeToAllocatedBlocks.put(batchTime, allocatedBlocks) //lastAllocatedBatchTime類型就是Time lastAllocatedBatchTime = batchTime } else { logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } } else { // This situation occurs when: // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent, // possibly processed batch job or half-processed batch job need to be processed again, // so the batchTime will be equal to lastAllocatedBatchTime. // 2. Slow checkpointing makes recovered batch time older than WAL recovered // lastAllocatedBatchTime. // This situation will only occurs in recovery time. logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery") } }
5,所以再回到ReceiverInputDStream中,
receiverTracker.getBlocksOfBatch(time).getOrElse(id),就是這個receiverId對應的Seq[ReceivedBlockInfo]SocketReceiver存放到saprk數據的元數據信息
abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext) extends InputDStream[T](ssc_) { 。。。。 override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { 。。。
} else {
//否則會通過ReceiverTracker取得當前批次所有ReceivedBlockInfo信息 val receiverTracker = ssc.scheduler.receiverTracker //receiverTracker.getBlocksOfBatch(validTime)取得當前批次對應的所有的ReceiverId和每個receiverId對應的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] // InputStream的id和receiverId 有對應關係 val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker //將註冊的輸入的blocks信息放到StreamInputInfo類中,id是ReceiverInputDstream對應的,numRecords存放到BlockManager中所有條數 val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
6,接下來就是調用createBlockRDD方法,從spark的BlockManager中取得當前批次內的所有RDD,來創建BlockRDD
//傳入當前批次,及對應的 Seq[ReceivedBlockInfo]表示存放到spark中的block信息 private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = { if (blockInfos.nonEmpty) { //當前案例會取出Array[StreamBlockId(streamId: Int, uniqueId: Long)] val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray // Are WAL record handles present with all the blocks val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty } if (areWALRecordHandlesPresent) { // If all the blocks have WAL record handle, then create a WALBackedBlockRDD val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray new WriteAheadLogBackedBlockRDD[T]( ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid) } else { // Else, create a BlockRDD. However, if there are some blocks with WAL info but not // others then that is unexpected and log a warning accordingly. if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) { if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) { logError("Some blocks do not have Write Ahead Log information; " + "this is unexpected and data may not be recoverable after driver failures") } else { logWarning("Some blocks have Write Ahead Log information; this is unexpected") } } //讓BlockManagerMaster去判斷是有StreamBlockId()在集羣中 val validBlockIds = blockIds.filter { id => ssc.sparkContext.env.blockManager.master.contains(id) } //如果當前記錄的Array[StreamBlockId(streamId: Int, uniqueId: Long)]和集羣中的數據不一致則記錄一下 if (validBlockIds.size != blockIds.size) { logWarning("Some blocks could not be recovered as they were not found in memory. "
+"To prevent such data loss, enabled Write Ahead Log (see programming guide " + "for more details.") } //按集羣中擁有的Array[StreamBlockId(streamId: Int, uniqueId: Long)]來創建BlockRDD new BlockRDD[T](ssc.sc, validBlockIds) } } else { // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD // according to the configuration if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) { new WriteAheadLogBackedBlockRDD[T]( ssc.sparkContext, Array.empty, Array.empty, Array.empty) } else { new BlockRDD[T](ssc.sc, Array.empty) } } }
到此,ReceiverInputDStream週期性去取,預先從SocketReceiver中存放到spark的BlockManager中的數據流程結束。。。