SparkStreaming案例:NetworkWordCount--ReceiverInputDstream的compute方法如何取得Socket預先存放在BlockManager中的數據

1, 還是從這個案例開始

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()
    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[5]")
    val ssc = new StreamingContext(sparkConf, Seconds(40))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream("192.168.4.41", 9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2,在“SparkStream例子HdfsWordCount--從Dstream到RDD全過程解析”這一文中詳細說明了DstreamGraph回溯生成RDD的。這邊再簡單回顧一下:

a,Dstream.print()==>對應的ForEachDStream的generateJob(time:Time)方法會被DstreamGraph.generateJobs(time)調用

b, ForEachDStream的generateJob(time:Time){ parent.getOrCompute(time)….},通過parent對應的Dstream一直找到FileInputDStream的compute方法,來生成RDD

===》此處NetworkWordCount對應的是ReceiverInputDstream,通過DstreamGraph回溯生成Rdd的過程是一樣的。不過ReceiverInputDstream是取預先由SocketReceiver存放在spark的BlockManager中的數據來生成RDD的.

===》( 在這一文“ReceiverSupervisorImpl中的startReceiver(),Receiver如何將數據store到RDD的”中分析過,Receiver如何將數據存放到RDD中)

3,咱們直接查看一下SocketInputDstream的父類ReceiverInputDStream的compute方法.

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  /**
   * Generates RDDs with blocks received by the receiver of this stream. */
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
// 發生返回空的Rdd,可能是因爲driver失敗後重啓並且沒有做 WAL
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream for this batch
//否則會通過ReceiverTracker取得當前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得當前批次對應的所有的ReceiverId和每個receiverId對應的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有對應關係
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //將註冊的輸入的blocks信息放到StreamInputInfo類中,id是ReceiverInputDstream對應的,numRecords存放到BlockManager中所有條數
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

4,分析一下這一段代碼還是挺值得學習的:目標就是爲了得到當前批次中所有receiverId對應的Seq[ReceivedBlockInfo]信息。

==》這個ReceivedBlockInfo類中,存放streamId,store()進來的總條數,及BlockId等原數據信息

val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

a,在ReceiverTracker中getBlockOfBatch方法是要得到所有輸入流的數據

private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
。。。。。
  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )
。。。。。
/** Get the blocks for the given batch and all input streams. */
def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = {
   receivedBlockTracker.getBlocksOfBatch(batchTime)
  }

b,從下面的代碼可以得知,所有批次對應的數據信息都是通過timeToAllocatedBlocks這個map對應的AllocateBlocks中。

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
 。。。。  。。。  
/** Get the blocks allocated to the given batch.
    *按當前批次取得指定 Map 裏面是當前批次對應的所有receiverId 和receiverId對應的Seq[ReceivedBlockInfo]
    * */
  def getBlocksOfBatch(batchTime: Time): Map[Int, Seq[ReceivedBlockInfo]] = 
synchronized {
        timeToAllocatedBlocks.get(batchTime).map{ _.streamIdToAllocatedBlocks }
.getOrElse(Map.empty)
  }

b,即然是從timeToAllocatedBlocks中取的數據,哪是由誰將當前數據放進去的呢?

==》當Reciver將數據store到spark的BlockManager之後,JobGenerate纔開始工作.看一下JobScheduler的start方法執行流程,就可以證明這一點。

def start(): Unit = synchronized {
  。。。
  listenerBus.start(ssc.sparkContext)
  //處理ReceiverInputDstream的數據源,如SocketInputDstream,FlumePollingInputDstream,FlumeInputDsteam等。看ReceiverInputDstream的子類
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  receiverTracker.start()
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

c,再跟蹤到JobGenerator. generateJobs方法,關鍵代碼就是

ReceiverTracker.allocateBlockToBatch(time),從註釋上看可以得知這個方法的作用是:分配接收到的Blocks到當前批次中

===》allocateBlockToBatch調用在先,ReceivedInputDstream的compute調用在後。

/** Generate jobs and perform checkpoint for the given `time`.  */
private def generateJobs(time: Time) {
  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    //調用graph的generateJobs方法,通過scala的Try的apply函數,返回Success(jobs) 或者 Failure(e),
    // 其中的jobs就是該方法返回的Job對象集合,如果Job創建成功,再調用JobScheduler的submitJobSet方法將job提交給集羣執行。
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      //其中streamIdToInputInfos就是store接收到的數據對應的元數據
      //JobSet代表了一個batch duration中的一批jobs。就是一個普通對象,包含了未提交的jobs,提交的時間,執行開始和結束時間等信息。
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  //發送執行CheckPoint時間,發送週期爲streaming batch接收數據的時間
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

d,看一下ReceiverTracker.allocateBlockToBatch(time)是如何實現的?

==>,分配所有未分配的blocks到給定的batch中

/** Allocate all unallocated blocks to the given batch.* */
def allocateBlocksToBatch(batchTime: Time): Unit = {
  if (receiverInputStreams.nonEmpty) {
    receivedBlockTracker.allocateBlocksToBatch(batchTime)
  }
}

f,還是進入ReceivedBlockTracker中:

該allocateBlocksToBatch方法作用就是:填充timeToAllocatedBlocks是HashMap[Time, AllocatedBlocks],key表示每個time批次,value 是AllocatedBlocks, AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int,Seq[ReceivedBlockInfo]])表示當前批次所有receiverId,  對應的Seq[ReceivedBlockInfo],放在這個map中

private[streaming] class ReceivedBlockTracker(
    conf: SparkConf,
    hadoopConf: Configuration,
    streamIds: Seq[Int],
    clock: Clock,
    recoverFromWriteAheadLog: Boolean,
    checkpointDirOption: Option[String])
  extends Logging {

  private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
  private var lastAllocatedBatchTime: Time = null
。。。。
  /**
   * Allocate all unallocated blocks to the given batch.
   * This event will get written to the write ahead log (if enabled).
    *  如果啓用WAL,會將該事件將被寫入日誌。
   */
  def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
      //將所有Receiver的id(streamId就是receiver的id)及它的ReceivedBlockInfo放在一個Map[streamId, Seq[ReceivedBlockInfo]]中
      val streamIdToBlocks = streamIds.map { streamId =>
          //1,dequeueAll會遍歷隊列中所有元素:ReceivedBlockInfo即Block信息,傳給匿名函數,如果返回true則元素被取出,並將該元素從隊列中移除。
          //==》使用將元素從隊列中移除這種特性來保證,即便下一次批次的ReceivedBlockInfo存放到這個隊列中也沒有關係,就當做當前批量進行處理,
           然後從隊列中移除
          //2,能從隊列中取數據是因爲,先由receiver通過store將數據存放到BlockManager中-》executor會使用AddBlock(ReceivedBlockInfo)
         通知Driver的ReceiverTrackerEndPoint==>
          //然後將當前批次中,所有ReceiverBlockInfo放在一個HashMap[Int, ReceivedBlockQueue]的value中,這個map的key就是receiverId
          (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
      }.toMap
      //將上面的streamIdToBlocks:Map[streamId, Seq[ReceivedBlockInfo]]放到批次對應的Block類中:AllocatedBlocks
      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
      //BatchAllocationEvent,代表當前ReceivedBlockTracker事件的狀態批次分配完成,即數據已存放到BlockManager中,它是給WAL使用的
      //writeToLog不管是否寫入到日誌中都會返回true的
      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
        //timeToAllocatedBlocks表示HashMap[Time, AllocatedBlocks]
        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
        //lastAllocatedBatchTime類型就是Time
        lastAllocatedBatchTime = batchTime
      } else {
        logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
      }
    } else {
      // This situation occurs when:
      // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,
      // possibly processed batch job or half-processed batch job need to be processed again,
      // so the batchTime will be equal to lastAllocatedBatchTime.
      // 2. Slow checkpointing makes recovered batch time older than WAL recovered
      // lastAllocatedBatchTime.
      // This situation will only occurs in recovery time.
      logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")
    }
  }

5,所以再回到ReceiverInputDStream中,

receiverTracker.getBlocksOfBatch(time).getOrElse(id),就是這個receiverId對應的Seq[ReceivedBlockInfo]SocketReceiver存放到saprk數據的元數據信息

abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)
  extends InputDStream[T](ssc_) {
 。。。。
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {
      if (validTime < graph.startTime) {
      。。。      
} else {
//否則會通過ReceiverTracker取得當前批次所有ReceivedBlockInfo信息
        val receiverTracker = ssc.scheduler.receiverTracker
        //receiverTracker.getBlocksOfBatch(validTime)取得當前批次對應的所有的ReceiverId和每個receiverId對應的Seq[ReceivedBlockInfo],
返回Map[receiverId,Seq[ReceivedBlockInfo]] 
        // InputStream的id和receiverId 有對應關係
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        // Register the input blocks information into InputInfoTracker
        //將註冊的輸入的blocks信息放到StreamInputInfo類中,id是ReceiverInputDstream對應的,numRecords存放到BlockManager中所有條數
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

6,接下來就是調用createBlockRDD方法,從spark的BlockManager中取得當前批次內的所有RDD,來創建BlockRDD

//傳入當前批次,及對應的 Seq[ReceivedBlockInfo]表示存放到spark中的block信息
private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {

  if (blockInfos.nonEmpty) {
    //當前案例會取出Array[StreamBlockId(streamId: Int, uniqueId: Long)]
    val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray

    // Are WAL record handles present with all the blocks
    val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }
    if (areWALRecordHandlesPresent) {
      // If all the blocks have WAL record handle, then create a WALBackedBlockRDD
      val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
      val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
    } else {
      // Else, create a BlockRDD. However, if there are some blocks with WAL info but not
      // others then that is unexpected and log a warning accordingly.
      if (blockInfos.find(_.walRecordHandleOption.nonEmpty).nonEmpty) {
        if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
          logError("Some blocks do not have Write Ahead Log information; " +
            "this is unexpected and data may not be recoverable after driver failures")
        } else {
          logWarning("Some blocks have Write Ahead Log information; this is unexpected")
        }
      }
      //讓BlockManagerMaster去判斷是有StreamBlockId()在集羣中
      val validBlockIds = blockIds.filter { id =>
        ssc.sparkContext.env.blockManager.master.contains(id)
      }
      //如果當前記錄的Array[StreamBlockId(streamId: Int, uniqueId: Long)]和集羣中的數據不一致則記錄一下
      if (validBlockIds.size != blockIds.size) {
        logWarning("Some blocks could not be recovered as they were not found in memory. " 
        +"To prevent such data loss, enabled Write Ahead Log (see programming guide " +
          "for more details.")
      }
      //按集羣中擁有的Array[StreamBlockId(streamId: Int, uniqueId: Long)]來創建BlockRDD
      new BlockRDD[T](ssc.sc, validBlockIds)
    }
  } else {
    // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
    // according to the configuration
    if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
      new WriteAheadLogBackedBlockRDD[T](
        ssc.sparkContext, Array.empty, Array.empty, Array.empty)
    } else {
      new BlockRDD[T](ssc.sc, Array.empty)
    }
  }
}

到此,ReceiverInputDStream週期性去取,預先從SocketReceiver中存放到spark的BlockManager中的數據流程結束。。。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章