Spark Streaming — StreamingCongtext初始化及Receiver啓動

StreamingContext初始化

  StreamingContext在初始化的時候,會創建兩個重要的組件DStreamGraph和JobScheduler,如下所示:

  // 這裏初始化的一個重要的組件DStreamGraph,
  // 它裏面保存了Spark Streaming Application中一系列的DStream的依賴關係,以及互相之間的算子的應用
  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      cp_.graph.setContext(this)
      cp_.graph.restoreCheckpointData()
      cp_.graph
    } else {
      require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(batchDur_)
      newGraph
    }
  }
  // 初始化JobScheduler,涉及到Job的調度;JobGenerator生成的Job,就是通過它來調度和提交的
  // 其底層還是基於Spark Core Engine
  private[streaming] val scheduler = new JobScheduler(this)

  在初始化StreamingContext之後,我們以WordCount程序爲例,程序接着往下執行。在前面Spark Core中分析過,觸發Job的運行是通過一個output操作(也即action),我們以簡單的print() action操作爲例,它裏面調用了print(10),也就是打印RDD中前10個數據,並且裏面調用了foreachRDD函數,而這個函數裏面調用了ForEachDStream的register()方法,而最終會調用generatorJob()方法,到這裏會觸發Job的提交。
然而上面僅僅只是觸發Job的提交,這裏並沒有涉及到Job的產生以及Receiver數據的接收,而觸發這些功能則是調用StreamingContext的start()方法,所以在Spark Streaming中,如果沒有調用它的start()方法程序是不會執行的,當然沒有action操作程序也不會執行,因爲沒有Job提交
  下面我們就着重分析start()方法:

StreamingContext的start()方法
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        // 加鎖,保證一個節點上只有一個StreamingContext在運行
        StreamingContext.ACTIVATION_LOCK.synchronized {
          // 判斷是否有多個StreamingContext在運行
          StreamingContext.assertNoOtherContextIsActive()
          try {
            // 檢測初始化組件是否合法,以及是否設置了checkpoint等。
            validate()
            
            // 啓用一個單獨的線程啓動Streaming Application
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              // 調用JobScheduler的start方法,來進行Receiver的啓動
              scheduler.start()
            }
            // 更新當前狀態
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        //  省略部分代碼
        .............................
    }
  }

  從上面代碼中可以看出,具體調用了JobScheduler的start()方法,我們到這個方法裏面看一下。

JobScheduler的start()方法
  // StreamingContext的start()方法,其實裏面真正調用的是JobScheduler的start()方法
  def start(): Unit = synchronized {
    // 假如這個StreamingContext已經在啓動,那麼返回(可能是故障重啓動等)
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    // 創建一個接收消息的消息隊列
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    // 啓動消息接收(接收的是本地消息)
    eventLoop.start()

    // 獲取輸入DStream的限流率
    // 這個其實還挺重要的,在基於Kafka Direct方式接收數據的時候(或者普通Receiver),
    // 可以設置一個最大接收速度,也就是進行限速
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start(ssc.sparkContext)
    // 創建ReceiverTracker組件,這是數據接收相關組件
    receiverTracker = new ReceiverTracker(ssc)
    // 記錄輸入DStream的數據信息,以便Streaming進行監控
    inputInfoTracker = new InputInfoTracker(ssc)
    // 啓動receiverTracker,這裏啓動輸入DStream關聯的Receiver
    receiverTracker.start()
    // 創建JobScheduler的時候,直接就把JobGenerator給創建出來了,並在這裏啓動
    jobGenerator.start()
    logInfo("Started JobScheduler")
  }

  我們重點分析一下上面的代碼:首先創建一個消息接收器,用於接收本地消息。接着獲取DStream的限流器,這裏涉及到輸入DStream的限流。
  這裏簡單說說Receiver的限流,如果集羣資源有限,並沒有大到Receiver一接收到數據就立即處理它,這會導致Receiver端有數據積壓,爲了防止數據積壓太多,因此有必要調整接收數據的速度,這裏可以通過兩個參數來設置:spark.streaming.receiver.maxRate 和 spark.streaming.kafka.maxRatePerPartition
  前者是設置普通Receiver,後者是設置Kafka 的,然而從Spark 1.5之後,對於Kafka Direct方式而言引入了backpressure(背壓)機制,從而不需要設置Receiver的限速,Spark可以自動估計Receiver最合理的接收速度,並根據情況動態調整。啓動這個機制只需要設置 spark.streaming.backpressure.enabled爲true即可
  接着分析上述代碼,然後創建了兩個重要組件ReceiverTracker和JobGenerator,並啓動他們,我們先分析ReceiverTracker的start()方法。

ReceiverTracker的start()方法

  由於ReceiverTracker的start方法中實際上調用的是launchReceivers()方法,我們就看這個方法:

  private def launchReceivers(): Unit = {
    // 獲取所有的Receiver
    val receivers = receiverInputStreams.map(nis => {
      // 將程序中創建的所有輸入DStream,調用其getReceiver方法,拿到一個Receiver集合
      val rcvr = nis.getReceiver()
      // 設置Receiver的ID
      rcvr.setReceiverId(nis.id)
      rcvr
    })
    runDummySparkJob()

    logInfo("Starting " + receivers.length + " receivers")
    // 向ReceiverTrackerEndpoint發送啓動所有Receiver消息,
    // 其實就是在本地進行消息收發
    endpoint.send(StartAllReceivers(receivers))
  }

  從上述代碼可以看到,啓動Receiver是通過發送一個本地消息StartAllReceiver來啓動的,下面我們看一下這個源碼:

 override def receive: PartialFunction[Any, Unit] = {
      // Local messages
      // 啓動所有的Receivers
      case StartAllReceivers(receivers) =>

        // 計算Receiver的啓動位置,說白了就是看Receiver在哪個executor啓動
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          // 獲取Receiver啓動的所在節點的executor
          val executors = scheduledLocations(receiver.streamId)
          // 更新到ReceiverInfo中
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          // 記錄每個Receiver的啓動位置
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          // 啓動Receiver,這裏傳入了Receiver要啓動的executor位置
          startReceiver(receiver, executors)
        }
        // 省略代碼
    }

  這裏主要就是startReceiver()這個函數來啓動Receiver,注意傳入的參數,一個是待啓動的Receiver的集合,還有就是每個Receiver的啓動位置(也即在哪個Worker的executor節點上啓動)。
  下面重點分析startReceiver方法:

ReceiverTracker的startReceiver()方法
private def startReceiver(
        receiver: Receiver[_],
        scheduledLocations: Seq[TaskLocation]): Unit = {
    // 檢測Receiver是否已經啓動或已經被關閉等
      val receiverId = receiver.streamId
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
        return
      }
      // 是否設置Checkpoint
      val checkpointDirOption = Option(ssc.checkpointDir)
      val serializableHadoopConf =
        new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
      /**
        *   這裏定義了Receiver的核心邏輯,
        *   注意:這裏以及之後的操作都只是定義,不是在Driver端執行的
        *   這裏只是定義了一個函數,這個函數的執行,以及往後的過程
        *   都是在executor上執行的。這裏強調,Receiver的啓動絕對不是在Driver上的,是在Executor上的
        */
        // 遍歷每個Receiver,並進行啓動
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException(
              "Could not start receiver as object not found.")
          }
          if (TaskContext.get().attemptNumber() == 0) {
            // 獲取一個Receiver
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            // 將每個Receiver封裝在ReceiverSupervisorImpl中,並調用其start方法啓動
            val supervisor = new ReceiverSupervisorImpl(
              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            // 這裏調用了它的父類ReceiverSupervisor的start方法
            supervisor.start()
            supervisor.awaitTermination()
          } else {
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // 這裏做了優化,receiver接收到的數據,所封裝成的RDD,它的最佳位置在Receiver啓動的那個節點上
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledLocations.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          val preferredLocations = scheduledLocations.map(_.toString).distinct
          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
        }
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

      // 這裏的submitJob會真正將Receiver的啓動函數,分佈到各個Worker節點的Executor上去執行
      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      // 這裏就是判斷Job的運行結果狀態
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logError("Receiver has been stopped. Try to restart it.", e)
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
      }(submitJobThreadPool)
      logInfo(s"Receiver ${receiver.streamId} started")
    }

  上面代碼中重要的部分就是對每個Receiver封裝了一個startReceiverFunc,這個Receiver的啓動函數,它裏面具體的就是將每個Receiver封裝進了ReceiverSupervisorImpl中,然後調用它的start()方法,啓動Receiver;接着就是封裝了receiverRDD;最重要的是將他們兩通過SparkContext的submitJob進行Job的提交發送到各個Worker節點的executor上去執行。
上面需要注意的是,真正啓動Receiver是在Worker節點的executor上,而不是Driver上,Driver只是將Receiver進行封裝,然後發送到各個Executor上進行啓動。
  下面我們看看在每個Worker的executor上,Receiver是如何啓動的,首先調用的是ReceiverSupervisorImpl的start()方法。

ReceiverSupervisorImpl的start()

  ReceiverSupervisorImpl的start()方法在這個類中沒有,因爲在其父類中實現的,我們看其父類ReceiverSupervisor的start方法:

  def start() {
    // 調用ReceiverSupervisorImpl的onStart()方法
    onStart()
    startReceiver()
  }

  start方法中只有兩個方法,一個是onStart(),如下所示,用於啓動JobGenerator(後面再進行分析);還有一個就是startReceiver,用於啓動Receiver。

override protected def onStart() {
    // 這裏啓動了一個BlockGenerator,非常重要,它允許在worker的executor端負責
    // 數據接收後的一些存取工作,以及配合ReceiverTracker。
    // 所以在Executor上,啓動Receiver之前,就會先啓動這個Receiver,相關的BlockGenerator
    // 這裏啓動已經註冊的BlockGenerator
    registeredBlockGenerators.foreach { _.start() }
  }

  在startReceiver中啓動Receiver。

startReceiver啓動Receiver
// 這裏就會啓動Receiver
  def startReceiver(): Unit = synchronized {
    try {
      // 先向ReceiverTracker發送啓動Receiver的信息進行註冊
      if (onReceiverStart()) {
        logInfo("Starting receiver")
        // 啓動Receiver
        receiverState = Started
        // 這裏就啓動Receiver
        receiver.onStart()
        logInfo("Called receiver onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
  }

  首先向ReceiverTracker發送啓動Receiver的消息,進行註冊。發送成功之後,就會進行Receiver的啓動,調用了Receiver的onStart()方法,我們這裏以socket receiver爲例,來進行說明,其他的Receiver啓動都大同小異。

// 啓動Receiver
  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }
 
 // 這裏主要就是建立一個socket連接,用於接收數據
def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }

  從上面可以很清楚的看到,在Worker節點的executor上啓動的Socket Receiver,主要就是與數據源建立一個Socket連接,然後接受數據,並保存到它對應的BlockManager上,然後進行後面一系列的算子處理。
  總結一下:上面主要分析了StreamingContext初始化的過程,以及它的start()方法;這裏面主要作用是創建了四個重要的組件JobScheduler、DStreamGraph、ReceiverTracker和JobGenerator。其中着重分析了Receiver的啓動過程,首先將Receiver封裝進啓動函數startReceiverFunc中,然後通過SparkContext的submitJob,將各個Receiver分發到Worker節點的executor上去啓動,啓動的時候主要調用的是ReceiverSupervosor的startReceiver()方法,首先給ReceiverTracker發送啓動消息,然後調用receiver的onStart()方法啓動Receiver的數據接收。我們以Socket Receiver爲例,進行了簡單分析。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章