Spark集成Kafka源碼分析——SparkStreaming從kafak中接收數據

整體概括:
要實現SparkStreaming從kafak中接收數據分爲以下幾步(其中涉及的類在包org.apache.spark.streaming.kafka中):
1.創建createStream()函數,返回類型爲ReceiverInputDStream對象,在createStream()函數中最後返回構造的KafkaInputDStream類對象
2.KafkaInputDStream類要繼承ReceiverInputDStream,來實現ReceiverInputDStream中的getReceiver()函數,在getReceiver()函數中構造KafkaReceiver類對象
3.KafkaReceiver類是真正幹活的類了,前邊的一些工作都沒啥實質工作,就是在扯皮,就跟工作中某些情況似的,項目中有很多人,一層層的領導們指揮規劃任務,但具體幹活的就是最底層的幾個,不過還是要有這些工作的,這樣整體脈絡比較清晰。
  a.設定kafka相關參數
  b.設定存儲kafka元數據的zookeeper的地址,連接zookeeper
  c.設定kafka中數據的反序列化相關類
  d.調用kafka消費者api來獲取數據
  e.創建線程池來
  f.關閉線程池

SparkStreaming從kafak中接收數據的主要工作就是:
1.在Receiver中做:
  a.消費消息隊列中的數據,得到一條條數據。
  b.調用Receiver中store函數將數據存儲到Spark內存
2.將createStream、ReceiverInputDStream、KafkaInputDStream、KafkaReceiver、Receiver這些類的關係對應好。


具體邏輯分析:

1.spark官網KafkaWordCount示例:

object KafkaWordCount {
  def main(args: Array[String]) {
    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint("checkpoint")

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L))
      .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
2.主要分析KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)中的createStream()函數:

/**
   * Create an input stream that pulls messages from a Kafka Broker.
   * @param ssc       StreamingContext object
   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)
   * @param groupId   The group id for this consumer
   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed
   *                  in its own thread
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   */
  def createStream(
      ssc: StreamingContext,
      zkQuorum: String,
      groupId: String,
      topics: Map[String, Int],
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[(String, String)] = {
    val kafkaParams = Map[String, String](
      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
      "zookeeper.connection.timeout.ms" -> "10000")
    createStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics, storageLevel)
  }
跟進createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, storageLevel):
 def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      topics: Map[String, Int],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[(K, V)] = {
    val walEnabled = ssc.conf.getBoolean("spark.streaming.receiver.writeAheadLog.enable", false)
    new KafkaInputDStream[K, V, U, T](ssc, kafkaParams, topics, walEnabled, storageLevel)
  }
函數最後返回了KafkaInputDStream對象,跟進KafkaInputDStream。

3.KafkaInputDStream類中代碼

/**
 * Input stream that pulls messages from a Kafka Broker.
 *
 * @param kafkaParams Map of kafka configuration parameters.
 *                    See: http://kafka.apache.org/configuration.html
 * @param topics Map of (topic_name -> numPartitions) to consume. Each partition is consumed
 * in its own thread.
 * @param storageLevel RDD storage level.
 */
private[streaming]
class KafkaInputDStream[
  K: ClassTag,
  V: ClassTag,
  U <: Decoder[_]: ClassTag,
  T <: Decoder[_]: ClassTag](
    @transient ssc_ : StreamingContext,
    kafkaParams: Map[String, String],
    topics: Map[String, Int],
    useReliableReceiver: Boolean,
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[(K, V)](ssc_) with Logging {

  def getReceiver(): Receiver[(K, V)] = {
    if (!useReliableReceiver) {
      new KafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)
    } else {
      new ReliableKafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)
    }
  }
}
該類主要有兩個工作:
a.KafkaInputDStream類繼承了ReceiverInputDStream[(K, V)](ssc_)。
b.實現了ReceiverInputDStream中的getReceiver()函數,getReceiver()返回兩個Recceiver,原理一樣查看KafkaReceiver即可。

4.查看構造的KafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel):

private[streaming]
class KafkaReceiver[
  K: ClassTag,
  V: ClassTag,
  U <: Decoder[_]: ClassTag,
  T <: Decoder[_]: ClassTag](
    kafkaParams: Map[String, String],
    topics: Map[String, Int],
    storageLevel: StorageLevel
  ) extends Receiver[(K, V)](storageLevel) with Logging {

  // Connection to Kafka
  var consumerConnector: ConsumerConnector = null

  def onStop() {
    if (consumerConnector != null) {
      consumerConnector.shutdown()
      consumerConnector = null
    }
  }

  def onStart() {

    logInfo("Starting Kafka Consumer Stream with group: " + kafkaParams("group.id"))

    // Kafka connection properties
    val props = new Properties()
    kafkaParams.foreach(param => props.put(param._1, param._2))

    val zkConnect = kafkaParams("zookeeper.connect")
    // Create the connection to the cluster
    logInfo("Connecting to Zookeeper: " + zkConnect)
    val consumerConfig = new ConsumerConfig(props)
    consumerConnector = Consumer.create(consumerConfig)
    logInfo("Connected to " + zkConnect)

    val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
      .newInstance(consumerConfig.props)
      .asInstanceOf[Decoder[K]]
    val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
      .newInstance(consumerConfig.props)
      .asInstanceOf[Decoder[V]]

    // Create threads for each topic/message Stream we are listening
    val topicMessageStreams = consumerConnector.createMessageStreams(
      topics, keyDecoder, valueDecoder)

    val executorPool = Utils.newDaemonFixedThreadPool(topics.values.sum, "KafkaMessageHandler")
    try {
      // Start the messages handler for each partition
      topicMessageStreams.values.foreach { streams =>
        streams.foreach { stream => executorPool.submit(new MessageHandler(stream)) }
      }
    } finally {
      executorPool.shutdown() // Just causes threads to terminate after work is done
    }
	// Handles Kafka messages
  private class MessageHandler(stream: KafkaStream[K, V])
    extends Runnable {
    def run() {
      logInfo("Starting MessageHandler.")
      try {
        val streamIterator = stream.iterator()
        while (streamIterator.hasNext()) {
          val msgAndMetadata = streamIterator.next()
          store((msgAndMetadata.key, msgAndMetadata.message)) //Store a single item of received data to Spark's memory.
        }
      } catch {
        case e: Throwable => logError("Error handling message; exiting", e)
      }
    }
  }
}
在構造的KafkaReceiver對象中做了最主要的工作。繼承了Receiver[(K, V)](storageLevel),要實現Receiver中的onStart()、onStop()函數。

在onStart()函數中要做的工作就是把kafka中的數據放到kafka中。
a.設定kafka相關參數
b.設定存儲kafka元數據的zookeeper的地址,連接zookeeper
c.設定kafka中數據的反序列化相關類
d.調用kafka消費者api來獲取數據
e.創建線程池來將獲取的流數據存儲到spark,store((msgAndMetadata.key, msgAndMetadata.message))該函數在Receiver類中,就是把該條消息以鍵值對的形式存儲到spark內存中,正因爲這種鍵值存儲導致調用 KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)時返回的是鍵值對的對象,之前用java寫spark接收kafak數據調用這個端口時返回這個鍵值對的對象,我就對此有些疑問,現在明白是在這做的處理導致返回的是鍵值對對象。
f.關閉線程池


onStop()函數就是關閉消費者與kafka連接了

然後就一層層返回,最後createStream函數的返回對象中就可以得到數據了。

至此spark接收消費kafak數據的工作流程結束了。



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章