Spark+Kafka的Direct方式將偏移量發送到Zookeeper實現 轉

 Apache Spark 1.3.0引入了Direct API,利用Kafka的低層次API從Kafka集羣中讀取數據,並且在SparkStreaming系統裏面維護偏移量相關的信息,並且通過這種方式去實現零數據丟失(zero data loss)相比使用基於Receiver的方法要高效。但是因爲是Spark Streaming系統自己維護Kafka的讀偏移量,而Spark Streaming系統並沒有將這個消費的偏移量發送到Zookeeper中,這將導致那些基於偏移量的Kafka集羣監控軟件(比如:Apache Kafka監控之Kafka Web ConsoleApache Kafka監控之KafkaOffsetMonitor等)失效。本文就是基於爲了解決這個問題,使得我們編寫的Spark Streaming程序能夠在每次接收到數據之後自動地更新Zookeeper中Kafka的偏移量。

  我們從Spark的官方文檔可以知道,維護Spark內部維護Kafka便宜了信息是存儲在HasOffsetRanges類的offsetRanges中,我們可以在Spark Streaming程序裏面獲取這些信息:

val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

這樣我們就可以獲取所以分區消費信息,只需要遍歷offsetsList,然後將這些信息發送到Zookeeper即可更新Kafka消費的偏移量。完整的代碼片段如下:

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

      messages.foreachRDD(rdd => {

        val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

        val kc = new KafkaCluster(kafkaParams)

        for (offsets < - offsetsList) {

          val topicAndPartition = TopicAndPartition("iteblog", offsets.partition)

          val o = kc.setConsumerOffsets(args(0), Map((topicAndPartition, offsets.untilOffset)))

          if (o.isLeft) {

            println(s"Error updating the offset to Kafka cluster: ${o.left.get}")

          }

        }

})

  KafkaCluster類用於建立和Kafka集羣的鏈接相關的操作工具類,我們可以對Kafka中Topic的每個分區設置其相應的偏移量Map((topicAndPartition, offsets.untilOffset)),然後調用KafkaCluster類的setConsumerOffsets方法去更新Zookeeper裏面的信息,這樣我們就可以更新Kafka的偏移量,最後我們就可以通過KafkaOffsetMonitor之類軟件去監控Kafka中相應Topic的消費信息,下圖是KafkaOffsetMonitor的監控情況:



如果想及時瞭解Spark、Hadoop或者Hbase相關的文章,歡迎關注微信公共帳號:iteblog_hadoop

  從圖中我們可以看到KafkaOffsetMonitor監控軟件已經可以監控到Kafka相關分區的消費情況,這對監控我們整個Spark Streaming程序來非常重要,因爲我們可以任意時刻了解Spark讀取速度。另外,KafkaCluster工具類的完整代碼如下:

package org.apache.spark.streaming.kafka

 

import kafka.api.OffsetCommitRequest

import kafka.common.{ErrorMapping, OffsetMetadataAndError, TopicAndPartition}

import kafka.consumer.SimpleConsumer

import org.apache.spark.SparkException

import org.apache.spark.streaming.kafka.KafkaCluster.SimpleConsumerConfig

 

import scala.collection.mutable.ArrayBuffer

import scala.util.Random

import scala.util.control.NonFatal

 

/**

 * User: 過往記憶

 * Date: 2015-06-02

 * Time: 下午23:46

 * bolg: https://www.iteblog.com

 * 本文地址:https://www.iteblog.com/archives/1381

 * 過往記憶博客,專注於hadoop、hive、spark、shark、flume的技術博客,大量的乾貨

 * 過往記憶博客微信公共帳號:iteblog_hadoop

 */

 

class KafkaCluster(val kafkaParams: Map[String, String]) extends Serializable {

  type Err = ArrayBuffer[Throwable]

 

  @transient private var _config: SimpleConsumerConfig = null

 

  def config: SimpleConsumerConfig = this.synchronized {

    if (_config == null) {

      _config = SimpleConsumerConfig(kafkaParams)

    }

    _config

  }

 

  def setConsumerOffsets(groupId: String,

                         offsets: Map[TopicAndPartition, Long]

                          ): Either[Err, Map[TopicAndPartition, Short]] = {

    setConsumerOffsetMetadata(groupId, offsets.map { kv =>

      kv._1 -> OffsetMetadataAndError(kv._2)

    })

  }

 

  def setConsumerOffsetMetadata(groupId: String,

                                metadata: Map[TopicAndPartition, OffsetMetadataAndError]

                                 ): Either[Err, Map[TopicAndPartition, Short]] = {

    var result = Map[TopicAndPartition, Short]()

    val req = OffsetCommitRequest(groupId, metadata)

    val errs = new Err

    val topicAndPartitions = metadata.keySet

    withBrokers(Random.shuffle(config.seedBrokers), errs) { consumer =>

      val resp = consumer.commitOffsets(req)

      val respMap = resp.requestInfo

      val needed = topicAndPartitions.diff(result.keySet)

      needed.foreach { tp: TopicAndPartition =>

        respMap.get(tp).foreach { err: Short =>

          if (err == ErrorMapping.NoError) {

            result += tp -> err

          } else {

            errs.append(ErrorMapping.exceptionFor(err))

          }

        }

      }

      if (result.keys.size == topicAndPartitions.size) {

        return Right(result)

      }

    }

    val missing = topicAndPartitions.diff(result.keySet)

    errs.append(new SparkException(s"Couldn't set offsets for ${missing}"))

    Left(errs)

  }

 

  private def withBrokers(brokers: Iterable[(String, Int)], errs: Err)

                         (fn: SimpleConsumer => Any): Unit = {

    brokers.foreach { hp =>

      var consumer: SimpleConsumer = null

      try {

        consumer = connect(hp._1, hp._2)

        fn(consumer)

      } catch {

        case NonFatal(e) =>

          errs.append(e)

      } finally {

        if (consumer != null) {

          consumer.close()

        }

      }

    }

  }

 

  def connect(host: String, port: Int): SimpleConsumer =

    new SimpleConsumer(host, port, config.socketTimeoutMs,

      config.socketReceiveBufferBytes, config.clientId)

}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章