SparkStreaming對接Kafka0.10+管理offset到zookeeper的方式

版本號:
spark-streaming-kafka-0-10_2.11
  version:2.4.0
kafka-clients
  version:0.11.0.0

問題:之前都是使用的0.8版本的來保存offset,但因爲線上Kafka集羣版本爲0.11.0.0的,保存offset的方式發生了很大的變化。

官網的方式:

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "use_a_separate_group_id_for_each_stream",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}

 而我現在有個需求就是在foreachRDD之前做一下開窗函數和map、filter運算,但做完這些操作之後再進行foreachRDD就獲取不到offsetRanges了。改好之後的代碼:

//全局的
var offsetRanges: Array[OffsetRange] = Array[OffsetRange]()

//方法體內的
val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> PropertiesUtil.getPropertiesToStr("kafka.hosts"),
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "xxx",
    "auto.offset.reset" -> "latest",
    "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("topicA")

val messages: InputDStream[ConsumerRecord[String, String]] = KafkaOffsetUtil.createMyZookeeperDirectKafkaStream(
    ssc,
    kafkaParams,
    topics,
    PropertiesUtil.getPropertiesToStr("zookeeper.group.name"),
    PropertiesUtil.getPropertiesToStr("zookeeper.path"))

val dataOriginDStream = messages.transform {
    rdd =>
        offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd
}.filter(kv => {
    ...
})

dataOriginDStream.foreachRDD(rdd => {
    if (!rdd.isEmpty()) {
        ...
    }
    // 存儲新的offset
    KafkaOffsetUtil.saveOffsets(
        PropertiesUtil.getPropertiesToStr("zookeeper.path"),
        offsetRanges,
        PropertiesUtil.getPropertiesToStr("zookeeper.group.name"))
})

 我的思路:使用transform先將offsetRanges保存到一個變量中,然後foreachRDD的時候再取出使用!!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章