Streaming x Kafka
實時統計數據時需要用到Spark Sreaming x kafka,spark版本就不多贅述了,kafka版本現在主要分0.8.x.x和0.10.x.x,但是調用相同API消費時發現兩者有區別,這裏做一下記錄。Kafka Streaming生成選擇常用的Direct Approach(No receiver)方式簡化並行,提升straming接數據時的穩定性。
0.8.x.x maven 依賴 與 消費
生成Spark Streaming時也可以不調用Spark Context,直接將Spark Conf 傳給 Streaming Context,這裏sc可以用來讀取其他變File
maven
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.x.x</version>
</dependency>
消費topic
val kafkaParams = Map(
"metadata.broker.list" -> KAFKA_BROKERS,
"group.id" -> KAFKA_GROUP_ID,
"auto.offset.reset" -> kafka.api.OffsetRequest.LargestTimeString
)
val sparkConf = if (local) {
new SparkConf()
.setMaster(SPARK_LOCAL_HOST)
.setAppName(appName)
} else {
new SparkConf().setAppName(appName)
}
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc,
Seconds(SPARK_STREAMING_INTERVAL.toInt)
)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
partition.foreach(line => {
Execute(line)
})
})
})
ssc.start()
ssc.awaitTermination()
}
0.10.x.x maven 依賴 與 消費
與0.8.x.x的消費主要區別在kafka配置與DStream生成的API改動,主要邏輯寫在Excute函數中即可
maven
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.x.x</version>
</dependency>
消費topic
val kafkaParameters = Map[String, Object](
"bootstrap.servers" -> KAFKA_BROKERS,
"group.id" -> KAFKA_GROUP_ID,
"enable.auto.commit" -> (true: java.lang.Boolean),
"auto.offset.reset" -> "latest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT",
"fetch.min.bytes" -> "4096",
"sasl.mechanism" -> "PLAIN"
)
val sparkConf = if (local) {
new SparkConf()
.setMaster(SPARK_LOCAL_HOST)
.setAppName(appName)
} else {
new SparkConf().setAppName(appName)
}
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc,
Seconds(SPARK_STREAMING_INTERVAL.toInt)
)
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Array(KAFKA_TOPIC), kafkaParameters))
kafkaStream.foreachRDD(rdd=>{
rdd.foreachPartition(partition => {
partition.foreach(line => {
Execute(line.value())
})
})
})
ssc.start()
ssc.awaitTermination()