我的原創地址:https://dongkelun.com/2018/06/14/updateStateBykeyWordCount/
前言
本文利用SparkStreaming和Kafka實現基於緩存的實時wordcount程序,什麼意思呢,因爲一般的SparkStreaming的wordcount程序比如官網上的,只能統計最新時間間隔內的每個單詞的數量,而不能將歷史的累加起來,本文是看了教程之後,自己實現了一下kafka的程序,記錄在這裏。其實沒什麼難度,只是用了一個updateStateByKey算子就能實現,因爲第一次用這個算子,所以正好學習一下。
1、數據
數據是我隨機在kafka裏生產的幾條,單詞以空格區分開
2、kafka topic
首先在kafka建一個程序用到topic:UpdateStateBykeyWordCount
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic UpdateStateBykeyWordCount
3、創建checkpoint的hdfs目錄
我的目錄爲:/spark/dkl/kafka/wordcount_checkpoint
hadoop fs -mkdir -p /spark/dkl/kafka/wordcount_checkpoint
4、Spark代碼
啓動下面的程序
package com.dkl.leanring.spark.kafka
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Seconds
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object UpdateStateBykeyWordCount {
def main(args: Array[String]): Unit = {
//初始化,創建SparkSession
val spark = SparkSession.builder().appName("sskt").master("local[2]").enableHiveSupport().getOrCreate()
//初始化,創建sparkContext
val sc = spark.sparkContext
//初始化,創建StreamingContext,batchDuration爲1秒
val ssc = new StreamingContext(sc, Seconds(5))
//開啓checkpoint機制
ssc.checkpoint("hdfs://ambari.master.com:8020/spark/dkl/kafka/wordcount_checkpoint")
//kafka集羣地址
val server = "ambari.master.com:6667"
//配置消費者
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> server, //kafka集羣地址
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "UpdateStateBykeyWordCount", //消費者組名
"auto.offset.reset" -> "latest", //latest自動重置偏移量爲最新的偏移量 earliest 、none
"enable.auto.commit" -> (false: java.lang.Boolean)) //如果是true,則這個消費者的偏移量會在後臺自動提交
val topics = Array("UpdateStateBykeyWordCount") //消費主題
//基於Direct方式創建DStream
val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
//開始執行WordCount程序
//以空格爲切分符切分單詞,並轉化爲 (word,1)形式
val words = stream.flatMap(_.value().split(" ")).map((_, 1))
val wordCounts = words.updateStateByKey(
//每個單詞每次batch計算的時候都會調用這個函數
//第一個參數爲每個key對應的新的值,可能有多個,比如(hello,1)(hello,1),那麼values爲(1,1)
//第二個參數爲這個key對應的之前的狀態
(values: Seq[Int], state: Option[Int]) => {
var newValue = state.getOrElse(0)
values.foreach(newValue += _)
Option(newValue)
})
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
5、生產幾條數據
隨便寫幾條即可
bin/kafka-console-producer.sh --broker-list ambari.master.com:6667 --topic UpdateStateBykeyWordCount
6、結果
根據結果可以看到,歷史的單詞也被統計打印出來了