Spark Streaming整合Flume(Pull-based Approach)統計詞頻
查看spark官網:
http://spark.apache.org/docs/2.2.0/streaming-flume-integration.html
flume的sink.type配置如圖:
我的flume配置如下:
開發spark streaming程序
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
from pyspark.streaming.flume import FlumeUtils
‘’‘Spark Streaming整合Flume(Pull-based Approach)統計詞頻’’’
#sc = SparkContext(master=“local[2]”,appName=“FlumePullWordCount”)
ssc = StreamingContext(sc,5)
address = [(“hadoop001”,41414)]
flumeStreams = FlumeUtils.createPollingStream(ssc=ssc,addresses=address)
#統計結果
counts = flumeStreams.map(lambda x: x[1])
.flatMap(lambda line: line.split(" "))
.map(lambda word: (word,1))
.reduceByKey(lambda a,b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
採用拉取數據的方式需要先啓動flume,數據會先存入緩存,再被streaming讀取
./flume-ng agent
–name simple-agent
–conf $FLUME_HOME/conf
–conf-file $FLUME_HOME/conf/flume-pull-streaming.conf
-Dflume.root.logger=INFO,console &
將開發好的spark streaming程序複製到pyspark中執行,啓動telnet,發送數據
查看詞頻統計
詞頻統計完成,有興趣的小夥伴可以思考一下,如何將結果中的換行符去掉哦。。。