需要下載對應 spark-streaming-kafka-0-8-assembly jar包(版本要對於)
下載地址:
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8-assembly_2.11
一定要下載對應的assembly版本,不然不識別
版本對應說明比如:spark-streaming-kafka-0-8-assembly_2.11-2.4.4.jar
2.11是scale版本,2.4.4是spark版本
2.11可以查kafka版本 find / -name \*kafka_\* | head -1 | grep -o '\kafka[^\n]*' 看到像kafka_2.10-0.8.2-beta.jar這樣的文件,其中2.10是Scala版本,0.8.2-beta是Kafka版本
代碼區:
啓動steaming程序後,zookeeper和kafka也啓動起來
運行:
spark-submit --jars /Us****oads/spark-streaming-kafka-0-8-assembly_2.11-2.4.4.jar pyspark01/pyspark_steaming02.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import os
import json
from pyspark.streaming.kafka import KafkaUtils
os.environ["PYSPARK_PYTHON"]="/Users/lonng/opt/anaconda3/python.app/Contents/MacOS/python"
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
sc.setLogLevel("OFF")
ssc = StreamingContext(sc, 5)
# checkpoint一定要設置,否則報錯
ssc.checkpoint("./")
zookeeper = "localhost:2181"
topic = {"test1": 1}
group_id = "test"
line1 = KafkaUtils.createStream(ssc, zookeeper, group_id, topic)
print(line1)
# lines = KafkaUtils.createDirectStream(ssc, ["hello"], {"metadata.broker.list": "127.0.0.1:9092"})
lines = line1.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
# print(lines)
# # # Split each line into words
# # words = lines.flatMap(lambda line: line.split(" "))
#
# # Count each word in each batch
# pairs = lines.map(lambda word: (word, 1))
# wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#
# # Print the first ten elements of each RDD generated in this DStream to the console
# wordCounts.pprint()
# print(wordCounts)
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate