pyspark streaming與Kafka的應用及offset的手動設置

spark streaming現在主要分爲兩個部分，一個是DStreams，另一個就是Structured Streaming，前一個是基於RDD進行編程，後一個是基於DataFrame或Dataset編程的。現在官方推薦的是使用Structured Streaming，因此可以根據需要自己選擇使用哪一個。這裏主要說明pyspark streaming連接Kafka的方式及解決無法使用group id的問題。

使用的版本：spark2.4.3， scala2.11，本地kafka2.1.0和線上kafka0.10

DStreams

test.py:

from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="test")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, 20)
   kafka_params = {"metadata.broker.list": "xxxxx:9092,xxxxx:9092,xxxxx:9092"}
    kafkaStream = KafkaUtils.createDirectStream(ssc, ["mytopic"], kafka_params,
                                                valueDecoder=lambda x: json.loads(x.decode("utf-8")))
    kafkaStream.map(lambda x: (x[1].get("userId"), len(x[1].get("lifeIds")), x[1].get("createDate"))).pprint()

    ssc.start()
    ssc.awaitTermination()

運行：
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.3 test.py
上面的KafkaUtil在spark2.3.0就deprecated了，但還是能夠使用，不過需要使用org.apache.spark:spark-streaming-kafka-0-8_2.11，而不能使用org.apache.spark:spark-streaming-kafka-0-10_2.11。因爲現在python spark streaming不支持Kafka0.10.如果想使用Kafka0.10就需要使用Structured Streaming了。

Structured Streaming

test.py:

spark = SparkSession \
        .builder \
        .appName("StructuredNetworkWordCount") \
        .getOrCreate()

    df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092") \
        .option("subscribe", "test") \
        .load()

    # df = df.rdd.map(lambda x: x.split(" ")).toDF()
    df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    df = df.withColumn("s", F.split(df['value'], " "))
    df = df.withColumn('e', F.explode(df['s']))

    q = df.writeStream \
        .format("console") \
        .trigger(processingTime='30 seconds')\
        .start()

    q.awaitTermination()

運行：
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 test.py

上面兩個最主要的問題就是不能設置group id，無法追蹤自己最新消費的offset，經過測試，每次啓動腳本的時候都是從當前時間開始消費數據的，也就是說以前產生的數據可能消費不到，並且pyspark DStreams不支持線上的kafka0.10，所以現在想到的方法就是自己手動設置並保存offset。

手動offset設置及保存

手動設置offset的思想：
(1) 保存Kafka offset至Hbase，在Hbase中創建一張表，用來存放Kafka的offset數據，形式如下：

# DDL:
create 'stream_kafka_offsets', {NAME=>'offsets', TTL=>2592000}

# ROW LAYOUT:
row:              <TOPIC_NAME>_<GROUP_ID>
column family:    offsets
qualifier:        <PARTITION_ID>
value:            <OFFSET_ID>

設置了TTL爲30天，row key使用了topic和group id，列族爲offsets，列爲partition id，值爲offset

(2) 對每一批Kafka中消費的數據，保存最新offset至Hbase

def save_offsets(topic_name, group_id, offset_ranges, hbase_table_name):
    happybase_util = HappyBaseUtil()
    for offset in offset_ranges:
        happybase_util.put(hbase_table_name, topic_name+"_"+group_id, {"offsets:"+str(offset.partition): str(offset.untilOffset)})

這一步比較簡單，就是把結果存到Hbase中去。

(3) 獲取最新的offset
這個需要考慮幾種情況：

首次運行。首次需要如何運行？是從當前時間開始運行還是從某一個最新的offset開始運行或者從offset爲0的地方開始運行還是其他？如果從offset爲0的地方運行的話，可能會出現Offset Out of Range Error，可能前期的一些數據被kafka清空了；如果從當前時間開始運行的話，可能以前的許多數據消費不到。現在使用的方法是使用某一個消費者組的最新offset作爲它首次運行的offset，這樣就有了首次運行的offset。獲取首次的一個topic的某個消費者組最新的offset需要一些額外的工作，這需要python的第三方包kafka。
運行一段時間了，中間被停止了，並且又增加了新的partition。這種情況就是仍然獲取Hbase中最新的offset，新的partition的offset從0開始
運行一段時間，中間停止了，中間沒有新的partition增加，這樣只獲取Hbase中最新的offset就行了。
所以這裏關鍵還是首次要如何運行，這可能還有更好的方法。

def get_last_committed_offsets(topic_name, group_id, hbase_table_name):
    # client = SimpleClient('localhost:9092')
    client = SimpleClient(["xxxxx:9092","xxxxx:9092","xxxxx:9092"])
    # 獲取zookeeper中kafka topic的partition
    topic_partition_ids = client.get_partition_ids_for_topic(topic_name)

    happybase_util = HappyBaseUtil()
    # 獲取hbase存放的kafka topic的partition
    partition_offset_values = happybase_util.get_row(hbase_table_name, row=topic_name+"_"+group_id)

    if len(partition_offset_values) == 0:
        # 第一次運行處理
        partitions = client.topic_partitions[topic_name]
        offset_requests = [OffsetRequestPayload(topic_name, p, -1, 1) for p in partitions.keys()]
        offsets_responses = client.send_offset_request(offset_requests)
        offsets = dict((TopicAndPartition(topic_name, r.partition), r.offsets[0]) for r in offsets_responses)

    elif len(partition_offset_values) < len(topic_partition_ids):
        # 如果hbase中partition個數小於zookeeper中partition的個數，說明有新增的partition，新增的partition偏移量設爲0
        offsets = dict((TopicAndPartition(topic_name, int(k.decode("utf-8").split(":")[1])), int(v))
                       for k, v in partition_offset_values.items())
        extra_partitions = dict((TopicAndPartition(topic_name, i), 0)
                                for i in range(len(topic_partition_ids), len(partition_offset_values)))
        offsets.update(extra_partitions)
    else:
        offsets = dict((TopicAndPartition(topic_name, int(k.decode("utf-8").split(":")[1])), int(v))
                       for k, v in partition_offset_values.items())

    return offsets

（4）接下來就是數據的處理，獲取數據的offset並保存offset

if __name__ == "__main__":
    sc = SparkContext(appName="test")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, 5)
    # kafka_params = {"metadata.broker.list": "localhost:9092"}
    kafka_params = {"metadata.broker.list": "xxxxx:9092,xxxxx:9092,xxxxx:9092"}

    # fromOffset = get_last_committed_offsets("test", "test-id", "stream_kafka_offsets")
    fromOffset = get_last_committed_offsets("mytopic", "test-group-2", "stream_kafka_offsets")

    # kafkaStream = KafkaUtils.createDirectStream(ssc, ["test"], kafka_params, fromOffsets=fromOffset)
    kafkaStream = KafkaUtils.createDirectStream(ssc, ["mytopic"], kafka_params, fromOffsets=fromOffset)

    def inner_func(rdd):
        rdd.foreach(lambda x: print(x))
        save_offsets("mytopic", "test-group-2", rdd.offsetRanges(),"stream_kafka_offsets")

    kafkaStream.foreachRDD(inner_func)

    ssc.start()
    ssc.awaitTermination()

然後運行：
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.3 test.py

完整代碼：
test.py:

from kafka import SimpleClient
from kafka.structs import OffsetRequestPayload

from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
from pyspark import SparkContext

from algo_core.utils.hbase_util import HappyBaseUtil  # 自己封裝的happybase包

def save_offsets(topic_name, group_id, offset_ranges, hbase_table_name):
    happybase_util = HappyBaseUtil()
    for offset in offset_ranges:
        happybase_util.put(hbase_table_name, topic_name+"_"+group_id, {"offsets:"+str(offset.partition): str(offset.untilOffset)})

def get_last_committed_offsets(topic_name, group_id, hbase_table_name):
    # client = SimpleClient('localhost:9092')
    client = SimpleClient(["xxxxx:9092","xxxxx:9092","xxxxx:9092"])
    # 獲取zookeeper中kafka topic的partition
    topic_partition_ids = client.get_partition_ids_for_topic(topic_name)

    happybase_util = HappyBaseUtil()
    # 獲取hbase存放的kafka topic的partition
    partition_offset_values = happybase_util.get_row(hbase_table_name, row=topic_name+"_"+group_id)

    if len(partition_offset_values) == 0:
        # 第一次運行處理
        partitions = client.topic_partitions[topic_name]
        offset_requests = [OffsetRequestPayload(topic_name, p, -1, 1) for p in partitions.keys()]
        offsets_responses = client.send_offset_request(offset_requests)
        offsets = dict((TopicAndPartition(topic_name, r.partition), r.offsets[0]) for r in offsets_responses)

    elif len(partition_offset_values) < len(topic_partition_ids):
        # 如果hbase中partition個數小於zookeeper中partition的個數，說明有新增的partition，新增的partition偏移量設爲0
        offsets = dict((TopicAndPartition(topic_name, int(k.decode("utf-8").split(":")[1])), int(v))
                       for k, v in partition_offset_values.items())
        extra_partitions = dict((TopicAndPartition(topic_name, i), 0)
                                for i in range(len(topic_partition_ids), len(partition_offset_values)))
        offsets.update(extra_partitions)
    else:
        offsets = dict((TopicAndPartition(topic_name, int(k.decode("utf-8").split(":")[1])), int(v))
                       for k, v in partition_offset_values.items())

    return offsets


if __name__ == "__main__":
    sc = SparkContext(appName="test")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, 5)
    # kafka_params = {"metadata.broker.list": "localhost:9092"}
    kafka_params = {"metadata.broker.list": "xxxxx:9092,xxxxx:9092,xxxxx:9092"}

    # fromOffset = get_last_committed_offsets("test", "test-id", "stream_kafka_offsets")
    fromOffset = get_last_committed_offsets("mytopic", "test-group-2", "stream_kafka_offsets")

    # kafkaStream = KafkaUtils.createDirectStream(ssc, ["test"], kafka_params, fromOffsets=fromOffset)
    kafkaStream = KafkaUtils.createDirectStream(ssc, ["mytopic"], kafka_params, fromOffsets=fromOffset)

    def inner_func(rdd):
        rdd.foreach(lambda x: print(x))
        save_offsets("mytopic", "test-group-2", rdd.offsetRanges(),"stream_kafka_offsets")

    kafkaStream.foreachRDD(inner_func)

    ssc.start()
    ssc.awaitTermination()

littlely_ll

發佈了112 篇原創文章 · 獲贊 165 · 訪問量 52萬+

他的留言板關注

pyspark streaming與Kafka的應用及offset的手動設置

DStreams

Structured Streaming

手動offset設置及保存

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

白話Flink

word2vec以及GloVe總結

使用pyspark進行機器學習（分類問題）

使用pyspark進行機器學習（聚類問題）

pyspark應用技巧

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結