SparkStreaming重複消費消息隊列中的數據解決方案

原創

2020-06-27 12:10

SparkStreaming重複消費消息隊列中的數據解決方案

問題：在E-MapReduce上使用SparkStreaming消費阿里雲LogService（可以當作Kafka類似的消息隊列來發送和消費數據）中的數據時，每個batch都會消費到之前所有的數據。
如圖：在向LogService中發送了16條數據後，每個batch都能消費到所有的數據

代碼如下：

public class javaStreamingDirect {
    public static void main(String[] args) throws InterruptedException {
        String logServiceProject = "teststreaming04";
        String logStoreName = "teststreaming4logstore";
        String loghubConsumerGroupName = "filter_info_count";
        //String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
        String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
        String accessKeyId = "";
        String accessKeySecret = "";

        Duration batchInterval = Durations.seconds(5);
        SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
        JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);

        ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
        //ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");

        HashMap<String, String> zkParam = new HashMap<>();
        zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
        //zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
        zkParam.put("enable.auto.commit", "false");

        JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
                ssc,
                logServiceProject,
                logStoreName,
                loghubConsumerGroupName,
                accessKeyId,
                accessKeySecret,
                loghubEndpoint,
                zkParam,
                LogHubCursorPosition.END_CURSOR);

        
        javaLoghubstream.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

去翻官網，發現在SparkStreaming Kafka Direct 部分發現瞭如下api文檔：

這裏的大致意思是說，Kafka會自動提交偏移量，但是因爲之前代碼中設置了參數enable.auto.commit 爲false，導致了偏移量不會自動提交，需要手動去提交偏移量，那麼代碼按照api可做如下修改：

public class javaStreamingDirect {
    public static void main(String[] args) throws InterruptedException {
        String logServiceProject = "teststreaming04";
        String logStoreName = "teststreaming4logstore";
        String loghubConsumerGroupName = "filter_info_count";
        //String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
        String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
        String accessKeyId = "";
        String accessKeySecret = "";

        Duration batchInterval = Durations.seconds(5);
        SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
        JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);

        ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
        //ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");

        HashMap<String, String> zkParam = new HashMap<>();
        zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
        //zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
        zkParam.put("enable.auto.commit", "false");

        JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
                ssc,
                logServiceProject,
                logStoreName,
                loghubConsumerGroupName,
                accessKeyId,
                accessKeySecret,
                loghubEndpoint,
                zkParam,
                LogHubCursorPosition.END_CURSOR);

        // 偏移量提交
        javaLoghubstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
            @Override
            public void call(JavaRDD<String> stringJavaRDD) throws Exception {
                ((CanCommitOffsets) javaLoghubstream.inputDStream()).commitAsync();
            }
        });

        javaLoghubstream.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

這樣設置後，每個batch之後，消費完數據後，都會向Kafka提交消費到的數據的偏移量，在下一個批次後不會再重複消費數據。

總結：避免重複消費數據的方法除了提交偏移量，還可以使用receive來拿到流數據，但是要保證高可用的話必須開啓WAL機制將數據寫入hdfs，但是這樣的話對性能會造成一定影響；另外若不想手動提交，可以將參數enable.auto.commit設置爲true，讓Kafka自動提交，但是這樣可能出現的問題是數據被消費了，但是spark程序在還沒有來得及將數據輸出，這樣沒法保證數據的一致性，所以建議若要使用direct的話，手動提交偏移量比較好。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

SparkStreaming重複消費消息隊列中的數據解決方案

SparkStreaming重複消費消息隊列中的數據解決方案

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

爬蟲兩種繞過5s盾的方法

【轉】[C#] WebAPI 防止併發調用二（冥等性）

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

hive COALESCE 的使用

SparkStreaming重複消費消息隊列中的數據解決方案

中間件——分佈式事務解決方案

中間件——Rocket MQ收發事務消息

容器化技術之Docker——安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結