SparkStreaming重複消費消息隊列中的數據解決方案

SparkStreaming重複消費消息隊列中的數據解決方案

問題:在E-MapReduce上使用SparkStreaming消費阿里雲LogService(可以當作Kafka類似的消息隊列來發送和消費數據)中的數據時,每個batch都會消費到之前所有的數據。
在這發裏插入圖片描述如圖:在向LogService中發送了16條數據後,每個batch都能消費到所有的數據

代碼如下:

public class javaStreamingDirect {
    public static void main(String[] args) throws InterruptedException {
        String logServiceProject = "teststreaming04";
        String logStoreName = "teststreaming4logstore";
        String loghubConsumerGroupName = "filter_info_count";
        //String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
        String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
        String accessKeyId = "";
        String accessKeySecret = "";

        Duration batchInterval = Durations.seconds(5);
        SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
        JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);

        ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
        //ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");

        HashMap<String, String> zkParam = new HashMap<>();
        zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
        //zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
        zkParam.put("enable.auto.commit", "false");

        JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
                ssc,
                logServiceProject,
                logStoreName,
                loghubConsumerGroupName,
                accessKeyId,
                accessKeySecret,
                loghubEndpoint,
                zkParam,
                LogHubCursorPosition.END_CURSOR);

        
        javaLoghubstream.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

去翻官網,發現在SparkStreaming Kafka Direct 部分發現瞭如下api文檔:
這裏的大致意思
這裏的大致意思是說,Kafka會自動提交偏移量,但是因爲之前代碼中設置了參數enable.auto.commit 爲false,導致了偏移量不會自動提交,需要手動去提交偏移量,那麼代碼按照api可做如下修改:

public class javaStreamingDirect {
    public static void main(String[] args) throws InterruptedException {
        String logServiceProject = "teststreaming04";
        String logStoreName = "teststreaming4logstore";
        String loghubConsumerGroupName = "filter_info_count";
        //String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
        String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
        String accessKeyId = "";
        String accessKeySecret = "";

        Duration batchInterval = Durations.seconds(5);
        SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
        JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);

        ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
        //ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");

        HashMap<String, String> zkParam = new HashMap<>();
        zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
        //zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
        zkParam.put("enable.auto.commit", "false");

        JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
                ssc,
                logServiceProject,
                logStoreName,
                loghubConsumerGroupName,
                accessKeyId,
                accessKeySecret,
                loghubEndpoint,
                zkParam,
                LogHubCursorPosition.END_CURSOR);

        // 偏移量提交
        javaLoghubstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
            @Override
            public void call(JavaRDD<String> stringJavaRDD) throws Exception {
                ((CanCommitOffsets) javaLoghubstream.inputDStream()).commitAsync();
            }
        });

        javaLoghubstream.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

這樣設置後,每個batch之後,消費完數據後,都會向Kafka提交消費到的數據的偏移量,在下一個批次後不會再重複消費數據。

總結:避免重複消費數據的方法除了提交偏移量,還可以使用receive來拿到流數據,但是要保證高可用的話必須開啓WAL機制將數據寫入hdfs,但是這樣的話對性能會造成一定影響;另外若不想手動提交,可以將參數enable.auto.commit設置爲true,讓Kafka自動提交,但是這樣可能出現的問題是數據被消費了,但是spark程序在還沒有來得及將數據輸出,這樣沒法保證數據的一致性,所以建議若要使用direct的話,手動提交偏移量比較好。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章