SparkStreaming重複消費消息隊列中的數據解決方案
問題:在E-MapReduce上使用SparkStreaming消費阿里雲LogService(可以當作Kafka類似的消息隊列來發送和消費數據)中的數據時,每個batch都會消費到之前所有的數據。
如圖:在向LogService中發送了16條數據後,每個batch都能消費到所有的數據
代碼如下:
public class javaStreamingDirect {
public static void main(String[] args) throws InterruptedException {
String logServiceProject = "teststreaming04";
String logStoreName = "teststreaming4logstore";
String loghubConsumerGroupName = "filter_info_count";
//String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
String accessKeyId = "";
String accessKeySecret = "";
Duration batchInterval = Durations.seconds(5);
SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);
ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
//ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");
HashMap<String, String> zkParam = new HashMap<>();
zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
//zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
zkParam.put("enable.auto.commit", "false");
JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
ssc,
logServiceProject,
logStoreName,
loghubConsumerGroupName,
accessKeyId,
accessKeySecret,
loghubEndpoint,
zkParam,
LogHubCursorPosition.END_CURSOR);
javaLoghubstream.print();
ssc.start();
ssc.awaitTermination();
}
}
去翻官網,發現在SparkStreaming Kafka Direct 部分發現瞭如下api文檔:
這裏的大致意思是說,Kafka會自動提交偏移量,但是因爲之前代碼中設置了參數enable.auto.commit 爲false,導致了偏移量不會自動提交,需要手動去提交偏移量,那麼代碼按照api可做如下修改:
public class javaStreamingDirect {
public static void main(String[] args) throws InterruptedException {
String logServiceProject = "teststreaming04";
String logStoreName = "teststreaming4logstore";
String loghubConsumerGroupName = "filter_info_count";
//String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
String accessKeyId = "";
String accessKeySecret = "";
Duration batchInterval = Durations.seconds(5);
SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);
ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
//ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");
HashMap<String, String> zkParam = new HashMap<>();
zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
//zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
zkParam.put("enable.auto.commit", "false");
JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
ssc,
logServiceProject,
logStoreName,
loghubConsumerGroupName,
accessKeyId,
accessKeySecret,
loghubEndpoint,
zkParam,
LogHubCursorPosition.END_CURSOR);
// 偏移量提交
javaLoghubstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
((CanCommitOffsets) javaLoghubstream.inputDStream()).commitAsync();
}
});
javaLoghubstream.print();
ssc.start();
ssc.awaitTermination();
}
}
這樣設置後,每個batch之後,消費完數據後,都會向Kafka提交消費到的數據的偏移量,在下一個批次後不會再重複消費數據。
總結:避免重複消費數據的方法除了提交偏移量,還可以使用receive來拿到流數據,但是要保證高可用的話必須開啓WAL機制將數據寫入hdfs,但是這樣的話對性能會造成一定影響;另外若不想手動提交,可以將參數enable.auto.commit設置爲true,讓Kafka自動提交,但是這樣可能出現的問題是數據被消費了,但是spark程序在還沒有來得及將數據輸出,這樣沒法保證數據的一致性,所以建議若要使用direct的話,手動提交偏移量比較好。