教程:
http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html
pom:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.1</version>
</dependency>
String brokers = "127.0.0.1:9092";
String topics = "helloSpark";
// Create context with a 2 seconds batch interval
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JavaDirectKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokers);
kafkaParams.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaParams.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaParams.put("serializer.class", "kafka.serializer.StringEncoder");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id","helloSpark");
// Create direct kafka stream with brokers and topics
JavaInputDStream<ConsumerRecord<String, String>> messages =
KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(ConsumerRecord::value);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s->new Tuple2<>(s,1)).reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
// Start the computation
jssc.start();
jssc.awaitTermination();
jssc.stop(false);
1、LocationStrategies
新的Kafka消費者API將預先獲取消息到緩衝區,出於性能原因,Spark會將緩存過的消費者保留在執行程序上(而不是爲每個批處理重新創建它們)。
在大多數情況下,使用LocationStrategies.PreferConsistent如上所示,這將在可用執行程序之間均勻分配分區。如果執行程序與Kafka brokers位於同一主機上,使用PreferBrokers,這將優先爲該分區的Kafka領導者安排分區。如果分區之間的負載有明顯偏差,請使用PreferFixed,可以指定分區到主機的顯式映射。
消費者的緩存的默認最大大小爲64.如果您希望處理超過(64個執行程序數)Kafka分區,可以更改設置:spark.streaming.kafka.consumer.cache.maxCapacity。
如果要禁用Kafka消費者的緩存,可以設置spark.streaming.kafka.consumer.cache.enabled爲false。可能需要禁用緩存來解決SPARK-19185中描述的問題。一旦SPARK-19185解決,可以在Spark的更高版本中刪除此屬性。
2、ConsumerStrategies
三種訂閱topic的策略:
- ConsumerStrategies.Subscribe,如上所示,允許訂閱固定的主題集合
- SubscribePattern允許使用正則表達式指定感興趣的主題
- Assign允許您指定固定的分區集合
這三種策略都有重載的構造函數,允許指定特定分區的起始偏移量
指定消息offset批量處理
如果用例更適合批處理,則可以爲定義的偏移範圍創建RDD:
OffsetRange[] offsetRanges = {
// topic, partition, inclusive starting offset, exclusive ending offset
OffsetRange.create("test", 0, 0, 100),
OffsetRange.create("test", 1, 0, 100)
};
JavaRDD<ConsumerRecord<String, String>> rdd = KafkaUtils.createRDD(
sparkContext,
kafkaParams,
offsetRanges,
LocationStrategies.PreferConsistent()
);