kafka offset & flink & spark structured streaming

前言

Kafka有offset的概念,offset記錄每個groupId對於每個topic的每個partition裏已經提交的讀取位置。當comsumer程序失敗重啓時,可以從這個位置重新讀取數據。

可以通過如下方法查看一個groupId的offset:

root@h2:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server h2:9092 --group mabb-g1 --describe
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).

Consumer group 'mabb-g1' has no active members.

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
test6                          0          39156           61587           22431      -                                                 -                              -

這篇博客介紹與offset有關的內容,包括kafka原生的api,flink,spark structured streaming。

Kafka API

依賴:

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>1.0.0</version>
        </dependency>

消費者代碼:

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.util.Arrays;
import java.util.Properties;

public class KafkaConsumerTest implements Runnable {

    private final KafkaConsumer<String, String> consumer;
    private ConsumerRecords<String, String> msgList;
    private String topic;
    private static final String GROUPID = "mabb-g1";

    public static void main(String[] args) {
        KafkaConsumerTest test1 = new KafkaConsumerTest("test6");
        Thread thread1 = new Thread(test1);
        thread1.start();
    }

    public KafkaConsumerTest(String topicName) {
        Properties props = new Properties();
        //kafka消費的的地址
        props.put("bootstrap.servers", "h2:9092,h3:9092,h3:9092");
        //組名 不同組名可以重複消費
        props.put("group.id", GROUPID);
        //是否自動提交
        props.put("enable.auto.commit", "true");
        //設置自動提交offset的間隔
        props.put("auto.commit.interval.ms ", "100"); 
        //從poll(拉)的回話處理時長
        props.put("auto.commit.interval.ms", "100");
        //超時時間
        props.put("session.timeout.ms", "30000");
        //一次最大拉取的條數
        props.put("max.poll.records", 20);
//		earliest: 當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,從頭開始消費
//		latest: 當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,消費新產生的該分區下的數據
//		none: topic各分區都存在已提交的offset時,從offset後開始消費;只要有一個分區不存在已提交的offset,則拋出異常
//        props.put("auto.offset.reset", "earliest");
        //序列化
        props.put("key.deserializer", StringDeserializer.class.getName());
        props.put("value.deserializer", StringDeserializer.class.getName());
        this.consumer = new KafkaConsumer<String, String>(props);
        this.topic = topicName;
        //訂閱主題列表topic
        this.consumer.subscribe(Arrays.asList(topic));
    }

    @Override
    public void run() {
        int messageNo = 1;
        try {
            for (; ; ) {
                msgList = consumer.poll(20);
                if (null != msgList && msgList.count() > 0) {
                    for (ConsumerRecord<String, String> record : msgList) {
                        System.out.println(messageNo + "=======receive: key = " + record.key() + ", value = " + record.value() + " offset===" + record.offset());
                        messageNo++;
                    }
                } else {
                    Thread.sleep(10000);
                }
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            consumer.close();
        }
    }
}

attention:

1.第29行,設置***enable.auto.commit*** 爲true,客戶端纔會在固定的週期提交當前的offset給server。

2.第39行,設置***auto.offset.reset***爲none,從之前記錄的offset開始消費。

flink

flink會把offset提交到kakfa,由kafka來維護offset,在失敗恢復時從上次位置繼續消費。

代碼如下:

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "h2:9092");
        properties.setProperty("group.id", "dev_test1");
        properties.setProperty("enable.auto.commit", "true");

        FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
        // start from the latest record
//        myConsumer.setStartFromLatest();
        // start from the earliest record
//        myConsumer.setStartFromEarliest();
        // 從消息到達kafka的物理時間開始
//        long time = System.currentTimeMillis() - TimeUnit.MINUTES.toMillis(2);
//        myConsumer.setStartFromTimestamp(time);
        // 從kafka維護的offset開始
        myConsumer.setStartFromGroupOffsets();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataSource1 = env.addSource(myConsumer, "kafka-source-topic1");

第8行~第15行描述了可以從4中位置開始消費kafka,分別是latest、earliest、物理時間、offset。

spark streaming

本文不闡述spark streaming是如何維護kafka offset。

spark structured streaming

spark structured streaming不使用kafka維護的offset,它自己在checkpoint維護一套offset,所以與kafka offset相關的配置有一定的特殊性,特殊性如下:

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:

Parameter Description
group.id Kafka source will create a unique group id for each query automatically.
auto.offset.reset Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit Kafka source doesn’t commit any offset.

這裏尤其注意auto.offset.reset的說明。

配置kafak的參數如下:

Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option(“kafka.bootstrap.servers”, “host:port”).

可以通過設置checkpoint,讓spark來維護offset,當job重啓時,從上次結束的位置繼續消費,代碼如下:

    val query = df2.writeStream
      .outputMode("append")
      .option("checkpointLocation", "/path/to/checkpoint")
      .format("console")
      .option("truncate", "false")
      .start()

Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.

attention

kafka client發送offset到服務器的間隔是***auto.commit.interval.ms***參數指定的。如果拉取了數據,但是還沒有提交offset,這個時候程序掛掉,下次重新消費時這部分數據會重複消費。
爲了避免這個情況,直接使用kafka api時,可以設置手動提交offset。
使用flink時,flink會在checkpoint以後再提交offset,但是checkpoint也是有一定週期的。有少量重複讀數據出現,代碼如下:

public class KafkaTest {

    public static void main(String[] args) throws Exception {
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "h2:9092");
        properties.setProperty("group.id", "test-g1");
        properties.setProperty("enable.auto.commit", "true");

        FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
        myConsumer.setStartFromGroupOffsets();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(1000);
        env.setStateBackend(new FsStateBackend("file:///Users/mabinbin/delete/checkpoint_dir/" + KafkaTest2.class.getName()));

        DataStream<String> dataSource1 = env.addSource(KafkaProperties.getKafkaConsumer(), "kafka-source-test9");
        dataSource1.print();
        env.execute("KafkaTest");
    }

參考

Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)

Flink 小貼士 (2):Flink 如何管理 Kafka 消費位點

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章