前言
Kafka有offset的概念,offset記錄每個groupId對於每個topic的每個partition裏已經提交的讀取位置。當comsumer程序失敗重啓時,可以從這個位置重新讀取數據。
可以通過如下方法查看一個groupId的offset:
root@h2:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server h2:9092 --group mabb-g1 --describe
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).
Consumer group 'mabb-g1' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test6 0 39156 61587 22431 - - -
這篇博客介紹與offset有關的內容,包括kafka原生的api,flink,spark structured streaming。
Kafka API
依賴:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>1.0.0</version>
</dependency>
消費者代碼:
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.util.Arrays;
import java.util.Properties;
public class KafkaConsumerTest implements Runnable {
private final KafkaConsumer<String, String> consumer;
private ConsumerRecords<String, String> msgList;
private String topic;
private static final String GROUPID = "mabb-g1";
public static void main(String[] args) {
KafkaConsumerTest test1 = new KafkaConsumerTest("test6");
Thread thread1 = new Thread(test1);
thread1.start();
}
public KafkaConsumerTest(String topicName) {
Properties props = new Properties();
//kafka消費的的地址
props.put("bootstrap.servers", "h2:9092,h3:9092,h3:9092");
//組名 不同組名可以重複消費
props.put("group.id", GROUPID);
//是否自動提交
props.put("enable.auto.commit", "true");
//設置自動提交offset的間隔
props.put("auto.commit.interval.ms ", "100");
//從poll(拉)的回話處理時長
props.put("auto.commit.interval.ms", "100");
//超時時間
props.put("session.timeout.ms", "30000");
//一次最大拉取的條數
props.put("max.poll.records", 20);
// earliest: 當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,從頭開始消費
// latest: 當各分區下有已提交的offset時,從提交的offset開始消費;無提交的offset時,消費新產生的該分區下的數據
// none: topic各分區都存在已提交的offset時,從offset後開始消費;只要有一個分區不存在已提交的offset,則拋出異常
// props.put("auto.offset.reset", "earliest");
//序列化
props.put("key.deserializer", StringDeserializer.class.getName());
props.put("value.deserializer", StringDeserializer.class.getName());
this.consumer = new KafkaConsumer<String, String>(props);
this.topic = topicName;
//訂閱主題列表topic
this.consumer.subscribe(Arrays.asList(topic));
}
@Override
public void run() {
int messageNo = 1;
try {
for (; ; ) {
msgList = consumer.poll(20);
if (null != msgList && msgList.count() > 0) {
for (ConsumerRecord<String, String> record : msgList) {
System.out.println(messageNo + "=======receive: key = " + record.key() + ", value = " + record.value() + " offset===" + record.offset());
messageNo++;
}
} else {
Thread.sleep(10000);
}
}
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
consumer.close();
}
}
}
attention:
1.第29行,設置***enable.auto.commit*** 爲true,客戶端纔會在固定的週期提交當前的offset給server。
2.第39行,設置***auto.offset.reset***爲none,從之前記錄的offset開始消費。
flink
flink會把offset提交到kakfa,由kafka來維護offset,在失敗恢復時從上次位置繼續消費。
代碼如下:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "h2:9092");
properties.setProperty("group.id", "dev_test1");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
// start from the latest record
// myConsumer.setStartFromLatest();
// start from the earliest record
// myConsumer.setStartFromEarliest();
// 從消息到達kafka的物理時間開始
// long time = System.currentTimeMillis() - TimeUnit.MINUTES.toMillis(2);
// myConsumer.setStartFromTimestamp(time);
// 從kafka維護的offset開始
myConsumer.setStartFromGroupOffsets();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> dataSource1 = env.addSource(myConsumer, "kafka-source-topic1");
第8行~第15行描述了可以從4中位置開始消費kafka,分別是latest、earliest、物理時間、offset。
spark streaming
本文不闡述spark streaming是如何維護kafka offset。
spark structured streaming
spark structured streaming不使用kafka維護的offset,它自己在checkpoint維護一套offset,所以與kafka offset相關的配置有一定的特殊性,特殊性如下:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
Parameter | Description |
---|---|
group.id | Kafka source will create a unique group id for each query automatically. |
auto.offset.reset | Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. |
enable.auto.commit | Kafka source doesn’t commit any offset. |
這裏尤其注意auto.offset.reset的說明。
配置kafak的參數如下:
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option(“kafka.bootstrap.servers”, “host:port”).
可以通過設置checkpoint,讓spark來維護offset,當job重啓時,從上次結束的位置繼續消費,代碼如下:
val query = df2.writeStream
.outputMode("append")
.option("checkpointLocation", "/path/to/checkpoint")
.format("console")
.option("truncate", "false")
.start()
Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
attention
kafka client發送offset到服務器的間隔是***auto.commit.interval.ms***參數指定的。如果拉取了數據,但是還沒有提交offset,這個時候程序掛掉,下次重新消費時這部分數據會重複消費。
爲了避免這個情況,直接使用kafka api時,可以設置手動提交offset。
使用flink時,flink會在checkpoint以後再提交offset,但是checkpoint也是有一定週期的。有少量重複讀數據出現,代碼如下:
public class KafkaTest {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "h2:9092");
properties.setProperty("group.id", "test-g1");
properties.setProperty("enable.auto.commit", "true");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011("topic1", new SimpleStringSchema(), properties);
myConsumer.setStartFromGroupOffsets();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(1000);
env.setStateBackend(new FsStateBackend("file:///Users/mabinbin/delete/checkpoint_dir/" + KafkaTest2.class.getName()));
DataStream<String> dataSource1 = env.addSource(KafkaProperties.getKafkaConsumer(), "kafka-source-test9");
dataSource1.print();
env.execute("KafkaTest");
}
參考
Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)