Streaming Connectors
預定義的Source和Sink
- 基於文件的Source
- readTextFile(path)
- readFile(fileInputFormat,path)
- 基於文件的Sink
- writeAsText
- writeAsCsv
- 基於Socket
- socketTextStream
- 基於Socket的Sink
- writeToSocket
- 基於Collections、Iterators
- fromCollection、fromElements
- 標準輸出、標準錯誤
- Print、printToError
一個SocketWindowWordCount的例子
public class SocketWindowWordCount {
public static void main(String[] args) throws Exception {
// 用final修飾符定義端口號,表示不可變
final int port;
try {
final ParameterTool params = ParameterTool.fromArgs(args);
port = params.getInt("port");
} catch (Exception e) {
System.err.println("No port specified. Please run 'SocketWindowWordCount --port <port>'");
return;
}
// (1)獲取執行環境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// (2)獲取數據流,例子中是從指定端口的socket中獲取用戶輸入的文本
DataStream<String> text = env.socketTextStream("localhost", port, "\n");
// (3)transformation操作,對數據流實現算法
DataStream<WordWithCount> windowCounts = text
//將用戶輸入的文本流以非空白符的方式拆開來,得到單個的單詞,存入命名爲out的Collector中
.flatMap(new FlatMapFunction<String, WordWithCount>() {
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
//將輸入的文本分爲不相交的分區,每個分區包含的都是具有相同key的元素。也就是說,相同的單詞被分在了同一個區域,下一步的reduce就是統計分區中的個數
.keyBy("word")
//滑動窗口機制,每1秒計算一次最近5秒
.timeWindow(Time.seconds(5), Time.seconds(1))
//一個在KeyedDataStream上“滾動”進行的reduce方法。將上一個reduce過的值和當前element結合,產生新的值併發送出。
//此處是說,對輸入的兩個對象進行合併,統計該單詞的數量和
.reduce(new ReduceFunction<WordWithCount>() {
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// 單線程執行該程序
windowCounts.print().setParallelism(1);
env.execute("Socket Window WordCount");
}
// 統計單詞的數據結構,包含兩個變量和三個方法
public static class WordWithCount {
//兩個變量存儲輸入的單詞及其數量
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
Bundled Connectors
- Apache Kafka(source/sink)
- Apache Cassandra(sink)
- Amazon Kinesis Streams(source/sink)
- Elasticsearch(sink)
- Hadoop FileSystem(sink)
- RabbitMQ(source/sink)
- Apache NiFi(source/sink)
- Twitter Streaming API(source)
以上流connector是Flink項目的一部分,但是不包括在二進制發佈包中
Apache Bahir中的連接
- Apache ActiveMQ(source/sink)
- Apache Flume(sink)
- Redis(sink)
- Akka(sink)
- Netty(source)
Async I/O
- 使用connector並不是數據輸入輸出Flink的唯一方式
- 在Map、FlatMap中使用Async I/O方式讀取外部數據庫等
Flink Kafka Connector
Flink Kafka Consumer
- 反序列化數據
- 消費起始位置設置
- Topic和Partition動態發現
- Commit Offset方式
- Timestamp Extraction/Watermark生成
Flink Kafka Producer
- Producer分區
- 容錯
生產者例子
public class WriteIntoKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Map properties= new HashMap();
properties.put("bootstrap.servers", "/*服務地址*/");
properties.put("topic", "/*topic*/");
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromMap(properties);
// add a simple source which is writing some strings
DataStream<String> messageStream = env.addSource(new SimpleStringGenerator());
// write stream to Kafka
messageStream.addSink(new FlinkKafkaProducer010<>(parameterTool.getRequired("bootstrap.servers"),
parameterTool.getRequired("topic"),
new SimpleStringSchema()));
messageStream.rebalance().map(new MapFunction<String, String>() {
//序列化設置
private static final long serialVersionUID = 1L;
@Override
public String map(String value) throws Exception {
return value;
}
});
messageStream.print();
env.execute();
}
public static class SimpleStringGenerator implements SourceFunction<String> {
//序列化設置
private static final long serialVersionUID = 1L;
boolean running = true;
@Override
public void run(SourceContext<String> ctx) throws Exception {
while(running) {
ctx.collect(prouderJson());
}
}
@Override
public void cancel() {
running = false;
}
}
}
消費者例子
public class ReadFromKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Map properties= new HashMap();
properties.put("bootstrap.servers", "/*服務地址*/");
properties.put("Okusi Infotech", "test");
properties.put("enable.auto.commit", "true");
properties.put("auto.commit.interval.ms", "1000");
properties.put("auto.offset.reset", "earliest");
properties.put("session.timeout.ms", "30000");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("topic", "/*topic*/");
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromMap(properties);
FlinkKafkaConsumer010 consumer010 = new FlinkKafkaConsumer010(
parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties());
DataStream<String> messageStream = env
.addSource(consumer010);
// print() will write the contents of the stream to the TaskManager's standard out stream
// the rebelance call is causing a repartitioning of the data so that all machines
// see the messages (for example in cases when "num kafka partitions" < "num flink operators"
messageStream.rebalance().map(new MapFunction<String, String>() {
private static final long serialVersionUID = 1L;
@Override
public String map(String value) throws Exception {
return value;
}
});
messageStream.print();
env.execute();
}
}
Kafka Consumer反序列化數據
- 將kafka中二進制數據轉換爲具體的java、scala對象
- DeserializationSchema,T deserialize(byte[] message)
- KeyedDeserializationSchema, T deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset): 對於訪問kafka key/value
常用
- SimpleStringSchema: 按字符串方式進行序列化、反序列化
- TypeInformationSerializationSchema:基於Flink的TypeInformation來創建schema
- JsonDeserializationSchema:使用jackson反序列化json格式小時,並返回ObjectNode,可以使用.geyt("property")方法來訪問字段
Kafka Consumer消費起始位置
- setStartFromGroupOffsets 默認
- 從Kafka記錄的group.id的位置開始讀取,如果沒有根據auto.offset.reset設置
- setStartFromEarliest
- 從kafka最早的位置讀取
- setStartFromLatest
- 從kafka最新數據開始讀取
- setStartFromTimestamp(long)
- 從時間戳大於或者等於指定時間戳的位置開始讀取
- setStartFromSpecificOffsets
- 從指定的分區的offset位置開始讀取,如指定的offsets中不存某個分區,該分區從group offset位置開始讀取
作業故障從checkpoint自動恢復,以及手動做savepoint時,消費的位置從保存狀態中恢復,與該配置無關
Kafka Consumer-topic partition自動發現
原理:內部單獨的線程獲取kafka meta信息進行更新
flink.partition-discovery-interval-millis:發現時間間隔。默認false,設置非負值開啓
分區發現
- 消費的Source kafka topic進行了partition擴容
- 新發現的分區,從earliest位置開始讀取
Topic發現
- 支持正則表達式描述topic名字
Pattern topicPattern= java.util.regex.Pattern.compile("topic[0-9]")
Kafka Consumer-commit offset方式
Checkpoint關閉
- 依賴kafka客戶端的auto commit定期提交offset
- 需設置enable.auto.commit, auto.commit.interval.ms參數到consumer properties
Checkpoint開啓
- offset自己在checkpoint state中管理和容錯。提交kafka僅作爲外部監事消費進度
- 通過setCommitOffsetsOnCheckpoints控制,Checkpoint成功之後,是否提交offset到kafka
Kafka Consumer 時戳提取、水位生成
per Kafka Partition watermark
- assignTimestampsAndWatermarks, 每個partition一個assigner,水位爲多個partition對齊後值
- 不在kafka source後生成watermark,會出現扔掉部分數據情況
Kafka Producer分區
- FlinkFixedPartition(默認):parallelInstanseID%partitions.length
- Partitioner設置爲null:round-robin kafka partitioner,維持過多連接
- customer partitioner:自定義分區
Kafka Producer容錯
Kafka 0.11:FlinkKafkaProducer011,兩階段提交Sink結合kafka事務,可以保證端到端精準一次