Flink之Connector连接器

Streaming Connectors

预定义的Source和Sink

  • 基于文件的Source
    • readTextFile(path)
    • readFile(fileInputFormat,path)
  • 基于文件的Sink
    • writeAsText
    • writeAsCsv
  • 基于Socket
    • socketTextStream
  • 基于Socket的Sink
    • writeToSocket
  • 基于Collections、Iterators
    • fromCollection、fromElements
  • 标准输出、标准错误
    • Print、printToError

一个SocketWindowWordCount的例子

public class SocketWindowWordCount {
    public static void main(String[] args) throws Exception {

        // 用final修饰符定义端口号,表示不可变
        final int port;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            port = params.getInt("port");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'SocketWindowWordCount --port <port>'");
            return;
        }

        // (1)获取执行环境
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // (2)获取数据流,例子中是从指定端口的socket中获取用户输入的文本
        DataStream<String> text = env.socketTextStream("localhost", port, "\n");

        // (3)transformation操作,对数据流实现算法
        DataStream<WordWithCount> windowCounts = text
                //将用户输入的文本流以非空白符的方式拆开来,得到单个的单词,存入命名为out的Collector中
                .flatMap(new FlatMapFunction<String, WordWithCount>() {
                    public void flatMap(String value, Collector<WordWithCount> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(new WordWithCount(word, 1L));
                        }
                    }
                })
                //将输入的文本分为不相交的分区,每个分区包含的都是具有相同key的元素。也就是说,相同的单词被分在了同一个区域,下一步的reduce就是统计分区中的个数
                .keyBy("word")
                //滑动窗口机制,每1秒计算一次最近5秒
                .timeWindow(Time.seconds(5), Time.seconds(1))
                //一个在KeyedDataStream上“滚动”进行的reduce方法。将上一个reduce过的值和当前element结合,产生新的值并发送出。
                //此处是说,对输入的两个对象进行合并,统计该单词的数量和
                .reduce(new ReduceFunction<WordWithCount>() {
                    public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                        return new WordWithCount(a.word, a.count + b.count);
                    }
                });

        // 单线程执行该程序
        windowCounts.print().setParallelism(1);

        env.execute("Socket Window WordCount");
    }

    // 统计单词的数据结构,包含两个变量和三个方法
    public static class WordWithCount {
        //两个变量存储输入的单词及其数量
        public String word;
        public long count;
        
        public WordWithCount() {}
        
        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

Bundled Connectors

  • Apache Kafka(source/sink)
  • Apache Cassandra(sink)
  • Amazon Kinesis Streams(source/sink)
  • Elasticsearch(sink)
  • Hadoop FileSystem(sink)
  • RabbitMQ(source/sink)
  • Apache NiFi(source/sink)
  • Twitter Streaming API(source)

以上流connector是Flink项目的一部分,但是不包括在二进制发布包中

Apache Bahir中的连接

  • Apache ActiveMQ(source/sink)
  • Apache Flume(sink)
  • Redis(sink)
  • Akka(sink)
  • Netty(source)

Async I/O

  • 使用connector并不是数据输入输出Flink的唯一方式
  • 在Map、FlatMap中使用Async I/O方式读取外部数据库等

Flink Kafka Connector

Flink Kafka Consumer

  • 反序列化数据
  • 消费起始位置设置
  • Topic和Partition动态发现
  • Commit Offset方式
  • Timestamp Extraction/Watermark生成

Flink Kafka Producer

  • Producer分区
  • 容错

生产者例子

public class WriteIntoKafka {
    public static void main(String[] args) throws Exception {
        // create execution environment
 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
 
        Map properties= new HashMap();
        properties.put("bootstrap.servers", "/*服务地址*/");
        properties.put("topic", "/*topic*/");
        
        // parse user parameters
        ParameterTool parameterTool = ParameterTool.fromMap(properties);
 
        // add a simple source which is writing some strings
        DataStream<String> messageStream = env.addSource(new SimpleStringGenerator());
 
        // write stream to Kafka
        messageStream.addSink(new FlinkKafkaProducer010<>(parameterTool.getRequired("bootstrap.servers"),
                parameterTool.getRequired("topic"),
                new SimpleStringSchema()));
 
        messageStream.rebalance().map(new MapFunction<String, String>() {
            //序列化设置
            private static final long serialVersionUID = 1L;
 
            @Override
            public String map(String value) throws Exception {
                return value;
            }
        });
 
        messageStream.print();
 
        env.execute();
    }
 
    public static class SimpleStringGenerator implements SourceFunction<String> {
    	//序列化设置
        private static final long serialVersionUID = 1L;
        boolean running = true;
 
        @Override
        public void run(SourceContext<String> ctx) throws Exception {
            while(running) {
                ctx.collect(prouderJson());
            }
        }
 
        @Override
        public void cancel() {
            running = false;
        }
    }
 }

消费者例子

public class ReadFromKafka {
 
    public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
 
        Map properties= new HashMap();
        properties.put("bootstrap.servers", "/*服务地址*/");
        properties.put("Okusi Infotech", "test");
        properties.put("enable.auto.commit", "true");
        properties.put("auto.commit.interval.ms", "1000");
        properties.put("auto.offset.reset", "earliest");
        properties.put("session.timeout.ms", "30000");
        properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("topic", "/*topic*/");
        // parse user parameters
      
        ParameterTool parameterTool = ParameterTool.fromMap(properties);
 
        FlinkKafkaConsumer010 consumer010 = new FlinkKafkaConsumer010(
                         parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties());
	
 
        DataStream<String> messageStream = env
                .addSource(consumer010);
 
        // print() will write the contents of the stream to the TaskManager's standard out stream
        // the rebelance call is causing a repartitioning of the data so that all machines
        // see the messages (for example in cases when "num kafka partitions" < "num flink operators"
        messageStream.rebalance().map(new MapFunction<String, String>() {
            private static final long serialVersionUID = 1L;
 
            @Override
            public String map(String value) throws Exception {
                return value;
            }
        });
 
 
 messageStream.print();
 
 env.execute();
}
}

Kafka Consumer反序列化数据

  • 将kafka中二进制数据转换为具体的java、scala对象
  • DeserializationSchema,T deserialize(byte[] message)
  • KeyedDeserializationSchema, T deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset): 对于访问kafka key/value

常用

  • SimpleStringSchema: 按字符串方式进行序列化、反序列化
  • TypeInformationSerializationSchema:基于Flink的TypeInformation来创建schema
  • JsonDeserializationSchema:使用jackson反序列化json格式小时,并返回ObjectNode,可以使用.geyt("property")方法来访问字段

Kafka Consumer消费起始位置

  • setStartFromGroupOffsets 默认
    • 从Kafka记录的group.id的位置开始读取,如果没有根据auto.offset.reset设置
  • setStartFromEarliest
    • 从kafka最早的位置读取
  • setStartFromLatest
    • 从kafka最新数据开始读取
  • setStartFromTimestamp(long)
    • 从时间戳大于或者等于指定时间戳的位置开始读取
  • setStartFromSpecificOffsets
    • 从指定的分区的offset位置开始读取,如指定的offsets中不存某个分区,该分区从group offset位置开始读取

作业故障从checkpoint自动恢复,以及手动做savepoint时,消费的位置从保存状态中恢复,与该配置无关

Kafka Consumer-topic partition自动发现

原理:内部单独的线程获取kafka meta信息进行更新

flink.partition-discovery-interval-millis:发现时间间隔。默认false,设置非负值开启

分区发现

  • 消费的Source kafka topic进行了partition扩容
  • 新发现的分区,从earliest位置开始读取

Topic发现

  • 支持正则表达式描述topic名字
Pattern topicPattern= java.util.regex.Pattern.compile("topic[0-9]")

Kafka Consumer-commit offset方式

Checkpoint关闭

  • 依赖kafka客户端的auto commit定期提交offset
  • 需设置enable.auto.commit, auto.commit.interval.ms参数到consumer properties

Checkpoint开启

  • offset自己在checkpoint state中管理和容错。提交kafka仅作为外部监事消费进度
  • 通过setCommitOffsetsOnCheckpoints控制,Checkpoint成功之后,是否提交offset到kafka

Kafka Consumer 时戳提取、水位生成

per Kafka Partition watermark

  • assignTimestampsAndWatermarks, 每个partition一个assigner,水位为多个partition对齐后值
  • 不在kafka source后生成watermark,会出现扔掉部分数据情况

Kafka Producer分区

  • FlinkFixedPartition(默认):parallelInstanseID%partitions.length
  • Partitioner设置为null:round-robin kafka partitioner,维持过多连接
  • customer partitioner:自定义分区

Kafka Producer容错

Kafka 0.11:FlinkKafkaProducer011,两阶段提交Sink结合kafka事务,可以保证端到端精准一次

An Overview of End-to-End Exactly-Once Processing in Apache Flink® (with Apache Kafka, too!)​www.ververica.com

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章