Flink 從 0 到 1 學習 —— Flink Data transformation(轉換)

toc: true
title: Flink 從 0 到 1 學習 —— Flink Data transformation(轉換)
date: 2018-11-04
tags:

Flink
大數據
流式計算

前言

在第一篇介紹 Flink 的文章《《從0到1學習Flink》—— Apache Flink 介紹》中就說過 Flink 程序的結構

Flink 應用程序結構就是如上圖所示：

1、Source: 數據源，Flink 在流處理和批處理上的 source 大概有 4 類：基於本地集合的 source、基於文件的 source、基於網絡套接字的 source、自定義的 source。自定義的 source 常見的有 Apache kafka、Amazon Kinesis Streams、RabbitMQ、Twitter Streaming API、Apache NiFi 等，當然你也可以定義自己的 source。

2、Transformation：數據轉換的各種操作，有 Map / FlatMap / Filter / KeyBy / Reduce / Fold / Aggregations / Window / WindowAll / Union / Window join / Split / Select / Project 等，操作很多，可以將數據轉換計算成你想要的數據。

3、Sink：接收器，Flink 將轉換計算後的數據發送的地點，你可能需要存儲下來，Flink 常見的 Sink 大概有如下幾類：寫入文件、打印出來、寫入 socket 、自定義的 sink 。自定義的 sink 常見的有 Apache kafka、RabbitMQ、MySQL、ElasticSearch、Apache Cassandra、Hadoop FileSystem 等，同理你也可以定義自己的 Sink。

在上四篇文章介紹了 Source 和 Sink：

1、《從0到1學習Flink》—— Data Source 介紹

2、《從0到1學習Flink》—— 如何自定義 Data Source ？

3、《從0到1學習Flink》—— Data Sink 介紹

4、《從0到1學習Flink》—— 如何自定義 Data Sink ？

那麼這篇文章我們就來看下 Flink Data Transformation 吧，數據轉換操作還是蠻多的，需要好好講講！

Transformation

Map

這是最簡單的轉換之一，其中輸入是一個數據流，輸出的也是一個數據流：

還是拿上一篇文章的案例來將數據進行 map 轉換操作：

SingleOutputStreamOperator<Student> map = student.map(new MapFunction<Student, Student>() {
    @Override
    public Student map(Student value) throws Exception {
        Student s1 = new Student();
        s1.id = value.id;
        s1.name = value.name;
        s1.password = value.password;
        s1.age = value.age + 5;
        return s1;
    }
});
map.print();

將每個人的年齡都增加 5 歲，其他不變。

FlatMap

FlatMap 採用一條記錄並輸出零個，一個或多個記錄。

SingleOutputStreamOperator<Student> flatMap = student.flatMap(new FlatMapFunction<Student, Student>() {
    @Override
    public void flatMap(Student value, Collector<Student> out) throws Exception {
        if (value.id % 2 == 0) {
            out.collect(value);
        }
    }
});
flatMap.print();

這裏將 id 爲偶數的聚集出來。

Filter

Filter 函數根據條件判斷出結果。

SingleOutputStreamOperator<Student> filter = student.filter(new FilterFunction<Student>() {
    @Override
    public boolean filter(Student value) throws Exception {
        if (value.id > 95) {
            return true;
        }
        return false;
    }
});
filter.print();

這裏將 id 大於 95 的過濾出來，然後打印出來。

KeyBy

KeyBy 在邏輯上是基於 key 對流進行分區。在內部，它使用 hash 函數對流進行分區。它返回 KeyedDataStream 數據流。

KeyedStream<Student, Integer> keyBy = student.keyBy(new KeySelector<Student, Integer>() {
    @Override
    public Integer getKey(Student value) throws Exception {
        return value.age;
    }
});
keyBy.print();

上面對 student 的 age 做 KeyBy 操作分區

Reduce

Reduce 返回單個的結果值，並且 reduce 操作每處理一個元素總是創建一個新值。常用的方法有 average, sum, min, max, count，使用 reduce 方法都可實現。

SingleOutputStreamOperator<Student> reduce = student.keyBy(new KeySelector<Student, Integer>() {
    @Override
    public Integer getKey(Student value) throws Exception {
        return value.age;
    }
}).reduce(new ReduceFunction<Student>() {
    @Override
    public Student reduce(Student value1, Student value2) throws Exception {
        Student student1 = new Student();
        student1.name = value1.name + value2.name;
        student1.id = (value1.id + value2.id) / 2;
        student1.password = value1.password + value2.password;
        student1.age = (value1.age + value2.age) / 2;
        return student1;
    }
});
reduce.print();

上面先將數據流進行 keyby 操作，因爲執行 reduce 操作只能是 KeyedStream，然後將 student 對象的 age 做了一個求平均值的操作。

Fold

Fold 通過將最後一個文件夾流與當前記錄組合來推出 KeyedStream。它會發回數據流。

KeyedStream.fold("1", new FoldFunction<Integer, String>() {
    @Override
    public String fold(String accumulator, Integer value) throws Exception {
        return accumulator + "=" + value;
    }
})

Aggregations

DataStream API 支持各種聚合，例如 min，max，sum 等。這些函數可以應用於 KeyedStream 以獲得 Aggregations 聚合。

KeyedStream.sum(0) 
KeyedStream.sum("key") 
KeyedStream.min(0) 
KeyedStream.min("key") 
KeyedStream.max(0) 
KeyedStream.max("key") 
KeyedStream.minBy(0) 
KeyedStream.minBy("key") 
KeyedStream.maxBy(0) 
KeyedStream.maxBy("key")

max 和 maxBy 之間的區別在於 max 返回流中的最大值，但 maxBy 返回具有最大值的鍵， min 和 minBy 同理。

Window

Window 函數允許按時間或其他條件對現有 KeyedStream 進行分組。以下是以 10 秒的時間窗口聚合：

inputStream.keyBy(0).window(Time.seconds(10));

Flink 定義數據片段以便（可能）處理無限數據流。這些切片稱爲窗口。此切片有助於通過應用轉換處理數據塊。要對流進行窗口化，我們需要分配一個可以進行分發的鍵和一個描述要對窗口化流執行哪些轉換的函數

要將流切片到窗口，我們可以使用 Flink 自帶的窗口分配器。我們有選項，如 tumbling windows, sliding windows, global 和 session windows。 Flink 還允許您通過擴展 WindowAssginer 類來編寫自定義窗口分配器。這裏先預留下篇文章來講解這些不同的 windows 是如何工作的。

WindowAll

windowAll 函數允許對常規數據流進行分組。通常，這是非並行數據轉換，因爲它在非分區數據流上運行。

與常規數據流功能類似，我們也有窗口數據流功能。唯一的區別是它們處理窗口數據流。所以窗口縮小就像 Reduce 函數一樣，Window fold 就像 Fold 函數一樣，並且還有聚合。

inputStream.keyBy(0).windowAll(Time.seconds(10));

Union

Union 函數將兩個或多個數據流結合在一起。這樣就可以並行地組合數據流。如果我們將一個流與自身組合，那麼它會輸出每個記錄兩次。

inputStream.union(inputStream1, inputStream2, ...);

Window join

我們可以通過一些 key 將同一個 window 的兩個數據流 join 起來。

inputStream.join(inputStream1)
           .where(0).equalTo(1)
           .window(Time.seconds(5))     
           .apply (new JoinFunction () {...});

以上示例是在 5 秒的窗口中連接兩個流，其中第一個流的第一個屬性的連接條件等於另一個流的第二個屬性。

Split

此功能根據條件將流拆分爲兩個或多個流。當您獲得混合流並且您可能希望單獨處理每個數據流時，可以使用此方法。

SplitStream<Integer> split = inputStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>(); 
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});

Select

此功能允許您從拆分流中選擇特定流。

SplitStream<Integer> split;
DataStream<Integer> even = split.select("even"); 
DataStream<Integer> odd = split.select("odd"); 
DataStream<Integer> all = split.select("even","odd");

Project

Project 函數允許您從事件流中選擇屬性子集，並僅將所選元素髮送到下一個處理流。

DataStream<Tuple4<Integer, Double, String, String>> in = // [...] 
DataStream<Tuple2<String, String>> out = in.project(3,2);

上述函數從給定記錄中選擇屬性號 2 和 3。以下是示例輸入和輸出記錄：

(1,10.0,A,B)=> (B,A)
(2,20.0,C,D)=> (D,C)

最後

本文主要介紹了 Flink Data 的常用轉換方式：Map、FlatMap、Filter、KeyBy、Reduce、Fold、Aggregations、Window、WindowAll、Union、Window Join、Split、Select、Project 等。並用了點簡單的 demo 介紹瞭如何使用，具體在項目中該如何將數據流轉換成我們想要的格式，還需要根據實際情況對待。