第1章 CDC簡介

1.1 什麼是CDC

CDC是Change Data Capture(變更數據獲取)的簡稱。核心思想是，監測並捕獲數據庫的變動（包括數據或數據表的插入、更新以及刪除等），將這些變更按發生的順序完整記錄下來，寫入到消息中間件中以供其他服務進行訂閱及消費。

1.2 CDC的種類

CDC主要分爲基於查詢和基於Binlog兩種方式。

基於查詢的 CDC：

離線調度查詢作業，批處理。把一張表同步到其他系統，每次通過查詢去獲取表中最新的數據；
無法保障數據一致性，查的過程中有可能數據已經發生了多次變更；
不保障實時性，基於離線調度存在天然的延遲。

基於日誌的 CDC：

實時消費日誌，流處理，例如 MySQL 的 binlog 日誌完整記錄了數據庫中的變更，可以把 binlog 文件當作流的數據源；
保障數據一致性，因爲 binlog 文件包含了所有歷史變更明細；
保障實時性，因爲類似 binlog 的日誌文件是可以流式消費的，提供的是實時數據。

	基於查詢的CDC	基於Binlog的CDC
開源產品	Sqoop、DataX、Kafka JDBC Source	Canal、Maxwell、Debezium
執行模式	Batch	Streaming
是否可以捕獲所有數據變化	否	是
延遲性	高延遲	低延遲
是否增加數據庫壓力	是	否

1.3 Flink-CDC

Flink CDC (Flink Change Data Capture) 是基於數據庫的日誌 CDC 技術，實現了全增量一體化讀取的數據集成框架。搭配Flink計算框架，Flink CDC 可以高效實現海量數據的實時集成。

開源地址：https://github.com/ververica/flink-cdc-connectors

Debezium是國外⽤戶常⽤的CDC組件，單機對於分佈式來說，在數據讀取能力的拓展上，沒有分佈式的更具有優勢，在大數據衆多的分佈式框架中（Hive、Hudi等）Flink CDC 的架構能夠很好地接入這些框架。

DataX無法支持增量同步。如果一張Mysql表每天增量的數據是不同天的數據，並且沒有辦法確定它的產生時間，那麼如何將數據同步到數倉是一個值得考慮的問題。DataX支持全表同步，也支持sql查詢的方式導入導出，全量同步一定是不可取的，sql查詢的方式沒有可以確定增量數據的字段的話也不是一個好的增量數據同步方案。

Canal是用java開發的基於數據庫增量日誌解析，提供增量數據訂閱&消費的中間件。Canal主要支持了MySQL的Binlog解析，將增量數據寫入中間件中（例如kafka,Rocket MQ等），但是無法同步歷史數據，因爲無法獲取到binlog的變更。

Sqoop主要用於在Hadoop(Hive)與傳統的數據庫(mysql、postgresql...)間進行數據的傳遞。Sqoop將導入或導出命令翻譯成mapreduce程序來實現，這樣的弊端就是Sqoop只能做批量導入，遵循事務的一致性，Mapreduce任務成功則同步成功，失敗則全部同步失敗。

Apache SeaTunnel是一個當前也非常受歡迎的數據集成同步組件。其可以支持全量和增量，支持流批一體。SeaTunnel的使用是非常簡單的，零編寫代碼，只需要寫一個配置文件腳本提交命令即可，同時也使用分佈式的架構，可以依託於Flink,Spark以及自身的Zeta引擎的分佈式完成一個任務在多個節點上運行。其內部也有類似Flink checkpoint的狀態保存機制，用於故障恢復，sink階段的兩階段提交機制也可以做到精準一次性Excatly-once。對於大部分的場景，SeaTunnel都能完美支持，但是SeaTunnel只能支持簡單的數據轉換邏輯，對於複雜的數據轉換場景，還是需要Flink、Spark任務來完成。

Flink CDC 基本都彌補了以上框架的不足，將數據庫的全量和增量數據一體化地同步到消息隊列和數據倉庫中；也可以用於實時數據集成，將數據庫數據實時入湖入倉；無需像其他的CDC工具一樣需要在服務器上進行部署，減少了維護成本，鏈路更少；完美套接Flink程序，CDC獲取到的數據流直接對接Flink進行數據加工處理，一套代碼即可完成對數據的抽取轉換和寫出，既可以使用flink的DataStream API完成編碼，也可以使用較爲上層的FlinkSQL API進行操作。

截止到Flink CDC 2.2 爲止，支持的連接器：

支持的Flink版本：

第2章 FlinkCDC案例實操

2.1 開啓MySQL Binlog並重啓MySQL

2.2 DataStream方式的應用

2.2.1 導入依賴

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.13.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.12</artifactId>
        <version>1.13.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.12</artifactId>
        <version>1.13.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>8.0.16</version>
    </dependency>

    <dependency>
        <groupId>com.ververica</groupId>
        <artifactId>flink-connector-mysql-cdc</artifactId>
        <version>2.1.0</version>
</dependency>

<!-- 如果不引入 flink-table 相關依賴，則會報錯：
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.connector.base.source.reader.RecordEmitter
引入如下依賴可以解決這個問題（引入某些其它的 flink-table 相關依賴也可）
-->

<dependency>
<groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-java-bridge_2.12</artifactId>
    <version>1.13.0</version>
</dependency>

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.68</version>
    </dependency>
</dependencies>
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

2.2.2 編寫代碼

import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.runtime.state.hashmap.HashMapStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Properties;

/**
 * description:
 * Created by 鐵盾 on 2022/4/6
 */
public class FlinkCDC_01_DS {
    public static void main(String[] args) throws Exception {
        // TODO 1. 準備流處理環境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // TODO 2. 開啓檢查點   Flink-CDC將讀取binlog的位置信息以狀態的方式保存在CK,如果想要做到斷點續傳,
        // 需要從Checkpoint或者Savepoint啓動程序
        // 2.1 開啓Checkpoint,每隔5秒鐘做一次CK  ,並指定CK的一致性語義
        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
        // 2.2 設置超時時間爲 1 分鐘
        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
        // 2.3 設置兩次重啓的最小時間間隔
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
        // 2.4 設置任務關閉的時候保留最後一次 CK 數據
        env.getCheckpointConfig().enableExternalizedCheckpoints(
                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        // 2.5 指定從 CK 自動重啓策略
        env.setRestartStrategy(RestartStrategies.failureRateRestart(
                3, Time.days(1L), Time.minutes(1L)
        ));
        // 2.6 設置狀態後端
        env.setStateBackend(new HashMapStateBackend());
        env.getCheckpointConfig().setCheckpointStorage(
                "hdfs://hadoop102:8020/flinkCDC"
        );
        // 2.7 設置訪問HDFS的用戶名
        System.setProperty("HADOOP_USER_NAME", "atguigu");

        // TODO 3. 創建 Flink-MySQL-CDC 的 Source
		// initial:Performs an initial snapshot on the monitored database tables upon first startup, and continue to read the latest binlog.
// earliest:Never to perform snapshot on the monitored database tables upon first startup, just read from the beginning of the binlog. This should be used with care, as it is only valid when the binlog is guaranteed to contain the entire history of the database.
// latest:Never to perform snapshot on the monitored database tables upon first startup, just read from the end of the binlog which means only have the changes since the connector was started.
// specificOffset:Never to perform snapshot on the monitored database tables upon first startup, and directly read binlog from the specified offset.
// timestamp:Never to perform snapshot on the monitored database tables upon first startup, and directly read binlog from the specified timestamp.The consumer will traverse the binlog from the beginning and ignore change events whose timestamp is smaller than the specified timestamp.
        MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
                .hostname("hadoop102")
                .port(3306)
                .databaseList("gmall_config") // set captured database
                .tableList("gmall_config.t_user") // set captured table
                .username("root")
                .password("000000")
                .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
                .startupOptions(StartupOptions.initial())
                .build();

        // TODO 4.使用CDC Source從MySQL讀取數據
        DataStreamSource<String> mysqlDS =
                env.fromSource(
                        mySqlSource,
                        WatermarkStrategy.noWatermarks(),
                        "MysqlSource");

        // TODO 5.打印輸出
        mysqlDS.print();

        // TODO 6.執行任務
        env.execute();
    }
}

2.2.3 案例測試

1）打包並上傳至Linux

2）啓動HDFS集羣

[atguigu@hadoop102 flink-local]$ start-dfs.sh

3）啓動Flink集羣

[atguigu@hadoop102 flink-local]$ bin/start-cluster.sh

4）啓動程序

[atguigu@hadoop102 flink-local]$ bin/flink run -m hadoop102:8081 -c com.atguigu.cdc.FlinkCDC_01_DS ./gmall-flink-cdc.jar

5）觀察taskManager日誌，會從頭讀取表數據

6）給當前的Flink程序創建Savepoint

[atguigu@hadoop102 flink-local]$ bin/flink savepoint JobId hdfs://hadoop102:8020/flinkCDC/save

在WebUI中cancelJob

在MySQL的gmall_config.t_user表中添加、修改或者刪除數據**

從Savepoint重啓程序

[atguigu@hadoop102 flink-standalone]$ bin/flink run -s hdfs://hadoop102:8020/flink/save/... -c com.atguigu.cdc.FlinkCDC_01_DS ./gmall-flink-cdc.jar

觀察taskManager日誌，會從檢查點讀取表數據

2.3 FlinkSQL 方式的應用

2.3.1 添加依賴

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-planner-blink_2.12</artifactId>
  <version>1.13.0</version>
</dependency>

2.3.2 代碼實現

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

/**
 * description:
 * Created by 鐵盾 on 2022/4/6
 */
public class FlinkCDC_02_SQL {
    public static void main(String[] args) throws Exception {
        // TODO 1. 準備環境
        // 1.1 流處理環境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        // 1.2 表執行環境
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // TODO 2. 創建動態表
        tableEnv.executeSql("CREATE TABLE user_info (\n" +
                "id INT,\n" +
                "name STRING,\n" +
                "age INT,\n" +
                "primary key(id) not enforced\n" +
                ") WITH (" +
                "'connector' = 'mysql-cdc'," +
                "'hostname' = 'hadoop102'," +
                "'port' = '3306'," +
                "'username' = 'root'," +
                "'password' = '000000'," +
                "'database-name' = 'gmall_config'," +
                "'table-name' = 't_user'" +
                ")");

        tableEnv.executeSql("select * from user_info").print();

        // TODO 3. 執行任務
        env.execute();
    }
}

Flink CDC