Foreword
Flink SQL之所以簡潔易用而功能強大,其中一個重要因素就是其擁有豐富的Connector(連接器)組件。Connector是Flink與外部系統交互的載體,並分爲負責讀取的Source和負責寫入的Sink兩大類。不過,Flink SQL內置的Connector有可能無法cover實際業務中的種種需求,需要我們自行定製。好在社區已經提供了一套標準化、易於擴展的體系,用戶只要按照規範面向接口編程,就能輕鬆打造自己的Connector。本文就在現有Bahir Flink項目的基礎上逐步實現一個SQL化的Redis Connector。
Introducing DynamicTableSource/Sink
當前(Flink 1.11+)Flink SQL Connector的架構簡圖如下所示,設計文檔可參見FLIP-95。
動態表(dynamic table)一直都是Flink SQL流批一體化的重要概念,也是上述架構中Planning階段的核心。而自定義Connector的主要工作就是實現基於動態表的Source/Sink,還包括上游產生它的工廠,以及下游在Runtime階段實際執行Source/Sink邏輯的RuntimeProvider。Metadata階段的表元數據則由Catalog維護。
前方海量代碼預警。
Implementing RedisDynamicTableFactory
DynamicTableFactory需要具備以下功能:
- 定義與校驗建表時傳入的各項參數;
- 獲取表的元數據;
- 定義讀寫數據時的編碼/解碼格式(非必需);
- 創建可用的DynamicTable[Source/Sink]實例。
實現了DynamicTable[Source/Sink]Factory接口的工廠類骨架如下所示。
public class RedisDynamicTableFactory implements DynamicTableSourceFactory, DynamicTableSinkFactory {
@Override
public DynamicTableSource createDynamicTableSource(Context context) { }
@Override
public DynamicTableSink createDynamicTableSink(Context context) { }
@Override
public String factoryIdentifier() { }
@Override
public Set<ConfigOption<?>> requiredOptions() { }
@Override
public Set<ConfigOption<?>> optionalOptions() { }
}
首先來定義Redis Connector需要的各項參數,利用內置的ConfigOption/ConfigOptions類即可。它們的含義都很簡單,不再贅述。
public static final ConfigOption<String> MODE = ConfigOptions
.key("mode")
.stringType()
.defaultValue("single");
public static final ConfigOption<String> SINGLE_HOST = ConfigOptions
.key("single.host")
.stringType()
.defaultValue(Protocol.DEFAULT_HOST);
public static final ConfigOption<Integer> SINGLE_PORT = ConfigOptions
.key("single.port")
.intType()
.defaultValue(Protocol.DEFAULT_PORT);
public static final ConfigOption<String> CLUSTER_NODES = ConfigOptions
.key("cluster.nodes")
.stringType()
.noDefaultValue();
public static final ConfigOption<String> SENTINEL_NODES = ConfigOptions
.key("sentinel.nodes")
.stringType()
.noDefaultValue();
public static final ConfigOption<String> SENTINEL_MASTER = ConfigOptions
.key("sentinel.master")
.stringType()
.noDefaultValue();
public static final ConfigOption<String> PASSWORD = ConfigOptions
.key("password")
.stringType()
.noDefaultValue();
public static final ConfigOption<String> COMMAND = ConfigOptions
.key("command")
.stringType()
.noDefaultValue();
public static final ConfigOption<Integer> DB_NUM = ConfigOptions
.key("db-num")
.intType()
.defaultValue(Protocol.DEFAULT_DATABASE);
public static final ConfigOption<Integer> TTL_SEC = ConfigOptions
.key("ttl-sec")
.intType()
.noDefaultValue();
public static final ConfigOption<Integer> CONNECTION_TIMEOUT_MS = ConfigOptions
.key("connection.timeout-ms")
.intType()
.defaultValue(Protocol.DEFAULT_TIMEOUT);
public static final ConfigOption<Integer> CONNECTION_MAX_TOTAL = ConfigOptions
.key("connection.max-total")
.intType()
.defaultValue(GenericObjectPoolConfig.DEFAULT_MAX_TOTAL);
public static final ConfigOption<Integer> CONNECTION_MAX_IDLE = ConfigOptions
.key("connection.max-idle")
.intType()
.defaultValue(GenericObjectPoolConfig.DEFAULT_MAX_IDLE);
public static final ConfigOption<Boolean> CONNECTION_TEST_ON_BORROW = ConfigOptions
.key("connection.test-on-borrow")
.booleanType()
.defaultValue(GenericObjectPoolConfig.DEFAULT_TEST_ON_BORROW);
public static final ConfigOption<Boolean> CONNECTION_TEST_ON_RETURN = ConfigOptions
.key("connection.test-on-return")
.booleanType()
.defaultValue(GenericObjectPoolConfig.DEFAULT_TEST_ON_RETURN);
public static final ConfigOption<Boolean> CONNECTION_TEST_WHILE_IDLE = ConfigOptions
.key("connection.test-while-idle")
.booleanType()
.defaultValue(GenericObjectPoolConfig.DEFAULT_TEST_WHILE_IDLE);
public static final ConfigOption<String> LOOKUP_ADDITIONAL_KEY = ConfigOptions
.key("lookup.additional-key")
.stringType()
.noDefaultValue();
public static final ConfigOption<Integer> LOOKUP_CACHE_MAX_ROWS = ConfigOptions
.key("lookup.cache.max-rows")
.intType()
.defaultValue(-1);
public static final ConfigOption<Integer> LOOKUP_CACHE_TTL_SEC = ConfigOptions
.key("lookup.cache.ttl-sec")
.intType()
.defaultValue(-1);
接下來分別覆寫requiredOptions()和optionalOptions()方法,它們分別返回Connector的必需參數集合和可選參數集合。
@Override
public Set<ConfigOption<?>> requiredOptions() {
Set<ConfigOption<?>> requiredOptions = new HashSet<>();
requiredOptions.add(MODE);
requiredOptions.add(COMMAND);
return requiredOptions;
}
@Override
public Set<ConfigOption<?>> optionalOptions() {
Set<ConfigOption<?>> optionalOptions = new HashSet<>();
optionalOptions.add(SINGLE_HOST);
optionalOptions.add(SINGLE_PORT);
// 其他14個參數略去......
optionalOptions.add(LOOKUP_CACHE_TTL_SEC);
return optionalOptions;
}
然後分別覆寫createDynamicTableSource()與createDynamicTableSink()方法,創建DynamicTableSource和DynamicTableSink實例。在創建之前,我們可以利用內置的TableFactoryHelper工具類來校驗傳入的參數,當然也可以自己編寫校驗邏輯。另外,通過關聯的上下文對象還能獲取到表的元數據。代碼如下,稍後會編寫具體的Source/Sink類。
@Override
public DynamicTableSource createDynamicTableSource(Context context) {
FactoryUtil.TableFactoryHelper helper = createTableFactoryHelper(this, context);
helper.validate();
ReadableConfig options = helper.getOptions();
validateOptions(options);
TableSchema schema = context.getCatalogTable().getSchema();
return new RedisDynamicTableSource(options, schema);
}
@Override
public DynamicTableSink createDynamicTableSink(Context context) {
FactoryUtil.TableFactoryHelper helper = createTableFactoryHelper(this, context);
helper.validate();
ReadableConfig options = helper.getOptions();
validateOptions(options);
TableSchema schema = context.getCatalogTable().getSchema();
return new RedisDynamicTableSink(options, schema);
}
private void validateOptions(ReadableConfig options) {
switch (options.get(MODE)) {
case "single":
if (StringUtils.isEmpty(options.get(SINGLE_HOST))) {
throw new IllegalArgumentException("Parameter single.host must be provided in single mode");
}
break;
case "cluster":
if (StringUtils.isEmpty(options.get(CLUSTER_NODES))) {
throw new IllegalArgumentException("Parameter cluster.nodes must be provided in cluster mode");
}
break;
case "sentinel":
if (StringUtils.isEmpty(options.get(SENTINEL_NODES)) || StringUtils.isEmpty(options.get(SENTINEL_MASTER))) {
throw new IllegalArgumentException("Parameters sentinel.nodes and sentinel.master must be provided in sentinel mode");
}
break;
default:
throw new IllegalArgumentException("Invalid Redis mode. Must be single/cluster/sentinel");
}
}
在factoryIdentifier()方法內指定工廠類的標識符,該標識符就是建表時必須填寫的connector參數的值。
@Override
public String factoryIdentifier() {
return "redis";
}
筆者在之前的文章中介紹過,Flink SQL採用Java SPI機制來發現與加載表工廠類。所以最後不要忘了classpath的META-INF/services目錄下創建一個名爲org.apache.flink.table.factories.Factory
的文件,並寫入我們自定義的工廠類的全限定名,如:org.apache.flink.streaming.connectors.redis.dynamic.RedisDynamicTableFactory
。
Implementing RedisDynamicTableSink
Bahir Flink項目已經提供了基於DataStream API的RedisSink,我們可以利用它來直接構建RedisDynamicTableSink,減少重複工作。實現了DynamicTableSink接口的類骨架如下。
public class RedisDynamicTableSink implements DynamicTableSink {
private final ReadableConfig options;
private final TableSchema schema;
public RedisDynamicTableSink(ReadableConfig options, TableSchema schema) {
this.options = options;
this.schema = schema;
}
@Override
public ChangelogMode getChangelogMode(ChangelogMode changelogMode) { }
@Override
public SinkRuntimeProvider getSinkRuntimeProvider(Context context) { }
@Override
public DynamicTableSink copy() { }
@Override
public String asSummaryString() { }
}
getChangelogMode()方法需要返回該Sink可以接受的change log行的類別。由於向Redis寫入的數據可以是隻追加的,也可以是帶有回撤語義的(如各種聚合數據),因此支持INSERT、UPDATE_BEFORE和UPDATE_AFTER類別。
@Override
public ChangelogMode getChangelogMode(ChangelogMode changelogMode) {
return ChangelogMode.newBuilder()
.addContainedKind(RowKind.INSERT)
.addContainedKind(RowKind.UPDATE_BEFORE)
.addContainedKind(RowKind.UPDATE_AFTER)
.build();
}
接下來需要實現SinkRuntimeProvider,即編寫SinkFunction供底層運行時調用。由於RedisSink已經是現成的SinkFunction了,我們只需要寫好通用的RedisMapper,順便做一些前置的校驗工作(如檢查表的列數以及數據類型)即可。getSinkRuntimeProvider()方法與RedisMapper的代碼如下,很容易理解。
@Override
public SinkRuntimeProvider getSinkRuntimeProvider(Context context) {
Preconditions.checkNotNull(options, "No options supplied");
FlinkJedisConfigBase jedisConfig = Util.getFlinkJedisConfig(options);
Preconditions.checkNotNull(jedisConfig, "No Jedis config supplied");
RedisCommand command = RedisCommand.valueOf(options.get(COMMAND).toUpperCase());
int fieldCount = schema.getFieldCount();
if (fieldCount != (needAdditionalKey(command) ? 3 : 2)) {
throw new ValidationException("Redis sink only supports 2 or 3 columns");
}
DataType[] dataTypes = schema.getFieldDataTypes();
for (int i = 0; i < fieldCount; i++) {
if (!dataTypes[i].getLogicalType().getTypeRoot().equals(LogicalTypeRoot.VARCHAR)) {
throw new ValidationException("Redis connector only supports STRING type");
}
}
RedisMapper<RowData> mapper = new RedisRowDataMapper(options, command);
RedisSink<RowData> redisSink = new RedisSink<>(jedisConfig, mapper);
return SinkFunctionProvider.of(redisSink);
}
private static boolean needAdditionalKey(RedisCommand command) {
return command.getRedisDataType() == RedisDataType.HASH || command.getRedisDataType() == RedisDataType.SORTED_SET;
}
public static final class RedisRowDataMapper implements RedisMapper<RowData> {
private static final long serialVersionUID = 1L;
private final ReadableConfig options;
private final RedisCommand command;
public RedisRowDataMapper(ReadableConfig options, RedisCommand command) {
this.options = options;
this.command = command;
}
@Override
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(command, "default-additional-key");
}
@Override
public String getKeyFromData(RowData data) {
return data.getString(needAdditionalKey(command) ? 1 : 0).toString();
}
@Override
public String getValueFromData(RowData data) {
return data.getString(needAdditionalKey(command) ? 2 : 1).toString();
}
@Override
public Optional<String> getAdditionalKey(RowData data) {
return needAdditionalKey(command) ? Optional.of(data.getString(0).toString()) : Optional.empty();
}
@Override
public Optional<Integer> getAdditionalTTL(RowData data) {
return options.getOptional(TTL_SEC);
}
}
剩下的copy()和asSummaryString()方法就很簡單了。
@Override
public DynamicTableSink copy() {
return new RedisDynamicTableSink(options, schema);
}
@Override
public String asSummaryString() {
return "Redis Dynamic Table Sink";
}
Implementing RedisDynamicTableSource
與DynamicTableSink不同,DynamicTableSource又根據其特性分爲兩類,即ScanTableSource和LookupTableSource。顧名思義,前者能夠掃描外部系統中的所有或部分數據,並且支持謂詞下推、分區下推之類的特性;而後者不會感知到外部系統中數據的全貌,而是根據一個或者多個key去執行點查詢並返回結果。
考慮到在數倉體系中Redis一般作爲維度庫使用,因此我們需要實現的是LookupTableSource接口。實現該接口的RedisDynamicTableSource類如下所示,大體結構與Sink類似。
public class RedisDynamicTableSource implements LookupTableSource {
private final ReadableConfig options;
private final TableSchema schema;
public RedisDynamicTableSource(ReadableConfig options, TableSchema schema) {
this.options = options;
this.schema = schema;
}
@Override
public LookupRuntimeProvider getLookupRuntimeProvider(LookupContext context) {
Preconditions.checkArgument(context.getKeys().length == 1 && context.getKeys()[0].length == 1, "Redis source only supports lookup by single key");
int fieldCount = schema.getFieldCount();
if (fieldCount != 2) {
throw new ValidationException("Redis source only supports 2 columns");
}
DataType[] dataTypes = schema.getFieldDataTypes();
for (int i = 0; i < fieldCount; i++) {
if (!dataTypes[i].getLogicalType().getTypeRoot().equals(LogicalTypeRoot.VARCHAR)) {
throw new ValidationException("Redis connector only supports STRING type");
}
}
return TableFunctionProvider.of(new RedisRowDataLookupFunction(options));
}
@Override
public DynamicTableSource copy() {
return new RedisDynamicTableSource(options, schema);
}
@Override
public String asSummaryString() {
return "Redis Dynamic Table Source";
}
}
根據Flink框架本身的要求,用於執行點查詢的LookupRuntimeProvider必須是TableFunction(同步)或者AsyncTableFunction(異步)。由於Bahir Flink項目採用的Jedis是同步客戶端,故本文只給出同步版本的實現,異步版本可以換用其他客戶端(如Redisson或Vert.x Redis Client)。RedisRowDataLookupFunction的代碼如下。
public static class RedisRowDataLookupFunction extends TableFunction<RowData> {
private static final long serialVersionUID = 1L;
private final ReadableConfig options;
private final String command;
private final String additionalKey;
private final int cacheMaxRows;
private final int cacheTtlSec;
private RedisCommandsContainer commandsContainer;
private transient Cache<RowData, RowData> cache;
public RedisRowDataLookupFunction(ReadableConfig options) {
Preconditions.checkNotNull(options, "No options supplied");
this.options = options;
command = options.get(COMMAND).toUpperCase();
Preconditions.checkArgument(command.equals("GET") || command.equals("HGET"), "Redis table source only supports GET and HGET commands");
additionalKey = options.get(LOOKUP_ADDITIONAL_KEY);
cacheMaxRows = options.get(LOOKUP_CACHE_MAX_ROWS);
cacheTtlSec = options.get(LOOKUP_CACHE_TTL_SEC);
}
@Override
public void open(FunctionContext context) throws Exception {
super.open(context);
FlinkJedisConfigBase jedisConfig = Util.getFlinkJedisConfig(options);
commandsContainer = RedisCommandsContainerBuilder.build(jedisConfig);
commandsContainer.open();
if (cacheMaxRows > 0 && cacheTtlSec > 0) {
cache = CacheBuilder.newBuilder()
.expireAfterWrite(cacheTtlSec, TimeUnit.SECONDS)
.maximumSize(cacheMaxRows)
.build();
}
}
@Override
public void close() throws Exception {
if (cache != null) {
cache.invalidateAll();
}
if (commandsContainer != null) {
commandsContainer.close();
}
super.close();
}
public void eval(Object obj) {
RowData lookupKey = GenericRowData.of(obj);
if (cache != null) {
RowData cachedRow = cache.getIfPresent(lookupKey);
if (cachedRow != null) {
collect(cachedRow);
return;
}
}
StringData key = lookupKey.getString(0);
String value = command.equals("GET") ? commandsContainer.get(key.toString()) : commandsContainer.hget(additionalKey, key.toString());
RowData result = GenericRowData.of(key, StringData.fromString(value));
cache.put(lookupKey, result);
collect(result);
}
}
有三點需要注意:
- Redis維度數據一般用String或Hash類型存儲,因此命令支持GET與HGET。如果使用Hash類型,需要在參數中額外傳入它的key,不能像Sink一樣動態指定;
- 爲了避免每來一條數據都請求Redis,需要設計緩存,上面利用的是Guava Cache。在Redis中查不到的數據也要緩存,防止穿透;
- TableFunction必須有一個簽名爲
eval(Object)
或eval(Object...)
的方法。在本例中實際輸出的數據類型爲ROW<STRING, STRING>,在Flink Table的類型體系中應表示爲RowData(StringData, StringData)。
Using Redis SQL Connector
來實際應用一下吧。先創建一張表示Hash結構的Redis Sink表。
CREATE TABLE rtdw_dws.redis_test_order_stat_dashboard (
hashKey STRING,
cityId STRING,
data STRING,
PRIMARY KEY (hashKey) NOT ENFORCED
) WITH (
'connector' = 'redis',
'mode' = 'single',
'single.host' = '172.16.200.124',
'single.port' = '6379',
'db-num' = '10',
'command' = 'HSET',
'ttl-sec' = '86400',
'connection.max-total' = '5',
'connection.timeout-ms' = '5000',
'connection.test-while-idle' = 'true'
)
然後讀取Kafka中的訂單流,統計一些簡單的數據,並寫入Redis。
/*
tableEnvConfig.setBoolean("table.dynamic-table-options.enabled", true)
tableEnvConfig.setBoolean("table.exec.emit.early-fire.enabled", true)
tableEnvConfig.setString("table.exec.emit.early-fire.delay", "5s")
tableEnv.createTemporarySystemFunction("MapToJsonString", classOf[MapToJsonString])
*/
INSERT INTO rtdw_dws.redis_test_order_stat_dashboard
SELECT
CONCAT('dashboard:city_stat:', p.orderDay) AS hashKey,
CAST(p.cityId AS STRING) AS cityId,
MapToJsonString(MAP[
'subOrderNum', CAST(p.subOrderNum AS STRING),
'buyerNum', CAST(p.buyerNum AS STRING),
'gmv', CAST(p.gmv AS STRING)
]) AS data
FROM (
SELECT
cityId,
SUBSTR(tss, 0, 10) AS orderDay,
COUNT(1) AS subOrderNum,
COUNT(DISTINCT userId) AS buyerNum,
SUM(quantity * merchandisePrice) AS gmv
FROM rtdw_dwd.kafka_order_done_log /*+ OPTIONS('scan.startup.mode'='latest-offset','properties.group.id'='fsql_redis_test_order_stat_dashboard') */
GROUP BY TUMBLE(procTime, INTERVAL '1' DAY), cityId, SUBSTR(tss, 0, 10)
) p
觀察結果~
再看一下Redis作爲維度表的使用,仍然以Hash結構爲例。
CREATE TABLE rtdw_dim.redis_test_city_info (
cityId STRING,
cityName STRING
) WITH (
'connector' = 'redis',
'mode' = 'single',
'single.host' = '172.16.200.124',
'single.port' = '6379',
'db-num' = '9',
'command' = 'HGET',
'connection.timeout-ms' = '5000',
'connection.test-while-idle' = 'true',
'lookup.additional-key' = 'rtdw_dim:test_city_info',
'lookup.cache.max-rows' = '1000',
'lookup.cache.ttl-sec' = '600'
)
爲了方便觀察結果,創建一張Print Sink表輸出數據,然後將Kafka流表與Redis維表做Temporal Join,SQL語句如下。
CREATE TABLE test.print_redis_test_dim_join (
tss STRING,
cityId BIGINT,
cityName STRING
) WITH (
'connector' = 'print'
)
INSERT INTO test.print_redis_test_dim_join
SELECT a.tss, a.cityId, b.cityName
FROM rtdw_dwd.kafka_order_done_log /*+ OPTIONS('scan.startup.mode'='latest-offset','properties.group.id'='fsql_redis_source_test') */ AS a
LEFT JOIN rtdw_dim.redis_test_city_info FOR SYSTEM_TIME AS OF a.procTime AS b ON CAST(a.cityId AS STRING) = b.cityId
WHERE a.orderType = 12
查看輸出~
4> +I(2021-03-04 20:44:48,10264,漳州市)
3> +I(2021-03-04 20:45:26,10030,常德市)
4> +I(2021-03-04 20:45:23,10332,桂林市)
7> +I(2021-03-04 20:45:26,10031,九江市)
9> +I(2021-03-04 20:45:23,10387,惠州市)
4> +I(2021-03-04 20:45:19,10607,蕪湖市)
3> +I(2021-03-04 20:45:25,10364,無錫市)
The End
通過上面的示例,相信看官已經能夠根據自己的需求靈活地定製Flink SQL Connector了。本文未詳述的ScanTableSource、異步LookupTableSource和Encoding/Decoding Format也會在之後的文章中擇機講解。
最近春寒料峭,民那注意增減衣物。
晚安晚安。