一.Regular Joins(雙流join)
這種 join 方式需要去保留兩個流的狀態,持續性地保留並且不會去做清除。兩邊的數據對於對方的流都是所有可見的,所以數據就需要持續性的存在state裏面,那麼 state 又不能存的過大,因此這個場景的只適合有界數據流或者結合ttl state配合使用。它的語法可以看一下,比較像離線批處理的 SQL
left join,right join,full join, inner join
CREATE TABLE NOC (
agent_id STRING,
codename STRING
)
WITH (
'connector' = 'kafka'
);
CREATE TABLE RealNames (
agent_id STRING,
name STRING
)
WITH (
'connector' = 'kafka'
);
SELECT
name,
codename
FROM NOC
INNER JOIN RealNames ON NOC.agent_id = RealNames.agent_id;
二.Interval Joins(區間 join)
加入了一個時間窗口的限定,要求在兩個流做 join 的時候,其中一個流必須落在另一個流的時間戳的一定時間範圍內,並且它們的 join key 相同才能夠完成 join。加入了時間窗口的限定,就使得我們可以對超出時間範圍的數據做一個清理,這樣的話就不需要去保留全量的 State。Interval join 是同時支持 processing time 和 even time去定義時間的。如果使用的是 processing time,Flink 內部會使用系統時間去劃分窗口,並且去做相關的 state 清理。如果使用 even time 就會利用 Watermark 的機制去劃分窗口,並且做 State 清理
CREATE TABLE orders (
id INT,
order_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)+5)*(-1) AS INT), CURRENT_TIMESTAMP)
)
WITH (
'connector' = 'kafka'
);
CREATE TABLE shipments (
id INT,
order_id INT,
shipment_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)) AS INT), CURRENT_TIMESTAMP)
)
WITH (
'connector' = 'kafka'
);
SELECT
o.id AS order_id,
o.order_time,
s.shipment_time,
TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff
FROM orders o
JOIN shipments s ON o.id = s.order_id
WHERE
o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time;
三.Temporal Table Join
Interval Joins 兩個輸入流都必須有時間下界,超過之後則不可訪問。這對於很多 Join 維表的業務來說是不適用的,因爲很多情況下維表並沒有時間界限。針對這個問題,Flink 提供了 Temporal Table Join 來滿足用戶需求。Temporal Table Join 類似於 Hash Join,將輸入分爲 Build Table 和 Probe Table。前者一般是緯度表的 changelog,後者一般是業務數據流,典型情況下後者的數據量應該遠大於前者。在 Temporal Table Join 中,Build Table 是一個基於 append-only 數據流的帶時間版本的視圖,所以又稱爲 Temporal Table。Temporal Table 要求定義一個主鍵和用於版本化的字段(通常就是 Event Time 時間字段),以反映記錄在不同時間的內容
CREATE TEMPORARY TABLE currency_rates (
`currency_code` STRING,
`eur_rate` DECIMAL(6,4),
`rate_time` TIMESTAMP(3),
WATERMARK FOR `rate_time` AS rate_time - INTERVAL '15' SECONDS,
PRIMARY KEY (currency_code) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'currency_rates',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'raw',
'value.format' = 'json'
);
CREATE TEMPORARY TABLE transactions (
`id` STRING,
`currency_code` STRING,
`total` DECIMAL(10,2),
`transaction_time` TIMESTAMP(3),
WATERMARK FOR `transaction_time` AS transaction_time - INTERVAL '30' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'transactions',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'raw',
'key.fields' = 'id',
'value.format' = 'json',
'value.fields-include' = 'ALL'
);
SELECT
t.id,
t.total * c.eur_rate AS total_eur,
t.total,
c.currency_code,
t.transaction_time
FROM transactions t
JOIN currency_rates FOR SYSTEM_TIME AS OF t.transaction_time AS c
ON t.currency_code = c.currency_code;
四.Lookup Joins(維表join)
JDBC 連接器可以用在時態表關聯中作爲一個可 lookup 的 source (又稱爲維表),當前只支持同步的查找模式。默認情況下,lookup cache 是未啓用的,
你可以設置 lookup.cache.max-rows and lookup.cache.ttl 參數來啓用。 lookup cache 的主要目的是用於提高時態表關聯 JDBC 連接器的性能。默認情況下,lookup cache 不開啓,所以所有請求都會發送到外部數據庫。 當 lookup cache 被啓用時,每個進程(即 TaskManager)將維護一個緩存。Flink 將優先查找緩存,只有當緩存未查找到時才向外部數據庫發送請求,並使用返回的數據更新緩存。當緩存命中最大緩存行 lookup.cache.max-rows 或當行超過最大存活時間 lookup.cache.ttl 時,緩存中最老的行將被設置爲已過期。緩存中的記錄可能不是最新的,用戶可以將 lookup.cache.ttl 設置爲一個更小的值以獲得更好的刷新數據,但這可能會增加發送到數據庫的請求數。所以要做好吞吐量和正確性之間的平衡。
CREATE TABLE user_log (
user_id STRING
,item_id STRING
,category_id STRING
,behavior STRING
,ts TIMESTAMP(3)
,process_time as proctime()
, WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka'
,'topic' = 'user_behavior'
,'properties.bootstrap.servers' = 'localhost:9092'
,'properties.group.id' = 'user_log'
,'scan.startup.mode' = 'group-offsets'
,'format' = 'json'
);
CREATE TEMPORARY TABLE mysql_behavior_conf (
id int
,code STRING
,map_val STRING
,update_time TIMESTAMP(3)
-- ,primary key (id) not enforced
-- ,WATERMARK FOR update_time AS update_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'jdbc'
,'url' = 'jdbc:mysql://localhost:3306/venn'
,'table-name' = 'lookup_join_config'
,'username' = 'root'
,'password' = '123456'
,'lookup.cache.max-rows' = '1000'
,'lookup.cache.ttl' = '1 minute' -- 緩存時間,即使一直在訪問也會刪除
);
SELECT a.user_id, a.item_id, a.category_id, a.behavior, c.map_val, a.ts
FROM user_log a
left join mysql_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS c
ON a.behavior = c.code
where a.behavior is not null;
五.Lateral Table Join (橫向join)
lateral Table Join基本相當於SQL Server的CROSS APPLY,功能上要強於CROSS APPLY
1.表和表關聯
CREATE TABLE People (
id INT,
city STRING,
state STRING,
arrival_time TIMESTAMP(3),
WATERMARK FOR arrival_time AS arrival_time - INTERVAL '1' MINUTE
) WITH (
'connector' = 'kafka'
);
CREATE TEMPORARY VIEW CurrentPopulation AS
SELECT
city,
state,
COUNT(*) as population
FROM (
SELECT
city,
state,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY arrival_time DESC) AS rownum
FROM People
)
WHERE rownum = 1
GROUP BY city, state;
SELECT
state,
city,
population
FROM
(SELECT DISTINCT state FROM CurrentPopulation) States,
LATERAL (
SELECT city, population
FROM CurrentPopulation
WHERE state = States.state
ORDER BY population DESC
LIMIT 2
);
2.函數表關聯
SELECT
data, name, age
FROM
userTab,
LATERAL TABLE(splitTVF(data)) AS T(name, age)