flink etl

一.Regular Joins（雙流join）

這種 join 方式需要去保留兩個流的狀態，持續性地保留並且不會去做清除。兩邊的數據對於對方的流都是所有可見的，所以數據就需要持續性的存在state裏面，那麼 state 又不能存的過大，因此這個場景的只適合有界數據流或者結合ttl state配合使用。它的語法可以看一下，比較像離線批處理的 SQL

left join，right join，full join， inner join

CREATE TABLE NOC (
  agent_id STRING,
  codename STRING
)
WITH (
  'connector' = 'kafka'
);
 
CREATE TABLE RealNames (
  agent_id STRING,
  name     STRING
)
WITH (
  'connector' = 'kafka'
);
 
SELECT
    name,
    codename
FROM NOC
INNER JOIN RealNames ON NOC.agent_id = RealNames.agent_id;

二.Interval Joins（區間 join）

加入了一個時間窗口的限定，要求在兩個流做 join 的時候，其中一個流必須落在另一個流的時間戳的一定時間範圍內，並且它們的 join key 相同才能夠完成 join。加入了時間窗口的限定，就使得我們可以對超出時間範圍的數據做一個清理，這樣的話就不需要去保留全量的 State。Interval join 是同時支持 processing time 和 even time去定義時間的。如果使用的是 processing time，Flink 內部會使用系統時間去劃分窗口，並且去做相關的 state 清理。如果使用 even time 就會利用 Watermark 的機制去劃分窗口，並且做 State 清理

CREATE TABLE orders (
  id INT,
  order_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)+5)*(-1) AS INT), CURRENT_TIMESTAMP)
)
WITH (
  'connector' = 'kafka'
);
 
 
CREATE TABLE shipments (
  id INT,
  order_id INT,
  shipment_time AS TIMESTAMPADD(DAY, CAST(FLOOR(RAND()*(1-5+1)) AS INT), CURRENT_TIMESTAMP)
)
WITH (
 'connector' = 'kafka'
);
 
SELECT
  o.id AS order_id,
  o.order_time,
  s.shipment_time,
  TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff
FROM orders o
JOIN shipments s ON o.id = s.order_id
WHERE
    o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time;

三.Temporal Table Join

Interval Joins 兩個輸入流都必須有時間下界，超過之後則不可訪問。這對於很多 Join 維表的業務來說是不適用的，因爲很多情況下維表並沒有時間界限。針對這個問題，Flink 提供了 Temporal Table Join 來滿足用戶需求。Temporal Table Join 類似於 Hash Join，將輸入分爲 Build Table 和 Probe Table。前者一般是緯度表的 changelog，後者一般是業務數據流，典型情況下後者的數據量應該遠大於前者。在 Temporal Table Join 中，Build Table 是一個基於 append-only 數據流的帶時間版本的視圖，所以又稱爲 Temporal Table。Temporal Table 要求定義一個主鍵和用於版本化的字段（通常就是 Event Time 時間字段），以反映記錄在不同時間的內容

CREATE TEMPORARY TABLE currency_rates (
  `currency_code` STRING,
  `eur_rate` DECIMAL(6,4),
  `rate_time` TIMESTAMP(3),
  WATERMARK FOR `rate_time` AS rate_time - INTERVAL '15' SECONDS,
  PRIMARY KEY (currency_code) NOT ENFORCED
) WITH (
  'connector' = 'upsert-kafka',
  'topic' = 'currency_rates',
  'properties.bootstrap.servers' = 'localhost:9092',
  'key.format' = 'raw',
  'value.format' = 'json'
);
 
CREATE TEMPORARY TABLE transactions (
  `id` STRING,
  `currency_code` STRING,
  `total` DECIMAL(10,2),
  `transaction_time` TIMESTAMP(3),
  WATERMARK FOR `transaction_time` AS transaction_time - INTERVAL '30' SECONDS
) WITH (
  'connector' = 'kafka',
  'topic' = 'transactions',
  'properties.bootstrap.servers' = 'localhost:9092',
  'key.format' = 'raw',
  'key.fields' = 'id',
  'value.format' = 'json',
  'value.fields-include' = 'ALL'
);
 
SELECT
  t.id,
  t.total * c.eur_rate AS total_eur,
  t.total,
  c.currency_code,
  t.transaction_time
FROM transactions t
JOIN currency_rates FOR SYSTEM_TIME AS OF t.transaction_time AS c
ON t.currency_code = c.currency_code;

四.Lookup Joins（維表join）

JDBC 連接器可以用在時態表關聯中作爲一個可 lookup 的 source (又稱爲維表)，當前只支持同步的查找模式。默認情況下，lookup cache 是未啓用的，
你可以設置 lookup.cache.max-rows and lookup.cache.ttl 參數來啓用。 lookup cache 的主要目的是用於提高時態表關聯 JDBC 連接器的性能。默認情況下，lookup cache 不開啓，所以所有請求都會發送到外部數據庫。當 lookup cache 被啓用時，每個進程（即 TaskManager）將維護一個緩存。Flink 將優先查找緩存，只有當緩存未查找到時才向外部數據庫發送請求，並使用返回的數據更新緩存。當緩存命中最大緩存行 lookup.cache.max-rows 或當行超過最大存活時間 lookup.cache.ttl 時，緩存中最老的行將被設置爲已過期。緩存中的記錄可能不是最新的，用戶可以將 lookup.cache.ttl 設置爲一個更小的值以獲得更好的刷新數據，但這可能會增加發送到數據庫的請求數。所以要做好吞吐量和正確性之間的平衡。

CREATE TABLE user_log (
  user_id STRING
  ,item_id STRING
  ,category_id STRING
  ,behavior STRING
  ,ts TIMESTAMP(3)
  ,process_time as proctime()
  , WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka'
  ,'topic' = 'user_behavior'
  ,'properties.bootstrap.servers' = 'localhost:9092'
  ,'properties.group.id' = 'user_log'
  ,'scan.startup.mode' = 'group-offsets'
  ,'format' = 'json'
);
 
CREATE TEMPORARY TABLE mysql_behavior_conf (
   id int
  ,code STRING
  ,map_val STRING
  ,update_time TIMESTAMP(3)
--   ,primary key (id) not enforced
--   ,WATERMARK FOR update_time AS update_time - INTERVAL '5' SECOND
) WITH (
   'connector' = 'jdbc'
   ,'url' = 'jdbc:mysql://localhost:3306/venn'
   ,'table-name' = 'lookup_join_config'
   ,'username' = 'root'
   ,'password' = '123456'
   ,'lookup.cache.max-rows' = '1000'
   ,'lookup.cache.ttl' = '1 minute' -- 緩存時間，即使一直在訪問也會刪除
);
 
SELECT a.user_id, a.item_id, a.category_id, a.behavior, c.map_val, a.ts
FROM user_log a
  left join mysql_behavior_conf FOR SYSTEM_TIME AS OF a.process_time AS c
  ON a.behavior = c.code
where a.behavior is not null;

五.Lateral Table Join （橫向join）

lateral Table Join基本相當於SQL Server的CROSS APPLY，功能上要強於CROSS APPLY

1.表和表關聯

CREATE TABLE People (
    id           INT,
    city         STRING,
    state        STRING,
    arrival_time TIMESTAMP(3),
    WATERMARK FOR arrival_time AS arrival_time - INTERVAL '1' MINUTE
) WITH (
    'connector' = 'kafka'
);
 
CREATE TEMPORARY VIEW CurrentPopulation AS
SELECT
    city,
    state,
    COUNT(*) as population
FROM (
    SELECT
        city,
        state,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY arrival_time DESC) AS rownum
    FROM People
)
WHERE rownum = 1
GROUP BY city, state;
 
SELECT
    state,
    city,
    population
FROM
    (SELECT DISTINCT state FROM CurrentPopulation) States,
    LATERAL (
        SELECT city, population
        FROM CurrentPopulation
        WHERE state = States.state
        ORDER BY population DESC
        LIMIT 2
);

2.函數表關聯

SELECT
data, name, age
FROM
userTab,
LATERAL TABLE(splitTVF(data)) AS T(name, age)

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

一文了解基於 ITIL 的運維管理體系框架

一圖帶你解鎖數字化運維的建設思路

【騰訊雲 BI 數據分析可視化大賽】有獎徵文活動

KubeKey 部署 K8s v1.28.8 實戰

KubeSphere 社區雙週報｜2024.04.26-05.09

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結