Flink 在伴魚的實踐:如何保障數據的準確性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着伴魚業務的快速發展,離線數據日漸無法滿足運營同學的需求,數據的實時性要求越來越高。之前的實時任務是通過實時同步至 TiDB 的數據,利用 TiDB 進行微批計算。隨着越來越多的實時場景湧現出來,TiDB 已經無法滿足實時數據計算場景,計算和查詢都在一套集羣中,導致集羣壓力過大,可能影響正常的業務使用。根據業務形態搭建實時數倉已經是必要的建設了。伴魚實時數倉主要以 Flink 爲計算引擎,搭配 Redis ,Kafka 等分佈式數據存儲介質,以及 ClickHouse 等多維分析引擎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"伴魚實時作業應用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於平臺提供了穩定的環境(統一調度方式,統一管理,統一監控等)。我們構建了一些實時服務,通過服務化的方式去支持各個業務方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時數倉:數據同步,業務數據清洗去重,相關主題業務數據關聯拼接,以及數據聚合提煉等,逐步構建多維度,多覆蓋面的實時數倉體系。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時特徵平臺:實時數據提取,計算,以及特徵回寫。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單介紹下:目前數據在伴魚內的流動架構圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b5fa8c87427a1d8d66fa26bbf949e5be.png","alt":"fink-practice","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"下面主要介伴魚實時數倉的建設體系"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ODS 層數據平臺統一進行數據解析處理,寫入 Kafka 。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DWD 比較關鍵,會將來自同一業務域數據表對應的多條數據流,按最細粒度關聯成一條完整的日誌,並關聯相應維度,描述一個完整事實。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DWS 將每個小業務域數據按相同維度進行聚合,寫入 TiDB 和 ClickHouse 。在 TiDB ,ClickHouse ,再次進行關聯,形成跨業務域聚合數據。供業務和分析人員使用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3e\/3e2d06601f6580b4afe9e41c87d66d86.png","alt":"fink-practice","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"DWD 層複雜場景數據處理方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據從 ODS 層採集後,數據的處理和加工主要集中在 DWD 層,我們的場景中面臨了很多複雜的加工邏輯,本章重點對 DWD 層數據處理方案進行詳細的闡述。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 數據的去重"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於伴魚內部業務大面積使用 MongoDB ,MongoDB 本身存儲的是半結構化的數據,它不具有固定的 schema 。在同步 Mongo 的 oplog 時,實時數倉的 dwd 層並不需要所有字段參與,我們只會抽取日常使用率相對較高的字段進行表建設。這就可能由於不相干的數據發生變化,我們也會收到一條相同的數據記錄。例如在對用戶訂單金額進行分類分析時,如果用戶訂單地址發生了變化,我們同樣也會收到一條業務日誌,因爲我們並不關注地址維度,所以這條日誌是無用的重複數據。這種未經處理的數據是不方便BI工程師直接使用的,甚至直接影響計算結果的準確性。所以我們針對這種非 Append-only 數據,我們進行了定製化的日誌格式。在經由平臺方解析後的 binlog 或者 oplog ,我們仍然定製化加入了一些元數據信息,用來讓 BI 工程師更好的理解這條數據進入實時計算引擎時,對應的時間點到底發生了什麼事情,這件實事到底是否參與計算。所以,我們加入了 metadata_table (原始表名), metadata_changes(修改字段名) , metadata_op_type (DML類型) ,metadata_commit_ts (修改時間戳)等字段,輔助我們對業務上認爲的重複數據,做更好的過濾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/89\/891705c17ab66611d6b7979253a7327c.jpeg","alt":"fink-practice","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. join場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時計算相較於離線不同,因爲數據具有一過性,流過的數據,如果不做特殊記錄,很難在找回,從而降低了實時作業準確性,這是實時計算的一個痛點問題,這個問題主要表現在多流關聯時,數據難以關聯準確,下面敘述一下在伴魚內部,多流 join 的場景是如何解決的。數據關聯常用的inner join ,left join 。inner join 近似可以看做 left join + where 的操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從時間角度來講分爲:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩條實時數據流相關聯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時流與過去發生的事實數據相關聯。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"兩條實時數據流相關聯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"利用 Redis 基於內存,支持單位時間大量 QPS ,快速訪問的特性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們應觀察一定範圍內數據,觀察數據在時間維度上的亂序情況.設定數據延遲的時間和數據緩存時間。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"伴魚的服務都相對較穩定,數據亂序最多就是秒級差異,我們通常選擇數據量相對大的流做主流,對主數據流加窗口等待(窗口時間不必太長,如10s),右邊數據流將數據寫入 Redis 緩存(分鐘級),當主流的窗口到期,確保右邊流數據以及緩存在 Redis 中了。實現在 Flink job 內部多 Operator 之間的內存共享。這種方式的優點是:足夠簡單,通用; Flink job 無需維護業務狀態,job 輕量化、運行穩定。缺點是,隨着數據量的上升,以及 job 的增多,會對 Redis 集羣造成較大壓力。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fc\/fc84c2b0731aab49ff655541f299b4e3.png","alt":"fink-practice","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Flink 作業內部,提供了完整的 user-state 狀態管理,包括狀態初始化,狀態更新,狀態快照,以及狀態恢復等:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將數據 leftStream 與 rightStream 分別打上不同的 tag ,將 leftStream 與 rightStream 用 contect 算子聯合在一起。對 join 的條件進行 group by 操作,相同分組的數據,在 precess 算子內進行數據的 state 緩存與輸出。下游得到的即爲能關聯上的數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對狀態操作的同時,調用定時器,比如我們可以按天爲單位,每天凌晨設置定時器,清空狀態,具體定時器觸發策略,看業務場景。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優點:"},{"type":"text","text":"整個作業所有處理邏輯不依賴其他外部存儲系統,均在 Flink 內部計算。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺點:"},{"type":"text","text":"如果多個數據流關聯,整體作業 code 量較大,開發成本相對較高;數據交由 Flink 狀態維護,整個作業內存負載較高,數據量大的情況下,checkpoint 很大,對作業整體穩定運行有影響。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Flink 社區已經認識到多流join的痛點問題,提供了區別於離線sql的特殊join方式:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對 leftStream 與 rightStream 分別註冊 Watermark (最好用事件時間)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將 leftStream 與 rightStream 進行 Interval Join。(在流與流的 join 中, window join 只能關聯兩個流中對應的 window 中的消息,跨窗口的消息關聯不上,所以摒棄。Interval Join 則沒有 window 的概念,直接用時間戳作爲關聯的條件,更具表達力。Interval join 的實現基本邏輯比較簡單,主要依靠 TimeBoundedStreamJoin 完成消息的關聯,其核心邏輯主要包含消息的緩存,不同關聯類型的處理,消息的清理,但實現起來並不簡單。一條流中的消息,可能需要和另一條流的多條消息關聯,因此流流關聯時,通常需要類似如下關聯條件:)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優點:"},{"type":"text","text":"編碼簡單;整個作業 state 的修改訪問由 Flink 源碼自動完成,整體 state 負載與用戶手動編碼相對較小。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺點:"},{"type":"text","text":"特殊join方式受場景限制較大。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/aa\/aa8b437f9722defe41fa47703242da96.jpeg","alt":"fink-practice","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Flink Table & SQL 時態表Temporal Table:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Flink 中,從1.7開始,提出了時態表(即 Temporal Table )的概念。Temporal Table 可以簡化和加速此類查詢,並減少對狀態的使用 Temporal Table 是將一個 Append-Only 表中追加的行,根據設置的主鍵和時間(如上 productID 、updatedAt ),解釋成Chanlog,並在特定時間提供數據的版本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用時態表( Temporal Table )時,要注意以下問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"Temporal Table 可提供歷史某個時間點上的數據。\nTemporal Table 根據時間來跟蹤版本。\nTemporal Table 需要提供時間屬性和主鍵。\nTemporal Table 一般和關鍵詞 LATERAL TABLE 結合使用。\nTemporal Table 在基於 ProcessingTime 時間屬性處理時,每個主鍵只保存最新版本的數據。\nTemporal Table 在基於 EventTime 時間屬性處理時,每個主鍵保存從上個 Watermark 到當前系統時間的所有版本。\nAppend-Only 表 Join 右側 Temporal Table ,本質上還是左表驅動 Join ,即從左表拿到 Key ,根據 Key 和時間(可能是歷史時間)去右側 Temporal Table 表中查詢。\nTemporal Table Join 目前只支持 Inner Join。\nTemporal Table Join 時,右側 Temporal Table 表返回最新一個版本的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cb\/cbc1c8f7b389809688f7b570f2280a6d.jpeg","alt":"temporal_join","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"對於關聯歷史數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們首先要分析歷史數據的過期性,例如伴魚業務場景中,用戶約課行爲和用戶在線上課兩條數據流關聯到的數據,可能相差幾天(用戶提前約下週的課程)。此時數據的過期時間就需要我們特殊關係與處理,我們可以精確的計算"},{"type":"text","marks":[{"type":"strong"}],"text":"先"},{"type":"text","text":"發生的事件,它的準確過期時間,例如:例如正式上課時間爲三天後,所以,我們可將他們放入 Redis 中緩存(3+1)*24 h,以確保用戶上課時,他們的約課記錄還仍然在我們內存中預熱。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果無法判斷出歷史數據的過期性。例如在伴魚的業務場景中,經常會關聯用戶某個重要行爲(下單)時,對用的用戶等級,以及綁定的教師等細節信息,類似這種常用且重要的維度,我們只能將它們永久緩存在 Redis 中,供事實數據去訪問關聯。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 從數據形態觀查join"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從數據join的方式來看,共分爲三種,一對一,多對一,多對多三種情形。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一對一,多對一,我們只需要用 Redis 或者 state 緩存住單一一方的數據流。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於多對多的join情形:多對多的 join。我們只能將 leftStream 與 rightStream 先 connect 連接起來,天級別的 將數據分別緩存至 Redis 或者 job momery中。無論 left Steam 還是 right Stream,數據來了都是統一先緩存,去遍歷另一方的所有已經到來的數據,輸出到下游。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於多對多的 left join 情形:多對多的 left join的場景,是比較複雜的,我們也只能將 leftStream 與 rightStream 先 connect 連接起來,將其緩存在 job momery 或者 Redis 中,leftSteam或者 rightStream 數據來了就先統一先緩存,再去遍歷另一方的所有已經到來的數據,輸出到下游。只不過此時,對於下游沒有join上的數據,並不能很好的判斷 數據到底是真的沒有 join 上,還是因爲數據進入 Operator 的時間性的差異,沒有 join 上。此時我們會將數據寫入 TiDB ,或者 ClickHouse 中,在這種可以基於天級別數據量快速計算的 OLAP 引擎中,對因進入 Operator 算子時間差異而導致沒有 join 上的數據進行過濾。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意如果用 Flink Operator State,需要設置定時器,或者使用 Flink TTL,對 state 定時清理,不然程序會 OOM 。如果使用 Redis ,需要對數據設置失效或者定時調用離線腳本對數據進行刪除。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"DWS 數據層數據處理方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在離線數倉中通常存放的是跨業務域的粗粒度數。在伴魚的實時數倉內部,我們也同樣是這樣存儲的。只不過跨業務域的數據之間的關聯,我們不在 Flink 實時處理引擎中做計算。而是把它們放到 TiDB 或者 ClickHouse 中做計算了。在 Flink 內存,我們只計算當前業務域的聚合指標,以及會對數據進行 tag 標記,標記出數據是按哪些維度聚合而來,聚合粒度是如何的。(例如時間粒度上,我們通常會以 5min 或者 10min 爲小單位對數據進行聚合),如果要查詢當天跨業務的聯合數據時,會基於 TiDB 或者 ClickHouse 預先定義好視圖,在視圖內先對當天單個業務域主題內數據先做聚合 sum ,再將不同業務域的數,按提前在數據中標記的維度 tag 進行關聯,得到跨業務的聚合指標。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"未來我們仍會繼續對比 Storm, Spark Streaming, Flink 等多種技術棧產品在使用和性能上的利弊。期待Flink生態的豐富,我們會嘗試讓Flink CDC,Flink ML,Flink CEP等一些特性發揮在我們的數倉建設中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink SQL最近幾個版本的迭代也是相當頻繁的。由於阿里對 Flink planner 的支持,使Flink的批流一體的概念更加趨近於現實,我們會嘗試使用 Flink 作爲離線數倉的處理引擎,在公司數據組推開 Flink SQL 。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"繼續完善實時平臺對Flink任務的監控,以及資源管理的優化。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文獻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/ci.apache.org\/projects\/Flink\/Flink-docs-release-1.13\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/ci.apache.org\/projects\/Flink\/Flink-docs-release-1.13\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/blog.csdn.net\/wangpei1949\/article\/details\/103541939","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/blog.csdn.net\/wangpei1949\/article\/details\/103541939"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:李震"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/tech.ipalfish.com\/blog\/2021\/06\/29\/flink_practice\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:Flink 在伴魚的實踐:如何保障數據的準確性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:伴魚技術博客"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章