乾貨 | 實時數據聚合怎麼破

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時數據分析一直是個熱門話題,需要實時數據分析的場景也越來越多,如金融支付中的風控,基礎運維中的監控告警,實時大盤之外,AI模型也需要消費更爲實時的聚合結果來達到很好的預測效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時數據分析如果講的更加具體些,基本上會牽涉到數據聚合分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據聚合分析在實時場景下,面臨的新問題是什麼,要解決的很好,大致有哪些方面的思路和框架可供使用,本文嘗試做一下分析和釐清。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實時數據分析場景下,最大的制約因素是時間,時間一變動,所要處理的源頭數據會發生改變,處理的結果自然也會因此而不同。在此背景下,引申出來的三大子問題就是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過何種機制觀察到變化的數據 "}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過何種方式能最有效的處理變化數據,將結果併入到原先的聚合分析結果中"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析後的數據如何讓使用方及時感知並獲取"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以說,"},{"type":"text","marks":[{"type":"strong"}],"text":"數據新鮮性和處理及時性"},{"type":"text","text":"是實時數據處理中的一對基本矛盾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c4f2c7a2d59d363c82dcd3d05b54964.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外實時是一個相對的概念,在不同場景下對應的時延也差異很大,借用Uber給出的定義,大體來區分一下實時處理所能接受的時延範圍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1a\/1a43575116ca0752abc36c20734399bb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、數據新鮮性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲簡單起見,把數據分成兩大類,一類是關鍵的交易性數據,以存儲在關係型數據庫爲主,另一類是日誌型數據,以存儲在日誌型消息隊列(如kafka)爲主。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二類數據,消費端到感知到最新的變化數據,採用內嵌的pull機制,比較容易實現,同時日誌類數據,絕大部分是append-only,不涉及到刪改,無論是採用ClickHouse還是使用TimeScaleDB都可以達到很好的實時聚合效果,這裏就不再贅述。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對第一類存儲在數據庫中的數據,要想實時感知到變化的數據(這裏的變化包含有增\/刪\/改三種操作類型),有兩種打法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打法一:"},{"type":"text","text":"基於時間戳方式的數據同步,假設在表設計時,每張表中都有datachange_lasttime字段表示最近一次操作發生的時間,同步程序會定期掃描目標表,把datachange_lasttime不小於上次同步時間的數據拉出進行同步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種處理方式的主要缺點是無法感知到數據刪除操作,爲了規避這個不足,可以採用邏輯刪除的表設計方式。數據刪除並不是採取物理刪除,只是修改表示數據已經刪除的列中的值標記爲刪除或無效。使用這種方法雖然讓同步程序可以感知到刪除操作,但額外的成本是讓應用程序在刪除和查詢時,操作語句和邏輯都變得複雜,降低了數據庫的可維護性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打法一的變種是基於觸發器方式,把變化過的數據推送給同步程序。這種方式的成本,一方面是需要設計實現觸發器,另一方面是了降低了insert\/update\/delete操作的性能, 提升了時延,降低了吞吐量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"打法二:"},{"type":"text","text":"基於CDC(Change Data Capture)的方式進行增量數據同步,這種方式對數據庫設計的侵入性最小,性能影響也最低,同時可以獲得豐富的開源組件支持,如Cannal對MySQL有很好支持,Debezium對PostgreSQL有支持。利用這些同步組件,把變化數據寫入到Kafka,然後供後續實時數據分析進一步處理。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、數據關聯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新鮮數據在獲取到之後,第一步常見操作是進行數據補全(Data Enrichment), 數據補全自然涉及到多表之間的關聯。這裏有一個痛點,要關聯的數據並不一定也會在增量數據中,如機票訂單數據狀態發生變化,要找到變化過訂單涉及到的航段信息。由於訂單信息和航段信息是兩張不同的表維護,如果只是拿增量數據進行關聯,那麼有可能找不到航段信息。這是一個典型的實時數據和歷史數據關聯的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決實時數據和歷史數據關聯一種非常容易想到的思路就是當實時數據到達的時候,去和數據庫中的歷史數據進行關聯,這種做法一是加大了數據庫的訪問,導致數據庫負擔增加,另一方面是關聯的時延會大大加長。爲了讓歷史數據迅速可達,自然想到添加緩存,緩存的引入固然可以減少關聯處理時延,但容易引起緩存數據和數據庫中的數據不一致問題,另外緩存容量不易估算,成本增加。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有沒有別的套路可以嘗試?這個必須要有。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以在數據庫側先把數據進行補全,利用行轉列的方式,形成一張寬表,實現數據自完備,寬表的變化內容,利用CDC機制,讓外界實時感知。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、計算及時性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在解決好數據變化實時感知和數據完備兩個問題之後,進入最關鍵一環,數據聚合分析。爲了達到結果準確和處理及時之間的平衡,有兩大解決方法:一爲全量,一爲增量。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 全量計算(1m
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章