數倉系列 | Flink 窗口的應用與實現

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"簡介:"},{"type":"text","text":" 本文根據 Apache Flink 系列直播整理而成,由 Apache Flink Contributor、OPPO 大數據平臺研發負責人張俊老師分享。主要內容如下: 1. 整體思路與學習路徑 2. 應用場景與編程模型 3. 工作流程與實現機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者 | 張俊(OPPO大數據平臺研發負責人)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整理 | 祝尚(Flink 社區志願者)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"校對 | 鄒志業(Flink 社區志願者)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"摘要:本文根據 Apache Flink 系列直播整理而成,由 Apache Flink Contributor、OPPO 大數據平臺研發負責人張俊老師分享。主要內容如下:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"整體思路與學習路徑"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"應用場景與編程模型"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"工作流程與實現機制"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tips:點擊「下方鏈接」可查看更多數倉系列直播視頻~ "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"underline"},{"type":"strong"}],"text":"數倉系列直播:  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ververica.cn/developers/flink-training-course-data-warehouse/","title":null},"content":[{"type":"text","text":"https://ververica.cn/developers/flink-training-course-data-warehouse/"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"整體思路與學習路徑"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d6f34297440e5c72a45a9997ace01836.png","alt":"640 1.png","title":"640 1.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們碰到一項新的技術時,我們應該怎樣去學習並應用它呢?在我個人看來,有這樣一個學習的路徑,應該把它拆成應用和實現兩塊。首先應該從它的應用入手,然後再深入它的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用主要分爲三個部分,首先應該瞭解它的應用場景,比如窗口的一些使用場景。然後,進一步地我們去了解它的編程接口,最後再深入瞭解它的一些抽象概念。因爲一個框架或一項技術,肯定有它的編程接口和抽象概念來組成它的編程模型。我們可以通過查看文檔的方式來熟悉它的應用。在對應用這三個部分有了初步的瞭解後,我們就可以通過閱讀代碼的方式去了解它的一些實現了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現部分也分三個階段,首先從工作流程開始,可以通過 API 層面不斷的下鑽來了解它的工作流程。接下來是它整體的設計模式,通常對一些框架來說,如果能構建一個比較成熟的生態,一定是在設計模式上有一些獨特的地方,使其有一個比較好的擴展性。最後是它的數據結構和算法,因爲爲了能夠處理海量數據並達到高性能,它的數據結構和算法一定有獨到之處。我們可以做些深入瞭解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上大概是我們學習的一個路徑。從實現的角度可以反哺到應用上來,通常在應用當中,剛接觸某個概念的時候會有一些疑惑。當我們對實現有一些瞭解之後,應用中的這些疑惑就會迎刃而解。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"爲什麼要關心實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/36fd64e984186426fdcdcc1bc4963c6b.png","alt":"640 2.png","title":"640 2.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看了這個例子我們可能會有些疑惑:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ReduceFunction 爲什麼不用計算每個 key 的聚合值?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 key 基數很大時,如何有效地觸發每個 key 窗口計算?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口計算的中間結果如何存儲,何時被清理?"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口計算如何容忍 late data ?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你瞭解了實現部分再回來看應用這部分,可能就有種醍醐灌頂的感覺。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"應用場景與編程模型"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時數倉的典型架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a6/a6231a5f944503286f311d01e931d858.png","alt":"640 3.png","title":"640 3.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第一種最簡單架構"},{"type":"text","text":",ODS 層的 Kafka 數據經過 Flink 的 ETL 處理後寫入 DW 層的 Kafka,再通過 Flink 聚合寫入 ADS 層的 MySQL 中,做這樣一個實時報表展現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺點"},{"type":"text","text":":由於 MySQL 存儲數據有限,所以聚合的時間粒度不能太細,維度組合不能太多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第二種架構"},{"type":"text","text":"相對於第一種引入了 OLAP 引擎,同時也不用 Flink 來做聚合,通過 Druid 的 Rollup 來做聚合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺點"},{"type":"text","text":":因爲 Druid 是一個存儲和查詢引擎,不是計算引擎。當數據量巨大時,比如每天上百億、千億的數據量,會加劇 Druid 的導入壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 第三種架構"},{"type":"text","text":"在第二種基礎上,採用 Flink 來做聚合計算寫入 Kafka,最終寫入 Druid。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺點"},{"type":"text","text":":當窗口粒度比較長時,結果輸出會有延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ 第四種架構在第三種基礎上,結合了 Flink 聚合和 Druid Rollup。Flink 可以做輕度的聚合,Druid 做 Rollup 的彙總。好處是 Druid 可以實時看到 Flink 的聚合結果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 應用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a9/a9f46651208fd9f04871aea366a7683d.png","alt":"640 4.png","title":"640 4.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 聚合統計"},{"type":"text","text":":從 Kafka 讀取數據,根據不同的維度做1分鐘或5分鐘的聚合計算,然後結果寫入 MySQL 或 Druid 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 記錄合併"},{"type":"text","text":":對多個 Kafka 數據源在一定的窗口範圍內做合併,結果寫入 ES。例如:用戶的一些行爲數據,針對每個用戶,可以對其行爲做一定的合併,減少寫入下游的數據量,降低 ES 的寫入壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 雙流 join"},{"type":"text","text":":針對雙流 join 的場景,如果全量 join 的話,成本開銷會非常大。所以就要考慮基於窗口來做 join。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 抽象概念"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7b9f09c4e8fbe1eb2160467708c53ab.png","alt":"640 5.png","title":"640 5.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ TimestampAssigner:"},{"type":"text","text":" 時間戳分配器,假如我們使用的是 EventTime 時間語義,就需要通過 TimestampAssigner 來告訴Flink 框架,元素的哪個字段是事件時間,用於後面的窗口計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ KeySelector"},{"type":"text","text":":Key 選擇器,用來告訴 Flink 框架做聚合的維度有哪些。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ WindowAssigner"},{"type":"text","text":":窗口分配器,用來確定哪些數據被分配到哪些窗口。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ State"},{"type":"text","text":":狀態,用來存儲窗口內的元素,如果有 AggregateFunction,則存儲的是增量聚合的中間結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ AggregateFunction(可選)"},{"type":"text","text":":增量聚合函數,主要用來做窗口的增量計算,減輕窗口內 State 的存儲壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Trigger"},{"type":"text","text":":觸發器,用來確定何時觸發窗口的計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Evictor(可選)"},{"type":"text","text":":驅逐器,用於在窗口函數計算之前(後)對滿足驅逐條件的數據做過濾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ WindowFunction"},{"type":"text","text":":窗口函數,用來對窗口內的數據做計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ Collector"},{"type":"text","text":":收集器,用來將窗口的計算結果發送到下游。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中紅色部分都是可以自定義的模塊,通過自定義這些模塊的組合,我們可以實現高級的窗口應用。同時 Flink 也提供了一些內置的實現,可以用來做一些簡單應用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Window 編程接口"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"stream \n .assignTimestampsAndWatermarks(…)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章