Note: WatermarkGenerator 包含了以前版本 AssignerWithPunctuatedWatermarks 的\n * AssignerWithPeriodicWatermarks 的功能(已經棄用)\n */\n@Public\npublic interface WatermarkGenerator
Flink EventTime 和 Watermark
{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"時間概念","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間在流式計算中起很重要的作用,Flink 提供了3種時間模型:EventTime、ProcessingTime、IngestionTime(1.13 版本已經不再提 IngestionTime 了)。底層實現上分爲2種:Processing Time 與 Event Time,Ingestion Time 本質上也是一種 Processing Time,對於3者的描述(參考下圖):","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/fee03206246c4b7846a1d338d74b0ef6.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"EventTime","attrs":{}},{"type":"text","text":" —— 是事件創建的時間,即數據產生時自帶時間戳。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"IngestionTime","attrs":{}},{"type":"text","text":" —— 是事件進入 Flink 的時間,即進入 source operator 是給定的時間戳。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ProcessingTime","attrs":{}},{"type":"text","text":" —— 是每一個執行 window 操作的本地時間。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以參考以下兩篇 Blog 和 Paper 幫助對時間概念的理解,也是官方推薦的","attrs":{}},{"type":"link","attrs":{"href":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101","title":"","type":null},"content":[{"type":"text","text":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102","title":"","type":null},"content":[{"type":"text","text":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43864.pdf","title":"","type":null},"content":[{"type":"text","text":"https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43864.pdf","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"附筆者 ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/2a8422f3a467","title":"","type":null},"content":[{"type":"text","text":"翻譯1","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/695dd06d811f","title":"","type":null},"content":[{"type":"text","text":"翻譯2","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/29a51ade587f","title":"","type":null},"content":[{"type":"text","text":"翻譯3","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink 如何設置時間域?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"setStreamTimeCharacteristic","attrs":{}}],"attrs":{}},{"type":"text","text":" 設置時間域,枚舉類 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic","attrs":{}}],"attrs":{}},{"type":"text","text":" 預設了三種時間域,不顯式設置的情況下,默認使用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic.EventTime","attrs":{}}],"attrs":{}},{"type":"text","text":"(1.12 版本以前默認是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic.ProcessingTime","attrs":{}}],"attrs":{}},{"type":"text","text":")。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"env = StreamExecutionEnvironment.getExecutionEnvironment();\n// 已經是過期方法\nenv.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 1.12 以後版本默認是使用 EventTime,如果要顯示使用 ProcessingTime,可以關閉 watermark(自動生成 watermark 的間隔設置爲 0),設置","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"env.getConfig().setAutoWatermarkInterval(0);","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"EventTime 與 WaterMarks","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼必須處理事件時間?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大多數情況下,消息進入系統中是無序的(網絡、硬件、分佈式邏輯都可能影響),並且會有消息延遲到達(例如移動場景中,由於手機無信號,導致一系列的操作消息在手機重新連接信號後發送),如果按照消息進入系統的時間計算,結果會與實時嚴重不符合。理想情況是 event time 和 processing time 是一致的(發生時間即處理時間),但是現實情況是不一致的,兩者存在 skew。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,支持事件時間的流式處理程序需要一種方法來測量事件時間的進度。例如,有一個按小時構建的窗口,當事件時間超過了一小時的時間範圍,需要通知該窗口,以便關閉正在進行的窗口。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"什麼是 Watermark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 中檢測事件時間處理進度的機制就是 Watermark,Watermark 作爲數據處理流中的一部分進行傳輸,並且攜帶一個時間戳 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":"。一個 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"Watermark(t)","attrs":{}}],"attrs":{}},{"type":"text","text":" 表示流中應該不再有事件時間比 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 小的元素(只是讓系統這樣認爲,並不代表實際情況)。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Watermark 有助於解決亂序問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示一個順序的事件流中的 Watermark,w(x) 代表 Watermark 更新值","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de5c71dd9836a52b123629575f7e98da.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示一個亂序的事件流中的 Watermark,表示所有事件時間戳小於 Watermark 時間戳的數據都已經處理完了,任何事件大於 Watermark 的元素都不應該再出現(如果出現,系統可以丟棄改消息,或其他處理行爲),當然這只是一種推測性的結果(基於多種信息的推測)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a2cb836cc2527a4074fbe1ea39ba85df.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"並行流中的 Watermark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Watermark 是在 Source function 處或之後立即生成的。Source function 的每個並行子任務通常獨立地生成 Watermark。這些 Watermark 定義了該特定並行源的事件時間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Watermark 經過流處理程序時,會將該算子的事件時間向前推進。當算子推進其事件時間時,會爲下游的算子生成新 Watermark。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些算子使用多個輸入流,例如,使用 keyBy/partition 函數的算子。此類算子的當前事件時間是其輸入流事件時間的最小值。當它的輸入流更新它們的事件時間時,算子也會更新。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖顯示了事件和 Watermark 經過並行流的示例,以及跟蹤事件時間的運算符。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a59899cbfc8e7ebb4c64d5d000706781.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"延遲記錄(Late Elements)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某些記錄可能會違反 Watermark 的條件,事件時間小於 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 但是晚於 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"Watermark(t)","attrs":{}}],"attrs":{}},{"type":"text","text":" 到達。實際運行過程中,事件可能被延遲任意的時間,所以不可能指定一個時間,保證該時間之前的所有事件都被處理了。而且,即使延時時間是有界限的,過多的延遲的時間也是不理想的,會造成時間窗口處理的太多延時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統允許設置一個可容忍的延遲時間,在距離 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 的時間在可容忍的延遲時間內,可以繼續處理數據,否則丟棄。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/32/32f36e2402f173e8159b5abe948c87c4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"窗口計算(Windowing)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聚合事件(如 count、sum)在流處理和批處理中的工作方式不同。例如,無法計算流中的所有元素(認爲流是無界的,數據是無限的)。流上的聚合由窗口來確定計算範圍,例如,在過去5分鐘內計數或最後100個元素的總和。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口可以是時間驅動或數據驅動,一種窗口的分類方式:滾動窗口(tumbling,窗口無重疊)、滑動窗口(sliding,窗口有重疊)、會話窗口(session,以無活動時間間隔劃分)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/11/11aaa439b9e9ae34a782c35feb31a3fd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"生成時間戳和 Watermark","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Watermark 策略(WatermarkStrategy)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了處理事件時間,Flink 需要知道事件時間戳,這意味着流中的每個元素都需要分配其事件時間戳。這通常是通過使用 TimestampAssigner 從元素中的某個字段提取時間戳來完成的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間戳分配與生成 Watermark 密切相關,Watermark 告訴系統事件時間的進度,可以通過指定 WatermarkGenerator 進行配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink API 需要一個同時包含 TimestampAssigner 和 WatermarkGenerator 的 WatermarkStrategy(Watermark 生成策略)。提供了一些常見的策略可以直接使用(作爲 WatermarkStrategy 的靜態方法),用戶也可以在需要時自定義構建策略。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"public interface WatermarkStrategy \n extends TimestampAssignerSupplier,\n WatermarkGeneratorSupplier{\n\n /**\n * 實例化一個 TimestampAssigner 對象,來生成/指定時間戳\n */\n @Override\n TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context);\n\n /**\n * 實例化一個 WatermarkGenerator 對象,生成 watermark\n */\n @Override\n WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常不需要自己實現此接口,而是使用 WatermarkStrategy 提供的實現常見的水印策略。例如,要使用有界無序 watermark,lambda 函數作爲時間戳賦值器:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"WatermarkStrategy\n .>forBoundedOutOfOrderness(Duration.ofSeconds(20))\n .withTimestampAssigner((event, timestamp) -> event.f0);\n// 指定 TimestampAssigner 是可選的,有些場景下不需要指定,例如可以直接從 Kafka 記錄中獲取","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"使用 WatermarkStrategy","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Flink 應用程序中有兩個地方可以使用 WatermarkStrategy:1. 直接在 Source 算子上,2.在非 Source 算子之後。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一種選擇更可取,因爲允許 Source 利用 watermark 邏輯中關於 shards/partitions/splits 的信息。Source 通常可以在更精細的層次上跟蹤 watermark,並且 Source 生成的整體水印將更精確。直接在源上指定水印策略通常意味着您必須使用特定於源的接口,後文介紹了在 Kafka Connector 上的工作方式,以及每個分區水印在其中的工作方式的更多詳細信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個選項(在任意操作後設置 WatermarkStrategy)應該只在無法直接在 Source 上設置策略時使用:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();\n\nDataStream stream = ...;\n\nDataStream withTimestampsAndWatermarks = \n stream\n .assignTimestampsAndWatermarks()\n .keyBy( (event) -> event.getGroup() )\n .window(TumblingEventTimeWindows.of(Time.seconds(10)))\n .reduce( (a, b) -> a.add(b) )\n .addSink(...);","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Idle Source 的處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果其中一個輸入的 splits/partitions/shards 在一段時間內沒有任何事件,這意味着 WatermarkGenerator 也不會獲得任何新的水印信息,可以稱之爲 Idle input 或 Idle source。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個問題,因爲其他的一些分區可能仍然有事件發生,在這種情況下,Watermark 不會更新(因爲有多個 input 的算子的 Watermark 被計算爲所有 input 的 watermark 的最小值)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這個問題,可以使用 WatermarkStrategy 檢測並將 input 標記爲空閒:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"WatermarkStrategy\n .>forBoundedOutOfOrderness(Duration.ofSeconds(20))\n .withIdleness(Duration.ofMinutes(1));","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"WatermarkGenerator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TimestampAssigner 是一個從事件中提取字段的簡單函數,因此不需要詳細研究它們。另一方面,WatermarkGenerator 有點複雜,這是 WatermarkGenerator 接口:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"/**\n * 根據事件,或週期性地生成 watermark\n *\n *
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.