Flink EventTime 和 Watermark

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"時間概念","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間在流式計算中起很重要的作用,Flink 提供了3種時間模型:EventTime、ProcessingTime、IngestionTime(1.13 版本已經不再提 IngestionTime 了)。底層實現上分爲2種:Processing Time 與 Event Time,Ingestion Time 本質上也是一種 Processing Time,對於3者的描述(參考下圖):","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/fee03206246c4b7846a1d338d74b0ef6.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"EventTime","attrs":{}},{"type":"text","text":" —— 是事件創建的時間,即數據產生時自帶時間戳。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"IngestionTime","attrs":{}},{"type":"text","text":" —— 是事件進入 Flink 的時間,即進入 source operator 是給定的時間戳。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ProcessingTime","attrs":{}},{"type":"text","text":" —— 是每一個執行 window 操作的本地時間。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以參考以下兩篇 Blog 和 Paper 幫助對時間概念的理解,也是官方推薦的","attrs":{}},{"type":"link","attrs":{"href":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101","title":"","type":null},"content":[{"type":"text","text":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102","title":"","type":null},"content":[{"type":"text","text":"https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43864.pdf","title":"","type":null},"content":[{"type":"text","text":"https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43864.pdf","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"附筆者 ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/2a8422f3a467","title":"","type":null},"content":[{"type":"text","text":"翻譯1","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/695dd06d811f","title":"","type":null},"content":[{"type":"text","text":"翻譯2","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://www.jianshu.com/p/29a51ade587f","title":"","type":null},"content":[{"type":"text","text":"翻譯3","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink 如何設置時間域?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"setStreamTimeCharacteristic","attrs":{}}],"attrs":{}},{"type":"text","text":" 設置時間域,枚舉類 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic","attrs":{}}],"attrs":{}},{"type":"text","text":" 預設了三種時間域,不顯式設置的情況下,默認使用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic.EventTime","attrs":{}}],"attrs":{}},{"type":"text","text":"(1.12 版本以前默認是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TimeCharacteristic.ProcessingTime","attrs":{}}],"attrs":{}},{"type":"text","text":")。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"env = StreamExecutionEnvironment.getExecutionEnvironment();\n// 已經是過期方法\nenv.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 1.12 以後版本默認是使用 EventTime,如果要顯示使用 ProcessingTime,可以關閉 watermark(自動生成 watermark 的間隔設置爲 0),設置","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"env.getConfig().setAutoWatermarkInterval(0);","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"EventTime 與 WaterMarks","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼必須處理事件時間?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大多數情況下,消息進入系統中是無序的(網絡、硬件、分佈式邏輯都可能影響),並且會有消息延遲到達(例如移動場景中,由於手機無信號,導致一系列的操作消息在手機重新連接信號後發送),如果按照消息進入系統的時間計算,結果會與實時嚴重不符合。理想情況是 event time 和 processing time 是一致的(發生時間即處理時間),但是現實情況是不一致的,兩者存在 skew。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,支持事件時間的流式處理程序需要一種方法來測量事件時間的進度。例如,有一個按小時構建的窗口,當事件時間超過了一小時的時間範圍,需要通知該窗口,以便關閉正在進行的窗口。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"什麼是 Watermark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 中檢測事件時間處理進度的機制就是 Watermark,Watermark 作爲數據處理流中的一部分進行傳輸,並且攜帶一個時間戳 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":"。一個 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"Watermark(t)","attrs":{}}],"attrs":{}},{"type":"text","text":" 表示流中應該不再有事件時間比 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 小的元素(只是讓系統這樣認爲,並不代表實際情況)。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Watermark 有助於解決亂序問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示一個順序的事件流中的 Watermark,w(x) 代表 Watermark 更新值","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de5c71dd9836a52b123629575f7e98da.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示一個亂序的事件流中的 Watermark,表示所有事件時間戳小於 Watermark 時間戳的數據都已經處理完了,任何事件大於 Watermark 的元素都不應該再出現(如果出現,系統可以丟棄改消息,或其他處理行爲),當然這只是一種推測性的結果(基於多種信息的推測)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a2cb836cc2527a4074fbe1ea39ba85df.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"並行流中的 Watermark","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Watermark 是在 Source function 處或之後立即生成的。Source function 的每個並行子任務通常獨立地生成 Watermark。這些 Watermark 定義了該特定並行源的事件時間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Watermark 經過流處理程序時,會將該算子的事件時間向前推進。當算子推進其事件時間時,會爲下游的算子生成新 Watermark。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一些算子使用多個輸入流,例如,使用 keyBy/partition 函數的算子。此類算子的當前事件時間是其輸入流事件時間的最小值。當它的輸入流更新它們的事件時間時,算子也會更新。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖顯示了事件和 Watermark 經過並行流的示例,以及跟蹤事件時間的運算符。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a59899cbfc8e7ebb4c64d5d000706781.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"延遲記錄(Late Elements)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某些記錄可能會違反 Watermark 的條件,事件時間小於 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 但是晚於 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"Watermark(t)","attrs":{}}],"attrs":{}},{"type":"text","text":" 到達。實際運行過程中,事件可能被延遲任意的時間,所以不可能指定一個時間,保證該時間之前的所有事件都被處理了。而且,即使延時時間是有界限的,過多的延遲的時間也是不理想的,會造成時間窗口處理的太多延時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統允許設置一個可容忍的延遲時間,在距離 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"t","attrs":{}}],"attrs":{}},{"type":"text","text":" 的時間在可容忍的延遲時間內,可以繼續處理數據,否則丟棄。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/32/32f36e2402f173e8159b5abe948c87c4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"窗口計算(Windowing)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聚合事件(如 count、sum)在流處理和批處理中的工作方式不同。例如,無法計算流中的所有元素(認爲流是無界的,數據是無限的)。流上的聚合由窗口來確定計算範圍,例如,在過去5分鐘內計數或最後100個元素的總和。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窗口可以是時間驅動或數據驅動,一種窗口的分類方式:滾動窗口(tumbling,窗口無重疊)、滑動窗口(sliding,窗口有重疊)、會話窗口(session,以無活動時間間隔劃分)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/11/11aaa439b9e9ae34a782c35feb31a3fd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"生成時間戳和 Watermark","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Watermark 策略(WatermarkStrategy)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了處理事件時間,Flink 需要知道事件時間戳,這意味着流中的每個元素都需要分配其事件時間戳。這通常是通過使用 TimestampAssigner 從元素中的某個字段提取時間戳來完成的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間戳分配與生成 Watermark 密切相關,Watermark 告訴系統事件時間的進度,可以通過指定 WatermarkGenerator 進行配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink API 需要一個同時包含 TimestampAssigner 和 WatermarkGenerator 的 WatermarkStrategy(Watermark 生成策略)。提供了一些常見的策略可以直接使用(作爲 WatermarkStrategy 的靜態方法),用戶也可以在需要時自定義構建策略。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"public interface WatermarkStrategy \n extends TimestampAssignerSupplier,\n WatermarkGeneratorSupplier{\n\n /**\n * 實例化一個 TimestampAssigner 對象,來生成/指定時間戳\n */\n @Override\n TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context);\n\n /**\n * 實例化一個 WatermarkGenerator 對象,生成 watermark\n */\n @Override\n WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常不需要自己實現此接口,而是使用 WatermarkStrategy 提供的實現常見的水印策略。例如,要使用有界無序 watermark,lambda 函數作爲時間戳賦值器:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"WatermarkStrategy\n .>forBoundedOutOfOrderness(Duration.ofSeconds(20))\n .withTimestampAssigner((event, timestamp) -> event.f0);\n// 指定 TimestampAssigner 是可選的,有些場景下不需要指定,例如可以直接從 Kafka 記錄中獲取","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"使用 WatermarkStrategy","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Flink 應用程序中有兩個地方可以使用 WatermarkStrategy:1. 直接在 Source 算子上,2.在非 Source 算子之後。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一種選擇更可取,因爲允許 Source 利用 watermark 邏輯中關於 shards/partitions/splits 的信息。Source 通常可以在更精細的層次上跟蹤 watermark,並且 Source 生成的整體水印將更精確。直接在源上指定水印策略通常意味着您必須使用特定於源的接口,後文介紹了在 Kafka Connector 上的工作方式,以及每個分區水印在其中的工作方式的更多詳細信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個選項(在任意操作後設置 WatermarkStrategy)應該只在無法直接在 Source 上設置策略時使用:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();\n\nDataStream stream = ...;\n\nDataStream withTimestampsAndWatermarks = \n stream\n .assignTimestampsAndWatermarks()\n .keyBy( (event) -> event.getGroup() )\n .window(TumblingEventTimeWindows.of(Time.seconds(10)))\n .reduce( (a, b) -> a.add(b) )\n .addSink(...);","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Idle Source 的處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果其中一個輸入的 splits/partitions/shards 在一段時間內沒有任何事件,這意味着 WatermarkGenerator 也不會獲得任何新的水印信息,可以稱之爲 Idle input 或 Idle source。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個問題,因爲其他的一些分區可能仍然有事件發生,在這種情況下,Watermark 不會更新(因爲有多個 input 的算子的 Watermark 被計算爲所有 input 的 watermark 的最小值)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這個問題,可以使用 WatermarkStrategy 檢測並將 input 標記爲空閒:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"WatermarkStrategy\n .>forBoundedOutOfOrderness(Duration.ofSeconds(20))\n .withIdleness(Duration.ofMinutes(1));","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"WatermarkGenerator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TimestampAssigner 是一個從事件中提取字段的簡單函數,因此不需要詳細研究它們。另一方面,WatermarkGenerator 有點複雜,這是 WatermarkGenerator 接口:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"/**\n * 根據事件,或週期性地生成 watermark\n *\n *

Note: WatermarkGenerator 包含了以前版本 AssignerWithPunctuatedWatermarks 的\n * AssignerWithPeriodicWatermarks 的功能(已經棄用)\n */\n@Public\npublic interface WatermarkGenerator {\n\n /**\n * 爲每個事件調用,允許 watermark generator 檢查並記住事件時間戳\n * 或基於事件本身發出 watermark\n */\n void onEvent(T event, long eventTimestamp, WatermarkOutput output);\n\n /**\n * 定期調用,可能會發出新的 watermark,也可能不會\n *\n * 調用間隔取決於 ExecutionConfig#getAutoWatermarkInterval() 設置的值\n */\n void onPeriodicEmit(WatermarkOutput output);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有兩種不同的 Watermark 生成方式:periodic(週期的)和 punctuated(帶標記的)。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"週期的生成器通過 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"onEvent()","attrs":{}}],"attrs":{}},{"type":"text","text":" 觀察傳入事件,然後在框架調用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"onPeriodicEmit()","attrs":{}}],"attrs":{}},{"type":"text","text":" 時發出水印。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶標記的生成器通過 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"onEvent()","attrs":{}}],"attrs":{}},{"type":"text","text":" 觀察傳入事件,並等待流中帶有 watermark 信息的特殊標記事件(special marker events)或標記(punctuations)。當觀察到其中一個時,會立即發出水印。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Periodic WatermarkGenerator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"週期生成器定期觀察流事件並生成 watermark(可能取決於流元素,或者完全基於處理時間)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成 watermark 的間隔通過 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"ExecutionConfig.setAutoWatermarkInterval()","attrs":{}}],"attrs":{}},{"type":"text","text":" 定義,每次都會調用生成器的 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"onPeriodicEmit()","attrs":{}}],"attrs":{}},{"type":"text","text":" 方法,如果返回的 watermark 不爲空且大於上一個 watermark,則會發出新的 watermark。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏展示兩個簡單的例子。請注意,Flink 自帶的 BoundedOutOfOrdernessWatermarks 是類似的水印生成器,可以在這裏查看更詳細內容。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"// 事件時間 watermark\n// 事件到達時在一定程度上是無序的,某個時間戳 t 的最後達到元素相比時間戳 t 的最早到達元素,最大延遲 n 毫秒。\npublic class BoundedOutOfOrdernessGenerator implements WatermarkGenerator {\n // 允許數據的延遲,3.5 seconds\n private final long maxOutOfOrderness = 3500; \n\n private long currentMaxTimestamp;\n\n @Override\n public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {\n currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);\n }\n\n @Override\n public void onPeriodicEmit(WatermarkOutput output) {\n output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));\n }\n\n}\n\n// 處理時間 watermark\n// 生成的 watermark 比處理時間滯後固定時間長度。\npublic class TimeLagWatermarkGenerator implements WatermarkGenerator {\n\n private final long maxTimeLag = 5000; \n\n @Override\n public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {\n // 處理時間不關心事件\n }\n\n @Override\n public void onPeriodicEmit(WatermarkOutput output) {\n output.emitWatermark(new Watermark(System.currentTimeMillis() - maxTimeLag));\n }\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Punctuated WatermarkGenerator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶標記的生成器將觀察事件流,並在遇到帶有特殊信息的元素時發出 watermark。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"public class PunctuatedAssigner implements WatermarkGenerator {\n\n @Override\n public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {\n // 事件只要帶有特殊標記,就發出 watermark\n if (event.hasWatermarkMarker()) {\n output.emitWatermark(new Watermark(event.getWatermarkTimestamp()));\n }\n }\n\n @Override\n public void onPeriodicEmit(WatermarkOutput output) {\n // to do nothing\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"需要注意的是","attrs":{}},{"type":"text","text":",可以在每個事件上生成 watermark。但是,由於每個 watermark 都會引起下游的一些計算,過多的 watermark 會降低性能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Kafka Connector 應用 WatermarkStrategy","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"每個 Kafka 分區一個時間戳","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當使用 Kafka 作爲數據源的時候,每個分區可能有一個簡單的事件時間模式(按時間戳升序或其他)。當消費來自 Kafka 的流數據時,多個分區一般會並行消費。分區中的事件交替消費,會破壞分區中的模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種情況下,可以使用 Flink 的 Kafka-partition-aware(分區感知)watermark 生成器。使用這個特性的時候,watermark 會在 Kafka 消費者內部爲每個分區生成,並且每個分區 watermark 的合併方式與在流進行 shuffle 時合併的方式相同。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,如果事件時間戳嚴格按每個 Kafka 分區升序排列,那麼使用 ","attrs":{}},{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/event-time/generating_watermarks/event_timestamp_extractors.html#assigners-with-ascending-timestamps","title":"","type":null},"content":[{"type":"text","text":"ascending timestamps watermark generator","attrs":{}}]},{"type":"text","text":"。下圖顯示瞭如何爲每個 Kafka 分區生成 watermark,以及在這種情況下 watermark 如何通過數據流傳播。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"FlinkKafkaConsumer kafkaSource = new FlinkKafkaConsumer<>(\"myTopic\", schema, props);\n\nkafkaSource.assignTimestampsAndWatermarks(\n WatermarkStrategy.\n .forBoundedOutOfOrderness(Duration.ofSeconds(20)));\n\nDataStream stream = env.addSource(kafkaSource);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/ca1f90a9b54975b45b7d35ad979d82f4.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"How Operators Process Watermarks","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲一般規則,Operator 需要完全處理一個給定的 watermater,然後再轉發到下游。例如,WindowOperator 將首先計算所有應該觸發的窗口,只有在生成由 watermark 觸發的所有輸出之後,watermark 纔會被髮送到下游。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣的規則也適用於 TwoInputStreamOperator。但是,在這種情況下,Operator 的當前 watermark 被定義爲其兩個輸入的最小值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"詳細內容可查看對應算子的實現:","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"OneInputStreamOperator#processWatermark","attrs":{}}],"attrs":{}},{"type":"text","text":"、","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TwoInputStreamOperator#processWatermark1","attrs":{}}],"attrs":{}},{"type":"text","text":" 和 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"TwoInputStreamOperator#processWatermark2","attrs":{}}],"attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內容引用:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/concepts/time/","title":null,"type":null},"content":[{"type":"text","text":"https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/concepts/time/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/event-time/generating_watermarks/","title":null,"type":null},"content":[{"type":"text","text":"https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/event-time/generating_watermarks/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_time.html","title":null,"type":null},"content":[{"type":"text","text":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_time.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_timestamps_watermarks.html","title":null,"type":null},"content":[{"type":"text","text":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_timestamps_watermarks.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_timestamp_extractors.html","title":null,"type":null},"content":[{"type":"text","text":"https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/event_timestamp_extractors.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://vishnuviswanath.com/flink_eventtime.html","title":null,"type":null},"content":[{"type":"text","text":"http://vishnuviswanath.com/flink_eventtime.html","attrs":{}}]}]}]}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章