- > PENDING_PARTITIONS_STATE_DESC =\n new ListStateDescriptor<>(\n \"pending-partitions\",\n new ListSerializer<>(StringSerializer.INSTANCE));\n\n private static final ListStateDescriptor
Flink SQL FileSystem Connector 分區提交與自定義小文件合併策略
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink SQL 的 FileSystem Connector 爲了與 Flink-Hive 集成的大環境適配,做了很多改進,而其中最爲明顯的就是分區提交(partition commit)機制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文先通過源碼簡單過一下分區提交機制的兩個要素——即觸發(trigger)和策略(policy)的實現,然後用合併小文件的實例說一下自定義分區提交策略的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"PartitionCommitTrigger"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"path\n└── datetime=2019-08-25\n └── hour=11\n ├── part-0.parquet\n ├── part-1.parquet\n └── hour=12\n ├── part-0.parquet\n└── datetime=2019-08-26\n └── hour=6\n ├── part-0.parquet"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,已經寫入的分區數據何時才能對下游可見呢?這就涉及到如何觸發分區提交的問題。根據官方文檔,觸發參數有以下兩個:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"sink.partition-commit.trigger"},{"type":"text","text":":可選 process-time(根據處理時間觸發)和 partition-time(根據從事件時間中提取的分區時間觸發)。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"sink.partition-commit.delay"},{"type":"text","text":":分區提交的時延。如果 trigger 是 process-time,則以分區創建時的系統時間戳爲準,經過此時延後提交;如果 trigger 是 partition-time,則以分區創建時本身攜帶的事件時間戳爲準,當水印時間戳經過此時延後提交。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見,process-time trigger 無法應對處理過程中出現的抖動,一旦數據遲到或者程序失敗重啓,數據就不能按照事件時間被歸入正確的分區了。所以在實際應用中,我們幾乎總是選用 partition-time trigger,並自己生成水印。當然我們也需要通過 partition.time-extractor.*一系列參數來指定抽取分區時間的規則(PartitionTimeExtractor),官方文檔說得很清楚,不再贅述。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在源碼中,PartitionCommitTrigger 的類圖如下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8628dd73fb5cfba75c08cc46016cec4.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面以分區時間觸發的 PartitionTimeCommitTrigger 爲例,簡單看看它的思路。直接上該類的完整代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"public class PartitionTimeCommitTigger implements PartitionCommitTrigger {\n private static final ListStateDescriptor
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.