Flink+ 數據湖 Iceberg 的體驗

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"本文作者:餘東,2021年加入Qunar,主要負責數據平臺Flink的運維與平臺開發。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/52/527d8ef75bb65347aa9378b5d0c9c122.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本文導讀","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/6293d2b03e41380278fec2504e82b0d0.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一. 背景及特點","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用 Flink 做實時數倉以及數據傳輸過程中,遇到了一些問題:比如 Kafka 數據丟失,Flink 結合 Hive 的近實時數倉性能等。Iceberg 0.11 的新特性解決了這些業務場景碰到的問題。對比 Kafka 來說,Iceberg 在某些特定場景有自己的優勢。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 原架構方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原先的架構採用 Kafka 存儲實時數據。然後用 Flink SQL 或者 Flink datastream 消費數據進行流轉。內部自研了提交 SQL 和 Datastream 的平臺,通過該平臺提交實時作業。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 痛點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 存儲成本高且數據量大。Kafka 由於壓力大將數據過期時間設置的比較短,當數據產生反壓,積壓等情況時,如果在一定的時間內沒消費數據導致數據過期,會造成數據丟失。Flink 在 Hive 上做了近實時的讀寫支持。爲了分擔 Kafka 壓力,將一些實時性不太高的數據放入 Hive,讓 Hive 做分鐘級的分區。但是隨着元數據不斷增加,Hive metadata 的壓力日益顯著,查詢也變得更慢,且存儲 Hive 元數據的數據庫壓力也變大。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二. 背景及特點","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae35d4d52301ec45f9ee60bdcaa57cb7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 術語解析","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據文件 ( data files )","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Iceberg 表真實存儲數據的文件,一般存儲在data目錄下,以\".parquet\"結尾。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"清單文件 ( Manifest file )","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每行都是每個數據文件的詳細描述,包括數據文件的狀態、文件路徑、分區信息、列級別的統計信息(比如每列的最大最小值、空值數等)、通過該文件、可過濾掉無關數據、提高檢索速度。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快照( Snapshot )","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快照代表一張表在某個時刻的狀態。每個快照版本包含某個時刻的所有數據文件列表。Data files 是存儲在不同的 manifest files 裏面, manifest files 是存儲在一個 Manifest list 文件裏面,而一個 Manifest list 文件代表一個快照。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Iceberg 查詢計劃","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢計劃是在表中查找查詢所需文件的過程。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元數據過濾","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"清單文件包括分區數據元組和每個數據文件的列級統計信息。在計劃期間,查詢謂詞會自動轉換爲分區數據上的謂詞,並首先應用於過濾數據文件。接下來,使用列級值計數,空計數,下限和上限來消除與查詢謂詞不匹配的文件。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Snapshot ID","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Snapshot ID會關聯到一組manifest files、而每一組manifest files包含很多manifest file。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"manifest files文件列表","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個manifest files又記錄了當前data數據塊的元數據信息,其中就包含了文件列的最大值和最小值,然後根據這個元數據信息,索引到具體的文件塊,從而更快的查詢到數據。","attrs":{}}]},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三. 痛點一:Kafka 數據丟失","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 痛點介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常我們會選擇 Kafka 做實時數倉,以及日誌傳輸。Kafka 本身存儲成本很高,且數據保留時間有時效性,一旦消費積壓,數據達到過期時間後,就會造成數據丟失且沒有消費到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 解決方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將實時要求不高的業務數據入湖、比如說能接受 1-10 分鐘的延遲。因爲 Iceberg 0.11 也支持 SQL 實時讀取,而且還能保存歷史數據。這樣既可以減輕線上 Kafka 的壓力,還能確保數據不丟失的同時也能實時讀取。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 爲什麼 Iceberg 只能做近實時入湖?","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/76/76e965976011aed11935df2a63b39ebe.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① Iceberg 提交 Transaction 時是以文件粒度來提交。這就沒法以秒爲單位提交 Transaction,否則會造成文件數量膨脹;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 沒有在線服務節點。對於實時的高吞吐低延遲寫入,無法得到純實時的響應;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ Flink 寫入以 checkpoint 爲單位,物理數據寫入 Iceberg 後並不能直接查詢,當觸發了 checkpoint 纔會寫 metadata 文件,這時數據由不可見變爲可見。checkpoint 每次執行都會有一定時間。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. Flink 入湖分析","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/11/111c47445828da78b6f74bdba801478d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"組件介紹","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"IcebergStreamWriter","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要用來寫入記錄到對應的 avro、parquet、orc 文件,生成一個對應的 Iceberg DataFile,併發送給下游算子;另外一個叫做 IcebergFilesCommitter,主要用來在 checkpoint 到來時把所有的 DataFile 文件收集起來,並提交 Transaction 到 Apache iceberg,完成本次 checkpoint 的數據寫入,生成DataFile。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"IcebergFilesCommitter","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲每個checkpointId 維護了一個 DataFile 文件列表,即 map,這樣即使中間有某個 checkpoint的transaction 提交失敗了,它的 DataFile 文件仍然維護在 State 中,依然可以通過後續的 checkpoint 來提交數據到 Iceberg 表中。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. Flink SQL Demo","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/db9c5d50749676f9d3e3590381971778.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"5.1 前期工作","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開啓實時讀寫功能 set execution.type = streaming開啓 table sql hint 功能來使用 OPTIONS 屬性 set table.dynamic-table-options.enabled=true註冊 Iceberg catalog 用於操作 Iceberg 表","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CREATE CATALOG Iceberg_catalog WITH (\" +","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \" 'type'='Iceberg',\" +","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \" 'catalog-type'='Hive',\" +","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \" 'uri'='thrif://localhost:9083'\" +","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" \");","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kafka 實時數據入湖","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"insert into Iceberg_catalog.Iceberg_db.tbl1 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" select * from Kafka_tbl;","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據湖之間實時流轉 tbl1 -> tbl2","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"insert into Iceberg_catalog.Iceberg_db.tbl2 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" select * from Iceberg_catalog.Iceberg_db.tbl1 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" /*+ OPTIONS('streaming'='true', ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"'monitor-interval'='10s',snapshot-id'='3821550127947089987')*/","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"5.2 參數解釋","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"monitor-interval 連續監視新提交的數據文件的時間間隔(默認值:1s)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"start-snapshot-id 從指定的快照 ID 開始讀取數據、每個快照 ID 關聯的是一組 manifest file 元數據文件,每個元數據文件映射着自己的真實數據文件,通過快照 ID,從而讀取到某個版本的數據。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. 踩坑記錄","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我之前在 SQL Client 寫數據到 Iceberg,data 目錄數據一直在更新,但是 metadata 沒有數據,導致查詢的時候沒有數,因爲 Iceberg 的查詢是需要元數據來索引真實數據的。SQL Client 默認沒有開啓 checkpoint,需要通過配置文件來開啓狀態。所以會導致 data 目錄寫入數據而 metadata 目錄不寫入元數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PS:無論通過 SQL 還是 Datastream 入湖,都必須開啓 Checkpoint。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7. 數據樣例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面兩張圖展示的是實時查詢 Iceberg 的效果,一秒前和一秒後的數據變化情況。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"一秒前的數據","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5a/5ae3025dfff9d24c01f76c46ec76f029.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"一秒後刷新的數據","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/04/048fd0deb6492fa501c400f8cddbcac2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四. 痛點二:Flink 結合 Hive 的近實時越來越慢","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 痛點介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選用 Flink + Hive 的近實時架構雖然支持了實時讀寫,但是這種架構帶來的問題是隨着表和分區增多,將會面臨以下問題:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元數據過多 Hive 將分區改爲小時 / 分鐘級,雖然提高了數據的準實時性,但是 metestore 的壓力也是顯而易見的,元數據過多導致生成查詢計劃變慢,而且還會影響線上其他業務穩定。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據庫壓力變大 隨着元數據增加,存儲 Hive 元數據的數據庫壓力也會增加,一段時間後,還需要對該庫進行擴容,比如存儲空間。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e0e011f5726071ba85419224f9e34043.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f8/f8eddeeb1fbbc5de1dcf0e244cb3f930.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 解決方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將原先的 Hive 近實時遷移到 Iceberg。爲什麼 Iceberg 可以處理元數據量大的問題,而 Hive 在元數據大的時候卻容易形成瓶頸?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Iceberg 是把 metadata 維護在可拓展的分佈式文件系統上,不存在中心化的元數據系統;Hive 則是把 partition 之上的元數據維護在 metastore 裏面(partition 過多則給 mysql 造成巨大壓力),而 partition 內的元數據其實是維護在文件內的(啓動作業需要列舉大量文件才能確定文件是否需要被掃描,整個過程非常耗時)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f3/f39678658d98224490f1dd32ae258cf1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五. 優化實踐","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 小文件處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Iceberg 0.11 以前,通過定時觸發 batch api 進行小文件合併,這樣雖然能合併,但是需要維護一套 Actions 代碼,而且也不是實時合併的。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Table table = findTable(options, conf);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Actions.forTable(table).rewriteDataFiles()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" .targetSizeInBytes(10 * 1024) // 10KB","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" .execute();","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Iceberg 0.11 新特性,支持了流式小文件合併。通過分區/存儲桶鍵使用哈希混洗方式寫數據、從源頭直接合並文件,這樣的好處在於,一個 task 會處理某個分區的數據,提交自己的 Datafile 文件,比如一個 task 只處理對應分區的數據。這樣避免了多個 task 處理提交很多小文件的問題,且不需要額外的維護代碼,只需在建表的時候指定屬性 write.distribution-mode,該參數與其它引擎是通用的,比如 Spark 等。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CREATE TABLE city_table ( ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" province BIGINT,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" city STRING","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":") PARTITIONED BY (province, city) WITH (","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 'write.distribution-mode'='hash' ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":");","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Iceberg 0.11 排序","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.1 排序介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Iceberg 0.11 之前,Flink 是不支持 Iceberg 排序功能的,所以之前只能結合 Spark 以批模式來支持排序功能,0.11 新增了排序特性的支持,也意味着,我們在實時也可以體會到這個好處。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排序的本質是爲了掃描更快,因爲按照 sort key 做了聚合之後,所有的數據都按照從小到大排列,max-min 可以過濾掉大量無效的數據。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/40dd0331c7c80871ec093a7f64148523.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.2 排序 demo","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"insert into Iceberg_table select days from Kafka_tbl order by days, province_id;","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Iceberg 排序後 manifest 詳解","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8d/8d7793fb4ba6900d4517e352eef03afd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"參數解釋:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"file_path:物理文件位置。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"partition:文件所對應的分區。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lowerbounds:該文件中,多個排序字段的最小值,下圖是我的 days 和 province_id 最小值。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"upperbounds:該文件中,多個排序字段的最大值,下圖是我的 days 和 provinceid 最大值。通過分區、列的上下限信息來確定是否讀取 filepath 的文件,數據排序後,文件列的信息也會記錄在元數據中,查詢計劃從 manifest 去定位文件,不需要把信息記錄在 Hive metadata,從而減輕 Hive metadata 壓力,提升查詢效率。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用 Iceberg 0.11 的排序特性,將天作爲分區。按天、小時、分鐘進行排序,那麼 manifest 文件就會記錄這個排序規則,從而在檢索數據的時候,提高查詢效率,既能實現 Hive 分區的檢索優點,還能避免 Hive metadata 元數據過多帶來的壓力。","attrs":{}}]},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相較於之前的版本來說,Iceberg 0.11 新增了許多實用的功能,對比了之前使用的舊版本,做以下總結:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Flink + Iceberg 排序功能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Iceberg 0.11 以前,排序功能集成了 Spark,但沒有集成 Flink,當時用 Spark + Iceberg 0.10 批量遷移了一批 Hive 表。在 BI 上的收益是:原先 BI 爲了提升 Hive 查詢速度建了多級分區,導致小文件和元數據過多,入湖過程中,利用 Spark 排序 BI 經常查詢的條件,結合隱式分區,最終提升 BI 檢索速度的同時,也沒有小文件的問題,Iceberg 有自身的元數據,也減少了 Hive metadata 的壓力。Icebeg 0.11 支持了 Flink 的排序,是一個很實用的功能點。我們可以把原先 Flink + Hive 的分區轉移到 Iceberg 排序中,既能達到 Hive 分區的效果,也能減少小文件和提升查詢效率。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 實時讀取數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 SQL 的編程方式,即可實現數據的實時讀取。好處在於,可以把實時性要求不高的,比如業務可以接受 1-10 分鐘延遲的數據放入 Iceberg 中 ,在減少 Kafka 壓力的同時,也能實現數據的近實時讀取,還能保存歷史數據。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 實時合併小文件","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Iceberg 0.11以前,需要用 Iceberg 的合併 API 來維護小文件合併,該 API 需要傳入表信息,以及定時信息,且合併是按批次這樣進行的,不是實時的。從代碼上來說,增加了維護和開發成本;從時效性來說,不是實時的。0.11 用 Hash 的方式,從源頭對數據進行實時合併,只需在 SQL 建表時指定 ('write.distribution-mode'='hash') 屬性即可,不需要手工維護。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章