如何構建智能湖倉架構?亞馬遜工程師的代碼實踐

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據倉庫的數據體系嚴格、治理容易,業務規模越大,ROI 越高;數據湖的數據種類豐富,治理困難,業務規模越大,ROI 越低,但勝在靈活。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,魚和熊掌我都想要,應該怎麼辦?"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/rKSdSdeEQhgNJVk4sxsz","title":"xxx","type":null},"content":[{"type":"text","text":"湖倉一體架構"}]},{"type":"text","text":"就在這種情況下,快速在產業內普及。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要構建湖倉一體架構並不容易,需要解決非常多的數據問題。比如,計算層、存儲層、異構集羣層都要打通,對元數據要進行統一的管理和治理。對於很多業內技術團隊而言,已經是個比較大的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可即便如此,在亞馬遜雲科技技術專家潘超看來,也未必最能貼合企業級大數據處理的最新理念。在 11 月 18 日晚上 20:00 的直播中,潘超詳細分享了亞馬遜雲科技眼中的智能湖倉架構,以及以流式數據接入爲主的最佳實踐。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"現代化數據平臺架構的關鍵指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統湖倉一體架構的不足之處是,着重解決點的問題,也就是“湖”和“倉”的打通,而忽視了面的問題:數據在整個數據平臺的自由流轉。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"潘超認爲,"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/Jk5dZg81oMD5RU1izh0k","title":"xxx","type":null},"content":[{"type":"text","text":"現代數據平臺架構"}]},{"type":"text","text":"應該具有幾個關鍵特徵:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"以任何規模來存儲數據;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在整套架構涉及的所有產品體系中,獲得最佳性價比;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"實現無縫的數據訪問,實現數據的自由流動;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"實現數據的統一治理;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"用 AI\/ML 解決業務難題;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/d5\/d5f6f46f7634a357ac61c46cc748a1af.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在構建企業級現代數據平臺架構時,這五個關鍵特徵,實質上覆蓋了三方視角 ——"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於架構師而言,第一點和第二點值得引起注意。前者是遷移上雲的一大核心訴求,後者是架構評審一定會過問的核心事項;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於開發者而言,第三點和第四點尤爲重要,對元數據的管理最重要實現的是數據在整個系統內的自由流動和訪問,而不僅僅是打通數據湖和數據倉庫;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於產品經理而言,第五點點明瞭當下大數據平臺的價值導向,即數據的收集和治理,應以解決業務問題爲目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了方便理解,也方便通過 Demo 演示,潘超將這套架構體系,同等替換爲了亞馬遜雲科技現有產品體系,包括:Amazon Athena、Amazon Aurora 、Amazon MSK、Amazon EMR 等,而流式數據入湖,重點涉及 Amazon MSK、Amazon EMR,以及另一個核心服務:Apache Hudi。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/37\/37c5f628d7f440ff2f7c1fde52a2041c.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Amazon MSK 的擴展能力與最佳實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon MSK 是亞馬遜託管的高可用、強安全的 Kafka 服務,是數據分析領域,負責消息傳遞的基礎,也因此在流式數據入湖部分舉足輕重。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之所以以 Amazon MSK 舉例,而不是修改 Kafka 代碼直接構建這套系統,是爲了最大程度將開發者的注意力聚焦於流式應用本身,而不是管理和維護基礎設施。況且,一旦你決定從頭構建 PaaS 層基礎設施,涉及到的工作就不僅僅是拉起一套 Kafka 集羣了。一張圖可以很形象地反映這個問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/5b\/5b921eea4fbc7b23e3de8da326ab204b.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這張圖從左至右,依次爲不使用任何雲服務的工作列表,使用 EC2 的工作列表,以及使用 MSK 的工作列表,工作量和 ROI 高下立現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而對於 MSK 來說,擴展能力是其重要特性。MSK 可以自動擴容,也可以手動 API 擴容。但如果對自己的“動手能力”沒有充足的信心,建議選擇自動擴容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon MSK 的自動擴容可以根據存儲利用率來設定閾值,建議設定 50%-60%。自動擴容每次擴展 Max(10GB,10%* 集羣存儲空間),同時自動擴展每次有"},{"type":"codeinline","content":[{"type":"text","text":"6 個小時"}]},{"type":"text","text":"的冷卻時間。一次如果一次需要擴容更大的容量,可以使用手動擴容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種擴容既包括橫向擴容 —— 通過 API 或者控制檯向集羣添加新的 Brokers,期間不會影響集羣的可用性,也包括縱向擴容 —— 調整集羣 Broker 節點的 EC2 實例類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但無論是自動還是手動,是橫向還是縱向,前提都是你已經做好了磁盤監控,可以使用 CloudWatch 雲監控集成的監控服務,也可以在 MSK 裏勾選其他的監控服務 (Prometheus),最終監控結果都能可視化顯示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,MSK 集羣增加 Broker,每個舊 Topic 的分區如果想重分配,需要手動執行。重分配的時候,會帶來額外的帶寬,有可能會影響業務,所以可以通過一些參數控制 Broker 間流量帶寬,防止過程當中對業務造成太大的影響。當然像 Cruise 一樣的開源工具,也可以多多用起來。Cruise 是做大規模集羣的管理的 MSK 工具,它可以幫你做 Broker 間負載的 Re-balance 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於 MSK 集羣的高可用,有三點需要注意:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"對於兩 AZ 部署的集羣,副本因子至少保證爲 3。如果只有 1,那麼當集羣滾動升級的時候,就不能對外提供服務了;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"最小的 ISR(in-sync replicas)最多設置爲 RF - 1,不然也會影響集羣的滾動升級;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"當客戶端連接 Broker 節點時,雖然配置一個 Broker 節點的連接地址就可以,但還是建議配置多個。MSK 故障節點自動替換以及在滾動升級的過程中,如果客戶端只配備了一個 Broker 節點,可能會鏈接超時。如果配置了多個,還可以重試連接。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 CPU 層面,CloudWatch 裏有兩個關於 MSK 的指標值得注意,一個是 CpuSystem,另一個是 CpuUser,推薦保持在 60% 以下,這樣在 MSK 升級維護時,都有足夠的 CPU 資源可用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果 CPU 利用率過高,觸發報警,則可以通過以下幾種方式來擴展 MSK 集羣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"垂直擴展,通過滾動升級進行替換。每個 Broker 的替換大概需要 10-15 分鐘的時間。當然,是否替換集羣內所有機器,要根據實際情況做選擇,以免造成資源浪費;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"橫向拓展,Topic 增加分區數;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"添加 Broker 到集羣,之前創建的 Topic 進行 reassign Partitions,重分配會消耗集羣資源,當然這是可控的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,關於 ACK 參數的設置也值得注意,ACK = 2 意味着在生產者發送消息後,等到所有副本都接收到消息,才返回成功。這雖然保證了消息的可靠性,但吞吐率最低。比如日誌類數據,參考業務具體情況,就可以酌情設置 ACK = 1,容忍數據丟失的可能,但大幅提高了吞吐率。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Amazon EMR 存算分離及資源動態擴縮"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 是託管的 Hadoop 生態,常用的 Hadoop 組件在 EMR 上都會有,但是 EMR 核心特徵有兩點,一是存算分離,二是資源動態擴縮。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/b6\/b664de6e556ac1c4bc1d4dbc75e33657.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大數據領域,存算分離概念的熱度,不下於流批一體、湖倉一體。以亞馬遜雲科技產品棧爲例,實現存算分離後,數據是在 S3 上存儲,EMR 只是一個計算集羣,是一個無狀態的數據。而數據與元數據都在外部,集羣簡化爲無狀態的計算資源,用的時候打開,不用的時候關閉就可以。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子,凌晨 1 點到 5 點,大批 ETL 作業,開啓集羣。其他時間則完全不用開啓集羣。用時開啓,不用關閉,對於上雲企業而言,交服務費就像交電費,格外節省。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而資源的動態擴縮主要是指根據不同的工作負載,動態擴充節點,按使用量計費。但如果數據是在 HDFS 上做存算分離與動態擴縮,就不太容易操作了,擴縮容如果附帶 DataNote 數據,就會引發數據的 Re-balance,非常影響效率。如果單獨擴展 NodeManager,在雲下的場景,資源不再是彈性的,集羣也一般是預製好的,與雲上有本質區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EMR 有三類節點,第一類是 Master 主節點,部署着 Resource Manager 等服務;Core 核心節點,有 DataNote,NodeManager, 依然可以選用 HDFS;第三類是任務節點,運行着 EMR 的 NodeManager 服務,是一個計算節點。所以,EMR 的擴縮,在於核心節點與任務節點的擴縮,可以根據 YARN 上 Application 的個數、CPU 的利用率等指標配置擴縮策略。也可以使用 EMR 提供 Managed Scaling 策略其內置了智能算法來實現自動擴縮,也是推薦的方式,對開發者而言是無感的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/ff\/ff1b841ef286e5a36b2e577135010e71.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"EMR Flink Hudi 構建數據湖及 CDC 同步方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼應該如何利用 MSK 和 EMR 做數據湖的入湖呢?其詳細架構圖如下,分作六步詳解:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/79\/792d952c04c9ef4023bdfde550ea0d4b.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 1:日誌數據和業務數據發送⾄MSK(Kafka),通過 Flink(TableAPI) 建立Kafka 表,消費 Kafka 數據,Hive Metastore 存儲 Schema;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 2:RDS(MySQL) 中的數據通過 Flink CDC(flink-cdc-connector) 直接消費 Binlog 數據,⽆需搭建其他消費 Binlog 的服務 (⽐如 Canal,Debezium)。注意使⽤flink-cdc-connector 的 2.x 版本,⽀持parallel reading"},{"type":"codeinline","content":[{"type":"text","text":","}]},{"type":"text","text":" lock-free and checkpoint feature;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 3:使用Flink Hudi Connector, 將數據寫⼊Hudi(S3) 表, 對於⽆需 Update 的數據使⽤Insert 模式寫⼊,對於需要 Update 的 數據 (業務數據和 CDC 數據) 使用Upsert 模式寫⼊;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 4:使用Presto 作爲查詢引擎,對外提供查詢服務。此條數據鏈路的延遲取決於入Hudi 的延遲及 Presto 查詢的延遲,總體在分鐘級別;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 5:對於需要秒級別延遲的指標,直接在 Flink 引擎中做計算,計算結果輸出到 RDS 或者 KV 數據庫,對外提供 API 查詢服務;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 6:使用QuickSight 做數據可視化,支持多種數據源接入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,在具體的實踐過程中,仍需要開發者對數據湖方案有足夠的瞭解,才能切合場景選擇合適的調參配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q\/A 問答"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 如何從 Apache Kafka 遷移至 Amazon MSK?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MSK 託管的是 Apache Kafka,其 API 是完全兼容的,業務應用代碼不需要調整,更換爲 MSK 的鏈接地址即可。如果已有的 Kafka 集羣數據要遷移到 MSK,可以使用 MirrorMaker2 做數據同步,然後切換應用鏈接地址即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文檔:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/migration.htmlhttps:\/\/d1.awsstatic.com\/whitepapers\/amazon-msk-migration-guide.pdf?did=wp_card&trk=wp_card","title":"","type":null},"content":[{"type":"text","text":"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/migration.htmlhttps:\/\/d1.awsstatic.com\/whitepapers\/amazon-msk-migration-guide.pdf?did=wp_card&trk=wp_card"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. MSK 支持 schema registry 嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MSK 支持 Schema Registry, 不僅支持使用 AWS Glue 作爲 Schema Registry, 還支持第三方的比如 confluent-schema-registry"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.MySQL cdc 到 hudi 的延遲如何?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總體來講是分鐘級別延遲。和數據量,選擇的 Hudi 表類型,計算資源都有關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. Amazon EMR 比標準 Apache Spark 快多少?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 比標準 Apache Spark 快 3 倍以上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 在 Spark3.0 上比開源 Spark 快 1.7 倍,在 TPC-DS 3TB 數據的測試。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參見:"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark\/"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 在 Spark 2.x 上比開源 Spark 快 2~3 倍以上"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon Presto 比開源的 PrestoDB 快 2.6 倍。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參見:"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/amazon-emr-introduces-emr-runtime-for-prestodb-which-provides-a-2-6-times-speedup\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/amazon-emr-introduces-emr-runtime-for-prestodb-which-provides-a-2-6-times-speedup\/"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 智能湖倉和湖倉一體的區別是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這在本次分享中的現代化數據平臺建設和 Amazon 的智能湖倉架構圖中都有所體現,Amazon 的智能湖倉架構靈活擴展,安全可靠 ; 專門構建,極致性能 ; 數據融合,統一治理 ; 敏捷分析,深度智能 ; 擁抱開源,開發共贏。湖倉一體只是開始,智能湖倉纔是終極。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"re:Invent 正在拉斯維加斯同步舉辦,今年是 re:Invent 的第十年,今天 Keynote 上 Adam 爲我們提出的從未設想過的“數據之旅”,從數據湖擴展與安全出發,發佈了 Row and cell-level security for Lake Formation,實現了海量數據的精細化管理。同時,順應雲計算時代無服務器化的技術發展趨勢,同時發佈了四款當前雲上專門數據分析工具的無服務器版本。成爲了行業內全棧式無服務器數據分析的先行者。大數據分析的易用性再次提升,體驗在雲上操作數據舉重若輕的感覺。相關發佈內容可以註冊觀看大會直播。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/f8\/f827de795aa283d5840540caf2631210.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5 附錄:操作代碼實施"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 創建 EMR 集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"makefile"},"content":[{"type":"text","text":"log_uri=\"s3:\/\/*****\/emr\/log\/\"\nkey_name=\"****\"\njdbc=\"jdbc:mysql:\\\/\\\/*****.ap-southeast-1.rds.amazonaws.com:3306\\\/hive_metadata_01?\ncreateDatabaseIfNotExist=true\"\ncluster_name=\"tech-talk-001\"\n\naws emr create-cluster \\\n--termination-protected \\\n--region ap-southeast-1 \\\n--applications Name=Hadoop Name=Hive Name=Flink Name=Tez Name=Spark\nName=JupyterEnterpriseGateway Name=Presto Name=HCatalog \\\n--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \\\n--release-label emr-6.4.0 \\\n--ebs-root-volume-size 50 \\\n--service-role EMR_DefaultRole \\\n--enable-debugging \\\n--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge\nInstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge \\\n--managed-scaling-policy\nComputeLimits='{MinimumCapacityUnits=2,MaximumCapacityUnits=5,MaximumOnDemandCapacityUnits=2,Ma\nximumCoreCapacityUnits=2,UnitType=Instances}' \\\n--name \"${cluster_name}\" \\\n--log-uri \"${log_uri}\" \\\n--ec2-attributes '{\"KeyName\":\"'${key_name}'\",\"SubnetId\":\"subnet-\n0f79e4471cfa74ced\",\"InstanceProfile\":\"EMR_EC2_DefaultRole\"}' \\\n--configurations '[{\"Classification\": \"hive-site\",\"Properties\":\n{\"javax.jdo.option.ConnectionURL\": \"'${jdbc}'\",\"javax.jdo.option.ConnectionDriverName\":\n\"org.mariadb.jdbc.Driver\",\"javax.jdo.option.ConnectionUserName\":\n\"admin\",\"javax.jdo.option.ConnectionPassword\": \"xxxxxx\"}}]'\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 創建 MSK 集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"# MSK集羣創建可以通過CLI, 也可以通過Console創建\n# 下載kafka,創建topic寫⼊數據\nwget https:\/\/dlcdn.apache.org\/kafka\/2.6.2\/kafka_2.12-2.6.2.tgz\n# msk zk地址,broker 地址\nzk_servers=*****.c3.kafka.ap-southeast-1.amazonaws.com:2181\nbootstrap_server=******.5ybaio.c3.kafka.ap-southeast-1.amazonaws.com:9092\ntopic=tech-talk-001\n# 創建tech-talk-001 topic\n.\/bin\/kafka-topics.sh --create --zookeeper ${zk_servers} --replication-factor 2 --partitions 4\n--topic ${topic}\n# 寫⼊消息\n.\/bin\/kafka-console-producer.sh --bootstrap-server ${bootstrap_server} --topic ${topic}\n{\"id\":\"1\",\"name\":\"customer\"}\n{\"id\":\"2\",\"name\":\"aws\"}\n# 消費消息\n.\/bin\/kafka-console-consumer.sh --bootstrap-server ${bootstrap_server} --topic ${topic}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.EMR 啓動 Flink"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"properties"},"content":[{"type":"text","text":"# 啓動flink on yarn session cluster\n# 下載kafka connector\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/repo1.maven.org\/maven2\/org\/apache\/flink\/flink-sql?\nconnector-kafka_2.12\/1.13.1\/flink-sql-connector-kafka_2.12-1.13.1.jar && sudo chown flink:flink\n\/usr\/lib\/flink\/lib\/flink-sql-connector-kafka_2.12-1.13.1.jar\n# hudi-flink-bundle 0.10.0\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/dxs9dnjebzm6y.cloudfront.net\/tmp\/hudi-flink?\nbundle_2.12-0.10.0-SNAPSHOT.jar && sudo chown flink:flink \/usr\/lib\/flink\/lib\/hudi-flink?\nbundle_2.12-0.10.0-SNAPSHOT.jar\n# 下載 cdc connector\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/repo1.maven.org\/maven2\/com\/ververica\/flink-sql?\nconnector-mysql-cdc\/2.0.0\/flink-sql-connector-mysql-cdc-2.0.0.jar && sudo chown flink:flink\n\/usr\/lib\/flink\/lib\/flink-sql-connector-mysql-cdc-2.0.0.jar\n# flink session\nflink-yarn-session -jm 1024 -tm 4096 -s 2 \\\n-D state.checkpoints.dir=s3:\/\/*****\/flink\/checkpoints \\\n-D state.backend=rocksdb \\\n-D state.checkpoint-storage=filesystem \\\n-D execution.checkpointing.interval=60000 \\\n-D state.checkpoints.num-retained=5 \\\n-D execution.checkpointing.mode=EXACTLY_ONCE \\\n-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \\\n-D state.backend.incremental=true \\\n-D execution.checkpointing.max-concurrent-checkpoints=1 \\\n-D rest.flamegraph.enabled=true \\\n-d\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.Flink SQL 客戶端"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"bash"},"content":[{"type":"text","text":"# 這是使⽤flink sql client寫SQL提交作業\n# 啓動client\n\/usr\/lib\/flink\/bin\/sql-client.sh -s application_*****\n# result-mode\nset sql-client.execution.result-mode=tableau;\n# set default parallesim\nset 'parallelism.default' = '1';\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 消費 Kafka 寫⼊Hudi"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"# 創建kafka表\nCREATE TABLE kafka_tb_001 (\nid string,\nname string,\n`ts` TIMESTAMP(3) METADATA FROM 'timestamp'\n) WITH (\n'connector' = 'kafka',\n'topic' = 'tech-talk-001',\n'properties.bootstrap.servers' = '****:9092',\n'properties.group.id' = 'test-group-001',\n'scan.startup.mode' = 'latest-offset',\n'format' = 'json',\n'json.ignore-parse-errors' = 'true',\n'json.fail-on-missing-field' = 'false',\n'sink.parallelism' = '2'\n);\n# 創建flink hudi表\nCREATE TABLE flink_hudi_tb_106(\nuuid string,\nname string,\nts TIMESTAMP(3),\nlogday VARCHAR(255),\nhh VARCHAR(255)\n)PARTITIONED BY (`logday`,`hh`)\nWITH (\n'connector' = 'hudi',\n'path' = 's3:\/\/*****\/teck-talk\/flink_hudi_tb_106\/',\n'table.type' = 'COPY_ON_WRITE',\n'write.precombine.field' = 'ts',\n'write.operation' = 'upsert',\n'hoodie.datasource.write.recordkey.field' = 'uuid',\n'hive_sync.enable' = 'true',\n'hive_sync.metastore.uris' = 'thrift:\/\/******:9083',\n'hive_sync.table' = 'flink_hudi_tb_106',\n'hive_sync.mode' = 'HMS',\n'hive_sync.username' = 'hadoop',\n'hive_sync.partition_fields' = 'logday,hh',\n'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor'\n);\n# 插⼊數據\ninsert into flink_hudi_tb_106 select id as uuid,name,ts,DATE_FORMAT(CURRENT_TIMESTAMP, 'yyyy?\nMM-dd') as logday, DATE_FORMAT(CURRENT_TIMESTAMP, 'hh') as hh from kafka_tb_001;\n# 除了在創建表是指定同步數據的⽅式,也可以通過cli同步hudi表元數據到hive,但要注意分區格式\n.\/run_sync_tool.sh --jdbc-url jdbc:hive2:\\\/\\\/*****:10000 --user hadop --pass hadoop --\npartitioned-by logday --base-path s3:\/\/****\/ --database default --table *****\n# presto 查詢數據\npresto-cli --server *****:8889 --catalog hive --schema default\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6.mysql cdc 同步到 hudi"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"# 創建mysql CDC表\nCREATE TABLE mysql_cdc_002 (\nid INT NOT NULL,\nname STRING,\ncreate_time TIMESTAMP(3),\nmodify_time TIMESTAMP(3),\nPRIMARY KEY(id) NOT ENFORCED\n) WITH (\n'connector' = 'mysql-cdc',\n'hostname' = '*******',\n'port' = '3306',\n'username' = 'admin',\n'password' = '*****',\n'database-name' = 'cdc_test_db',\n'table-name' = 'test_tb_01',\n'scan.startup.mode' = 'initial'\n);\n# 創建hudi表\nCREATE TABLE hudi_cdc_002 (\nid INT ,\nname STRING,\ncreate_time TIMESTAMP(3),\nmodify_time TIMESTAMP(3)\n) WITH (\n'connector' = 'hudi',\n'path' = 's3:\/\/******\/hudi_cdc_002\/',\n'table.type' = 'COPY_ON_WRITE',\n'write.precombine.field' = 'modify_time',\n'hoodie.datasource.write.recordkey.field' = 'id',\n'write.operation' = 'upsert',\n'write.tasks' = '2',\n'hive_sync.enable' = 'true',\n'hive_sync.metastore.uris' = 'thrift:\/\/*******:9083',\n'hive_sync.table' = 'hudi_cdc_002',\n'hive_sync.db' = 'default',\n'hive_sync.mode' = 'HMS',\n'hive_sync.username' = 'hadoop'\n);\n# 寫⼊數據\ninsert into hudi_cdc_002 select * from mysql_cdc_002;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7. sysbench"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"properties"},"content":[{"type":"text","text":"# sysbench 寫⼊mysql數據\n# 下載sysbench\ncurl -s https:\/\/packagecloud.io\/install\/repositories\/akopytov\/sysbench\/script.rpm.sh | sudo\nbash\nsudo yum -y install sysbench\n# 注意當前使用的“lua”並未提供構建,請根據自身情況定義,上述⽤到表結構如下\nCREATE TABLE if not exists `test_tb_01` (\n`id` int NOT NULL AUTO_INCREMENT,\n`name` varchar(155) DEFAULT NULL,\n`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n`modify_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE\nCURRENT_TIMESTAMP,\nPRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;\n# 創建表\nsysbench creates.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1 run\n# 插⼊數據\nsysbench insert.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=500 --\ntime=0 --threads=1 --skip_trx=true run\n# 更新數據\nsysbench update.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1000 --\ntime=0 --threads=10 --skip_trx=true --update_id_min=3 --update_id_max=500 run\n# 刪除表\nsysbench drop.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1 run\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章