如何構建智能湖倉架構？亞馬遜工程師的代碼實踐

原創

2021-12-01 18:44

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據倉庫的數據體系嚴格、治理容易，業務規模越大，ROI 越高；數據湖的數據種類豐富，治理困難，業務規模越大，ROI 越低，但勝在靈活。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在，魚和熊掌我都想要，應該怎麼辦？"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/rKSdSdeEQhgNJVk4sxsz","title":"xxx","type":null},"content":[{"type":"text","text":"湖倉一體架構"}]},{"type":"text","text":"就在這種情況下，快速在產業內普及。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要構建湖倉一體架構並不容易，需要解決非常多的數據問題。比如，計算層、存儲層、異構集羣層都要打通，對元數據要進行統一的管理和治理。對於很多業內技術團隊而言，已經是個比較大的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可即便如此，在亞馬遜雲科技技術專家潘超看來，也未必最能貼合企業級大數據處理的最新理念。在 11 月 18 日晚上 20：00 的直播中，潘超詳細分享了亞馬遜雲科技眼中的智能湖倉架構，以及以流式數據接入爲主的最佳實踐。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"現代化數據平臺架構的關鍵指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統湖倉一體架構的不足之處是，着重解決點的問題，也就是“湖”和“倉”的打通，而忽視了面的問題：數據在整個數據平臺的自由流轉。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"潘超認爲，"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/Jk5dZg81oMD5RU1izh0k","title":"xxx","type":null},"content":[{"type":"text","text":"現代數據平臺架構"}]},{"type":"text","text":"應該具有幾個關鍵特徵："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"以任何規模來存儲數據；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在整套架構涉及的所有產品體系中，獲得最佳性價比；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"實現無縫的數據訪問，實現數據的自由流動；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"實現數據的統一治理；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"用 AI\/ML 解決業務難題；"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/d5\/d5f6f46f7634a357ac61c46cc748a1af.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在構建企業級現代數據平臺架構時，這五個關鍵特徵，實質上覆蓋了三方視角 ——"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於架構師而言，第一點和第二點值得引起注意。前者是遷移上雲的一大核心訴求，後者是架構評審一定會過問的核心事項；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於開發者而言，第三點和第四點尤爲重要，對元數據的管理最重要實現的是數據在整個系統內的自由流動和訪問，而不僅僅是打通數據湖和數據倉庫；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於產品經理而言，第五點點明瞭當下大數據平臺的價值導向，即數據的收集和治理，應以解決業務問題爲目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了方便理解，也方便通過 Demo 演示，潘超將這套架構體系，同等替換爲了亞馬遜雲科技現有產品體系，包括：Amazon Athena、Amazon Aurora 、Amazon MSK、Amazon EMR 等，而流式數據入湖，重點涉及 Amazon MSK、Amazon EMR，以及另一個核心服務：Apache Hudi。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/37\/37c5f628d7f440ff2f7c1fde52a2041c.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Amazon MSK 的擴展能力與最佳實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon MSK 是亞馬遜託管的高可用、強安全的 Kafka 服務，是數據分析領域，負責消息傳遞的基礎，也因此在流式數據入湖部分舉足輕重。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之所以以 Amazon MSK 舉例，而不是修改 Kafka 代碼直接構建這套系統，是爲了最大程度將開發者的注意力聚焦於流式應用本身，而不是管理和維護基礎設施。況且，一旦你決定從頭構建 PaaS 層基礎設施，涉及到的工作就不僅僅是拉起一套 Kafka 集羣了。一張圖可以很形象地反映這個問題："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/5b\/5b921eea4fbc7b23e3de8da326ab204b.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這張圖從左至右，依次爲不使用任何雲服務的工作列表，使用 EC2 的工作列表，以及使用 MSK 的工作列表，工作量和 ROI 高下立現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而對於 MSK 來說，擴展能力是其重要特性。MSK 可以自動擴容，也可以手動 API 擴容。但如果對自己的“動手能力”沒有充足的信心，建議選擇自動擴容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon MSK 的自動擴容可以根據存儲利用率來設定閾值，建議設定 50%-60%。自動擴容每次擴展 Max(10GB,10%* 集羣存儲空間)，同時自動擴展每次有"},{"type":"codeinline","content":[{"type":"text","text":"6 個小時"}]},{"type":"text","text":"的冷卻時間。一次如果一次需要擴容更大的容量，可以使用手動擴容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種擴容既包括橫向擴容 —— 通過 API 或者控制檯向集羣添加新的 Brokers，期間不會影響集羣的可用性，也包括縱向擴容 —— 調整集羣 Broker 節點的 EC2 實例類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但無論是自動還是手動，是橫向還是縱向，前提都是你已經做好了磁盤監控，可以使用 CloudWatch 雲監控集成的監控服務，也可以在 MSK 裏勾選其他的監控服務 (Prometheus)，最終監控結果都能可視化顯示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是，MSK 集羣增加 Broker，每個舊 Topic 的分區如果想重分配，需要手動執行。重分配的時候，會帶來額外的帶寬，有可能會影響業務，所以可以通過一些參數控制 Broker 間流量帶寬，防止過程當中對業務造成太大的影響。當然像 Cruise 一樣的開源工具，也可以多多用起來。Cruise 是做大規模集羣的管理的 MSK 工具，它可以幫你做 Broker 間負載的 Re-balance 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於 MSK 集羣的高可用，有三點需要注意："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"對於兩 AZ 部署的集羣，副本因子至少保證爲 3。如果只有 1，那麼當集羣滾動升級的時候，就不能對外提供服務了；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"最小的 ISR（in-sync replicas）最多設置爲 RF - 1，不然也會影響集羣的滾動升級；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"當客戶端連接 Broker 節點時，雖然配置一個 Broker 節點的連接地址就可以，但還是建議配置多個。MSK 故障節點自動替換以及在滾動升級的過程中，如果客戶端只配備了一個 Broker 節點，可能會鏈接超時。如果配置了多個，還可以重試連接。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 CPU 層面，CloudWatch 裏有兩個關於 MSK 的指標值得注意，一個是 CpuSystem，另一個是 CpuUser，推薦保持在 60% 以下，這樣在 MSK 升級維護時，都有足夠的 CPU 資源可用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果 CPU 利用率過高，觸發報警，則可以通過以下幾種方式來擴展 MSK 集羣："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"垂直擴展，通過滾動升級進行替換。每個 Broker 的替換大概需要 10-15 分鐘的時間。當然，是否替換集羣內所有機器，要根據實際情況做選擇，以免造成資源浪費；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"橫向拓展，Topic 增加分區數；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"添加 Broker 到集羣，之前創建的 Topic 進行 reassign Partitions，重分配會消耗集羣資源，當然這是可控的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後，關於 ACK 參數的設置也值得注意，ACK = 2 意味着在生產者發送消息後，等到所有副本都接收到消息，才返回成功。這雖然保證了消息的可靠性，但吞吐率最低。比如日誌類數據，參考業務具體情況，就可以酌情設置 ACK = 1，容忍數據丟失的可能，但大幅提高了吞吐率。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Amazon EMR 存算分離及資源動態擴縮"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 是託管的 Hadoop 生態，常用的 Hadoop 組件在 EMR 上都會有，但是 EMR 核心特徵有兩點，一是存算分離，二是資源動態擴縮。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/b6\/b664de6e556ac1c4bc1d4dbc75e33657.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大數據領域，存算分離概念的熱度，不下於流批一體、湖倉一體。以亞馬遜雲科技產品棧爲例，實現存算分離後，數據是在 S3 上存儲，EMR 只是一個計算集羣，是一個無狀態的數據。而數據與元數據都在外部，集羣簡化爲無狀態的計算資源，用的時候打開，不用的時候關閉就可以。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子，凌晨 1 點到 5 點，大批 ETL 作業，開啓集羣。其他時間則完全不用開啓集羣。用時開啓，不用關閉，對於上雲企業而言，交服務費就像交電費，格外節省。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而資源的動態擴縮主要是指根據不同的工作負載，動態擴充節點，按使用量計費。但如果數據是在 HDFS 上做存算分離與動態擴縮，就不太容易操作了，擴縮容如果附帶 DataNote 數據，就會引發數據的 Re-balance，非常影響效率。如果單獨擴展 NodeManager，在雲下的場景，資源不再是彈性的，集羣也一般是預製好的，與雲上有本質區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EMR 有三類節點，第一類是 Master 主節點，部署着 Resource Manager 等服務；Core 核心節點，有 DataNote，NodeManager, 依然可以選用 HDFS；第三類是任務節點，運行着 EMR 的 NodeManager 服務，是一個計算節點。所以，EMR 的擴縮，在於核心節點與任務節點的擴縮，可以根據 YARN 上 Application 的個數、CPU 的利用率等指標配置擴縮策略。也可以使用 EMR 提供 Managed Scaling 策略其內置了智能算法來實現自動擴縮，也是推薦的方式，對開發者而言是無感的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/ff\/ff1b841ef286e5a36b2e577135010e71.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"EMR Flink Hudi 構建數據湖及 CDC 同步方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼應該如何利用 MSK 和 EMR 做數據湖的入湖呢？其詳細架構圖如下，分作六步詳解："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/79\/792d952c04c9ef4023bdfde550ea0d4b.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 1：日誌數據和業務數據發送⾄MSK(Kafka)，通過 Flink(TableAPI) 建立Kafka 表，消費 Kafka 數據，Hive Metastore 存儲 Schema；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 2：RDS(MySQL) 中的數據通過 Flink CDC(flink-cdc-connector) 直接消費 Binlog 數據，⽆需搭建其他消費 Binlog 的服務 (⽐如 Canal,Debezium)。注意使⽤flink-cdc-connector 的 2.x 版本，⽀持parallel reading"},{"type":"codeinline","content":[{"type":"text","text":","}]},{"type":"text","text":" lock-free and checkpoint feature；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 3：使用Flink Hudi Connector, 將數據寫⼊Hudi(S3) 表, 對於⽆需 Update 的數據使⽤Insert 模式寫⼊，對於需要 Update 的數據 (業務數據和 CDC 數據) 使用Upsert 模式寫⼊；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 4：使用Presto 作爲查詢引擎，對外提供查詢服務。此條數據鏈路的延遲取決於入Hudi 的延遲及 Presto 查詢的延遲，總體在分鐘級別；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 5：對於需要秒級別延遲的指標，直接在 Flink 引擎中做計算，計算結果輸出到 RDS 或者 KV 數據庫，對外提供 API 查詢服務；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中標號 6：使用QuickSight 做數據可視化，支持多種數據源接入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然，在具體的實踐過程中，仍需要開發者對數據湖方案有足夠的瞭解，才能切合場景選擇合適的調參配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q\/A 問答"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 如何從 Apache Kafka 遷移至 Amazon MSK？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MSK 託管的是 Apache Kafka，其 API 是完全兼容的，業務應用代碼不需要調整，更換爲 MSK 的鏈接地址即可。如果已有的 Kafka 集羣數據要遷移到 MSK，可以使用 MirrorMaker2 做數據同步，然後切換應用鏈接地址即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文檔："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/migration.htmlhttps:\/\/d1.awsstatic.com\/whitepapers\/amazon-msk-migration-guide.pdf?did=wp_card&trk=wp_card","title":"","type":null},"content":[{"type":"text","text":"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/migration.htmlhttps:\/\/d1.awsstatic.com\/whitepapers\/amazon-msk-migration-guide.pdf?did=wp_card&trk=wp_card"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. MSK 支持 schema registry 嗎？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MSK 支持 Schema Registry, 不僅支持使用 AWS Glue 作爲 Schema Registry, 還支持第三方的比如 confluent-schema-registry"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.MySQL cdc 到 hudi 的延遲如何？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總體來講是分鐘級別延遲。和數據量，選擇的 Hudi 表類型，計算資源都有關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. Amazon EMR 比標準 Apache Spark 快多少？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 比標準 Apache Spark 快 3 倍以上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 在 Spark3.0 上比開源 Spark 快 1.7 倍，在 TPC-DS 3TB 數據的測試。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參見："}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/run-apache-spark-3-0-workloads-1-7-times-faster-with-amazon-emr-runtime-for-apache-spark\/"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon EMR 在 Spark 2.x 上比開源 Spark 快 2~3 倍以上"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Amazon Presto 比開源的 PrestoDB 快 2.6 倍。"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參見："}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/amazon-emr-introduces-emr-runtime-for-prestodb-which-provides-a-2-6-times-speedup\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/aws.amazon.com\/cn\/blogs\/big-data\/amazon-emr-introduces-emr-runtime-for-prestodb-which-provides-a-2-6-times-speedup\/"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 智能湖倉和湖倉一體的區別是什麼？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這在本次分享中的現代化數據平臺建設和 Amazon 的智能湖倉架構圖中都有所體現，Amazon 的智能湖倉架構靈活擴展，安全可靠 ; 專門構建，極致性能 ; 數據融合，統一治理 ; 敏捷分析，深度智能 ; 擁抱開源，開發共贏。湖倉一體只是開始，智能湖倉纔是終極。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"re:Invent 正在拉斯維加斯同步舉辦，今年是 re:Invent 的第十年，今天 Keynote 上 Adam 爲我們提出的從未設想過的“數據之旅”，從數據湖擴展與安全出發，發佈了 Row and cell-level security for Lake Formation，實現了海量數據的精細化管理。同時，順應雲計算時代無服務器化的技術發展趨勢，同時發佈了四款當前雲上專門數據分析工具的無服務器版本。成爲了行業內全棧式無服務器數據分析的先行者。大數據分析的易用性再次提升，體驗在雲上操作數據舉重若輕的感覺。相關發佈內容可以註冊觀看大會直播。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/f8\/f827de795aa283d5840540caf2631210.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5 附錄：操作代碼實施"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 創建 EMR 集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"makefile"},"content":[{"type":"text","text":"log_uri=\"s3:\/\/*****\/emr\/log\/\"\nkey_name=\"****\"\njdbc=\"jdbc:mysql:\\\/\\\/*****.ap-southeast-1.rds.amazonaws.com:3306\\\/hive_metadata_01?\ncreateDatabaseIfNotExist=true\"\ncluster_name=\"tech-talk-001\"\n\naws emr create-cluster \\\n--termination-protected \\\n--region ap-southeast-1 \\\n--applications Name=Hadoop Name=Hive Name=Flink Name=Tez Name=Spark\nName=JupyterEnterpriseGateway Name=Presto Name=HCatalog \\\n--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \\\n--release-label emr-6.4.0 \\\n--ebs-root-volume-size 50 \\\n--service-role EMR_DefaultRole \\\n--enable-debugging \\\n--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge\nInstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge \\\n--managed-scaling-policy\nComputeLimits='{MinimumCapacityUnits=2,MaximumCapacityUnits=5,MaximumOnDemandCapacityUnits=2,Ma\nximumCoreCapacityUnits=2,UnitType=Instances}' \\\n--name \"${cluster_name}\" \\\n--log-uri \"${log_uri}\" \\\n--ec2-attributes '{\"KeyName\":\"'${key_name}'\",\"SubnetId\":\"subnet-\n0f79e4471cfa74ced\",\"InstanceProfile\":\"EMR_EC2_DefaultRole\"}' \\\n--configurations '[{\"Classification\": \"hive-site\",\"Properties\":\n{\"javax.jdo.option.ConnectionURL\": \"'${jdbc}'\",\"javax.jdo.option.ConnectionDriverName\":\n\"org.mariadb.jdbc.Driver\",\"javax.jdo.option.ConnectionUserName\":\n\"admin\",\"javax.jdo.option.ConnectionPassword\": \"xxxxxx\"}}]'\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 創建 MSK 集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"# MSK集羣創建可以通過CLI, 也可以通過Console創建\n# 下載kafka,創建topic寫⼊數據\nwget https:\/\/dlcdn.apache.org\/kafka\/2.6.2\/kafka_2.12-2.6.2.tgz\n# msk zk地址，broker 地址\nzk_servers=*****.c3.kafka.ap-southeast-1.amazonaws.com:2181\nbootstrap_server=******.5ybaio.c3.kafka.ap-southeast-1.amazonaws.com:9092\ntopic=tech-talk-001\n# 創建tech-talk-001 topic\n.\/bin\/kafka-topics.sh --create --zookeeper ${zk_servers} --replication-factor 2 --partitions 4\n--topic ${topic}\n# 寫⼊消息\n.\/bin\/kafka-console-producer.sh --bootstrap-server ${bootstrap_server} --topic ${topic}\n{\"id\":\"1\",\"name\":\"customer\"}\n{\"id\":\"2\",\"name\":\"aws\"}\n# 消費消息\n.\/bin\/kafka-console-consumer.sh --bootstrap-server ${bootstrap_server} --topic ${topic}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.EMR 啓動 Flink"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"properties"},"content":[{"type":"text","text":"# 啓動flink on yarn session cluster\n# 下載kafka connector\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/repo1.maven.org\/maven2\/org\/apache\/flink\/flink-sql?\nconnector-kafka_2.12\/1.13.1\/flink-sql-connector-kafka_2.12-1.13.1.jar && sudo chown flink:flink\n\/usr\/lib\/flink\/lib\/flink-sql-connector-kafka_2.12-1.13.1.jar\n# hudi-flink-bundle 0.10.0\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/dxs9dnjebzm6y.cloudfront.net\/tmp\/hudi-flink?\nbundle_2.12-0.10.0-SNAPSHOT.jar && sudo chown flink:flink \/usr\/lib\/flink\/lib\/hudi-flink?\nbundle_2.12-0.10.0-SNAPSHOT.jar\n# 下載 cdc connector\nsudo wget -P \/usr\/lib\/flink\/lib\/ https:\/\/repo1.maven.org\/maven2\/com\/ververica\/flink-sql?\nconnector-mysql-cdc\/2.0.0\/flink-sql-connector-mysql-cdc-2.0.0.jar && sudo chown flink:flink\n\/usr\/lib\/flink\/lib\/flink-sql-connector-mysql-cdc-2.0.0.jar\n# flink session\nflink-yarn-session -jm 1024 -tm 4096 -s 2 \\\n-D state.checkpoints.dir=s3:\/\/*****\/flink\/checkpoints \\\n-D state.backend=rocksdb \\\n-D state.checkpoint-storage=filesystem \\\n-D execution.checkpointing.interval=60000 \\\n-D state.checkpoints.num-retained=5 \\\n-D execution.checkpointing.mode=EXACTLY_ONCE \\\n-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \\\n-D state.backend.incremental=true \\\n-D execution.checkpointing.max-concurrent-checkpoints=1 \\\n-D rest.flamegraph.enabled=true \\\n-d\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.Flink SQL 客戶端"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"bash"},"content":[{"type":"text","text":"# 這是使⽤flink sql client寫SQL提交作業\n# 啓動client\n\/usr\/lib\/flink\/bin\/sql-client.sh -s application_*****\n# result-mode\nset sql-client.execution.result-mode=tableau;\n# set default parallesim\nset 'parallelism.default' = '1';\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 消費 Kafka 寫⼊Hudi"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"# 創建kafka表\nCREATE TABLE kafka_tb_001 (\nid string,\nname string,\n`ts` TIMESTAMP(3) METADATA FROM 'timestamp'\n) WITH (\n'connector' = 'kafka',\n'topic' = 'tech-talk-001',\n'properties.bootstrap.servers' = '****:9092',\n'properties.group.id' = 'test-group-001',\n'scan.startup.mode' = 'latest-offset',\n'format' = 'json',\n'json.ignore-parse-errors' = 'true',\n'json.fail-on-missing-field' = 'false',\n'sink.parallelism' = '2'\n);\n# 創建flink hudi表\nCREATE TABLE flink_hudi_tb_106(\nuuid string,\nname string,\nts TIMESTAMP(3),\nlogday VARCHAR(255),\nhh VARCHAR(255)\n)PARTITIONED BY (`logday`,`hh`)\nWITH (\n'connector' = 'hudi',\n'path' = 's3:\/\/*****\/teck-talk\/flink_hudi_tb_106\/',\n'table.type' = 'COPY_ON_WRITE',\n'write.precombine.field' = 'ts',\n'write.operation' = 'upsert',\n'hoodie.datasource.write.recordkey.field' = 'uuid',\n'hive_sync.enable' = 'true',\n'hive_sync.metastore.uris' = 'thrift:\/\/******:9083',\n'hive_sync.table' = 'flink_hudi_tb_106',\n'hive_sync.mode' = 'HMS',\n'hive_sync.username' = 'hadoop',\n'hive_sync.partition_fields' = 'logday,hh',\n'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor'\n);\n# 插⼊數據\ninsert into flink_hudi_tb_106 select id as uuid,name,ts,DATE_FORMAT(CURRENT_TIMESTAMP, 'yyyy?\nMM-dd') as logday, DATE_FORMAT(CURRENT_TIMESTAMP, 'hh') as hh from kafka_tb_001;\n# 除了在創建表是指定同步數據的⽅式，也可以通過cli同步hudi表元數據到hive,但要注意分區格式\n.\/run_sync_tool.sh --jdbc-url jdbc:hive2:\\\/\\\/*****:10000 --user hadop --pass hadoop --\npartitioned-by logday --base-path s3:\/\/****\/ --database default --table *****\n# presto 查詢數據\npresto-cli --server *****:8889 --catalog hive --schema default\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6.mysql cdc 同步到 hudi"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"# 創建mysql CDC表\nCREATE TABLE mysql_cdc_002 (\nid INT NOT NULL,\nname STRING,\ncreate_time TIMESTAMP(3),\nmodify_time TIMESTAMP(3),\nPRIMARY KEY(id) NOT ENFORCED\n) WITH (\n'connector' = 'mysql-cdc',\n'hostname' = '*******',\n'port' = '3306',\n'username' = 'admin',\n'password' = '*****',\n'database-name' = 'cdc_test_db',\n'table-name' = 'test_tb_01',\n'scan.startup.mode' = 'initial'\n);\n# 創建hudi表\nCREATE TABLE hudi_cdc_002 (\nid INT ,\nname STRING,\ncreate_time TIMESTAMP(3),\nmodify_time TIMESTAMP(3)\n) WITH (\n'connector' = 'hudi',\n'path' = 's3:\/\/******\/hudi_cdc_002\/',\n'table.type' = 'COPY_ON_WRITE',\n'write.precombine.field' = 'modify_time',\n'hoodie.datasource.write.recordkey.field' = 'id',\n'write.operation' = 'upsert',\n'write.tasks' = '2',\n'hive_sync.enable' = 'true',\n'hive_sync.metastore.uris' = 'thrift:\/\/*******:9083',\n'hive_sync.table' = 'hudi_cdc_002',\n'hive_sync.db' = 'default',\n'hive_sync.mode' = 'HMS',\n'hive_sync.username' = 'hadoop'\n);\n# 寫⼊數據\ninsert into hudi_cdc_002 select * from mysql_cdc_002;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7. sysbench"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"properties"},"content":[{"type":"text","text":"# sysbench 寫⼊mysql數據\n# 下載sysbench\ncurl -s https:\/\/packagecloud.io\/install\/repositories\/akopytov\/sysbench\/script.rpm.sh | sudo\nbash\nsudo yum -y install sysbench\n# 注意當前使用的“lua”並未提供構建，請根據自身情況定義,上述⽤到表結構如下\nCREATE TABLE if not exists `test_tb_01` (\n`id` int NOT NULL AUTO_INCREMENT,\n`name` varchar(155) DEFAULT NULL,\n`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n`modify_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE\nCURRENT_TIMESTAMP,\nPRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;\n# 創建表\nsysbench creates.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1 run\n# 插⼊數據\nsysbench insert.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=500 --\ntime=0 --threads=1 --skip_trx=true run\n# 更新數據\nsysbench update.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1000 --\ntime=0 --threads=10 --skip_trx=true --update_id_min=3 --update_id_max=500 run\n# 刪除表\nsysbench drop.lua --mysql-user=admin --mysql-password=admin123456 --mysql?\nhost=****.rds.amazonaws.com --mysql-db=cdc_test_db --report-interval=1 --events=1 run\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

RocketMQ 之 IoT 消息解析：物聯網需要什麼樣的消息技術?

前言：從初代開源消息隊列崛起，到 PC 互聯網、移動互聯網爆發式發展，再到如今 IoT、雲計算、雲原生引領了新的技術趨勢，消息中間件的發展已經走過了 30 多個年頭。目前，消息中間件在國內許多行業的關鍵應用中扮演着至關重要的角色。隨着數

2024-04-24 23:40:04

“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！

本文分享自華爲雲社區《“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！》，作者： GaussDB 數據庫。 4月19日，由福州軟件園科技創新發展公司和華爲技術有限公司聯合主辦的HCDG城市行福州站——“企業創新新引擎”數據庫專

2024-04-24 10:32:53

沙特2030年願景和對中國IT企業的市場機會分析

沙特2030年願景和對中國IT企業的市場機會分析前言：最近“開源老DJ，帶你去沙特”欄目第一期已經播出，收到了不錯的反響。見COPU官網的回顧。（https://mp.weixin.qq.com/s/3B0jNVhybxTF1xPiy

2024-04-23 22:24:54

日誌架構演進：從集中式到分佈式的Kubernetes日誌策略

當我們沒有使用雲原生方案部署應用時採用的日誌方案往往是 ELK 技術棧。這套技術方案比較成熟，穩定性也很高，所以幾乎成爲了當時的標配。可是隨着我們使用 kubernetes 步入雲原生的時代後， kubernetes 把以往的操作系統

2024-04-23 11:47:10

【案例+PPT】普元信息臧一超：海量數據下“流批一體”的數據平臺演進路線

“數據中臺新範式”雲端峯會，深入解析湖倉一體、批流一體、治理與運營“三位一體”的數據中臺新範式特徵，普元信息數智研究院副院長臧一超在峯會發表演講《海量數據下的高性能流批一體數據開發平臺》。 18分鐘完整回放視頻見文末，拎幾個特別精彩的內

2024-04-23 11:43:51

世界讀書日 | 開發者必讀書單重磅來襲，華爲雲DTSE專家天團力薦

本文分享自華爲雲社區《世界讀書日 | 開發者必讀書單重磅來襲，華爲雲DTSE專家天團力薦》，作者：華爲雲社區精選。春色恰如許，讀書正當時。讀書，就像解鎖一把神祕鑰匙，爲開發者洞開新世界的大門，賦予他們破譯複雜難題的能力、挑戰未知領域的

2024-04-23 10:32:58

雲原生週刊：Kubernetes v1.30 發佈｜ 2024.4.22

開源項目推薦 pv-migrate pv-migrate 是一個 CLI 工具/kubectl 插件，可輕鬆將一個 Kubernetes 的內容遷移 PersistentVolumeClaim 到另一個 Kubernetes。 Claudi

2024-04-22 22:46:27

活動回顧丨雲原生開源開發者沙龍北京站回放 & PPT 下載

“零信任架構” 是一種安全概念，它要求在任何時候不對任何請求默認信任，無論它的來源內部還是外部。服務安全性已成爲企業的核心關切，4 月 13 日，雲原生開源開發者沙龍在北京順利開展。阿里雲一線工程師圍繞《微服務面臨的安全挑戰、趨勢與解決方

2024-04-22 21:12:01

西安站開營！AI 編碼助手通義靈碼幫大學生“整活兒”

如何更好地與 AI 爲伴，做時代的先進開發者？4 月 17 日，阿里雲推出的 AI 編程助手通義靈碼與雲工開物“高校訓練營”走進西安多所高校開啓實操培訓，結合 AI 輔助編程的發展背景、通義靈碼的具體能力和應用實操，幫助在校大學生了解人工智

2024-04-24 21:12:06

對接HiveMetaStore，擁抱開源大數據

本文分享自華爲雲社區《對接HiveMetaStore，擁抱開源大數據》，作者：睡覺是大事。 1. 前言適用版本：9.1.0及以上在大數據融合分析時代，面對海量的數據以及各種複雜的查詢，性能是我們使用一款數據處理引擎最重要的考量

2024-04-24 22:33:08

重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗

本文分享自華爲雲社區《重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗！》，作者：GaussDB 數據庫。所謂，凡有井水處，即能歌柳詞。大數據時代，凡有數據處，必有數據庫。隨着業務需求的不斷擴大和數據量的激增，數

2024-04-23 22:32:33

03-爲啥大模型LLM還沒能完全替代你？

1 不具備記憶能力的它是零狀態的，我們平常在使用一些大模型產品，尤其在使用他們的API的時候，我們會發現那你和它對話，尤其是多輪對話的時候，經過一些輪次後，這些記憶就消失了，因爲它也記不住那麼多。 2 上下文窗口的限制大模型對其inpu

2024-04-23 01:07:00

入職3年-我如何做一名AI產品經理

引言從2021年校招加入京東開始，我一直從事AI產品經理的工作，有幸見證了AI行業的熱情從一臺臺服務器燒到了全世界各個角落，也見證了京東AI中臺團隊的影響力如何一步步的擴大。從21年的迷茫到24年的堅定，很慶幸我正走在適合自己的道路上，

2024-04-22 11:16:31

如何增強Java API 的導入和導出性能

前言 GrapeCity Documents for Excel (以下簡稱GcExcel) 是葡萄城公司的一款服務端表格組件，它提供了一組全面的 API 以編程方式生成 Excel (XLSX) 電子表格文檔的功能，支持爲多個平臺創建、操

2024-04-23 10:23:02

SLS 查詢新範式：使用 SPL 對日誌進行交互式探索

作者：無哲引言在構建現代數據和業務系統的過程中，可觀測性已經變得至關重要，日誌服務（SLS）爲 Log/Trace/Metric 數據提供了大規模、低成本、高性能的一站式平臺服務，並提供數據採集、加工、投遞、分析、告警、可視化等功能，從

2024-04-22 21:12:05

24小時熱門文章

最新文章

最新評論文章