雲端數智新引擎,騰訊雲原生數據湖計算重磅發佈

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、數據湖的前世今生"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2010年 Pentaho 公司的創始人兼首席技術官詹姆斯·狄克遜(James Dixon)首次提出數據湖的概念。把數據湖中的數據比作原生態的水——它是未經處理的,原汁原味的。數據湖中的水從源頭流入湖中,各種用戶都可以來湖裏獲取、蒸餾提純這些水(數據)。此時, 大家對於數據湖的理解主要是當作一個集中式的存儲系統,允許存儲任意規模的結構化和非結構化數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着存 HDFS 和對象存儲等技術的發展, 海量數據的低成本存儲問題得以解決,用戶對湖中數據價值萃取的訴求愈發強烈。至此,數據湖重點從存儲轉向數據的計算分析,核心在於提升數據分析的敏捷性、增強對數據的洞察力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2017年前後,興起了新一輪的 AI 熱潮。深度學習和超大規模的神經網絡更離不開對海量數據文件的敏捷處理。藉助數據湖架構,可以更好地打通數據之間的壁壘,支撐AI 模型訓練、推理以及數據預處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"發展至今, 數據湖已經不再侷限於某個技術、某個軟件產品,而是涵蓋數據湖存儲、數據湖計算、數據湖AI的多元化數據架構,滿足企業級用戶的生產管理需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/62\/629654a1c893fe0821bfe71058b4df68.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊技術和產品發展至今,幾乎任何一個與用戶相關的業務數據量都在億級別,每日系統調用次數從億到百億,對海量異構數據的低成本存儲和高敏捷分析是最重要的關注點。我們認爲:“數據湖是企業新一代數據技術架構,可以賦予客戶更高的數據敏捷度、更低的分析成本,而云是數據湖的最佳實踐場所”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、騰訊雲原生數據湖架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"選擇 Cloud 還是 Local 的諸多討論和實踐中,成本一直是繞不開的話題。“在雲端部署數據架構不如想象的便宜”,國內不少剛開始接觸雲服務的企業會有如此感嘆。反觀國外很多中大型企業(例如 Netflix,Pinterest),或者體量較大的中國出海公司(Shareit,Mobvista)更偏向於選擇公有云服務。其核心差異是雲原生技術的普及和落地,如何更好的利用雲服務的優勢,達到比本地自建大數據平臺更低的IT成本,是雲服務廠商和企業用戶共同探索的關鍵點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2dab74e4b4fe4f12696491f5f27bc30d.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決海量異構數據的存儲和敏捷分析問題,騰訊雲推出了雲端數據湖體系,其包含:海量異構數據的存儲能力、面向多元化場景的分析能力、音視圖文的 AI 智能化能力。客戶藉助於騰訊雲“數據雲原生”能力, 高效構建企業級數據湖架構, 降低企業數據成本 、 提升企業數據敏捷性,助力企業數字化決策。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6b\/6b913dbdc95c87a0942be02c0ce0062d.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲數據湖體系圍繞數據湖存儲、數據湖計算、數據湖 AI,覆蓋數據業務全場景,形成綜合性雲端數據湖解決方案。目前,騰訊雲數據湖體系已服務衆多內外部客戶,算力彈性資源池達500萬核,存儲數據超過100PB,日採集數據量超500TB,每日分析任務數達1500萬,每日實時計算次數超過超過萬億,能支持上億維度的數據訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、雲原生數據湖計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常使用大數據分析組件對對象存儲中的數據進行分析時, 會面臨兩個核心問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何基於雲服務兼容特性屏蔽底層架構,降低計算成本?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何加速和優化存儲側的性能瓶頸?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ef\/ef380a1dfb20d848276b24d4b08c4933.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決數據湖敏捷高效的分析和計算問題,騰訊雲推出一款開箱即用的數據湖分析服務——騰訊雲數據湖計算(Data Lake Compute,DLC)(官網介紹:https:\/\/cloud.tencent.com\/product\/dlc)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該服務採用 Serverless 架構,用戶無需關注底層架構或維護計算資源,使用標準 SQL 即可完成對象存儲服務(COS)及其他雲端數據設施的聯合分析計算。藉助於該服務,用戶無需進行傳統的數據分層建模,大幅縮減了海量數據分析的準備時間,有效提升企業數據敏捷度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1a\/1a81d9820cf17c7f9de638bc19d37a99.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲 DLC 服務聯合騰訊多個團隊深耕核心技術, 以提供一款高性能數據計算服務爲目標,實現瞭如下幾個關鍵技術特徵:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 數據湖高性能計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲 DLC 引入高性能 serverless presto 引擎,針對數據湖底層存儲的特點,在穩定性和性能方面做了大量的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據傾斜多年來一直是數據工程的宿敵,對雲原生數據湖架構而言卻是個好消息:在數據 scan 階段,數據熱度的巨大差異可以用很少的緩存來撬動很好的加速效果。在騰訊常見的大數據場景中,我們發現 read-only 的請求的緩存命中率高達75%-85%,甚至可能更高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了緩存加速,減少數據文件的掃描量在數據湖架構下更重要,如何做好數據排布需要新一代的建模技術。除了分區,分桶等傳統技術,稀疏索引在數據湖扮演非常重要的作用。AP 向 TP存儲格式設計的靠攏大大加速了分析性能,可以看到一些高性能數倉技術如 clickhouse 都會引入稀疏索引技術,在不過分消耗存儲的基礎上大大提升了查詢性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 數據湖存儲透明加速"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶最關注的問題是:如何把數據快速輸送給大數據引擎,讓引擎高效率工作。這是騰訊雲工程師們一直在思考的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對象存儲服務 COS 作爲數據湖統一存儲服務,在確保數據安全、可靠、無限擴展能力的基礎上,針對大數據業務 IO 特點做了進一步性能優化,分別在計算端、AZ 端、存儲端提供了性能加速能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/87fa6e3884e03978a3bb6d8b28c9ba55.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這三級加速位於數據湖計算引擎和 COS 持久化存儲之間,爲數據分析和存儲系統建了橋樑, 將數據從 COS 對象存儲移動到距離數據應用更近的位置,使數據能夠更容易被訪問到。層次化的加速架構,使得數據的訪問速度能比現有方案有數量級的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 低成本,無限算力雲原生數據湖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相對於傳統固定規模集羣,騰訊數據湖技術完全基於"},{"type":"link","attrs":{"href":"https:\/\/cloud.tencent.com\/document\/product\/457\/39804","title":"xxx","type":null},"content":[{"type":"text","text":"騰訊雲彈性容器技術"}]},{"type":"text","text":"構建,理論上“無限”的計算資源隨時可供秒級調度,滿足不同規模的計算任務,使用者再也不用關心底層資源的部署和運維。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在傳統基於物理機\/虛擬機的大數據架構下,往往要維護一個規模相對固定的計算集羣,資源成本存在巨大的浪費。而云數據湖技術真正做到了隨用隨棄,充分利用彈性計算資源。計算引擎資源的創建、自動擴縮容、刪除、秒級監控等功能全部交由 EKS 的控制模塊來負責,用戶只需直接提交計算任務即可。當 DLC 預測到當前算力即將不足時,動態擴容計算資源以補充算力,作業無須重新執行,大幅度減少集羣空閒時的成本浪費,同時又能快速響應各種臨時 \/backfill 需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/671b2fa9fc6e4b6add50844cc06df519.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、騰訊雲原生數據湖技術未來展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着企業對數據驅動業務需求的加深,也隨着海量數據分析技術的成熟,傳統單一的數據架構也沒法滿足多變的數據分析需求。騰訊雲推出雲原生數據湖體系,一方面降低數據存儲和分析的成本, 另一方面大幅度提升數據分析的敏捷性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊雲數據湖體系架構,未來將會繼續在如下幾個方面繼續深耕,進一步推動雲端數據湖的技術發展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 靈活高效的計算引擎調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大數據領域,沒有一個萬能的 SQL 執行引擎,不同的計算引擎擅長不同的任務。基於騰訊大數據漂移計算技術,可以智能選擇對應最佳的計算引擎,支持數據源下推和 CBO 優化,提供更佳的分析性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 增強數據湖入湖能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供更優的數據入湖能力,支持 ACID 事務能力,可以大幅縮短數據入湖操作流程,提升 ETL處理效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 更優的流批處理能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供流式增量和批式全量處理能力,使用相同的高性能存儲模型,數據不再孤立,架構更簡單。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4. 更好的兼容性和擴展性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更好的適配支持 Hadoop 生態,對象存儲的語義,結合 Cache 能力解決對象存儲性能問題。支持智能行列混存,針對讀\/寫不同場景下有更好的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5. 更低成本的 Serverless 算力支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EKS 即將推出更具成本優勢的競價型容器服務, 進一步減少數據湖計算資源的成本消耗,從而更降低用戶使用數據湖分析的價格。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:騰訊雲大數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/YnFoAEGxljo9pbyVAebCtA"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:雲端數智新引擎,騰訊雲原生數據湖計算重磅發佈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:雲加社區 - 微信公衆號 [ID:QcloudCommunity]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章