數據中臺與湖倉一體能碰出怎樣的火花?網易數帆實時數據湖Arctic的新探索

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作者 | 蔡芳芳"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"採訪嘉賓 | 馬進 網易數帆平臺開發專家"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據中臺也要從離線爲主走向實時化,湖倉一體是第一步。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據從離線到實時是當前一個很大的趨勢,但要建設實時數據、應用實時數據還面臨兩個難題。首先是實時和離線的技術棧不統一,導致系統和研發重複投入,在這之上的數據模型、代碼也不能統一;其次是缺少數據治理,實時數據通常沒有納入數據中臺管理,沒有建模規範、數據質量差。針對這兩個問題,網易數帆近日推出了實時數據湖引擎Arctic。據介紹,Arctic具備實時數據更新和導入的能力,能夠無縫對接數據中臺,將數據治理帶入實時領域,同時支持批量查詢和增量消費,可以做到流表和批表的一體。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這是作爲網易公司基礎軟件團隊的網易數帆首次對外發布在湖倉一體方向的進展,同時宣佈的還有網易數帆有數實時數據中臺戰略。爲了深入瞭解網易數帆在湖倉一體方向的探索和思考,以及實時數據湖引擎Arctic的設計思路和產品定位,InfoQ採訪了網易數帆平臺開發專家馬進,圍繞以上問題逐一展開探討。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"網易數帆要做什麼樣的湖倉一體?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"湖倉一體(Lakehouse)最初指的是數據湖和數據倉庫融合、兼具兩者優點的新興數據架構,但如今它已經不只是一個純粹的技術概念,而是被賦予了更多與廠商產品層面相關的含義。在湖倉一體越來越火的同時,不同廠商也爲它做出了各自的解讀。在進一步探討網易的湖倉一體實踐之前,我們有必要先了解一下網易數帆是怎麼理解“湖倉一體”的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"網易數帆團隊開展湖倉一體工作主要源於現實應用場景中的一個痛點,即在大數據場景下的實時數據和離線數據的處理鏈路是割裂開的,而且實時數據和離線數據的存儲也分別採用了兩套不同的存儲方案。一方面,重複建設和維護的成本比較大,另一方面,雙方的研究成果也沒有得到很好的複用。所以,團隊一開始的目標其實是爲了實現流批一體,也就是將實時數據和離線數據的處理和存儲統一起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那爲什麼後面演變成湖倉一體呢?馬進將流批一體劃分爲三個層次,分別是存儲流批一體、開發流批一體和工具流批一體,並給出了這樣一個等式:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“存儲流批一體 = 湖倉一體 = 基於數據湖實現所有數倉功能”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"離線數倉存儲從本質上來講,對應的就是數據湖技術,比如Hadoop生態的Hive;相應的,實時數倉對應的就是傳統數倉所具備的技術能力,像Greenplum、Teradata、Oracle這樣的商用數據庫,其實都具備流式更新和ACID的能力,可以完成一些實時報表的工作。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"網易數帆團隊希望讓基於數據湖概念的離線數倉技術具備實時計算的能力以及ACID的保障,也就是具備傳統數倉的能力"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",因此,數據湖和傳統數倉各項能力的結合,就是網易數帆團隊要做的湖倉一體。基於這個目標,網易數帆打造了實時數據湖引擎Arctic。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6d\/6d261e23d4df28f8ba25089227e11c3d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從實現路徑來看,Arctic的原始需求是基於數據湖解決“倉”的問題,團隊對它的規劃是先要具備“倉”的功能,“倉”的相關工作做好後,再去延展實現“湖”的功能。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"邏輯數據湖和湖倉一體,同一場景的兩種解法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了湖倉一體,InfoQ注意到,此前網易數帆還多次在公開場合提到另一個概念,即邏輯數據湖。網易數據科學中心總監、網易數帆有數產品總經理餘利華曾在接受"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/U5Qv45rsFxGGbXme2ydL","title":null,"type":null},"content":[{"type":"text","text":"InfoQ採訪"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"時表示,邏輯數據湖是一種性價比更高的方式。這也給我們帶來了一些疑惑:邏輯數據湖這個概念因什麼而出現?它和湖倉一體、數據中臺之間的關係要怎麼理解?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"馬進表示,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"邏輯數據湖與湖倉一體是同一場景下的兩個解決方案,本質上來說都是爲中颱服務的。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"邏輯數據湖是“物理分散、邏輯統一”,而湖倉一體是“物理統一”,二者是同一問題下的兩個分支。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據中臺提供的是一套數據治理和數據研發的方法論,主要面向業務,其中的數據建模、數據研發包括數據運維,它們的治理體系是一套。但是從中臺模塊產品往下看,就會有不同方案的拆分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其中,邏輯數據湖比較尊重業務以往的歷史負擔,比如之前用了Greenplum、Oracle這種數據倉庫,希望數據中臺能夠直接基於這些數據倉庫建設,不做數據遷移。從業務方來看,數據建模、數據開發與中臺的治理體系是一套,但底層的數據存儲可以不同。邏輯數據湖嘗試用技術把一個個數據孤島打通,比如對不同的數據倉庫做聯邦Join,可以認爲它是爲了解決這種不統一的一個方案。“物理分散”,即底層的存儲可以分開,但“邏輯統一”,上層的中臺邏輯是統一的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"據瞭解,邏輯數據湖方案主要是爲了滿足網易數帆部分企業客戶的需求,網易集團內部其實沒有太多這樣的負擔,甚至可以說這種負擔幾乎沒有,因爲網易內部一開始就是基於Hadoop自建數據湖去實現的。但對很多企業客戶來說,他們以前採購了不同的數據庫,後來要構建自己的數據湖和數據中臺體系,網易數帆就給他們提供了邏輯數據湖的方案,客戶可以繼續使用原有方案,同時網易數帆給他們提供一個整套的中臺入口,統一管理不同的數據孤島。這是邏輯數據湖主要適用的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"相比較之下,湖倉一體的解決方案更加徹底。對於沒有歷史負擔的業務場景或企業客戶,他們所有新建業務都可以基於湖倉一體的方案來建設。基於湖倉一體方案,底層存儲在物理上就是統一的,都基於數據湖,上層也必然是統一的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可以認爲,這兩個方案都是服務於整個中臺,去構建一個統一的數據中臺的治理邏輯。馬進解釋道,兩種方案的收益不同,邏輯數據湖可以讓用戶快速上手,更好地覆蓋企業的歷史負擔;而湖倉一體可以用更低的成本去解決業務上的痛點,如果把時間線拉長,未來當雲計算更大範圍普及之後,基於雲端對象存儲建設數據湖,跟使用傳統商用數據庫或商用數倉相比,節省的成本可能高達幾十倍甚至幾百倍。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實時數據湖Arctic的設計思路和定位"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"網易數帆建設湖倉一體的核心技術原理與Hive離線數倉方案最大的不同是對數據的管理粒度更加細化,Hive的管理粒度在Partition級別,而網易數帆湖倉一體方案的管理粒度細化到文件。由於上層承接數據中臺體系,湖倉一體需要爲上層提供體系化的文件管理方案,涵蓋文件治理和文件合併等功能,因此具備細粒度的文件管理能力是首要需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"經過調研,團隊最終選擇使用Apache Iceberg,主要考慮是因爲Iceberg本身的元數據管理是面向文件的,有非常全的manifest機制,可以把表中的所有文件管理起來,Iceberg作爲底座提供了ACID的事務保證以及MVCC功能,可以保證數據的一致性,同時又具有可擴展性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Iceberg的基礎上,團隊又自研了實時攝取、文件索引、數據合併,以及一整套元數據管理服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4b\/4b2f0f0abaa8b025460cd2117d94ecb9.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"技術選型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"據馬進介紹,在最早做技術選型的時候,團隊也調研過與Iceberg同類型的開源項目Apache Hudi和Delta Lake,但最終都因爲一些原因而放棄選用。在做調研時,Hudi還是相對比較封閉的狀態(它對自己的定位是Spark的一個Lib,去年年底到今年纔開始真正把支持Flink作爲優先級比較高的工作),而網易數帆需要一個開放的解決方案來適配高度定製化需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除此之外,也有一些技術細節的考量。比如數據格式方面的問題,Hudi的文件索引採用的是Bloomfilter以及HBase的機制,這兩種機制都不是特別理想,HBase需要引入第三方KV數據庫,對商業輸出不利,而Bloomfilter比較重,會讓實時性大打折扣,因此都不太適合網易數帆的技術選型。網易數帆對Arctic核心功能的想法和設計,也跟Hudi有出入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而沒選Delta Lake則是因爲它對實時性並不是看得很重,馬進團隊通過研究相關論文發現,Delta Lake更多還是把Spark的生態作爲第一優先級,這與團隊做湖倉一體的目標還是有一些區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"相比之下,Iceberg相對更開放,對計算引擎的集成、對上層元數據的集成、對不同系統的集成都做得比較好,可以滿足團隊高度定製化的需求。因此團隊最終選擇了Iceberg,能更好地落實自己的想法,並做出網易數帆獨有的功能特色。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基於Iceberg,但不侷限於Iceberg"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"雖然Arctic以Iceberg爲底座,但馬進認爲,從社區定位來看,Hudi纔是跟Arctic最像的。數據湖倉有一個非常重要的功能,即能夠基於主鍵進行行級更新,Hudi在功能上與Arctic比較匹配,只是在核心設計上二者存在分歧,在實時入湖這一方面Hudi也最具有代表性。所以Arctic在做性能對比測試的時候,也是拿Hudi來對比,而不是Iceberg。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"實際上,網易數帆團隊在一開始做Arctic這個產品時,並不打算綁定任何一個開源的數據湖方案,包括Iceberg。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最初團隊更希望基於數據湖做一個流批一體的湖倉,通過制定一個管理Base數據(即存量數據)和Change數據(即增量數據或實時數據)的方案,做到對兩種數據的解耦,不管底層使用什麼數據湖技術,無論Iceberg還是Delta Lake,對外暴露的都是同一套湖倉一體方案。這是Arctic最初的定位,即不跟任何一家數據湖基座做高度綁定,但要做到這一點需要極高的研發投入,很難一步到位。因此前期團隊對於Arctic的定位首先是滿足網易湖倉一體的業務目標,把上層實時入湖功能涉及的讀合併、異步合併、元數據服務、小文件治理等等在一個數據湖基座上管理起來,有了數據湖的基座,就可以基於此再去做上層的服務,然後再考慮增加在不同數據湖上構建湖倉一體的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這看起來似乎又在已經相當複雜的數據系統中增加了一個服務層,不過馬進表示並非如此。首先做數據中臺,本身就是在Hive之上加了一層;其次增加的這些功能實際上是引擎端的適配,會有一個單獨的治理服務,而這個治理服務是偏中臺的模塊,可以認爲是整套數據中臺體系中的一部分。這個治理服務能夠把湖倉的元數據管理起來,類似Hive中的HMS,同時也可以做一些數據合併的規劃,還能對接不同的計算引擎,比如Presto、Impala、Spark SQL以及Flink。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"據馬進透露,團隊預計會在明年Q2將Arctic開源出去。其實團隊一直有在考慮如何將自研的東西貢獻回開源社區。從去年開始,網易數帆團隊就嘗試跟一些頭部互聯網公司共建Iceberg社區,希望能引導社區往湖倉一體的方向去發展。但社區本身對發展方向有自己的規劃,包括"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/tfCDPeN2L9S7KzGdfogY","title":null,"type":null},"content":[{"type":"text","text":"社區創始人前不久也已經從Netflix離職自己出來創業、圍繞Iceberg成立了商業公司"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",想要推動社區往一定的方向走成本很高,進度也會比較慢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因此網易數帆團隊目前更希望先把所有的想法在Arctic上落實好,讓整個湖倉一體方案運轉起來,然後將做出來的成果開放出來,再進一步跟社區溝通,看哪些東西可以貢獻回社區。馬進認爲,最重要的是希望Arctic至少能夠在網易長期經營起來。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"落地情況和挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前網易數帆已經有部分客戶在使用Arctic,集團內部也有不少業務接入了Arctic。馬進透露,根據前段時間中期彙報的統計數據,網易集團內部已經有大約600TB規模的數據在使用Arctic,並且陸續有新的業務開始嘗試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"按照數據來源,馬進將Arctic的用戶場景分爲兩大類,不同使用場景採用的數據架構不同,引入Arctic時使用的改造方案也有所不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一類場景的數據主要來源於日誌,比如網易雲音樂、網易傳媒,還有電商的部分數倉系統,他們的數據都以日誌爲主。對於日誌數據,業務線在幾年前已經構建了非常健全的T+1數據處理解決方案,現在他們希望將原有的T+1的離線業務改造成實時業務。但是改造成實時鏈路後,又擔心數據的準確性,因爲日誌數據比較容易出現數據亂序和重複。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"針對這種日誌型數據的場景,更多使用的是Lambda架構,Arctic針對Hive提供了原地升級的方案"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",即將Hive的離線數倉表通過特定方式升級爲Arctic表,升級後就可以通過實時計算引擎進行數據寫入,而離線數倉還保持了批量寫入的能力。Arctic表會自動根據場景做實時和離線的切換來面對不同的業務場景。網易集團內部主推Lambda架構,因爲集團內部日誌型數據場景更多一些。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa架構則更多面向企業用戶,像金融、製造業、物流等傳統行業,他們的數據不管是實時數據還是全量數據,主要來源是數據庫。數據庫裏存儲的數據很少出現亂序重複的問題,相對比較準確,也有完整的機制保證數據一致性。這種情況通常不需要用離線鏈路來兜底,日常用一個實時鏈路就可以。但有時候也會出現數據庫表變更的情況,比如增加一列或減少一列數據、數據表結構發生變化,或者一些數據出現了錯誤,需要大規模修正,這時就需要對原始數據進行批量計算回補,同時需要離線鏈路去發揮作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"綜上,網易集團內部主要進行Lambda架構的改造,而針對企業客戶,主要的實踐是Kappa架構。網易集團內部的互聯網業務和傳統企業客戶的業務,數據處理的場景和方式不一樣,不過兩者沒有絕對的界限,網易集團內部也有一些潛在的使用Kappa架構的場景,比如嚴選電商就有很多實時數據來自數據庫的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/26\/26b91eb5979fb17c0e453861d6a3aeba.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於湖倉一體方案的實施和落地,Kappa架構是最理想的方案,因爲天然具備實時性,也沒有歷史負擔,建設湖倉一體的成本低;而對於Lambda架構來說,就可能面臨已有離線鏈路、但離線做的不夠規範,本身就需要一定的改造,這種情況升級改造的成本會比較高,技術實現上也需要更多磨合。當前團隊推動湖倉一體方案落地更多會選擇一些基於Kappa架構的場景,Lambda架構主要與集團內部大的業務共建,過程相對比較緩慢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了前面所說的歷史負擔問題,企業嘗試採用湖倉一體技術還面臨另一個挑戰,就是組織上的問題。在馬進看來,目前數據中臺整個方案“離線”的基因很重,實時相對來說是一個比較獨立的分支,而且實時計算不光用在大數據場景,在線場景也經常涉及。如果希望通過實時給整個數據中臺賦能,就需要侵入到數據中臺的架構體系裏面去。這就涉及不同團隊的磨合以及目標統一的問題,在推進上有一定的困難。這與前兩年企業推行數據中臺戰略面臨的挑戰是類似的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"馬進坦言,去年準備做湖倉一體的時候,就面臨比較大的阻力,因爲數據中臺團隊也有自己的規劃,比如前面提到的邏輯數據湖,而湖倉一體是從另外一個角度去解決問題。這就需要公司的決策層在這件事情上有非常精準的判斷並制定相應的戰略目標。今年在網易數字+大會上正式宣佈將實時數據中臺作爲戰略來推進,是網易數帆在推進湖倉一體過程中有優勢的一點。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"流批一體的最終目標還有多遠?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於網易數帆來說,湖倉一體(即存儲的流批一體)是最終實現流批一體必經的一步,最終願景是用一個邏輯一套代碼去覆蓋離線和實時兩個場景。如果實時和離線是兩套存儲、用到兩張表,就不可能用一套代碼解決,因此要優先解決存儲的流批一體,然後再基於此做開發的流批一體。把工具和團隊統一之後,中臺的模塊如數據模型、數據資產、數據質量等等也都可以做流批一體了,從原先只有離線的功能,到具備實時功能,這被稱爲工具流批一體,更確切的說法是中臺模塊的流批一體,最終給前端業務呈現的就是實時數據中臺。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cb\/cb56bac9b209932ec5bcdae2a42d4f47.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流批一體是網易數帆團隊一直以來的戰術方向,即做到大數據平臺的實時化,而不是將實時計算獨立出來做。前述流批一體的三個層次都是網易大數據平臺未來的重點改進方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"針對存儲的流批一體,現在已經有實時數據湖引擎Arctic,後續團隊的工作重點主要包括性能優化和自研特色功能,比如實時數據攝取、數據合併、元數據管理服務等,整體有一個長期的研究規劃。未來Arctic也將適配更多計算引擎,除了已經適配的Flink、Spark,Impala的適配工作也在進行中,明年Arctic開源的時候,也會做好Presto的適配。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"同時,開發流批一體和工具流批一體方面的工作也在緊鑼密鼓地展開。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"開發的流批一體主要是負責Flink的小團隊在跟進,目前主要在實踐階段。馬進表示,計算流批一體的社區成熟度要比存儲流批一體好很多,網易更多是在業務側實踐,爭取明年可以推出開發流批一體的工具和平臺。工具流批一體則是整個數據中臺團隊在推進,整體進度已經完成了20%~30%,不過暫時還沒有對外發布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在馬進看來,未來實時和離線技術必然會收斂到一起,從技術實現來看相對樂觀,目前網易數帆也已經有相對應的解決方案,但大規模的業務落地需要更多時間。至少還需要兩年時間,纔會有更多業務把流批一體和湖倉一體作爲一個比較標準的方案,過程的快慢與每個業務自身對存算分離的訴求的急迫程度有關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"客觀來說,現階段湖倉一體技術在開源技術裏還不是很成熟。馬進表示,企業需要對大方向保持關注,但到底要不要採用,還得看企業的發展情況。如果企業的自研能力相對缺乏,可以繼續觀望,等待更加成熟的解決方案出現。在他看來,現在的解決方案大多數都還處在體驗嚐鮮的階段,遠沒有達到廣泛應用的階段。對於有一定技術實力的企業,可以先基於集團內部場景推廣使用,這也是很多頭部企業的做法,比如阿里、騰訊以及字節,網易也是先基於集團內部場景孵化一些解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"不過網易數帆的工作重點是給企業做私有化解決方案,相比之下,阿里和騰訊的工作重點是公有云,更希望能將客戶的解決方案壟斷在自己的生態之中,而網易數帆則更傾向於背靠開源,然後強於開源,做技術的破壁者。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"過去一年,圍繞湖倉一體和流批一體話題,InfoQ"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/theme\/106","title":null,"type":null},"content":[{"type":"text","text":"採訪"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"了數位大數據平臺領域專家,雖然每家公司的解讀和實現路徑各有不同,但對於湖倉一體和流批一體未來長期的發展趨勢基本能夠達成一致。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從長遠來看,不管是阿里雲、騰訊雲還是Databricks,未來的湖倉一體發展趨勢都是趨同的,即基於廉價的存儲設施,把數倉的能力建設好,短期內可能由於公司發展戰略及自身定位的差異,研究方向存在一定差異。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"雖然如此,技術更迭的過程仍免不了曲折。對於很多企業來說,之前已經做了實時計算並構建出一套比較獨立的架構,並沒有很強的動力去做架構升級和更新。這有點類似於過去數據庫領域常常提到的“自己革自己命”,數據中臺也面臨這樣的困擾。但以發展的眼光來看,這樣的突破非常有必要。如果進行了革命,最終實時和離線統一到一起,未來大數據平臺會更加簡單精練,工具的專業門檻會越來越高,但使用成本會越來越低,用戶使用工具的投入走向收斂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據平臺以及數據業務全面的實時化,必然會對當前的生產關係以及組織架構帶來一定的調整訴求。這是一個自我改革的驅動,需要企業有一定的魄力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"採訪嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"馬進,網易數帆平臺開發專家,網易數據科學中心在線數據和實時計算團隊負責人,負責網易集團分佈式數據庫,數據傳輸平臺,實時計算平臺,實時數據湖等項目,長期從事中間件、大數據基礎設施方面的研究和實踐,目前帶領團隊聚焦在流批一體、湖倉一體的平臺方案和技術演進上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"專題推薦:"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/theme\/106","title":"xxx","type":null},"content":[{"type":"text","text":"《數據湖與數據倉庫融合架構實踐》"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章