OPPO數倉與數據湖融合架構升級的實踐與思考

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去幾年,數據倉庫和數據湖方案在快速演進和彌補自身缺陷的同時,二者之間的邊界也逐漸淡化。雲原生的新一代數據架構不再遵循數據湖或數據倉庫的單一經典架構,而是在一定程度上結合二者的優勢重新構建。在雲廠商和開源技術方案的共同推動之下,2021 年我們將會看到更多“湖倉一體”的實際落地案例。InfoQ希望通過選題的方式對數據湖和數倉融合架構在不同企業的落地情況、實踐過程、改進優化方案等內容進行呈現。"},{"type":"text","marks":[{"type":"strong"}],"text":"本文,InfoQ採訪了OPPO雲數架構部部長鮑永成,請他與我們分享OPPO引入數據湖和數倉融合架構的探索工作和實踐中的一些思考。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"當我們談數據湖,談的是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:數據湖和數倉融合架構是當下大數據領域非常重要的議題之一,不僅各大雲廠商先後提出了自己的技術方案,開源社區也有一些項目(包括DeltaLake、Iceberg和Hudi)非常活躍。其實數據湖這個概念誕生至今有挺長時間了,在您看來,目前業內對數據湖的定義和重要性是否已經達成一致?雲廠商的產品和開源項目之間有什麼差異嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"回答這個問題之前,我們得明確倉與湖的主要區別。倉裏的數據,有明確的表、字段定義,表與表之間的關係清晰。湖裏的數據,樣式就多了,有結構化、半結構化(JSON、XML等)、非結構化(圖片、視頻、音樂)。 數據入倉,我們要預先定義好schema。 數據入湖,則沒有這樣的要求,只需要將原始數據寫入指定存儲即可(通常是對象存儲),當真正需要使用的時候,我們再設法定義schema,進行分析應用。顯然,數據入湖比入倉要方便快捷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"回到我們的話題,雲廠商的數據湖產品,通常積極推廣他們的低成本雲存儲(S3、OSS等),吸引用戶將數據上雲。一旦數據上雲,進而吸引用戶使用他們多年完善的大數據體系產品(計算引擎、依賴調度、質量管理、血緣管理、數據治理等),爲用戶提供數據開發、分析、應用的附加服務。其次,很多企業出於數據安全的考慮,並不願意將自己的數據上雲,一套兼容各類存儲的倉湖融合方案,是雲廠商對市場的迎合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"開源的幾個數據產品Delta Lake、Apache Iceberg、Apache Hudi 更多可以理解爲一種TableFormat,這種TableFormat可以靈活定製Schema,對對象存儲友好,同時可以實時處理數據,支持Update、Insert、Upsert特性。企業應用好他們,可以助力自身構建批流一體、倉湖融合的大數據架構。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"倉湖融合架構升級的三個階段"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:OPPO是什麼時候決定要引入數據湖和數倉融合架構的?能否介紹下當時的整個背景情況?是希望解決什麼樣的問題或痛點?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"OPPO在2020年初引入數據湖的架構方案,這是建立在OPPO進軍IoT領域的大背景下。隨着公司不斷推出IoT產品,IoT設備產生的數據源源不斷,設備的智能化服務需要我們對這些數據做更深層次的挖掘。海量的數據如何高效存儲、高效利用是大數據部門必須要解決的問題。數據湖方案可以幫助我們快速收集保存數據,有效支持我們做數據分析、市場預測,以及智能服務的算法訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:爲什麼選擇引入Apache Iceberg?你們前期做了哪些技術選型和評估工作?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"引入Iceberg構建我們的數據湖方案,主要出於兩點考慮。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一.雲數融合:OPPO已經基於K8S,構建了自己的雲平臺,主要數據存在對象存儲OCS上。大數據依靠Yarn調度,HDFS做存儲,目前雲與大數據正在逐步完成融合,做到調度融合,存儲融合,統一運維,進而降低成本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"二是在與Delta Lake、Hudi,Iceberg對比中,雖然Hudi的實時特性最完備,但與Spark耦合太緊,Schema的定義缺乏靈活性。Iceberg對Upsert、Insert、Delete的支持沒有Hudi完備,但社區在積極跟進。擺脫了具體計算引擎的束縛,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Iceberg專注於數據湖TableFormat標準的制定,這個標準正在慢慢被各家計算引擎所接受,未來會贏得更多的用戶認可"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:能否詳細介紹一下OPPO整體的數據平臺架構或數據處理流水線?在引入Iceberg前後,有哪些變化和演進?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"下圖是OPPO的大數據架構,我們目前主要在推進兩項工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/27\/27cfc4c6e7efbbdfd175feef2d8a6748.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"雲數融合"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":OPPO已經基於K8s構建了自己的雲平臺,主要數據存在對象存儲OCS上。大數據平臺依靠Yarn調度,HDFS做存儲,後續二者將統一調度與存儲,統一運維,降低成本。雲數融合的一期重點,會通過我們自研的ADLS支持數據湖存儲。ADLS的"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"冷熱分層,全局緩存"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"特性在數據湖落地的過程中,可以有效降低存儲成本,減輕存算分離後的性能影響。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5e\/5e69b1f51cf390cc4f0e2b61af534c8c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"流批一體、倉湖融合的架構升級"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9f\/9fdbd0cd556eb5cef1a06f01af0c27d1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於傳統的Lambda架構,流與批是單獨的兩條數據處理鏈路,維護成本高,且容易出現數據不一致的情況。新的Kappa架構使用Kafka作爲存儲,簡化了架構,但Kafka數據承載能力有限,且數據格式並不利於計算引擎分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們在以下兩個方面,對傳統架構進行了改造。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1)針對公司原來的數據埋點鏈路,我們引入了Iceberg,將實時數倉庫的Kafka替換成IcebergFormat,通過Spark\/Presto引擎做數據查詢,增加數據倉庫的實時性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6d\/6dc52e8c7913c83320e2ef39d6a1e3cb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2)公司原有的數據庫入倉鏈路通過Flinkx完成數據同步, 無法支持CDC。目前Flink已經原生支持CDC數據解析,binlog 通過flink-cdc-connector拉取可以自動轉化成"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"INSERT、DELETE、UPDATE_BEFORE、UPDATE_AFTER消息,結合Iceberg支持的Update、Delete特性,可以高效準確地將數據庫同步到數倉,方便計算引擎進行分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/27\/27729b1e9af0b79c0cb111ebaf3d771a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" OPPO的一些數據,存儲在Oracle,SqlServer等數據庫中,Flink對這些數據庫的數據的拉取並沒有提供支持。我們封裝了Obus-DB的組件,來適配各類數據庫,將數據同步到Kafka中,支持後續數據入湖的消費。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:你們是如何將基於Apache Iceberg的數據湖方案,與OPPO已有的數據倉庫融合的(分哪些階段、關鍵工作等)?改造過程中存在哪些難點和挑戰?你們是如何解決的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"倉湖融合可以分爲3個階段,目標是做到3個統一:統一數據、統一元數據、統一計算引擎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"1)統一數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據Lambda架構分實時離線兩條鏈路,兩條鏈路可能造成數據不一致。Kappa架構雖然統一了鏈路,但kafka的特徵註定了這套架構對歷史分析並不友好。引入Iceberg換掉kafka,可以認爲是Kappa架構的升級,簡稱Kappa+,可以做到在一條鏈路上完成數據入倉,支撐流批數據分析。Iceberg Commit Snapshot架構實時性不及Kafka。目前我們採用Iceberg備份Kafka數據,同時縮短Kafka數據的存儲時間,以滿足用戶的實時性需求,併爲後續的Iceberg優化預留空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"2)統一元數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前大數據的元數據基本都存儲在HiveMetaStore中,Iceberg構建表,需要融合其中。Flink 1.9 版本以前,是自己管理元數據,基於此,OPPO的元數據也分離線和實時兩套;Flink 1.11 版本後,對HiveMetaStore的集成已經比較成熟,我們正在將實時鏈路的元數據逐部遷移到HiveMetaStore。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"3)統一引擎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg目前可以支撐Flink、Spark、Presto的查詢,這雖然給了用戶更多的選擇,但同時也給用戶帶來了選擇困難。在引擎層,可以做到一套引擎出口,根據數據特徵通過HBO靈活選擇執行引擎。這項工作目前正在規劃中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前數據埋點入倉與數據庫CDC入倉兩條鏈路已經完成了數據湖架構改造,但OPPO每天入倉的數據量巨大,Iceberg性能還需要優化。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"沒有完備的數據體系,空談湖倉之爭沒意義"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:您認爲現階段Iceberg還存在哪些不足待改進?你們有沒有什麼踩坑經驗可以分享?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Iceberg目前雖然有基於文件的統計信息,方便做謂詞下推的數據過濾,但還缺少細緻的索引規範來加速文件檢索。同時Row-level delete的性能也不夠高效,還有優化空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:接下來對於數據湖引擎和調度平臺的建設,以及湖倉融合架構改造,你們還有哪些規劃?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"談論數據湖和數據倉庫,立足點應該建立提供更好的數據服務上。完備的數據體系,包括數據存儲、多模態計算引擎、依賴調度、質量管理、血緣管理、任務診斷、數據集成、統一元數據、數據安全、數據服務等多方面內容,只有這些方面的能力完備,才能提供優質的數據服務,這是OPPO數據中心的核心工作。無論是數據湖,還是數據倉庫的數據,只有運轉在這套體系下,才能得到高效利用。在上述能力具備的條件下,解決好湖數據快速構建schema、湖與倉的元數據統一問題,倉湖自然融合。上述能力不完備,空談湖與倉之爭,沒有太多意義,孤島問題不可避免,數據利用率低,使用成本高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ:您怎麼看數據湖和數倉融合架構未來的發展趨勢?企業或業務團隊該從哪些方面去評估自己是否需要引入湖倉融合架構?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"鮑永成:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"倉湖融合的架構是個必然趨勢。數據時代,人們產生和接觸的數據越來越多樣,數據服務的要求也越來越高。快速而又低成本的利用數據,數據湖有着較爲明顯的優勢。如果企業與團隊面臨這樣的挑戰,可以引入倉湖融合的架構。但要做到倉湖融合,可以結合自身的情況,參考上一個問題的回答。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"採訪嘉賓介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"鮑永成,OPPO雲數架構&個人雲負責人。曾服務於土豆網、思科、京東、頭條等公司。長期負責雲計算平臺、大數據平臺的研發與技術演進。帶領團隊先後在京東商城、OPPO實施完成了全量業務上雲戰略。目前致力於打造OPPO開放的雲與大數據能力,在跨場景、多終端的超大空間爲用戶提供智慧服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"相關文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/SJr7sHxxVBczcl5ArCiF","title":"xxx","type":null},"content":[{"type":"text","text":"從自研到 Delta 到 Iceberg,網易嚴選數據湖建設實踐"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/WolODSeUzp1iv0G1ojmI","title":"xxx","type":null},"content":[{"type":"text","text":"Flink 集成 Iceberg 在同程藝龍的實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章