開源數據湖方案Apache Iceberg 成立公司,CEO:我們將消除數據維護和優化難題

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當地時間8月29日,"},{"type":"link","attrs":{"href":"https:\/\/iceberg.apache.org\/","title":null,"type":null},"content":[{"type":"text","text":"Apache Iceberg"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的創建者 "},{"type":"link","attrs":{"href":"https:\/\/tabular.io\/blog\/introducing-tabular\/","title":null,"type":null},"content":[{"type":"text","text":"Ryan Blue"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"、Dan Weeks和Netflix 數據架構總監Jason Reid 宣佈從風投 "},{"type":"link","attrs":{"href":"https:\/\/a16z.com\/","title":null,"type":null},"content":[{"type":"text","text":"a16z"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 處拿到了A輪融資,正式成立圍繞 Apache Iceberg 構建新型數據平臺的商業公司 "},{"type":"link","attrs":{"href":"https:\/\/tabular.io\/","title":null,"type":null},"content":[{"type":"text","text":"Tabular"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Iceberg是一個通用的表格式(數據組織格式),可以適配Presto、Spark等引擎提供高性能的讀寫和元數據管理功能。目前已被Netflix、蘋果、Adobe、LinkedIn、Expedia、Stripe等公司採用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“從根本上構建一個獨立、雲原生並且可以積極管理數據的平臺,是我和其他聯合創始人創建 Tabular 的初衷。”現任 Tabular 首席執行官的 Ryan Blue 表示。Ryan Blue 在其文章中指出當前數據基礎設施主要存在兩大缺點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先,數據湖充滿陷阱和挫折,這讓人們不得不成爲解決各種古怪限制的專家,而不能專注在把事情做好上。刪除一列數據可能會悄悄破壞查詢結果,不知道應該向查詢添加冗餘過濾器可能會浪費分析師數天的時間,更不用說還增加了雲成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其次,大數據生態系統一直在把問題推給錯誤的人。使用這些技術的人應該專注於構建相關和可靠的數據產品,但他們不得不浪費時間擔心 SQL 會生成多少文件。數據基礎設施應該做得更多,而不是要靠人來彌補它的許多差距。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Apache Iceberg 認爲,節省時間和消除令人頭痛的問題是數據基礎設施的關鍵下一步。Blue 表示,Iceberg 哲學的核心是讓人們開心:數據基礎設施應該在沒有令人不快的意外情況下正常工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Blue表示,Tabular將消除數據維護和優化難題。開發者可以使用 Iceberg 安全地自主構建管理表。數據平臺可以提供更多的功能,包括壓縮、集羣、配置、索引等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此前在 Netflix,Iceberg 使得從 Kafka 流入的數據在幾分鐘內便可以使用,而非原來的數小時。在此過程中,Netflix 將 Iceberg "},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/iceberg","title":null,"type":null},"content":[{"type":"text","text":"開源"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"出來並捐贈給了"},{"type":"link","attrs":{"href":"https:\/\/apache.org\/","title":null,"type":null},"content":[{"type":"text","text":"Apache 軟件基金會"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。Tabular承諾永遠不會控制或傷害Iceberg,並將爲開源社區作出貢獻。“Iceberg 的持續成功取決於建立了一個通用和開放標準的多元化"},{"type":"link","attrs":{"href":"https:\/\/iceberg.apache.org\/community\/","title":null,"type":null},"content":[{"type":"text","text":"社區"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。”Blue表示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前市面上流行的三大開源數據湖方案分別爲:Delta、Apache Iceberg和Apache Hudi。其中Iceberg 以自身獨特的優勢被越來越多開發者關注。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先,Iceberg 的架構和實現沒有綁定到某一特定引擎,實現了通用的數據組織格式,利用此格式可以與不同引擎(如 Flink、Hive、Spark)對接。其次,Iceberg 還有良好的架構和開放格式。相比於 Hudi、Delta Lake,Iceberg 的架構實現更爲優雅,同時對於數據格式、類型系統有完備的定義和可進化的設計。此外,Iceberg 在數據組織方式上充分考慮了對象存儲的特性,避免耗時的 listing 和 rename 操作,使其在基於對象存儲的數據湖架構適配上更有優勢。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章