Hadoop 生態裏,爲什麼 Hive 活下來了?

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Hive 能在下一輪“淘汰”中倖存下來嗎?"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Hive 在 2010 年作爲 Hadoop 生態系統的一部分嶄露頭角,當時 Hadoop 是一種新穎而創新的大數據分析方法。Hive 的功能就是實現 Hadoop 的 SQL 接口。它的架構包括兩個主要服務:一是"},{"type":"text","marks":[{"type":"strong"}],"text":"查詢引擎"},{"type":"text","text":":負責執行 SQL 語句;二是"},{"type":"text","marks":[{"type":"strong"}],"text":"元存儲"},{"type":"text","text":":負責在 HDFS 中將數據收集虛擬化爲表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/3f\/3fe1ac610fa1440630ce6b5e45f1fbe7.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS 上的 Hive 的主要組成部分,包括用戶界面、驅動程序和元存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Hadoop 背後的概念是革命的"},{"type":"text","text":"。分佈式文件系統("},{"type":"text","marks":[{"type":"strong"}],"text":"HDFS"},{"type":"text","text":")中存儲大量的數據集,運行於商業硬件集羣之上。並行執行計算作業時使用 "},{"type":"text","marks":[{"type":"strong"}],"text":"MapReduce"},{"type":"text","text":" 的數據。這類任務的分配由 "},{"type":"text","marks":[{"type":"strong"}],"text":"Yarn"},{"type":"text","text":" 管理。主接口是一種編程語言,最初是 "},{"type":"text","marks":[{"type":"strong"}],"text":"Java"},{"type":"text","text":" 或 "},{"type":"text","marks":[{"type":"strong"}],"text":"Scala"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些組件在 "},{"type":"text","marks":[{"type":"strong"}],"text":"Apache 基金會"},{"type":"text","text":"下開源,可以免費使用。這套技術已經成爲幾年來大規模分析的標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"但是,這個技術棧已經逐漸被一些新技術淘汰……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS 被對象存儲所取代,由 AWS S3 主導。MapReduce 已經被 Spark 所取代,Spark 也逐漸減少了對 Hadoop 的依賴性。Yarn 正在被像 Kubernetes 這樣的技術取代。此外,Hive 的查詢引擎組件在性能和採用方面已經被 "},{"type":"text","marks":[{"type":"strong"}],"text":"Presto\/Trino"},{"type":"text","text":" 超越。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然有這些改變,但大多數以數據湖爲特色的組織仍然將活躍的 "},{"type":"text","marks":[{"type":"strong"}],"text":"Hive Metastore"},{"type":"text","text":" 部署作爲其架構的一部分。與 Hadoop 的同類產品相比,你可能會想,“Hive Metastore 有什麼特別之處?”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要回答這個問題,讓我們深入瞭解一下 Hive Metastore 目前提供了什麼功能,以及正在出現什麼技術來取代它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive Metastore 做了什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在將新數據存入對象存儲中時,我們會在 Hive Metastore 中註冊一個元存儲 API,該 API 來自於任何數據應用或編排工具。此生命性階段將一組對象從對象存儲重映射到 Hive 公開的表。部分註冊包含指定文件中保存的表的模式,以及描述這些列的元數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以這種方式使用 Hive Metastore 有四個主要好處:虛擬化、可發現性、模式演化、性能。讓我們來詳細討論一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"虛擬化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分析師使用 SQL 通常不關心對象存儲的細節和其訪問模式。他們只是想要得到他們的表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當其他 Hadoop 組件被取代時,Hive Metastore 將扮演不可替代的角色。每種新技術的引入都確保了對 Hive Metastore 的支持,從而避免了依賴於 Hive 中定義的表對象的關鍵分析工作流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"可發現性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當公開新數據並更新數據時,Hive Metastore 會變成包含在對象存儲中的所有集合的目錄。如果維護得當,就可以發現可供查詢的數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,補充性信息可以保存在元存儲中,以便提供關於數據的有用信息,比如其更新頻率,誰擁有它,等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"模式演化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"管理數據集所面臨的挑戰之一就是其可變性。在描述其屬性的現有列時,記錄可以隨時間而改變。也有可能是屬性集本身會隨時間改變,從而導致表的模式發生改變。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的註冊過程爲每一個屬於表的附加數據文件提供了模式的記錄。這就是說,如果模式在某一時刻發生了變化,那麼它將被記錄到 Hive Metastore 中。在訪問數據時,可以使用合適的模式進行訪問。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這也爲驗證一個模式提供了一個很好的基礎,如果它不應該被改變,並對它發出警報。Hive 保存着創建此類測試的信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲 Hive Metastore 將表映射到了底層對象上,所以它可以基於對象存儲支持的主鍵來表示分區。當分區均衡且數量合理時,分區的粒度可以由用戶設置,這種映射可以提高查詢性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這通常被稱爲“分區修剪”(partition pruning),它允許查詢引擎識別哪些數據文件可以被跳過。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive 會在下一次革命中倖存嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前還沒有直接取代元存儲的候選者,但如果現有的一些趨勢佔據上風併發揮好作用,那麼它可能會被淘汰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們來看看領先的繼任者們。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"開放表格式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Iceberg、Hudi 和 Delta Lake 是這個類別中的三個參與者。每一個都是爲了滿足不同的需求而創建的,但是隨着時間的推移,它們全部匯聚在一起,涵蓋了一系列的特性。這些特性允許:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"可變性"},{"type":"text","text":"(Hudi、Delta)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"訪問大表的效率"},{"type":"text","text":"(Iceberg)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模式實施和演化"},{"type":"text","text":"(Delta)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 Hive Metastore 是一個所有應用程序都支持的通用接口,因此使用開放表格式的組織仍然依賴 Hive 來進行虛擬化,以及 \/ 或用於格式未涵蓋的其他用例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"數據目錄"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在過去的一年多時間裏,我們目睹了由數據工程領域的領導者發佈的 10 多個開源發現工具的閃電戰,這表明了對組織級數據目錄的需求。這些新來者加入了其他現有的商業數據目錄產品,如 Allation。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目錄支持對象存儲與目前使用的大多數數據庫的映射。如有可能,許多發現工具將利用已經在 Hive Metastore 中的數據,否則就會進入對象存儲。毫不奇怪,隨着時間的推移,這些工具很有可能取代 Hive Metastore 的編目功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"可觀察性工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可觀察性工具的主要目的是在運行中監控數據管道的質量和數據本身。一些工具專注於前者(如 Databand),而另一些工具側重於後者(Great Expectations 或 Monte Carlo)。如果可觀察性工具在整個數據生命週期內實施,它可以動態地更新數據目錄,並將 Hive Metastore 替換爲目錄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"結  語"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多技術已經開始在改進 Hive 的功能方面有所突破。但是現在還沒有任何一種技術足夠成熟,也沒有就成功去除 Hive Metastore 的組合達成共識。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這並不意味着它應該或將繼續成爲數據架構的一部分。實際上,它在可用性和性能方面都存在着明顯的不足。值得關注的是 Hive Metastore:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"難以安裝和維護。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非雲原生架構,使得管理服務的實施變得複雜。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因依賴關係型數據庫而受到可擴展性限制。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜合這些因素,我們可以預測 Hive Metastore 不會在下一個數據架構的演進中倖存下來。這種情況不會自動發生——它需要來自社區內部的力量。願你與我們攜手共創美好未來!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Einat Orr 博士,Treeverse 合夥創始人兼 CEO,lakeFS 合夥創始人。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/lakefs.io\/hive-metastore-why-its-still-here-and-what-can-replace-it\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/lakefs.io\/hive-metastore-why-its-still-here-and-what-can-replace-it\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章