Hadoop 的“遺產”

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"宣佈“Hadoop 已死”已成爲一種時尚。但,Hadoop 讓企業失去了對大數據的恐懼。Hadoop 反過來又釋放出一種創新的良性循環,爲我們今天所知的雲分析和人工智能服務帶來了大量市場。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近ZDNet 的 Big on Data 專欄撰稿人 Andrew Brust發表的一篇關於"},{"type":"link","attrs":{"href":"https:\/\/www.zdnet.com\/article\/apache-software-foundation-retires-slew-of-hadoop-related-projects\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Hadoop 項目“春季大掃除"}]},{"type":"text","text":"”的文章閱讀數爆表,顯然觸動了人們的神經。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"迄今爲止,"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Apache Hadoop"}]},{"type":"text","text":"項目系列不再像十年前那樣是大數據的中心,事實上,有關"},{"type":"link","attrs":{"href":"https:\/\/www.infoworld.com\/article\/2922720\/hadoop-demand-falls-as-other-big-data-tech-rises.html?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Hadoop 已死"}]},{"type":"text","text":"的論調已經流傳很久,以至於聽起來更像是“"},{"type":"link","attrs":{"href":"https:\/\/amp.en.google-info.org\/28819\/1\/generalissimo-francisco-franco-is-still-dead.html?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"弗朗西斯科・弗朗哥最後還是死了"}]},{"type":"text","text":"”這則老標語的最新版本。("},{"type":"text","marks":[{"type":"strong"}],"text":"譯註"},{"type":"text","text":":弗朗西斯科・弗朗哥,Francisco Franco,西班牙國家元首、最高統帥、大獨裁者,1975 年 9 月 27 日被宣告政治死亡,1975 年 11 月 20 日,死於帕金森症。)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想進一步瞭解情況,看看招聘信息就知道了。最近,"},{"type":"link","attrs":{"href":"https:\/\/terenceshin.medium.com\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Terence Shin"}]},{"type":"text","text":"在上個月發佈的"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/the-most-in-demand-skills-for-data-scientists-in-2021-4b2a808f4005?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"一項調查"}]},{"type":"text","text":"(如下圖所示),通過網絡搜索整理了超過 15000 個數據科學家的工作清單,清單顯示,僱主對 Hadoop 技能的需求正在急劇下降,C++、Hive 和一些遺留的專用語言也在其中。順便說一下,對了,Spark 和 Java 也同樣在清單中。如果向數據工程師提出同樣的問題,結果是否會有所不同?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決方案“Hadoop”被認爲是 2014 年的事情。對於大數據來說,這個世界也在不斷髮展。大數據之所以被貼上這樣的標籤,是因爲在那個時候,很少有人會對多達 TB 或 PB 級的數據進行梳理,對非關係數據進行分析的能力也有限。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如今,多模型數據庫已經變得越來越普遍,而大多數關係數據倉庫(Data Warehouse)也增加了解析 JSON 數據和疊加圖形數據視圖的功能。在雲存儲中直接查詢數據和 \/ 或從數據倉庫進行聯合查詢的功能也已變得司空見慣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如Andrew 所說,“春季大掃除”旨在“清除蜘蛛網”。和傳統的觀念相反,Hadoop 並沒有死亡。在"},{"type":"link","attrs":{"href":"https:\/\/www.cloudera.com\/products\/cloudera-data-platform.html?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Cloudera Data Platform"}]},{"type":"text","text":"(CDP)中,Hadoop 生態系統的一些核心項目仍在繼續,這一產品非常有活力。因爲倖存下來的是 CDP 之前還沒有出現的打包平臺,所以我們不再稱之爲 Hadoop 了。現在,動物園裏的動物都安全地關進了籠子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e6\/e67f01822ed28a4d78d6988ca685a9cb.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用幾個甚至更多的獨立開源項目來組建自己的集羣的想法已經過時了。既然有其他可供選擇的方案(我們不僅僅是在討論 CDP),爲什麼還要浪費時間手工實現 Apache"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-mapreduce-client\/hadoop-mapreduce-client-core\/MapReduceTutorial.html?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"MapReduce"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/hive.apache.org\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Hive"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/ranger.apache.org\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Ranger"}]},{"type":"text","text":"或者"},{"type":"link","attrs":{"href":"https:\/\/atlas.apache.org\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Atlas"}]},{"type":"text","text":"呢?至少在過去 30年裏,這一直是數據庫領域的常態;當你購買 Oracle 時,你不必分別安裝查詢優化器和存儲引擎。爲什麼我們用來調用Hadoop的數據平臺會有所不同呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到 2020 年,對於新的項目,你的組織可能計劃實施雲服務,而非安裝打包的軟件。儘管推動雲計算最初是爲了轉移成本,但現在更多是關於公共控制平面下的操作簡化和敏捷性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,有多種方法可以分析過去稱爲“"},{"type":"link","attrs":{"href":"https:\/\/bigdataldn.com\/intelligence\/big-data-the-3-vs-explained\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"三個 V"}]},{"type":"text","text":"”("},{"type":"text","marks":[{"type":"strong"}],"text":"V"},{"type":"text","text":"olum、"},{"type":"text","marks":[{"type":"strong"}],"text":"V"},{"type":"text","text":"elocity、"},{"type":"text","marks":[{"type":"strong"}],"text":"V"},{"type":"text","text":"ariety,即體積、速度、種類)的數據。如今,你可以隨時訪問位於雲對象存儲中的數據,也就是事實上的數據湖(Data Lake)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過使用"},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/athena\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Amazon Athena"}]},{"type":"text","text":"等服務進行特別查詢,可以實現這一點;在大多數雲數據倉庫服務中利用可選的聯合查詢功能;使用"},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Databricks"}]},{"type":"text","text":"等專用服務或"},{"type":"link","attrs":{"href":"https:\/\/azure.microsoft.com\/en-us\/services\/synapse-analytics\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Azure Synapse Analytics"}]},{"type":"text","text":"等雲數據倉庫服務對數據運行"},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Spark"}]},{"type":"text","text":"。由於數據倉庫和數據湖之間的界限越來越模糊,現在很多人採用了模糊術語"},{"type":"link","attrs":{"href":"https:\/\/databricks.com\/blog\/2020\/01\/30\/what-is-a-data-lakehouse.html?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"數據湖屋"}]},{"type":"text","text":"(Data Dakehouses),或者整合跨數據倉庫和數據湖的訪問,或者把數據湖變成 80% 的數據倉庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且我們甚至還沒有涉及到人工智能和機器學習。就像早期的 Hadoop 只屬於數據科學家(在數據工程師的幫助下)一樣,最初機器學習和更廣泛的人工智能也是如此。如今,數據科學家擁有許多工具和框架來管理他們所創建模型的生命週期。對於公民數據科學家而言,"},{"type":"link","attrs":{"href":"https:\/\/cloud.google.com\/automl?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"AutoML 服務"}]},{"type":"text","text":"使構建機器學習模型變得觸手可及,而云數據倉庫正在增加它自己的"},{"type":"link","attrs":{"href":"https:\/\/cloud.google.com\/bigquery-ml\/docs\/introduction?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"預打包機器學習模型"}]},{"type":"text","text":",可以"},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/redshift\/features\/redshift-ml\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"通過 SQL 命令"}]},{"type":"text","text":"來觸發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能人們很容易忘記,僅僅在十年前,這一切似乎還是不可能發生的事情。谷歌的創新研究推動了這一領域的發展。藉助"},{"type":"link","attrs":{"href":"https:\/\/research.google.com\/archive\/gfs-sosp2003.pdf?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"谷歌文件系統"}]},{"type":"text","text":",這家互聯網巨頭設計出了一個僅限於附加的文件系統,利用廉價磁盤的優勢,突破了傳統存儲網絡的限制。通過"},{"type":"link","attrs":{"href":"https:\/\/research.google.com\/archive\/mapreduce-osdi04.pdf?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"MapReduce"}]},{"type":"text","text":",谷歌破解了這一密碼,它在商品硬件上實現了幾乎線性的可擴展性。在當時廣泛採用的擴展性 SMP 架構中,很難做到這一點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"谷歌發表了這些論文,這對"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/cutting\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Doug Cutting"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/web.eecs.umich.edu\/~michjc\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Mike Cafarella"}]},{"type":"text","text":"來說是件好事,他們當時正在開發一個能夠索引至少 10 億頁的"},{"type":"link","attrs":{"href":"https:\/\/www.geeksforgeeks.org\/hadoop-history-or-evolution\/?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"搜索引擎"}]},{"type":"text","text":"項目,發現了一條開源之路,可以大幅降低實現這樣一個系統的成本。後來,社區的其他成員接過了 Cutting 和 Cafarella 的工作,例如,Facebook 開發了"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Apache_Hive?fileGuid=f7uaCPa9UPkktDUb","title":"","type":null},"content":[{"type":"text","text":"Hive"}]},{"type":"text","text":",它提供了一種類似於 SQL 的編程語言,用來在 PB 級別梳理 PB 級各種數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如今,隨着經典 Hadoop 項目的採用率的下降,人們很容易忘記,Hadoop 項目的發現帶來了一個良性循環,創新吞噬了年輕一代。在 Hadoop 出現的時候,數據就變得如此龐大,以至於我們不得不對數據進行計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着雲原生架構的出現,廉價、大量的帶寬使之成爲可能:再一次將更多的存儲層、計算層和數據層分開。並不是說這兩種方法都是正確或錯誤的,而是說它們適用於當時設計時就已存在的技術。這就是科技創新的週期性本質。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從 Hadoop 學到的經驗突破了規模化處理的限制,從而促生了一個循環,很多舊的假設,比如 GPU 嚴格用於圖形處理,都被拋在了一邊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop 的“遺產”不僅在於它所催生的創新良性循環,還在於它使企業能夠克服對數據處理的恐懼,而且還是海量數據。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tony Baer,dbInsight LLC 的負責人,負責大數據和數據管理以及一些系統工程。領導 Ovum 的大數據研究領域,從事該行業的 25 年中,研究了數據集成、軟件和數據架構、中間件和應用開發等問題。擁有多學科背景,涉及企業軟件的不同層次。與他人合作出版了一些關於 Java 和 .NET 框架的早期書籍,並在多家雜誌社發表過許多文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/www.zdnet.com\/article\/hadoops-legacy-no-more-fear-of-data\/#ftag=RSSbaffb68"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章