大數據技術發展(二):Hadoop 技術生態圈的發展

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"大家好,這裏是抖碼課堂,抖碼課堂專注提升互聯網技術人的軟硬實力。在抖碼課堂的公衆號中可以聽這篇文章的音頻,體驗更好~~~~"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"google 的\"三駕馬車\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"我們在上一篇文章中知道了,google 爲了解決數據量越來越大的問題,開發了分佈式存儲技術 GFS 和分佈式計算技術 MapReduce,這兩個技術奠定了大數據技術的發展。如果 google 對這兩個技術不開放出來的話,它的影響力也不會很大,可能很多人就不會知道這兩個技術,但是 google 分別在 2003 年和 2004 年將這兩個技術以論文的方式發佈出來了,從而奠定開源大數據技術的發展,也就是我們現在免費使用的大數據技術 (Hadoop)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"要了解 Hadoop 的發展史,我們得先從 google 的\"三駕馬車\"開始說起,google 分別在 2003 年、2004 年以及 2006 年發佈了三篇論文:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"The Google File System,簡稱 GFS"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"MapReduce:Simplified Data Processing on Large Clusters"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"Bigtable:A Distributed Storage System for Structured Data"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"這三篇論文闡述的是解決大數據處理這個問題的思想,思想的發展是緩慢的,到今天爲止,我們仍然使用這三篇論文提供的思想解決大數據處理的問題,但是我們發現技術的發展是不斷迭代更新的,比如大數據的技術從 Hadoop 到 Spark 再到 Flink 等。所以說,如果我們要學習一項技術的話,最好的學習方法就是徹底掌握這個技術解決問題的思想,這樣一來,即使技術不斷的更新發展,我們也能很容易的跟上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"說完 google 的\"三駕馬車\",繼續回到我們的主題,即 Hadoop 的發展。很多人都說 Hadoop 起源於 google 的\"三駕馬車\",其實這個是不嚴謹的,我們可以說成:Hadoop 的實現是參考 google \"三駕馬車\" 中的 GFS 和 MapReduce 這 \"兩駕馬車\" 解決大數據處理問題的思想。如果把思想和技術分開的話,下面的說話應該更加的準確:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"從思想上,Hadoop 起源於google \"三駕馬車\" 中的 GFS 和 MapReduce 這 \"兩駕馬車\""}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"從技術上,Hadoop 實際上是起源於 Doug Cutting 研發的網絡搜索引擎技術,而非 google 的搜索引擎技術"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Hadoop 技術的起源"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"故事要從 1997 年的一個晴朗的下午開始說起,Doug Cutting 從這個時間開始開發第一個版本的 Lucene。Lucene 是一個全文搜索庫,它要解決的問題本質上就是我們在前一篇文章提到的倒排索引。Lucene 其實是倒排索引的技術實現了,有了 Lucene ,我們就可以在海量的文章中快速的搜索包含指定關鍵詞的文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"Doug Cutting 花了 3 個月的時間開發完 Lucene,然後一直優化 Lucene,爲了讓更多人的使用 Lucene,Doug Cutting 在開源社區 Source Forge 中開源 Lucene,使用 Lucene 的人越來越多,反響越來越好,在 2001 年的時候 Lucene 在 Apache 上開源,用的人就越來越多了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"完成了 Lucene 項目後,Cutting 開始和 Mike Cafarella(卡法雷拉) 致力於對全網的網頁進行索引,從而可以快速的通過關鍵字搜索(其實做的事情和 google 做的事情差不多),通過他倆的努力,研發出了Apache Nutch 這個項目,這個項目是 Lucene 的子項目,它其實就是網絡爬蟲的技術實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"一開始,他們在一臺服務器上部署 Nutch,然後去爬取網絡上所有的網頁,當爬取到 1 億個網頁的時候,單臺機器存儲不下了,很明顯,單臺機器不可能存儲的下整個網絡上所有的網頁。然後他們將服務器的數量提升到 4 臺,雖然加服務器可以存儲更多的網頁數據,但是帶來的複雜度確實非常的大,他們需要手動的處理服務器之間的數據交換、以及手動的處理磁盤空間(因爲磁盤會滿),而且每次增加服務器的時候,複雜度都會呈指數級增長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"爲了解決服務器擴展的問題,他們也嘗試了很多方法,直到 2003 年十月份,google 發佈了 GFS 分佈式存儲文件系統的論文,Cutting 和 Cafarella(卡法雷拉) 兩個人閱讀到這篇論文後,發現這篇論文解決的就是他們現在遇到的問題。隨後,他倆就按照這篇論文的思路實現了一個分佈式存儲文件系統,就是 Nutch Distributed File System (簡稱 NDFS)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"解決了存儲的問題,Cutting 和 Cafarella 同樣也需要面對怎麼樣對分佈式存儲的數據進行計算的問題。NDFS 分佈式存儲的特點,也就是數據分塊存儲在多臺機器上,那麼計算需要並行的進行計算,這樣的話可以利用分佈式存儲的特點,從而提高計算的性能。在他們摸索的時候,同樣,在 2004 年 google 發佈了另一篇 MapReduce 的論文,Cutting 和 Cafarella(卡法雷拉) 看完這個論文後,發現這個解決方案也就是他們想要的,所以,他們就按照這篇論文的思想重新實現了一個分佈式計算的技術,也命名爲 MapReduce。並在 2005 年,將 MapReduce 集成到 Nutch 項目中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"在 2006 年的二月,Cutting 將 NDFS 和 MapReduce 從 Nutch 項目中剝離出來,單獨創建一個名爲 Hadoop 的項目,這個 Hadoop 項目包含了 Hadoop Common、HDFS (其實就是 NDFS,就是改了下名字而已) 和 MapReduce 三個模塊。這樣 Hadoop 就出世了。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Hadoop 技術的發展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"在這個時候,Yahoo 看到了 google 的 GFS 和 MapReduce 的好處,也想使用這個分佈式技術來實現他們的數據存儲和處理,他們決定聘用 Doug Cutting 來幫助他們使用 Hadoop 來代替他們之前的數據存儲和處理方案。事實證明,這是一個明智的選擇,在 2007 年二月份的時候,Yahoo 的 Hadoop 集羣的節點就已經達到了 1000 個了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2007 年和 2008 年是 Hadoop 發展最快的兩年,在這兩年內,很多大公司,比如 Twitter、Facebook、LinkedIn 等都在使用 Hadoop,並且他們還貢獻了很多的以 Hadoop 爲核心的大數據處理分析工具,比如分佈式協調工具 Zookeeper、Pig、Hive 、HBase 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2009 年的亞馬遜開始提供 MapReduce 計算服務。同年,Cutting 離開 Yahoo,以首席架構師的身份去了 Cloudera (也就是 Hadoop 商用的組織,CDH 就是他們研發的)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2012 年,Yahoo 的 Hadoop 集羣的節點數達到了 42000 個,同年,Hadoop 的 contributors 也達到了最多的 1200 個。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop 2.x"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"以上我們聊的是 1.x 版本的 Hadoop,在 Hadoop 1.x 中的 MapReduce 有一個很大的問題,那就是分佈式數據處理和資源管理是在一起的,這樣當數據量特別大,或者當數據處理任務特別多的時候會存在性能瓶頸,爲了解決這個問題,在 2012 年,Hadoop 團隊將分佈式數據處理的邏輯和資源管理的邏輯拆分開,形成了一個獨立的分佈式資源管理技術,也就是 Yarn,這代表着 Hadoop 進入 Hadoop 2.x 時代。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop 3.x"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"爲了提高 Hadoop 的性能,在 2017 年 12 月份,Hadoop 發佈了 Hadoop 3.0.0 版本,預示着 Hadoop 進入到 Hadoop 3.x 時代。相對於 Hadoop 2.x,Hadoop 3.x 做了很多性能和功能上的優化。至今 Hadoop 3.x 版本仍然在迭代更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"這篇文章就分享到這裏,感謝你的閱讀和收聽,如果你覺着這篇文章對你有所幫助的話,你可以分享給你更多的朋友喲~~"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統學習大數據技術入口:"},{"type":"link","attrs":{"href":"http://douma.ke.qq.com","title":""},"content":[{"type":"text","text":"http://douma.ke.qq.com"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章