終於有人把大數據架構講明白了
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分享自百度開發者中心","attrs":{}},{"type":"link","attrs":{"href":"https://developer.baidu.com/article.html#/articleDetailPage?id=293510?from=010727","title":"","type":null},"content":[{"type":"text","text":"終於有人把大數據架構講明白了","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據技術其實是分佈式技術在數據處理領域的創新性應用,其本質和此前講到的分佈式技術思路一脈相承,即用更多的計算機組成一個集羣,提供更多的計算資源,從而滿足更大的計算壓力要求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據技術討論的是,如何利用更多的計算機滿足大規模的數據計算要求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據就是將各種數據統一收集起來進行計算,發掘其中的價值。這些數據,既包括數據庫的數據,也包括日誌數據,還包括專門採集的用戶行爲數據;既包括企業內部自己產生的數據,也包括從第三方採購的數據,還包括使用網絡爬蟲獲取的各種互聯網公開數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對如此龐大的數據,如何存儲、如何利用大規模的服務器集羣處理計算纔是大數據技術的核心。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"01 HDFS分佈式文件存儲架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模的數據計算首先要解決的是大規模數據的存儲問題。如何將數百TB或數百PB的數據存儲起來,通過一個文件系統統一管理,這本身就是一項極大的挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS的架構,如圖31-1所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/ccd0bdf04b7db0640002846956718e06.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-1 HDFS架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS可以將數千臺服務器組成一個統一的文件存儲系統,其中NameNode服務器充當文件控制塊的角色,進行文件元數據管理,即記錄文件名、訪問權限、數據存儲地址等信息,而真正的文件數據則存儲在DataNode服務器上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataNode以塊爲單位存儲數據,所有的塊信息,比如塊ID、塊所在的服務器IP地址等,都記錄在NameNode服務器上,而具體的塊數據則存儲在DataNode服務器上。理論上,NameNode可以將所有DataNode服務器上的所有數據塊都分配給一個文件,也就是說,一個文件可以使用所有服務器的硬盤存儲空間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,HDFS爲了保證不會因爲硬盤或者服務器損壞而導致文件損壞,還會對數據塊進行復制,每個數據塊都會存儲在多臺服務器上,甚至多個機架上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"02 MapReduce大數據計算架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲在HDFS上的最終目標還是爲了計算,通過數據分析或者機器學習獲得有益的結果。但是如果像傳統的應用程序那樣把HDFS當作普通文件,從文件中讀取數據後進行計算,那麼對於需要一次計算數百TB數據的大數據計算場景,就不知道要算到什麼時候了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據處理的經典計算框架是MapReduce。MapReduce的核心思想是對數據進行分片計算。既然數據是以塊爲單位分佈存儲在很多臺服務器組成的集羣上的,那麼能不能就在這些服務器上針對每個數據塊進行分佈式計算呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事實上,MapReduce可以在分佈式集羣的多臺服務器上啓動同一個計算程序,每個服務器上的程序進程都可以讀取本服務器上要處理的數據塊進行計算,因此,大量的數據就可以同時進行計算了。但是這樣一來,每個數據塊的數據都是獨立的,如果這些數據塊需要進行關聯計算怎麼辦?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce將計算過程分成兩個部分:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一部分是map過程,每個服務器上會啓動多個map進程,map優先讀取本地數據進行計算,計算後輸出一個集合;另一部分是reduce過程,MapReduce在每個服務器上都會啓動多個reduce進程,然後對所有map輸出的集合進行shuffle操作。所謂的shuffle就是將相同的key發送到同一個reduce進程中,在reduce中完成數據關聯計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面以經典的WordCount,即統計所有數據中相同單詞的詞頻數據爲例,來認識map和reduce的處理過程,如圖31-2所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/46/46a97c1744e7cf1ecc7f4e9570254b99.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-2 詞頻統計程序WordCount的MapReduce處理過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設原始數據有兩個數據塊,MapReduce框架啓動了兩個map進程進行處理,它們分別讀入數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"map函數會對輸入數據進行分詞處理,然後針對每個單詞輸出這樣的結果。然後MapReduce框架進行shuffle操作,相同的key發送給同一個reduce進程,reduce的輸入就是這樣的結構,即相同key的value合併成了一個value列表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個示例中,這個value列表就是由很多個1組成的列表。reduce對這些1進行求和操作,就得到每個單詞的詞頻結果了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體的MapReduce程序如下:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"public class WordCount {\n\n public static class TokenizerMapper\n extends Mapper
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.