終於有人把大數據架構講明白了

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分享自百度開發者中心","attrs":{}},{"type":"link","attrs":{"href":"https://developer.baidu.com/article.html#/articleDetailPage?id=293510?from=010727","title":"","type":null},"content":[{"type":"text","text":"終於有人把大數據架構講明白了","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據技術其實是分佈式技術在數據處理領域的創新性應用,其本質和此前講到的分佈式技術思路一脈相承,即用更多的計算機組成一個集羣,提供更多的計算資源,從而滿足更大的計算壓力要求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據技術討論的是,如何利用更多的計算機滿足大規模的數據計算要求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據就是將各種數據統一收集起來進行計算,發掘其中的價值。這些數據,既包括數據庫的數據,也包括日誌數據,還包括專門採集的用戶行爲數據;既包括企業內部自己產生的數據,也包括從第三方採購的數據,還包括使用網絡爬蟲獲取的各種互聯網公開數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對如此龐大的數據,如何存儲、如何利用大規模的服務器集羣處理計算纔是大數據技術的核心。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"01 HDFS分佈式文件存儲架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模的數據計算首先要解決的是大規模數據的存儲問題。如何將數百TB或數百PB的數據存儲起來,通過一個文件系統統一管理,這本身就是一項極大的挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS的架構,如圖31-1所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/ccd0bdf04b7db0640002846956718e06.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-1 HDFS架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS可以將數千臺服務器組成一個統一的文件存儲系統,其中NameNode服務器充當文件控制塊的角色,進行文件元數據管理,即記錄文件名、訪問權限、數據存儲地址等信息,而真正的文件數據則存儲在DataNode服務器上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataNode以塊爲單位存儲數據,所有的塊信息,比如塊ID、塊所在的服務器IP地址等,都記錄在NameNode服務器上,而具體的塊數據則存儲在DataNode服務器上。理論上,NameNode可以將所有DataNode服務器上的所有數據塊都分配給一個文件,也就是說,一個文件可以使用所有服務器的硬盤存儲空間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,HDFS爲了保證不會因爲硬盤或者服務器損壞而導致文件損壞,還會對數據塊進行復制,每個數據塊都會存儲在多臺服務器上,甚至多個機架上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"02 MapReduce大數據計算架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲在HDFS上的最終目標還是爲了計算,通過數據分析或者機器學習獲得有益的結果。但是如果像傳統的應用程序那樣把HDFS當作普通文件,從文件中讀取數據後進行計算,那麼對於需要一次計算數百TB數據的大數據計算場景,就不知道要算到什麼時候了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據處理的經典計算框架是MapReduce。MapReduce的核心思想是對數據進行分片計算。既然數據是以塊爲單位分佈存儲在很多臺服務器組成的集羣上的,那麼能不能就在這些服務器上針對每個數據塊進行分佈式計算呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事實上,MapReduce可以在分佈式集羣的多臺服務器上啓動同一個計算程序,每個服務器上的程序進程都可以讀取本服務器上要處理的數據塊進行計算,因此,大量的數據就可以同時進行計算了。但是這樣一來,每個數據塊的數據都是獨立的,如果這些數據塊需要進行關聯計算怎麼辦?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce將計算過程分成兩個部分:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一部分是map過程,每個服務器上會啓動多個map進程,map優先讀取本地數據進行計算,計算後輸出一個集合;另一部分是reduce過程,MapReduce在每個服務器上都會啓動多個reduce進程,然後對所有map輸出的集合進行shuffle操作。所謂的shuffle就是將相同的key發送到同一個reduce進程中,在reduce中完成數據關聯計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面以經典的WordCount,即統計所有數據中相同單詞的詞頻數據爲例,來認識map和reduce的處理過程,如圖31-2所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/46/46a97c1744e7cf1ecc7f4e9570254b99.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-2 詞頻統計程序WordCount的MapReduce處理過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設原始數據有兩個數據塊,MapReduce框架啓動了兩個map進程進行處理,它們分別讀入數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"map函數會對輸入數據進行分詞處理,然後針對每個單詞輸出這樣的結果。然後MapReduce框架進行shuffle操作,相同的key發送給同一個reduce進程,reduce的輸入就是這樣的結構,即相同key的value合併成了一個value列表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個示例中,這個value列表就是由很多個1組成的列表。reduce對這些1進行求和操作,就得到每個單詞的詞頻結果了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體的MapReduce程序如下:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"public class WordCount {\n\n public static class TokenizerMapper\n extends Mapper{\n\n private final static IntWritable one = new IntWritable(1);\n private Text word = new Text();\n\n public void map(Object key, Text value, Context context\n ) throws IOException, InterruptedException {\n StringTokenizer itr = new StringTokenizer(value.toString());\n while (itr.hasMoreTokens()) {\n word.set(itr.nextToken());\n context.write(word, one);\n }\n }\n\n public static class IntSumReducer\n extends Reducer {\n private IntWritable result = new IntWritable();\n\n public void reduce(Text key, Iterable values,\n Context context\n ) throws IOException, InterruptedException {\n int sum = 0;\n for (IntWritable val : values) {\n sum += val.get();\n }\n result.set(sum);\n context.write(key, result);\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面講述了map和reduce進程合作完成數據處理的過程,那麼這些進程是如何在分佈式的服務器集羣上啓動的呢?數據是如何流動並最終完成計算的呢?下面以MapReduce1爲例來看這個過程,如圖31-3所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b29062fd5e37359dce5da15ba5265ff6.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-3 MapReduce1計算處理過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce1主要有JobTracker和TaskTracker這兩種進程角色,JobTracker在MapReduce集羣中只有一個,而TaskTracker則和DataNode一起啓動在集羣的所有服務器上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce應用程序JobClient啓動後,會向JobTracker提交作業,JobTracker根據作業中輸入的文件路徑分析需要在哪些服務器上啓動map進程,然後就向這些服務器上的TaskTracker發送任務命令。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TaskTracker收到任務後,啓動一個TaskRunner進程下載任務對應的程序,然後反射加載程序中的map函數,讀取任務中分配的數據塊,並進行map計算。map計算結束後,TaskTracker會對map輸出進行shuffle操作,然後TaskRunner加載reduce函數進行後續計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS和MapReduce都是Hadoop的組成部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"03 Hive大數據倉庫架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce雖然只有map和reduce這兩個函數,但幾乎可以滿足任何大數據分析和機器學習的計算場景。不過,複雜的計算可能需要使用多個job才能完成,這些job之間還需要根據其先後依賴關係進行作業編排,開發比較複雜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統上,主要使用SQL進行數據分析,如果能根據SQL自動生成MapReduce,就可以極大降低大數據技術在數據分析領域的應用門檻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive就是這樣一個工具。我們來看對於如下一條常見的SQL語句,Hive是如何將其轉換成MapReduce計算的。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一條常見的SQL統計分析語句,用於統計不同年齡的用戶訪問不同網頁的興趣偏好,具體數據輸入和執行結果示例如圖31-4所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d0/d09ecc7bcda4bd77d38d2964d71ccf1c.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-4 SQL統計分析輸入數據和執行結果舉例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看這個示例我們就會發現,這個計算場景和WordCount很像。事實上也確實如此,我們可以用MapReduce完成這條SQL的處理,如圖31-5所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/8143fa9e4d4c0d8fbd28c25bdcd00d58.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-5 MapReduce完成SQL處理過程舉例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"map函數輸出的key是表的行記錄,value是1,reduce函數對相同的行進行記錄,也就是針對具有相同key的value集合進行求和計算,最終得到SQL的輸出結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive要做的就是將SQL翻譯成MapReduce程序代碼。實際上,Hive內置了很多Operator,每個Operator完成一個特定的計算過程,Hive將這些Operator構造成一個有向無環圖DAG,然後根據這些Operator之間是否存在shuffle將其封裝到map或者reduce函數中,之後就可以提交給MapReduce執行了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Operator組成的DAG如圖31-6所示,這是一個包含where查詢條件的SQL,where查詢條件對應一個FilterOperator。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/84f1c127fc3da87f3da0d774b54ef24b.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-6 示例SQL的MapReduce 有向無環圖DAG","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive整體架構如圖31-7所示。Hive的表數據存儲在HDFS。表的結構,比如表名、字段名、字段之間的分隔符等存儲在Metastore中。用戶通過Client提交SQL到Driver,Driver請求Compiler將SQL編譯成如上示例的DAG執行計劃中,然後交給Hadoop執行。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5d53ac4785db332c46cf721365519048.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-7 Hive整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"04 Spark快速大數據計算架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MapReduce主要使用硬盤存儲計算過程中的數據,雖然可靠性比較高,但是性能卻較差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,MapReduce只能使用map和reduce函數進行編程,雖然能夠完成各種大數據計算,但是編程比較複雜。而且受map和reduce編程模型相對簡單的影響,複雜的計算必須組合多個MapReduce job才能完成,編程難度進一步增加。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark在MapReduce的基礎上進行了改進,它主要使用內存進行中間計算數據存儲,加快了計算執行時間,在某些情況下性能可以提升上百倍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的主要編程模型是RDD,即彈性數據集。在RDD上定義了許多常見的大數據計算函數,利用這些函數可以用極少的代碼完成較爲複雜的大數據計算。前面舉例的WorkCount如果用Spark編程,只需要三行代碼:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"val textFile = sc.textFile(\"hdfs://...\")\nval counts = textFile.flatMap(line => line.split(\" \"))\n .map(word => (word, 1))\n .reduceByKey(_ + _)\ncounts.saveAsTextFile(\"hdfs://...\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,從HDFS讀取數據,構建出一個RDD textFile。然後,在這個RDD上執行三個操作:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是將輸入數據的每一行文本用空格拆分成單詞;二是將每個單詞進行轉換,比如word→(word, 1),生成的結構;三是針對相同的Key進行統計,統計方式是對Value求和。最後,將RDD counts寫入HDFS,完成結果輸出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面代碼中flatMap、map、reduceByKey都是Spark的RDD轉換函數,RDD轉換函數的計算結果還是RDD,所以上面三個函數可以寫在一行代碼上,最後得到的還是RDD。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark會根據程序中的轉換函數生成計算任務執行計劃,這個執行計劃就是一個DAG。Spark可以在一個作業中完成非常複雜的大數據計算,Spark DAG示例如圖31-8所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/cabcee882e73cba247891aa9430c870f.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-8 Spark RDD有向無環圖DAG示例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在圖31-8中,A、C和E是從HDFS上加載的RDD。A經過groupBy分組統計轉換函數計算後得到RDD B,C經過map轉換函數計算後得到RDD D,D和E經過union合併轉換函數計算後得到RDD F,B和F經過join連接轉換函數計算後得到最終結果RDD G。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"05 大數據流計算架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark雖然比MapReduce快很多,但是在大多數場景下計算耗時依然是分鐘級別的,這種計算一般被稱爲大數據批處理計算。而在實際應用中,有些時候需要在毫秒級完成不斷輸入的海量數據的計算處理,比如實時對攝像頭採集的數據進行監控分析,這就是所謂的大數據流計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"早期比較著名的流式大數據計算引擎是Storm,後來隨着Spark的火爆,Spark上的流式計算引擎Spark Streaming也逐漸流行起來。Spark Streaming的架構原理是將實時流入的數據切分成小的一批一批的數據,然後將這些小的一批批數據交給Spark執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據量比較小,Spark Streaming又常駐系統,不需要重新啓動,因此可以在毫秒級完成計算,看起來像是實時計算一樣,如圖31-9所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/56/56695c7d9b8c37d0a77e079102b98292.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖31-9 Spark Streaming流計算將實時流式數據轉化成小的批處理計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近幾年比較流行的大數據引擎Flink其架構原理和Spark Streaming很相似,它可以基於不同的數據源,根據數據量和計算場景的要求,靈活地適應流計算和批處理計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"06 小結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據技術可以說是分佈式技術的一個分支,兩者都是面臨大量的計算壓力時,採用分佈式服務器集羣的方案解決問題。差別是大數據技術要處理的數據具有關聯性,所以需要有個中心服務器進行管理,NameNode、JobTracker都是這樣的中心服務器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而高併發的互聯網分佈式系統爲了提高系統可用性,降低中心服務器可能會出現的瓶頸壓力、提升性能,通常不會在架構中使用這樣的中心服務器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:李智慧","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:大數據DT(ID:hzdashuju)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於作者:李智慧,資深架構專家,同程旅行交通首席架構師,曾在NEC、阿里巴巴、Intel等知名企業擔任架構師,也曾在WiFi萬能鑰匙等企業擔任CTO。長期從事大數據、大型網站的架構和研發工作,領導設計過多個日活用戶在千萬級以上的互聯網系統架構,實戰經驗豐富。曾設計、開發過 Web 服務器防火牆、分佈式NoSQL 系統、大數據倉庫引擎、反應式編程框架等各種類型的軟件系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":">>期待你的加入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":">>百度開發者中心已開啓徵稿模式,歡迎開發者進入","attrs":{}},{"type":"link","attrs":{"href":"https://developer.baidu.com/article.html#/articleDetailPage?id=293357?from=0707?from=010727","title":"","type":null},"content":[{"type":"text","text":"了不起的開發者","attrs":{}}]},{"type":"text","text":"進行投稿,優質文章將獲得豐厚獎勵和推廣資源。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章