大數據基礎:Spark工作原理及基礎概念

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、Spark 介紹及生態"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是UC Berkeley AMP Lab開源的通用分佈式並行計算框架,目前已成爲Apache軟件基金會的頂級開源項目。至於爲什麼我們要學習Spark,可以總結爲下面三點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f3\/54\/f3a9e785b965ec58d6b9537390615554.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Spark相對於hadoop的優勢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"(1)高性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark具有hadoop MR所有的優點,hadoop MR每次計算的中間結果都會存儲到HDFS的磁盤上,而Spark的中間結果可以保存在內存,在內存中進行數據處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"(2)高容錯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於“血統”(Lineage)的數據恢復:spark引入了彈性分佈式數據集RDD的抽象,它是分佈在一組節點中的只讀的數據的集合,這些集合是彈性的且是相互依賴的,如果數據集中的一部分的數據發生丟失可以根據“血統”關係進行重建。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CheckPoint容錯:RDD計算時可以通過checkpoint進行容錯,checkpoint有兩種檢測方式:通過冗餘數據和日誌記錄更新操作。在RDD中的doCheckPoint方法相當於通過冗餘數據來緩存數據,而“血統”是通過粗粒度的記錄更新操作來實現容錯的。CheckPoint容錯是對血統檢測進行的容錯輔助,避免“血統”(Lineage)過長造成的容錯成本過高。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"(3)spark的通用性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark 是一個通用的大數據計算框架,相對於hadoop它提供了更豐富的使用場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark相對於hadoop map reduce兩種操作還提供了更爲豐富的操作,分爲action(collect,reduce,save…)和transformations(map,union,join,filter…),同時在各節點的通信模型中相對於hadoop的shuffle操作還有分區,控制中間結果存儲,物化視圖等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. spark 生態介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/42\/e5\/429375f11b37d89eef77e32fe9becae5.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark支持多種編程語言,包括Java、Python、R和Scala。在計算資源調度層支持local模式,standalone模式,yarn模式以及k8s等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章