每週一書《Spark與Hadoop大數據分析》分享!

Spark與Hadoop大數據分析比較系統地講解了利用Hadoop和Spark及其生態系統裏的一系列工具進行大數據分析的方法,既涵蓋ApacheSpark和Hadoop的基礎知識,又深入探討所有Spark組件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、結構化流、MLlib、Graphx,以及Hadoop的核心組件(HDFS、MapReduce和Yarn)等,並配套詳細的實現示例,是快速掌握大數據分析基礎架構及其實施方法的詳實參考。

全書共10章,第1章從宏觀的角度講解大數據分析的概念,並介紹在Hadoop和Spark平臺上使用的工具和技術,以及一些*常見的用例;第2章介紹Hadoop和Spark平臺的基礎知識;第3章深入探討並學習Spark;第4章主要介紹DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章講解如何用SparkStreaming進行實時分析;第6章介紹Spark和Hadoop配套的筆記本和數據流;第7章講解Spark和Hadoop上的機器學習技術;第8章介紹如何構建推薦系統;第9章介紹如何使用GraphX進行圖分析;第10章介紹如何使用SparkR。

目錄:

第1章 從宏觀視角看大數據分析··········1

1.1 大數據分析以及Hadoop和Spark

在其中承擔的角色····························3

1.1.1 典型大攻據分析項目的

生名週期.....................4

1.1.2 Hadoop中Spark承擔的角色·············6

1.2 大數據札學以及Hadoop和

Spark在其中承扣的角色…………6

1.2.1 從數據分析到數據科學的

根本性轉變···························6

1.2.2 典型數據科學項目的生命週期··········8

1.2.3 Hadoop和Spark承擔的角色·················9

1.3 工具和技術··························9

1.4 實際環境中的用例·············11

1.5 小結········································12

第2章 Apache Hadoop和ApacheSpark 入門····13

2.1 Apache Hadoop概述..…………13

2.1.1 Hadoop分佈式文件系統····14

2.1.2 HDFS的特性·······························15

2.1.3 MapReduce··························16

2.1.4 MapReduce的特性······················17

2.1.5 MapReduce v 1與

MapRcduce v2 對比······················17

2.1.6 YARN··································18

2.1.7 Hadoop上的存儲選擇······················20

2.2 Apache Spark概述···························24

2.2.1 Spark的發展歷史······················24

2.2.2 Apache Spark是什麼······················25

2.2.3 Apache Spark不是什麼·······26

2.2.4 MapReduce的問題······················27

2.2.5 Spark的架構························28

2.3 爲何把Hadoop和Spark結合使用·······31

2.3.1 Hadoop的持性······················31

2.3.2 Spark的特性·······························31

2.4 安裝Hadoop和Spark集羣···············33

2.5 小結··················································36

第3章 深入剖析Apache Spark ··········37

3.1 啓動Spark守護進程·······························37

3.1.1 使用CDH ····························38

3.1.2 使用HDP 、MapR和Spark預製軟件包··············38

3.2 學習Spark的核心概念························39

3.2.1 使用Spark的方法.··························39

3.2.2 彈性分佈式數據集······················41

3.2.3 Spark環境································13

3.2.4 變換和動作..........................44

3.2.5 ROD中的並行度·························46

3.2.6 延遲評估·······························49

3.2.7 譜系圖··································50

3.2.8 序列化·································51

3.2.9 在Spark 中利用Hadoop文件格式····52

3.2.10 數據的本地性··················53

3.2.11 共享變量........................... 54

3.2.12 鍵值對RDD ······················55

3.3 Spark 程序的生命週期………………55

3.3.1 流水線............................... 57

3.3.2 Spark執行的摘要....………58

3.4 Spark應用程序······························59

3.4.1 Spark Shell和Spark應用程序·········59

3.4.2 創建Spark環境…….............59

3.4.3 SparkConf·························59

3.4.4 SparkSubmit ························60

3.4.5 Spark 配置項的優先順序····61

3.4.6 重要的應用程序配置··········61

3. 5 持久化與緩存··························62

3.5.1 存儲級別............................. 62

3.5.2 應該選擇哪個存儲級別·····63

3.6 Spark 資源管理器: Standalone 、

YARN和Mesos·······························63

3.6.1 本地和集羣模式··················63

3.6.2 集羣資源管理器························64

3.7 小結·················································67

第4章 利用Spark SQL 、DataFrame

和Dataset 進行大數據分析····················69

4.1 Spark SQL的發展史····························70

4.2 Spark SQL的架構·······················71

4.3 介紹Spark SQL的四個組件················72

4.4 DataFrame和Dataset的演變············74

4.4.1 ROD 有什麼問題····························74

4.4.2 ROD 變換與Dataset和

DataFramc 變換....................75

4.5 爲什麼要使用Dataset和Dataframe·····75

4.5.1 優化·····································76

4.5.2 速度·····································76

4.5.3 自動模式發現························77

4.5.4 多數據源,多種編程語言··················77

4.5.5 ROD和其包API之間的互操作性.......77

4.5.6 僅選擇和讀取爲要的數據···········78

4.6 何時使用ROD 、Dataset

和DataFrame·············78

4.7 利用DataFraine進行分析.......……78

4.7.1 創建SparkSession …………...79

4.7.2 創建DataFrame·····························79

4.7.3 把DataFrame轉換爲RDD·············82

4.7.4 常用的Dataset DataFrame操作······83

4.7.5 緩存數據··································84

4.7.6 性能優化·····························84

4.8 利用DatasetAPl進行分析················85

4.8.1 創建Dataset·····························85

4.8.2 把Dataframe轉換爲Dataset····86

4.8.3 利用數據字典訪問元數據···············87

4.9 Data Sources API ............................87

4.9.1 讀和寫函數································88

4.9.2 內置數據庫····································88

4.9.3 外部數據源··························93

4.10 把Spark SQL作爲分佈式SQL引擎····97

4.10.1 把Spark SQL的Thrift服務器

用於JDBC/ODBC訪問............97

4.10.2 使用beeline客戶端查詢數據·········98

4.10.3 使用spark-sqI CLI從Hive查詢數據....99

4.10.4 與BI工具集成··························100

4.11 Hive on Spark ...........................…100

4.12 小結..............................................100

第5章 利用Spark Streaming和Structured Streaming 進行

實時分析···102

5.1 實時處理概述··························103

5.1.1 Spark Streaming 的優缺點...104

5.1.2 Spark Strcruning的發展史····104

5.2 Spark Streaming的架構···············104

5.2.1 Spark Streaming應用程序流··········106

5.2.2 無狀態和有狀態的準處理·················107

5.3 Spark Streaming的變換和動作········109

5.3.1 union·································· 109

5.3.2 join···························109

5.3.3 transform操作··························109

5.3.4 updateStateByKey·····················109

5.3.5 mapWithState ····················110

5.3.6 窗口操作······ ·····················110

5.3.7 輸出操作........................... 1 11

5.4 輸人數據源和輸出存儲·············111

5.4.1 基本數據源·······112

5.4.2 高級數據源····················112

5.4.3 自定義數據源.···················112

5.4.4 接收器的可靠性························ 112

5.4.5 輸出存儲··························113

5.5 使用Katlca和HBase的SparkStreaming···113

5.5.1 基於接收器的方法·······················114

5.5.2 直接方法(無接收器······················116

5.5.3 與HBase集成···························117

5.6 Spark Streaming的高級概念·········118

5.6.1 使用DataF rame······················118

5.6.2 MLlib操作·······················119

5.6.3 緩存/持久化·······················119

5.6.4 Spark Streaming中的容錯機制······119

5.6.5 Spark Streaming應用程序的

性能調優············121

5.7 監控應用程序·······························122

5.8 結構化流概述································123

5.8.1 結構化流應用程序的工作流··········123

5.8.2 流式Dataset和流式DataFrame·····125

5.8.3 流式Dataset和流式

DataFrame的操作·················126

5.9 小結········································129

第6章 利用Spark 和Hadoop的

筆記本與數據流····················130

6.1 基下網絡的筆記本概述·····················130

6.2 Jupyter概述..·························· 131

6.2.1 安裝Jupyter···················132

6.2.2 用Jupyter進行分析···················134

6.3 Apache Zeppelin 概述····················· 135

6.3.1 Jupyter和Zeppelin對比····136

6.3.2 安裝ApacheZeppelin···················137

6.3.3 使用Zeppelin進行兮析····139

6.4 Livy REST作業服務器和Hue筆記本····140

6.4.1 安裝設置Livy服務器和Hue········141

6.4.2 使用Livy服務器····················1 42

6.4.3 Livy和Hue筆記本搭配使用·········145

6.4.4 Livy和Zeppelin搭配使用·············148

6.5 用於數據流的ApacheNiFi概述········148

6.5.1 安裝ApacheNiFi··················148

6.5.2 把N iF1用幹數據流和分析·····149

6.6 小結·····························152

第7章 利用Spark 和Hadoop 進行機器學習...153

7.1 機器學習概述........….................... 153

7.2 在Spark和Hadoop上進行機器學習.....154

7.3 機器學習算法··················155

7.3.1 有監督學習........…............. 156

7.3.2 無監督學習···················156

7.3.3 推薦系統…................…..... 157

7.3.4 特徵提取和變換……...…157

7.3.5 優化...................................158

7.3.6 Spark MLlib的數據類型…158

7.4 機器學習算法示例·················160

7.5 構建機器學習流水線·················163

7.5.1 流水線工作流的一個示例···········163

7.5.2 構建一個ML流水線··················164

7.5.3 保存和加載模型··················166

7.6 利用H2O和Spark進行機器學習·····167

7.6.1 爲什麼使用SparklingWatcr······167

7.6.2 YARN上的一個應用程序流.........167

7 .6.3 Sparkling Water入門........168

7.7 Hivemall概述……..…………..169

7.8 Hivemall for Spark概述.. ……........170

7.9 小結······························170

第8章 利用Spark和Mahout構建推薦系統...171

8.1 構建推薦系統..............…171

8.1.1 基幹內容的過濾························172

8.1.2 協同過濾······························ 172

8.2 推薦系統的侷限性··························· 173

8.3 用MLlib實現推薦系統·······················173

8.3.1 準備環境·······················174

8.3.2 創建RDD······················175

8.3.3 利用DataFrame探索數據·······176

8.3.4 創建訓練和測試數據集················178

8.3.5 創建一個模型···················178

8.3.6 做出預測··························179

8.3.7 利用測試數據對模型進行評估·······179

8.3.8 檢查誤型的準確度……......180

8.3.9 顯式和隱式反饋····················181

8.4 Mahout和Spark的集成·····················181

8.4.1 安裝Mahout····················181

8.4.2 探索Mahout shell ·····················182

8.4.3 利可Mahout和搜索工具

構建一個通用的推薦系統········186

8.5 小結····················189

第9章 利用GraphX進行圖分析···190

9.1 圖處理概述···································190

9.1.1 圖是什麼···························191

9.1.2 圖數據庫和圖處理系統····191

9.1.3 GraphX概述·······················192

9.1.4 圖算法···································192

9.2 GraphX入門·······················193

9.2.1 GraphX的基本操作·······················193

9.2.2 圖的變換·············198

9.2.3 GraphX算法·······················202

9.3 利用GraphX分析航班數據···········205

9.4 GraphFrames概述························209

9.4.1 模式發現··························· 211

9.4.2 加載和保存Graphframes···212

9.5 小結...............................................212

第10章 利用SparkR進行交互式分析······213

10.1 R語言和Spark.R概述·······················213

10.1.1 R語言是什麼.··························214

10.1.2 SparkR慨述.....................214

10.1.3 SparkR架構..................... 216

10.2 SparkR入門·······················216

10.2.1 安裝和配置R·························216

10.2.2 使用SparkR shell··········218

10.2.3 使甲Spark.R腳本·······················222

10.3 在 SparkR裏使用Dataframe······223

10.4 在RStudio裏使用SparkR···········228

10.5 利用SparkR進行機器學習·······230

10.5.1 利用樸素貝葉斯模型······230

10.5.2 利用K均值模型·······················232

10.6 在Zeppelin裏使用SparkR·······233

10.7 小結·······················234

果想得到下載地址,請訪問中科院計算所培訓中心官網http://www.tcict.cn/添加官網上的微信客服號索取!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章