Spark
一、概述
Apache Spark™ is a unified(統一) analytics engine for large-scale data processing.
特點
- 高效:Run workloads 100x faster.
- 易用:Write applications quickly in Java, Scala, Python, R, and SQL
- 通用:Combine SQL, streaming, and complex analytics
- 兼容:Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
二、環境搭建
注: Spark支持4中集羣類型,分別爲Standalone、Apache Mesos 、Hadoop YARN 、Kubernetes
以下環境搭建爲:Standalone 集羣
環境要求
- Linux/Mac OS操作系統
- JDK
- Scala
- Hadoop HDFS環境(版本需要匹配Spark)
下載
https://spark.apache.org/downloads.html
安裝
gaozhy@gaozhydeMacBook-Pro ~ tar -zxvf Downloads/spark-2.4.3-bin-hadoop2.7.tgz -C software/
配置
gaozhy@gaozhydeMacBook-Pro ~ cd software/spark-2.4.3-bin-hadoop2.7
gaozhy@gaozhydeMacBook-Pro ~ cp conf/spark-env.sh.template conf/spark-env.sh
gaozhy@gaozhydeMacBook-Pro ~ vim conf/spark-env.sh
# 在配置文件中添加如下配置
export SPARK_MASTER_HOST=spark
export SPARK_MASTER_PORT=7077
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 cp conf/slaves.template conf/slaves
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 vim conf/slaves
# 將配置文件中localhost修改爲spark
spark
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 sudo vim /etc/hosts
# 在配置文件末尾添加
127.0.0.1 spark
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 sudo vim /etc/profile
# 添加spark環境變量信息
export SCALA_HOME=/Users/gaozhy/software/scala-2.11.8
export SPARK_HOME=/Users/gaozhy/software/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 source /etc/profile
運行Spark服務
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 start-all.sh
注:關閉服務可使用指令
stop-all.sh
Spark Web UI
訪問地址:http://spark:8080
三、快速開始
通過Spark Shell進行交互式分析
運行spark shell
gaozhy@gaozhydeMacBook-Pro ~ spark-shell
基本操作
準備數據文件/Users/gaozhy/test.json
{"id":1,"name":"zs","sex":"男"}
{"id":2,"name":"ls","sex":"女"}
{"id":3,"name":"ww","sex":"男"}
spark-shell
操作
scala> val dataset = spark.read.json("/Users/gaozhy/test.json")
dataset: org.apache.spark.sql.DataFrame = [id: bigint, name: string ... 1 more field]
注:
DataSet
是Spark抽象出來的一個彈性分佈式數據集
DataSet處理
# 數據記錄數
scala> dataset.count()
res0: Long = 3
# 獲取數據集第一條數據
scala> dataset.first
res1: org.apache.spark.sql.Row = [1,zs,男]
# 性別爲男性的記錄個數
scala> dataset.filter(row => row(2).equals("男")).count()
res4: Long = 2
# 計算男性和女性用戶人數
scala> dataset.rdd.map(row => (row(2),1)).reduceByKey(_+_).saveAsTextFile("/Users/gaozhy/result2")
-----------------------------------------
(男,2)
(女,1)
通過Spark API分析
創建Maven項目並導入開發依賴
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
</dependencies>
測試代碼
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp2 {
def main(args: Array[String]): Unit = {
// 任務提交到spark集羣
val sparkConf = new SparkConf().setMaster("spark://spark:7077").setAppName("simple app").setJars(Array("/Users/gaozhy/workspace/20180429/spark01/target/spark01-1.0-SNAPSHOT.jar"))
// 本地模擬
// val sparkConf = new SparkConf().setMaster("local[*]").setAppName("simple app")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder.getOrCreate
val dataset = spark.read.json("/Users/gaozhy/test.json")
dataset.rdd.map(row => (row(2),1)).reduceByKey(_+_).saveAsTextFile("/Users/gaozhy/result2")
spark.stop()
}
}
運行結果
通過提交Jar包分析
將spark應用打成jar包
使用spark-submit提交任務
gaozhy@gaozhydeMacBook-Pro ~ spark-submit --class "SimpleApp2" --master spark://spar:7077 /Users/gaozhy/workspace/20180429/spark01/target/spark01-1.0-SNAPSHOT.jar
官方提供的例子(計算圓周率)
提交任務
gaozhy@gaozhydeMacBook-Pro ~/software/spark-2.4.3-bin-hadoop2.7 bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.4.3.jar \
100