Spark學習之路(一)【概述、環境搭建、基本操作】

Spark

一、概述

http://spark.apache.org/

Apache Spark™ is a unified(統一) analytics engine for large-scale data processing.

特點

  • 高效:Run workloads 100x faster.
  • 易用:Write applications quickly in Java, Scala, Python, R, and SQL
  • 通用:Combine SQL, streaming, and complex analytics
  • 兼容:Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

二、環境搭建

: Spark支持4中集羣類型,分別爲StandaloneApache MesosHadoop YARNKubernetes

以下環境搭建爲:Standalone 集羣

環境要求

  • Linux/Mac OS操作系統
  • JDK
  • Scala
  • Hadoop HDFS環境(版本需要匹配Spark)

下載

https://spark.apache.org/downloads.html
image-20190611093832276

安裝

gaozhy@gaozhydeMacBook-Pro  ~ tar -zxvf Downloads/spark-2.4.3-bin-hadoop2.7.tgz -C software/

配置

gaozhy@gaozhydeMacBook-Pro  ~  cd software/spark-2.4.3-bin-hadoop2.7
gaozhy@gaozhydeMacBook-Pro  ~  cp conf/spark-env.sh.template conf/spark-env.sh
gaozhy@gaozhydeMacBook-Pro  ~  vim conf/spark-env.sh

# 在配置文件中添加如下配置
export SPARK_MASTER_HOST=spark
export SPARK_MASTER_PORT=7077

gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  cp conf/slaves.template conf/slaves
gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  vim conf/slaves

# 將配置文件中localhost修改爲spark
spark

gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  sudo vim /etc/hosts

# 在配置文件末尾添加
127.0.0.1	spark

gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  sudo vim /etc/profile
# 添加spark環境變量信息
export SCALA_HOME=/Users/gaozhy/software/scala-2.11.8
export SPARK_HOME=/Users/gaozhy/software/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin

gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  source /etc/profile

運行Spark服務

gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  start-all.sh

注:關閉服務可使用指令stop-all.sh
在這裏插入圖片描述

Spark Web UI

訪問地址:http://spark:8080

在這裏插入圖片描述

三、快速開始

通過Spark Shell進行交互式分析

運行spark shell

gaozhy@gaozhydeMacBook-Pro  ~ spark-shell

在這裏插入圖片描述

基本操作

準備數據文件/Users/gaozhy/test.json

{"id":1,"name":"zs","sex":"男"}
{"id":2,"name":"ls","sex":"女"}
{"id":3,"name":"ww","sex":"男"}

spark-shell操作

scala> val dataset = spark.read.json("/Users/gaozhy/test.json")
dataset: org.apache.spark.sql.DataFrame = [id: bigint, name: string ... 1 more field]

注:DataSet是Spark抽象出來的一個彈性分佈式數據集

DataSet處理

# 數據記錄數
scala> dataset.count()
res0: Long = 3

# 獲取數據集第一條數據
scala> dataset.first
res1: org.apache.spark.sql.Row = [1,zs,男]

# 性別爲男性的記錄個數
scala> dataset.filter(row => row(2).equals("男")).count()
res4: Long = 2

# 計算男性和女性用戶人數
scala> dataset.rdd.map(row => (row(2),1)).reduceByKey(_+_).saveAsTextFile("/Users/gaozhy/result2")
-----------------------------------------
(男,2)
(女,1)

通過Spark API分析

創建Maven項目並導入開發依賴

<dependencies>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.3</version>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.4.3</version>
  </dependency>
</dependencies>

測試代碼

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}


object SimpleApp2 {

  def main(args: Array[String]): Unit = {
		// 任務提交到spark集羣
    val sparkConf = new SparkConf().setMaster("spark://spark:7077").setAppName("simple app").setJars(Array("/Users/gaozhy/workspace/20180429/spark01/target/spark01-1.0-SNAPSHOT.jar"))
    // 本地模擬
    // val sparkConf = new SparkConf().setMaster("local[*]").setAppName("simple app")
    val sparkContext = new SparkContext(sparkConf)
    val spark = SparkSession.builder.getOrCreate
    val dataset = spark.read.json("/Users/gaozhy/test.json")
    dataset.rdd.map(row => (row(2),1)).reduceByKey(_+_).saveAsTextFile("/Users/gaozhy/result2")
    spark.stop()
  }
}

運行結果

在這裏插入圖片描述

通過提交Jar包分析

將spark應用打成jar包

在這裏插入圖片描述

在這裏插入圖片描述

使用spark-submit提交任務

gaozhy@gaozhydeMacBook-Pro  ~ spark-submit --class "SimpleApp2" --master spark://spar:7077 /Users/gaozhy/workspace/20180429/spark01/target/spark01-1.0-SNAPSHOT.jar

官方提供的例子(計算圓周率)

提交任務

 gaozhy@gaozhydeMacBook-Pro  ~/software/spark-2.4.3-bin-hadoop2.7  bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.4.3.jar \
100

計算結果

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章