Spark-Python安裝與配置

centos下hadoop僞分佈式配置及python測試中,我們已經安裝好了Hadoop,下面我們來安裝Spark。

Scala安裝與配置

由於Spark是用Scala編寫的,所以我們要先安裝Scala。

$ wget -P ~/download/ https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ cd ~/download/
$ tar zxf scala-2.11.8.tgz
$ mv scala-2.11.* /opt/scala
$ echo 'export SCALA_HOME=/opt/scala' >> /etc/bashrc

Spark安裝與配置

apache官網目前只提供2.*版的Spark,我們選擇spark-2.4.2版進行安裝。

$ wget -P ~/download/ http://mirror.bit.edu.cn/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.6.tgz
$ tar zxf spark-2.4.2-bin-hadoop2.6.tgz -C /opt
$ cd /opt/
$ mv spark-2.4.2-bin-hadoop2.6/ spark
$ echo 'export SPARK_HOME=/opt/spark' >> /etc/bashrc
$ echo 'alias spark-shell=$SPARK_HOME/bin/spark-shell' >> /etc/bashrc
$ source /etc/bashrc

更新Java

這個時候我們啓動Spark-Shell,可能會報錯:

$ spark-shell
# Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 52.0

這是因爲spark-2.4.2需要更高版本的Java支持,於是我們把Java更新到1.8.0。

$ yum -y install java-1.8.0-openjdk*
$ ls -lrt /etc/alternatives/java
# lrwxrwxrwx 1 root root 73 May  1 20:57 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java
$ vi /etc/bashrc

用vi命令修改/etc/bashrc:

...
# export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.221-2.6.18.0.el7_6.x86_64/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/
...

保存文件後,成功進入spark-shell。

$ source /etc/bashrc
$ spark-shell

Scala腳本測試

scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> val conf = new SparkConf().setMaster("localhost").setAppName("App Name")
scala> val sc = new SparkContext(conf)
# 讀入文件
scala> val input = sc.textFile("/root/p*.txt")
# 分割單詞
scala> val words = input.flatMap(line => line.split(" "))
# 統計詞頻
scala> val count = words.map((_, 1)).reduceByKey(_+_)
# 打印顯示
scala> count.collect().foreach(println)
# (university,1)
# (priority,1)
# (next,2)
# (hence,,1)
# (low-priced.,1)
# (its,9)
# (others,1)
# (customized,1)
# (extraordinary,1)
# (have,6)
# ...

# 記得關閉sc
scala> sc.stop()

Python腳本測試

編寫spark_test. py文件:

import os
import sys
sys.path.append("/opt/spark/python")
sys.path.append("/opt/spark/python/lib/py4j-0.10.7-src.zip")
from pyspark import SparkContext, SparkConf

conf = SparkConf().setMaster("local[*]").setAppName("App Name")
sc = SparkContext(conf=conf)
text = sc.textFile("/root/p*.txt")
words = text.flatMap(lambda line: line.split(" "))
count = words.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
for each in count.collect():
    print(each[0], each[1])
sc.stop()

配置pyspark:

$ echo 'export PATH=$SPARK_HOME/bin:$PATH' >> /etc/bashrc
$ echo 'export PYSPARK_PYTHON=/usr/bin/python3.6' >> /etc/bashrc
$ echo 'export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6' >> /etc/bashrc
$ source /etc/bashrc 

測試,跟scala結果(順序)不同:

$ python spark_test.py
# 19/05/02 15:56:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
# Setting default log level to "WARN".
# To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# police 7
# with 12
# of 56
# cards 6
# murdered 2
#  48
# Shocking 1
# details 1
# from 7
# university 1
# kept 1
# ...
$ 

參考文獻

  1. 高揚, 衛崢, 尹會生. 白話大數據與機器學習[M]. 機械工業出版社, 2016.
  2. Spark入門(Python)
  3. Spark修改爲python3.6.5
  4. Spark解決版本衝突和pyspark結合jupyter使用
  5. spark下跑python程序
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章