目錄
二、配置Spark,使其指向python3(前期已經配置好Spark環境)
一、升級Python2爲Python3
sudo apt-get install python3
alias python=python3
二、配置Spark,使其指向python3(前期已經配置好Spark環境)
vim ~/.bashrc
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
三、啓動Spark
cd /usr/local/hadoop
./sbin/start-dfs.sh
cd /usr/local/spark
./bin/pyspark #進入pyspark交互環境,單機模式
四、常見報錯
No module named py4j
解決辦法:進入 $SPARK_HOME/python/lib/ 裏面檢查py4j 的版本號和你剛纔添加到 ~/.bashrc 裏面的版本號一樣!
五、實例運行
實例1-查看包含a,b的行數
修改日誌輸出類型(防止輸出大量心跳信息,修改log4j Catgory中INFO爲ERROR)-創建python文件-提交submit運行-輸出結果
cd /usr/local/spark/conf
ll
cp log4j.properties.template log4j.properties
vim log4j.properties
cd /usr/local/spark
mkdir -p mycode/python #有子目錄需要加-p
vim WordCount.py
#計算文件中包含a和b的行數
from pyspark import SparkConf,SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")#配置環境信息
sc = SparkContext(conf=conf)#創建指揮官
logFile = "file:///usr/local/spark/README.md" #注意本地file文件需要///
logData = sc.textFile(logFile,2).cache() #將數據讀入
numAs = logData.filter(lambda line:'a' in line).count()
numBs = logData.filter(lambda line:'b' in line).count()
print('Line with a:%s,Line with b:%s'%(numAs,numBs))
#提交submit運行
/usr/local/spark/bin/spark-submit - /usr/local/spark/mycode/python/WordCount.py
輸出結果
實例2-查看文件目錄中前top的值
from pyspark import SparkConf,SparkContext
conf = SparkConf().setMaster("local").setAppName("ReadHBase")
sc = SparkContext(conf=conf)
lines = sc.textFile("file:///usr/local/spark/mycode/rdd/file") #注意傳的是一個目錄
result1 = lines.filter(lambda line:(len(line.strip())>0) and (len(line.split(','))==4))
result2 = result1.map(lambda x:x.split(",")[2])
result3 = result2.map(lambda x:(int(x),""))
result4 = result3.repartition(1)#保證全局有序
result5 = result4.sortByKey(False)
result6 = result5.map(lambda x:x[0])
result7 = result6.take(5)
for i in result7:
print(i)
其中一個file文件示例
1,1768,40,155
2,22,34,22
3,34,55,23
4,45,33,242
5,33,67,345