使用python編寫pyspark的wordcount程序,使用spark-submit分別在local和yarn方式允許;
1.1、創建測試文件
- 本地文件
$ cd ~/pyspark/PythonProject
$ mkdir data
$ cd data/
$ vim word.txt
$ tail word.txt
hadoop spark hive
hive java python
spark perl hadoop
python RDD spark
RDD
- HDFS文件
$ cd ~/pyspark/PythonProject
$ hadoop fs -put data /user/input/
1.2、編寫spark wordcount程序
- 編寫wordcount 程序
$ vim wordcount.py
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from pyspark import SparkContext, SparkConf
def CreateSparkContext():
"""創建sparkConf函數,設定app名字"""
conf = SparkConf().setAppName("WordCount").set("spark.ui.showConsoleProgress", "false")
sc = SparkContext(conf = conf)
SetLogger(sc)
SetPath(sc)
return (sc)
def SetLogger(sc):
"""設置日誌顯示方式"""
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org").setLevel(logger.Level.ERROR)
logger.LogManager.getLogger("akka").setLevel(logger.Level.ERROR)
logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)
def SetPath(sc):
"""定義全局path"""
global Path
if sc.master[0:5] == "local":
Path = "file:/home/hadoop/pyspark/PythonProject"
else:
Path = "hdfs://node:9000/user/input/"
if __name__ == "__main__":
print("開始執行wordcount...")
sc = CreateSparkContext()
print("開始執行讀取文件...")
textFile = sc.textFile(Path + "data/word.txt")
print("執行map/reduce運算...")
stringRDD = textFile.flatMap(lambda line:line.split(" "))
countsRDD = stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
print("保存結果...")
try:
countsRDD.saveAsTextFile(Path + "data/output")
except Exception as e:
print("輸出目錄以及存在,請刪除原目錄!")
print("結束...")
sc.stop()
1.3、spark-submit 執行程序
1.3.1、 spark-submit 本地模式執行
- 執行命令
$ spark-submit wordcount.py
- 查看計算結果
$ cd ~/ipynotebook/data/
$ tree
.
├── output
│ ├── part-00000
│ └── _SUCCESS
└── word.txt
1 directory, 3 files
$ tail output/part-00000
('hadoop', 2)
('spark', 3)
('hive', 2)
('java', 1)
('python', 2)
('perl', 1)
('RDD', 2)
('', 1)
1.3.2、spark-submit yanr 模式執行
- 執行命令
$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop spark-submit --master yarn --deploy-mode client wordcount.py
- yarn 執行情況
$ yarn application -list -appStates ALL
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1530328746140_0001 WordCount SPARK hadoop default FINISHED SUCCEEDED 100% N/A
- 查看計算結果
$ hadoop fs -cat /user/input/data/output/part-0000*
('python', 2)
('', 1)
('hadoop', 2)
('hive', 2)
('java', 1)
('spark', 3)
('perl', 1)
('RDD', 2)