PySpark WordCount

使用python編寫pyspark的wordcount程序,使用spark-submit分別在local和yarn方式允許;

1.1、創建測試文件

  • 本地文件
$ cd ~/pyspark/PythonProject
$ mkdir data
$ cd data/
$ vim word.txt
$ tail word.txt 
hadoop spark hive
hive java python
spark perl hadoop
python RDD spark
RDD 
  • HDFS文件
$ cd ~/pyspark/PythonProject
$ hadoop fs -put data /user/input/

1.2、編寫spark wordcount程序

  • 編寫wordcount 程序
$ vim wordcount.py 

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from pyspark import SparkContext, SparkConf

def CreateSparkContext():
    """創建sparkConf函數,設定app名字"""
    conf = SparkConf().setAppName("WordCount").set("spark.ui.showConsoleProgress", "false")
    sc = SparkContext(conf = conf)
    SetLogger(sc)
    SetPath(sc)
    return (sc)

def SetLogger(sc):
    """設置日誌顯示方式"""
    logger = sc._jvm.org.apache.log4j
    logger.LogManager.getLogger("org").setLevel(logger.Level.ERROR)
    logger.LogManager.getLogger("akka").setLevel(logger.Level.ERROR)
    logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)

def SetPath(sc):
    """定義全局path"""
    global Path
    if sc.master[0:5] == "local":
        Path = "file:/home/hadoop/pyspark/PythonProject"
    else:
        Path = "hdfs://node:9000/user/input/"

if __name__ == "__main__":
    print("開始執行wordcount...")
    sc = CreateSparkContext()
    print("開始執行讀取文件...")
    textFile = sc.textFile(Path + "data/word.txt")
    print("執行map/reduce運算...")
    stringRDD = textFile.flatMap(lambda line:line.split(" "))
    countsRDD = stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
    print("保存結果...")
    try:
        countsRDD.saveAsTextFile(Path + "data/output")
    except Exception as e:
        print("輸出目錄以及存在,請刪除原目錄!")
    print("結束...")
    sc.stop()

1.3、spark-submit 執行程序

1.3.1、 spark-submit 本地模式執行

  • 執行命令
$ spark-submit wordcount.py
  • 查看計算結果
$ cd ~/ipynotebook/data/
$ tree
.
├── output
│   ├── part-00000
│   └── _SUCCESS
└── word.txt

1 directory, 3 files
$ tail output/part-00000 
('hadoop', 2)
('spark', 3)
('hive', 2)
('java', 1)
('python', 2)
('perl', 1)
('RDD', 2)
('', 1)

1.3.2、spark-submit yanr 模式執行

  • 執行命令
$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop spark-submit --master yarn --deploy-mode client wordcount.py 
  • yarn 執行情況
$ yarn application -list -appStates ALL

Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1530328746140_0001             WordCount                   SPARK        hadoop     default            FINISHED           SUCCEEDED             100%                                 N/A
  • 查看計算結果
$ hadoop fs -cat /user/input/data/output/part-0000*
('python', 2)
('', 1)
('hadoop', 2)
('hive', 2)
('java', 1)
('spark', 3)
('perl', 1)
('RDD', 2)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章