pyspark入門系列 - 01 統計文檔中單詞個數

導入SparkConf和SparkContext模塊,任何Spark程序都是SparkContext開始的,SparkContext的初始化需要一個SparkConf對象,SparkConf包含了Spark集羣配置的各種參數。初始化後,就可以使用SparkContext對象所包含的各種方法來創建和操作RDD和共享變量。

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setMaster('local').setAppName('read_txt')
sc = SparkContext(conf=conf)
# 讀取目錄下的txt文件
rdd = sc.textFile('../data/eclipse_license.txt')
# 統計元素個數(行數)
print(rdd.count())
70

filter

轉化操作,通過傳入函數定義過濾規則

# 使用filter篩選出包含‘License’的行,並查看第一個字符串
pythonline = rdd.filter(lambda line: 'License' in line)
pythonline.first() 
'Eclipse Public License - v 1.0'
# 統計詞頻,打印前10個
# 1. faltMap: 將返回的數組全部拆散,然後合成到一個數組中
# 2. map: 針對數組中的每一個元素進行操作
# 3. reduceByKey: 根據key進行合併計算
# 4. sortBy: 排序
result = rdd.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b)
sorted_result = result.sortBy(lambda x: x[1], ascending=False)
print(sorted_result.collect()[0:10])
# # sorted_result.saveAsTextFile('.result')  # 將結果保存到文件
[('the', 98), ('to', 56), ('of', 54), ('and', 48), ('Contributor', 31), ('a', 30), ('in', 28), ('this', 27), ('', 26), ('or', 24)]

# 將經常訪問的數據持久化到內存,(需要被重用的中間結果)
sorted_result.persist()  

PythonRDD[11] at collect at <ipython-input-13-722d464a9f79>:8

union() 聯合兩個rdd

errorsRDD = rdd.filter(lambda x: "error" in x)  
LicenseRDD = rdd.filter(lambda x: "License" in x)
unionLinesRDD = errorsRDD.union(LicenseRDD)
unionLinesRDD.collect()

['EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Each Recipient is solely responsible for determining the appropriateness of using and distributing the Program and assumes all risks associated with its exercise of rights under this Agreement , including but not limited to the risks and costs of program errors, compliance with applicable laws, damage to or loss of data, programs or equipment, and unavailability or interruption of operations.',
 'Eclipse Public License - v 1.0',
 '"Licensed Patents" mean patent claims licensable by a Contributor which are necessarily infringed by the use or sale of its Contribution alone or when combined with the Program.',
 'b) Subject to the terms of this Agreement, each Contributor hereby grants Recipient a non-exclusive, worldwide, royalty-free patent license under Licensed Patents to make, use, sell, offer to sell, import and otherwise transfer the Contribution of such Contributor, if any, in source code and object code form. This patent license shall apply to the combination of the Contribution and the Program if, at the time the Contribution is added by the Contributor, such addition of the Contribution causes such combination to be covered by the Licensed Patents. The patent license shall not apply to any other combinations which include the Contribution. No hardware per se is licensed hereunder.']

first,take,collect取值

  • first: 取出第一個值
  • take:取出n個值
  • collect:取出全部數據到內存
unionLinesRDD.take(2)
 'Eclipse Public License - v 1.0']

RDD轉化操作

在這裏插入圖片描述
在這裏插入圖片描述

RDD行動操作

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-OaJV4XFM-1591795158258)(attachment:image.png)]

使用reduce計算上面txt中的總單詞數

tmp_rdd = rdd.flatMap(lambda line:line.split(' ')).map(lambda x: (x, 1))
tmp_rdd1 = tmp_rdd.map(lambda x: x[1])
tmp_rdd1.reduce(lambda x, y: x + y)

1724

使用fold計算上面txt中的總單詞數

tmp_rdd1.fold(zeroValue=0, op=lambda x, y: x + y)
1724
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章