pyspark中RDD常用操作

準備工作:

import pyspark
from pyspark import SparkContext
from pyspark import SparkConf

conf=SparkConf().setAppName("lg").setMaster('local[4]')    #local[4]表示用4個內核在本地運行  
sc=SparkContext.getOrCreate(conf)

1. parallelize和collect

parallelize函數將list對象轉換爲RDD對象;collect()函數返回rdd對象對應的list數據類型

words = sc.parallelize(
    ["scala",
     "java",
     "spark",
     "hadoop",
     "spark",
     "akka",
     "spark vs hadoop",
     "pyspark",
     "pyspark and spark"
     ])
print(words)
print(words.collect())
ParallelCollectionRDD[139] at parallelize at PythonRDD.scala:184
['scala', 'java', 'spark', 'hadoop', 'spark', 'akka', 'spark vs hadoop', 'pyspark', 'pyspark and spark']

2. 定義/生成rdd的兩種方式:parallelize函數和textFile函數

          第一種方式即通過parallelize方法,第二種就是使用textFile函數直接讀取文件,注意如果傳的是文件夾,則會讀取該文件夾下所有的文件(文件夾下有文件夾會報錯)。

path = 'G:\\pyspark\\rddText.txt'  
rdd = sc.textFile(path)
rdd.collect()

3. 分區設置和展示:repartition,defaultParallelism和glom

       可通過SparkContext.defaultParallelism設置全局默認的分區數量;也可通過repartition設置某個具體rdd的分區數量。

在調用collect()函數前調用glom(),則結果會按分區展示

SparkContext.defaultParallelism=5
print(sc.parallelize([0, 2, 3, 4, 6]).glom().collect())
SparkContext.defaultParallelism=8
print(sc.parallelize([0, 2, 3, 4, 6]).glom().collect())
rdd = sc.parallelize([0, 2, 3, 4, 6])
rdd.repartition(2).glom().collect()
[[0], [2], [3], [4], [6]]
[[], [0], [], [2], [3], [], [4], [6]]
Out[105]:
[[2, 4], [0, 3, 6]]

注意:設置SparkContext.defaultParallelism只對之後定義的rdd有影響,對之前生成的rdd沒有影響

4. count和countByValue

     count返回rdd中元素的個數,返回一個int。countByValue返回rdd中不同元素值出現的個數,返回的是一個字典。如果rdd中的元素不是字典類型,如本案例中是字符串(如果是int會報錯),countByKey會將各個元素的首字母作爲key來統計不同key的個數.

counts = words.count()
print("Number of elements in RDD -> %i" % counts)
print("Number of every elements in RDD -> %s" % words.countByKey())
print("Number of every elements in RDD -> %s" % words.countByValue())
Number of elements in RDD -> 9
Number of every elements in RDD -> defaultdict(<class 'int'>, {'s': 4, 'j': 1, 'h': 1, 'a': 1, 'p': 2})
Number of every elements in RDD -> defaultdict(<class 'int'>, {'scala': 1, 'java': 1, 'spark': 2, 'hadoop': 1, 'akka': 1, 'spark vs hadoop': 1, 'pyspark': 1, 'pyspark and spark': 1})

5. filter過濾函數

      filter(func)按func函數對rdd每個分區中的元素(每個分區作爲一個整體)進行過濾

words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.glom().collect()
print("Fitered RDD -> %s" % (filtered))
Fitered RDD -> [[], ['spark'], ['spark'], ['spark vs hadoop', 'pyspark', 'pyspark and spark']]

6.map和flatMap

通過將map函數應用於RDD中的每個元素來返回新的RDD。 flatMap() 和map() 的區別:flatMap()返回一個由各列表中的元素組成的RDD,而不是一個由列表組成的RDD。

words_map = words.map(lambda x: (x, len(x)))
mapping = words_map.collect()
print("Key value pair -> %s" % (mapping))
words.flatMap(lambda x: (x, len(x))).collect()
Key value pair -> [('scala', 5), ('java', 4), ('spark', 5), ('hadoop', 6), ('spark', 5), ('akka', 4), ('spark vs hadoop', 15), ('pyspark', 7), ('pyspark and spark', 17)]

['scala',5,'java', 4,'spark', 5,'hadoop',6,'spark',5,'akka',4,'spark vs hadoop',15,'pyspark',7,'pyspark and spark',17]

7 reduce和fold

reduce函數;執行指定的可交換和關聯二元操作後,將返回RDD中的元素.

假如有一組整數[x1,x2,x3],利用reduce執行加法操作add,對第一個元素執行add後,結果爲sum=x1,

然後再將sum和x2執行add,sum=x1+x2,最後再將x2和sum執行add,此時sum=x1+x2+x3。

fold和reduce的區別:fold比reduce多傳一個參數,下面的實例中nums.fold(1,add)表示nums中的每個元素先執行add(e,1)後再執行reduce.

def add(a,b):
    c = a + b
    print(str(a) + ' + ' + str(b) + ' = ' + str(c))
    return c
nums = sc.parallelize([1, 2, 3, 4, 5])
adding = nums.reduce(add)
print("Adding all the elements -> %i" % (adding))
adding2 = nums.fold(1,add)   #第一個參數1表示nums中的每個元素先執行add(e,1)後再執行fold
print("Adding all the elements -> %i" % (adding2))
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
10 + 5 = 15
Adding all the elements -> 15
1 + 1 = 2
2 + 2 = 4
4 + 1 = 5
5 + 3 = 8
8 + 4 = 12
12 + 1 = 13
13 + 5 = 18
18 + 6 = 24
Adding all the elements -> 24

8. distinct 去重

9. 多個rdd之間的union,intersection,subtract和cartesian

相當於並、交、差和笛卡兒積操作

rdd1 = sc.parallelize(["spark","hadoop","hive","spark"])
rdd2 = sc.parallelize(["spark","hadoop","hbase","hadoop"])
rdd3 = rdd1.union(rdd2)
rdd3.collect()

['spark', 'hadoop', 'hive', 'spark', 'spark', 'hadoop', 'hbase', 'hadoop']
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()

['spark', 'hadoop']
rdd3 = rdd1.subtract(rdd2)
rdd3.collect()

['hive']
rdd3 = rdd1.cartesian(rdd2)
rdd3.collect()

[('spark', 'spark'),
 ('spark', 'hadoop'),
 ('spark', 'hbase'),
 ('spark', 'hadoop'),
 ('hadoop', 'spark'),
 ('hadoop', 'hadoop'),
 ('hadoop', 'hbase'),
 ('hadoop', 'hadoop'),
 ('hive', 'spark'),
 ('hive', 'hadoop'),
 ('hive', 'hbase'),
 ('hive', 'hadoop'),
 ('spark', 'spark'),
 ('spark', 'hadoop'),
 ('spark', 'hbase'),
 ('spark', 'hadoop')]

10 top,take,takeOrdered

返回值就是list,無需再collect()。其中take不按排序取元素,即取原始rdd中前n個位置的原始;top默認按大到小排序取前n個;takeOrdered默認按小到大排序取前n個.

rdd1 = sc.parallelize(["spark","hadoop","hive","spark","kafka"])
print(rdd1.top(3))
print(rdd1.take(3))
print(rdd1.takeOrdered(3))

['spark', 'spark', 'kafka']
['spark', 'hadoop', 'hive']
['hadoop', 'hive', 'kafka']

11. join操作

join(other, numPartitions = None) 返回RDD,其中包含一對帶有匹配鍵的元素以及該特定鍵的所有值

x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.collect()
print( "Join RDD -> %s" % (final))

Join RDD -> [('hadoop', (4, 5)), ('spark', (1, 2))]

12 aggregate

aggregate中前一個函數是在各分區內計算的函數,後一個函數是聚合個分區結果的函數

def add2(a,b):
    c = a + b
    print(str(a) + " add " + str(b) + ' = ' + str(c))
    return c
def mul(a,b):
    c = a*b
    print(str(a) + " mul " + str(b) + ' = ' + str(c))
    return c
print(nums.glom().collect())

#相當於先將各分區的數值求和加上2,即轉化爲[[3], [4], [5], [11]]
#再使用2和各分區的數字連乘,即2*3=6,6*4=24,24*5=120,120*11=1320
print(nums.aggregate(2,add2,mul))   
#相當於先將各分區的數值乘以2,即轉化爲[[2], [4], [6], [40]],其中40=2*4*5
#再使用2和各分區的數字連加,即2+2=4,4+4=8,8+6=14,14+40=54
print(nums.aggregate(2,mul,add2)) 

[[1], [2], [3], [4, 5]]
2 mul 3 = 6
6 mul 4 = 24
24 mul 5 = 120
120 mul 11 = 1320
1320
2 add 2 = 4
4 add 4 = 8
8 add 6 = 14
14 add 40 = 54
54

 

發佈了129 篇原創文章 · 獲贊 153 · 訪問量 56萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章