pyspark 筆記

原創

2020-03-30 09:11

前言：寫這個博客，1是爲了便於自己系統的學習梳理，2是爲了分享給大家……

一、創建rdd

sc = spark.sparkContext

############## 1、並行化創建rdd

d1 = sc.parallelize(range(0,10)) # 數組：隨機數  “[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]”
d2 = sc.parallelize(((1,2),(3,4))) # 二維數組  “[(1, 2), (3, 4)]”
d3 = sc.parallelize((1,2,3)) # 元組，變成了數組  “[1, 2, 3]”
d4 = sc.parallelize({'name','age','gender'}) # set集合，變成了數組  “['gender', 'age', 'name']”
d5 = sc.parallelize({1:2,3:4}) # 字典，只有key取到了值  “[1, 3]”

d6 = sc.parallelize(range(0,10),10) # 創建10個partition

print(d1.collect())
print(d2.collect())
print(d3.collect())
print(d4.collect())
print(d5.collect())
print(d6.collect())

############## 2、通過文件創建rdd
td1 = sc.textFile('hdfs:.../user/test_data/id/*') # 可以是hdfs，可以是本地文件
print(td1.collect()) # 文件的一行爲一個數組內容：[u'846240617068\t10002\t50']

############## 3、通過讀表創建rdd
tab = spark.sql("select * from u.table").filter(col("city") == city)
print(tab.collect())

############## 4、通過讀jdbc mysql表創建rdd
mtab = (spark.read.format("jdbc").options(
                url='jdbc:mysql://xxx.com:9080',
                driver='com.mysql.jdbc.Driver',
                user='root',
                password='123456',
                dbtable='u.t')).load()
print(mtab.collect())

二、操作

############## 1、Transformation 算子操作：map

d11 = d1.map(lambda x: x+2)
print(d11.collect()) # [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

d12 = d1.map(lambda x: (x,x))
print(d12.collect()) # [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]

d13 = d1.map(lambda x: (x,(x,x)))
print(d13.collect()) # [(0, (0, 0)), (1, (1, 1)), (2, (2, 2)), (3, (3, 3)), (4, (4, 4)), (5, (5, 5)), (6, (6, 6)), (7, (7, 7)), (8, (8, 8)), (9, (9, 9))]

############## 2、Transformation 操作：filter

d21 = d1.map(lambda x: x+2).filter(lambda x: x != 2)
print(d21.collect()) # [3, 4, 5, 6, 7, 8, 9, 10, 11]

d22 = d1.map(lambda x: x+2).filter(lambda x: x == 3)
print(d22.collect()) # [3]

############## 3、Transformation 操作：flatMap
d31 = d4.flatMap(lambda word: word)
print(d31.collect()) # ['g', 'e', 'n', 'd', 'e', 'r', 'a', 'g', 'e', 'n', 'a', 'm', 'e']
d32 = d4.flatMap(lambda word: word.split())
print(d32.collect()) # ['gender', 'age', 'name']
d33 = d4.flatMap(lambda word: len(word))
print(d33.collect()) # 報錯

############## 4、Transformation 操作：mapPartitions
def squareFunc(a):
    for i in a:
        yield i*i
d41 = d1.mapPartitions(squareFunc)
print(d41.collect()) # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

############## 5、Transformation 操作：sample
d51 = d1.sample(False,0.5,0) 
print(d51.collect()) # [0, 2, 4, 6, 7, 9]
d52 = d1.sample(True,0.6,0) # 最後一個參數不知道什麼意思
print(d52.collect()) # [1, 3, 5, 8, 8]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pyspark 筆記

pyspark 筆記

不同hql執行環境下配置變量之 hiveconf 與 hivevar

三千記之實踐應用

三千記之知識備忘

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結