創建RDD
根據內容parallelize ():
Line = sc.parallelize(["pandas", "Ilike pandas"])
根據文本文件 textFile():
inputRDD = sc.textFile("log.txt")
映射RDD
一對一映射 map(): 將每個數據項變換後形成新的數據項作爲結果
num = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
print("%i" % (num))
一對多映射 flatMap():將每個數據項變換後形成多個數據項,然後把所有數據項作爲結果
num = sc.parallelize(["1 2","2 3 4"])
words = nums.map(lambda x: x.split(" ")).collect()
for word in words:
print(word)
過濾RDD filler():
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x:"error" in x)
合併RDD union():
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x:"error" in x)
warningRDD = inputRDD.filter(lambda x:"warning" in x)
badLinesRDD = errorsRDD.union(warningRDD)
計算RDD長度 Count()
print( "Inpur had " + badLinesRDD.count() + “ concerning lines”)
讀取RDD內容到本地
部分讀取 take(Length) 從RDD中讀取length條記錄到本地
for line in badLinesRDD.take(10):
print(line)
全部讀取 collect()
for line in badLinesRDD.collect():
print(line)
aggregate() 複雜的reduce
from pyspark import SparkContext, SparkConf
sc = SparkContext("local[2]", "Simple App")
nums = sc.parallelize([1, 2, 3, 4, 5])
sumCount = nums.aggregate((0, 0),
(lambda acc, value: (acc[0]+value, acc[1]+1)),
(lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])))
print(sumCount[0]/float(sumCount[1]))