python Aggregate 操作例子

Similar to reduce() but used to return a different type. Accept three parameters:

zeroValue : the initial value for the accumulated result of each partition for the seqOp operator,and also the initial value for the combine results from different partitions for the combOpoperator

seqOp : an operator used to accumulate results within a partition

combOp: an associative operator used to combine results from different partitions

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
nums.reduce(lambda x,y:x+y)
結果36

Nums.aggregate(0,lambda x,y:x+y,lambda x,y:x+y)
結果36

源碼解釋網址:http://spark.apache.org/docs/2.2.1/api/python/pyspark#pyspark.RDD


Nums.Aggregate((0,0),lambdax,y:(x[0]+y,x[1]+1,lambda x,y:(x[0]+y[0],x[1]+y[1]))

 Nums.repartition(1).aggregate(100,lambdax,y:x+y,lambda x,y:x+y)

結果216

Reparation是合理分區,空間相等。

爲何數據不平衡?利用哈希算法 ,數據清洗。(除3餘0,代表數據相同就會放在同一個區裏)

RDD元素取值操作

take(n) : Return n elements from the RDD

top(num): Return num elements from the RDD.

first(): Return the first element

collect(): Return all elements from the RDD

foreach(func): Apply the provided function to each elementof the RDD

takeSample(withReplacement, num, [seed])

RDD取值例子

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
print (nums.top(3))
print (nums.take(2))
print (nums.first())

12,12,6

12,2

12,

騰訊共收了多少錢?
lines.map(lambda x:int(x,split(‘,’)[2])).reduce(lambda x,y:x+y)

 


#每個人在每個區共花了多少錢?按區進行降序,姓名升序
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0]==other.obj[0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0]>other.obj[0]

rs=sorted(rdd,key=lambda x:Reversinator1(x[0]))
print(rs)
#每個人在每個區共花了多少錢,按區降序排列,每個相同區中花的錢按升序排?
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0][0]==other.obj[0][0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0][0]>other.obj[0][0]

rs=sorted(rdd,key=lambda x:Reversinator1(x))
print(rs)

#每個人在每個區共花了多少錢?用組合key
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
#先按區排序,區相同的再按名字進行排序,用python排序
rs=sorted(rdd)
print(rs)
#降序
rs=sorted(rdd,reverse=True)
print(rs)
class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
class Reversinator2(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
rs=sorted(rs,key=lambda x:(x[0][0],Reversinator2(x[0][1])),reverse=True)
print(rs)

 



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章