python Aggregate 操作例子

Similar to reduce() but used to return a different type. Accept three parameters:

zeroValue : the initial value for the accumulated result of each partition for the seqOp operator,and also the initial value for the combine results from different partitions for the combOpoperator

seqOp : an operator used to accumulate results within a partition

combOp: an associative operator used to combine results from different partitions

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
nums.reduce(lambda x,y:x+y)
结果36

Nums.aggregate(0,lambda x,y:x+y,lambda x,y:x+y)
结果36

源码解释网址:http://spark.apache.org/docs/2.2.1/api/python/pyspark#pyspark.RDD


Nums.Aggregate((0,0),lambdax,y:(x[0]+y,x[1]+1,lambda x,y:(x[0]+y[0],x[1]+y[1]))

 Nums.repartition(1).aggregate(100,lambdax,y:x+y,lambda x,y:x+y)

结果216

Reparation是合理分区,空间相等。

为何数据不平衡?利用哈希算法 ,数据清洗。(除3余0,代表数据相同就会放在同一个区里)

RDD元素取值操作

take(n) : Return n elements from the RDD

top(num): Return num elements from the RDD.

first(): Return the first element

collect(): Return all elements from the RDD

foreach(func): Apply the provided function to each elementof the RDD

takeSample(withReplacement, num, [seed])

RDD取值例子

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
print (nums.top(3))
print (nums.take(2))
print (nums.first())

12,12,6

12,2

12,

腾讯共收了多少钱?
lines.map(lambda x:int(x,split(‘,’)[2])).reduce(lambda x,y:x+y)

 


#每个人在每个区共花了多少钱?按区进行降序,姓名升序
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0]==other.obj[0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0]>other.obj[0]

rs=sorted(rdd,key=lambda x:Reversinator1(x[0]))
print(rs)
#每个人在每个区共花了多少钱,按区降序排列,每个相同区中花的钱按升序排?
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0][0]==other.obj[0][0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0][0]>other.obj[0][0]

rs=sorted(rdd,key=lambda x:Reversinator1(x))
print(rs)

#每个人在每个区共花了多少钱?用组合key
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
#先按区排序,区相同的再按名字进行排序,用python排序
rs=sorted(rdd)
print(rs)
#降序
rs=sorted(rdd,reverse=True)
print(rs)
class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
class Reversinator2(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
rs=sorted(rs,key=lambda x:(x[0][0],Reversinator2(x[0][1])),reverse=True)
print(rs)

 



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章