python Aggregate 操作例子

原創

2018-09-02 22:39

Similar to reduce() but used to return a different type. Accept three parameters:

zeroValue : the initial value for the accumulated result of each partition for the seqOp operator,and also the initial value for the combine results from different partitions for the combOpoperator

seqOp : an operator used to accumulate results within a partition

combOp: an associative operator used to combine results from different partitions

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
nums.reduce(lambda x,y:x+y)
结果36

Nums.aggregate(0,lambda x,y:x+y,lambda x,y:x+y)
结果36

源码解释网址：http://spark.apache.org/docs/2.2.1/api/python/pyspark#pyspark.RDD

Nums.Aggregate((0,0),lambdax,y:(x[0]+y,x[1]+1,lambda x,y:(x[0]+y[0],x[1]+y[1]))

Nums.repartition(1).aggregate(100,lambdax,y:x+y,lambda x,y:x+y)

结果216

Reparation是合理分区，空间相等。

为何数据不平衡?利用哈希算法，数据清洗。（除3余0，代表数据相同就会放在同一个区里）

RDD元素取值操作

take(n) : Return n elements from the RDD

top(num): Return num elements from the RDD.

first(): Return the first element

collect(): Return all elements from the RDD

foreach(func): Apply the provided function to each elementof the RDD

takeSample(withReplacement, num, [seed])

RDD取值例子

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
print (nums.top(3))
print (nums.take(2))
print (nums.first())

12，12,6

12，2

12，

腾讯共收了多少钱？
lines.map(lambda x:int(x,split(‘,’)[2])).reduce(lambda x,y:x+y)

#每个人在每个区共花了多少钱？按区进行降序，姓名升序
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0]==other.obj[0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0]>other.obj[0]

rs=sorted(rdd,key=lambda x:Reversinator1(x[0]))
print(rs)
#每个人在每个区共花了多少钱，按区降序排列，每个相同区中花的钱按升序排？
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0][0]==other.obj[0][0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0][0]>other.obj[0][0]

rs=sorted(rdd,key=lambda x:Reversinator1(x))
print(rs)

#每个人在每个区共花了多少钱？用组合key
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
#先按区排序，区相同的再按名字进行排序，用python排序
rs=sorted(rdd)
print(rs)
#降序
rs=sorted(rdd,reverse=True)
print(rs)
class Reversinator1(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
class Reversinator2(object):#将第二个比较的规则更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
rs=sorted(rs,key=lambda x:(x[0][0],Reversinator2(x[0][1])),reverse=True)
print(rs)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python Aggregate 操作例子

号称能打败MLP的KAN到底行不行？数学核心原理全面解析

同事使用 insert into select 迁移数据，开开心心上线，上线后被公司开除！

DeepFilterNet复现

Python 全局變量局部變量

Python 局部變量全局變量匿名函數迭代器

Python 函數類

ptyhon數據類型及循環結構

Python 字符串循環

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結