python Aggregate 操作例子

原創

2018-09-02 22:39

Similar to reduce() but used to return a different type. Accept three parameters:

zeroValue : the initial value for the accumulated result of each partition for the seqOp operator,and also the initial value for the combine results from different partitions for the combOpoperator

seqOp : an operator used to accumulate results within a partition

combOp: an associative operator used to combine results from different partitions

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
nums.reduce(lambda x,y:x+y)
結果36

Nums.aggregate(0,lambda x,y:x+y,lambda x,y:x+y)
結果36

源碼解釋網址：http://spark.apache.org/docs/2.2.1/api/python/pyspark#pyspark.RDD

Nums.Aggregate((0,0),lambdax,y:(x[0]+y,x[1]+1,lambda x,y:(x[0]+y[0],x[1]+y[1]))

Nums.repartition(1).aggregate(100,lambdax,y:x+y,lambda x,y:x+y)

結果216

Reparation是合理分區，空間相等。

爲何數據不平衡?利用哈希算法，數據清洗。（除3餘0，代表數據相同就會放在同一個區裏）

RDD元素取值操作

take(n) : Return n elements from the RDD

top(num): Return num elements from the RDD.

first(): Return the first element

collect(): Return all elements from the RDD

foreach(func): Apply the provided function to each elementof the RDD

takeSample(withReplacement, num, [seed])

RDD取值例子

nums = sc.parallelize([12, 2, 6, 2, 12, 2])
print (nums.top(3))
print (nums.take(2))
print (nums.first())

12，12,6

12，2

12，

騰訊共收了多少錢？
lines.map(lambda x:int(x,split(‘,’)[2])).reduce(lambda x,y:x+y)

#每個人在每個區共花了多少錢？按區進行降序，姓名升序
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0]==other.obj[0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0]>other.obj[0]

rs=sorted(rdd,key=lambda x:Reversinator1(x[0]))
print(rs)
#每個人在每個區共花了多少錢，按區降序排列，每個相同區中花的錢按升序排？
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()

class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        if self.obj[0][0]==other.obj[0][0]:
            return self.obj[1]<other.obj[1]
        else:
            return self.obj[0][0]>other.obj[0][0]

rs=sorted(rdd,key=lambda x:Reversinator1(x))
print(rs)

#每個人在每個區共花了多少錢？用組合key
rdd=lines.map(lambda x:((x.split(',')[0],x.split(',')[1]),int(x.split(',')[2]))).reduceByKey(lambda x,y:x+y)
rdd=rdd.collect()
#先按區排序，區相同的再按名字進行排序，用python排序
rs=sorted(rdd)
print(rs)
#降序
rs=sorted(rdd,reverse=True)
print(rs)
class Reversinator1(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
class Reversinator2(object):#將第二個比較的規則更改了
    def __init__(self, obj):
        self.obj = obj
    def __lt__(self, other):
        return other.obj < self.obj
rs=sorted(rs,key=lambda x:(x[0][0],Reversinator2(x[0][1])),reverse=True)
print(rs)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python Aggregate 操作例子

Golang爬蟲代理接入的技術與實踐

Python 全局變量局部變量

Python 局部變量全局變量匿名函數迭代器

Python 函數類

ptyhon數據類型及循環結構

Python 字符串循環

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結