spark RDD *ByKey操作

*ByKey 操作

數據類型爲(key,value)模式

操作	解釋
sortByKey	對數據進行排序
reduceByKey	合併數據
reduceByKeyLocally	合併數據以字典的形式返回數據到master
sampleByKey	返回rdd的子集
subtractByKey	返回兩個RDD key 的差
aggregateByKey	合併函數
combineByKey	合併數據
countByKey	計算key的個數
foldByKey	合併是數據
groupBykey	根據key分組

sortByKey

sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.>)
eg:

 sc=spark.sparkContext
    tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5),('-2', 'test')]
    res=sc.parallelize(tmp).sortByKey(False)
    print(res.collect())
    res = sc.parallelize(tmp).sortByKey(True)
    print(res.collect())

輸出：

[(‘d’, 4), (‘b’, 2), (‘a’, 1), (‘2’, 5), (‘1’, 3), (’-2’, ‘test’)]
[(’-2’, ‘test’), (‘1’, 3), (‘2’, 5), (‘a’, 1), (‘b’, 2), (‘d’, 4)]

reduceByKey

Merge the values for each key using an associative and commutative reduce function.

eg:

from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

res:

[(‘a’, 2), (‘b’, 1)

reduceByKeyLocally(func)

注意：返回數據到master節點
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.reduceByKeyLocally(add))

{‘a’: 2, ‘b’: 1}

sampleByKey

sampleByKey(withReplacement, fractions, seed=None)

fractions 每個key採樣的概率
seed 隨機數種子
withReplacement 是否有抽樣放回

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

 fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample=rdd.sampleByKey(True,fractions,2)
    print(sample.collect())
    sample = rdd.sampleByKey(False, fractions, 2)
    print(sample.collect())

[(‘a’, 7), (‘b’, 1)]
[(‘a’, 2), (‘a’, 9), (‘b’, 8)]

groupByKey

groupByKey(numPartitions=None, partitionFunc=< function portable_hash>)

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(sorted(rdd.groupByKey().mapValues(len).collect()))
print(sorted(rdd.groupByKey().mapValues(list).collect()))

[(‘a’, 2), (‘b’, 1)]
[(‘a’, [1, 1]), (‘b’, [1])]

官網一個有趣的demo：

	fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())

    print(sample)
    print(type(sample['a']))
    print(list(sample['a']))

{‘b’: <pyspark.resultiterable.ResultIterable object at 0x1159ff890>, ‘a’: <pyspark.resultiterable.ResultIterable object at 0x1159ff2d0>}
<class ‘pyspark.resultiterable.ResultIterable’>
[2, 9]

substractByKey(other,numPartitions=None)

Return each (key, value) pair in self that has no pair with matching key in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]

aggregateByKey

aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=< function portable_hash>)

combineByKey

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=< function portable_hash>)

用自定義函數根據key合併數據
共有三個函數

createCombiner
mergeValue
mergeCombiners

 x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])

    def to_list(a):
        return [a]

    def append(a, b):
        a.append(b)
        return a

    def extend(a, b):
        a.extend(b)
        return a

    res = sorted(x.combineByKey(to_list, append, extend).collect())
    print(res)

[(‘a’, [1, 2]), (‘b’, [1])]

countByKey

計算key的個數，以字典的形式返回到master節點

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.countByKey())

defaultdict(<class ‘int’>, {‘a’: 2, ‘b’: 1})

foldByKey

foldByKey(zeroValue, func, numPartitions=None, partitionFunc=< function portable_hash>)
對KV做合併處理

zeroValue
這個就是用來對原始的V做合併操作的
func
作用於V的函數
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

rdd = sc.parallelize([("a", 3), ("b", 100), ("a", 5)])
 print(sorted(rdd.foldByKey(0, add).collect()))

[(‘a’, 8), (‘b’, 100)]

spark RDD *ByKey操作

*ByKey 操作

sortByKey

reduceByKey

reduceByKeyLocally(func)

sampleByKey

groupByKey

substractByKey(other,numPartitions=None)

aggregateByKey

combineByKey

countByKey

foldByKey

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

再談23種設計模式（3）：行爲型模式（學習筆記）

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

pyspark Window 窗口函數

pyspark dataframe 自定義分區器

(待解決) java.io.EOFException: End of File Exception between local host

Mysql 必知必會（持續更新中）

Keras Embedding詳解

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結