spark RDD *ByKey操作

*ByKey 操作

數據類型爲(key,value)模式

操作 解釋
sortByKey 對數據進行排序
reduceByKey 合併數據
reduceByKeyLocally 合併數據以字典的形式返回數據到master
sampleByKey 返回rdd的子集
subtractByKey 返回兩個RDD key 的差
aggregateByKey 合併函數
combineByKey 合併數據
countByKey 計算key的個數
foldByKey 合併是數據
groupBykey 根據key分組

sortByKey

sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.>)
eg:

 sc=spark.sparkContext
    tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5),('-2', 'test')]
    res=sc.parallelize(tmp).sortByKey(False)
    print(res.collect())
    res = sc.parallelize(tmp).sortByKey(True)
    print(res.collect())

輸出:

[(‘d’, 4), (‘b’, 2), (‘a’, 1), (‘2’, 5), (‘1’, 3), (’-2’, ‘test’)]
[(’-2’, ‘test’), (‘1’, 3), (‘2’, 5), (‘a’, 1), (‘b’, 2), (‘d’, 4)]

reduceByKey

Merge the values for each key using an associative and commutative reduce function.

eg:

from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

res:

[(‘a’, 2), (‘b’, 1)

reduceByKeyLocally(func)

注意:返回數據到master節點
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.reduceByKeyLocally(add))

{‘a’: 2, ‘b’: 1}

sampleByKey

sampleByKey(withReplacement, fractions, seed=None)

  • fractions 每個key採樣的概率
  • seed 隨機數種子
  • withReplacement 是否有抽樣放回

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

 fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample=rdd.sampleByKey(True,fractions,2)
    print(sample.collect())
    sample = rdd.sampleByKey(False, fractions, 2)
    print(sample.collect())
  

[(‘a’, 7), (‘b’, 1)]
[(‘a’, 2), (‘a’, 9), (‘b’, 8)]

groupByKey

groupByKey(numPartitions=None, partitionFunc=< function portable_hash>)

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(sorted(rdd.groupByKey().mapValues(len).collect()))
print(sorted(rdd.groupByKey().mapValues(list).collect()))

[(‘a’, 2), (‘b’, 1)]
[(‘a’, [1, 1]), (‘b’, [1])]

官網一個有趣的demo:

	fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())

    print(sample)
    print(type(sample['a']))
    print(list(sample['a']))

{‘b’: <pyspark.resultiterable.ResultIterable object at 0x1159ff890>, ‘a’: <pyspark.resultiterable.ResultIterable object at 0x1159ff2d0>}
<class ‘pyspark.resultiterable.ResultIterable’>
[2, 9]

substractByKey(other,numPartitions=None)

Return each (key, value) pair in self that has no pair with matching key in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]

aggregateByKey

aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=< function portable_hash>)

combineByKey

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=< function portable_hash>)

用自定義函數根據key合併數據
共有三個函數

  • createCombiner
  • mergeValue
  • mergeCombiners
 x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])

    def to_list(a):
        return [a]

    def append(a, b):
        a.append(b)
        return a

    def extend(a, b):
        a.extend(b)
        return a

    res = sorted(x.combineByKey(to_list, append, extend).collect())
    print(res)

[(‘a’, [1, 2]), (‘b’, [1])]

countByKey

計算key的個數,以字典的形式返回到master節點

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.countByKey())

defaultdict(<class ‘int’>, {‘a’: 2, ‘b’: 1})

foldByKey

foldByKey(zeroValue, func, numPartitions=None, partitionFunc=< function portable_hash>)
對KV做合併處理

  • zeroValue
    這個就是用來對原始的V做合併操作的
  • func
    作用於V的函數
    Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
rdd = sc.parallelize([("a", 3), ("b", 100), ("a", 5)])
 print(sorted(rdd.foldByKey(0, add).collect()))

[(‘a’, 8), (‘b’, 100)]

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章