所有鍵值對RDD轉換算子如下:
mapValues、flatMapValues、sortByKey、combineByKey、foldByKey、groupByKey、reduceByKey、aggregateByKey、cogroup、join、leftOuterJoin、rightOuterJoin
當然鍵值對RDD可以使用所有RDD轉換算子,介紹詳見:https://blog.csdn.net/qq_23146763/article/details/100988127
具體解釋和例子
1. mapValues
對pairRDD中的每個值調用map而不改變鍵
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val list = List(("zhangsan", 22), ("lisi", 20), ("wangwu", 23))
val rdd = sc.parallelize(list)
val mapValuesRDD = rdd.mapValues(_ + 2)
mapValuesRDD.foreach(println)
sc.stop()
2. flatMapValues
對pairRDD中的每個值調用flatMap而不改變鍵
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val list = List(("zhangsan", "GD SZ"), ("lisi", "HN YY"), ("wangwu", "JS NJ"))
val rdd = sc.parallelize(list)
val mapValuesRDD = rdd.flatMapValues(v => v.split(" "))
mapValuesRDD.foreach(println)
sc.stop()
3. sortByKey
1.如果key實現了排序,返回以Key排序的(K,V)鍵值對組成的RDD,accending爲true時表示升序,爲false時表示降序,numPartitions設置分區數,提高作業並行度
2.並行度設爲1,才能實現較好的排序
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr = List((1, 20), (1, 10), (2, 20), (2, 10), (3, 20), (3, 10))
val rdd = sc.parallelize(arr)
val sortByKeyRDD = rdd.sortByKey(true, 1)
sortByKeyRDD.foreach(println)
sc.stop()
4. combineByKey
使用不同的返回類型合併具有相同鍵的值
comineByKey(createCombiner,mergeValue,mergeCombiners,partitioner,mapSideCombine)
createCombiner:在第一次遇到Key時創建組合器函數,將RDD數據集中的V類型值轉換C類型值(V => C),C是集合類型
注意:這個過程會在每個分區中第一次出現各個鍵時發生
mergeValue:合併值函數(每個分區獨立處理),再次遇到相同的Key時,將createCombiner道理的C類型值與這次傳入的V類型值合併成一個C類型值(C,V)=>C
mergeCombiners:合併組合器函數(將每個鍵的分區結果合併),將C類型值兩兩合併成一個C類型值
partitioner:使用已有的或自定義的分區函數,默認是HashPartitioner
mapSideCombine:是否在map端進行Combine操作,默認爲true
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
//簡單聚合
val input = sc.parallelize(List(("coffee", 1), ("coffee", 2), ("panda", 4), ("panda", 5)))
val result = input.combineByKey(
(v) => (v),
(acc: (Int), v) => (acc + v),
(acc1: (Int), acc2: (Int)) => (acc1 + acc2)
)
result.foreach(println)
//求平均值
val result2 = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
).map { case (key, value) => (key, value._1 / value._2.toFloat) }
result2.collectAsMap().map(println)
sc.stop()
5. foldByKey
foldByKey,groupByKey,reduceByKey函數最終都是通過調用combineByKey函數實現的
作用:每個vaule都加上zeroValue,然後按key聚合
zeroVale:對V進行初始化,實際上是通過CombineByKey的createCombiner實現的 V => (zeroValue,V),再通過func函數映射成新的值,即func(zeroValue,V),如例4可看作對每個V先進行 V=> 2 + V
func: Value將通過func函數按Key值進行合併(實際上是通過CombineByKey的mergeValue,mergeCombiners函數實現的,只不過在這裏,這兩個函數是相同的)
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3))
val rdd = sc.parallelize(people)
val foldByKeyRDD = rdd.foldByKey(2)((_ + _))
foldByKeyRDD.foreach(println)
sc.stop()
6. groupByKey
按Key進行分組,返回[K,Iterable[V]],numPartitions設置分區數,提高作業並行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val seq = Seq(("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 2))
val rdd = sc.parallelize(seq)
val groupRDD = rdd.groupByKey(3)
groupRDD.foreach(println)
sc.stop()
7. reduceByKey
按Key進行分組,使用給定的func函數聚合value值, numPartitions設置分區數,提高作業並行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val seq = Seq(("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 2))
val rdd = sc.parallelize(seq)
val reduceRdd = rdd.reduceByKey((v1, v2) => v1 + v2, 3)
reduceRdd.foreach(println)
sc.stop()
8. aggregateByKey
1.概念
對PairRDD中相同的Key值進行聚合操作,在聚合過程中同樣使用了一箇中立的初始值。
2.和aggregate的不同點
1.aggregate的聚合結果和分區數量有關;aggregateByKey的聚合結果和分區數量無關
2.aggregate將每個分區裏面的元素進行聚合;aggregateByKey對PairRDD中相同的Key值進行聚合操作,返回PairRDD
3.例子解釋
1.按key分組,seq中將每組value和初始值比較,得到較小值作爲新鍵值對,comb中將每組數據value按key聚合,最終每組生成一個PairRDD
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(Seq((1, 2), (1, 3), (1, 4), (2, 5)))
def seq(a: Int, b: Int): Int = {
println("seq: " + a + "\t " + b)
math.min(a, b)
}
def comb(a: Int, b: Int): Int = {
println("comb: " + a + "\t " + b)
a + b
}
rdd.aggregateByKey(3, 2)(seq, comb).foreach(println)
sc.stop()
9. cogroup
對兩個RDD(如:(K,V)和(K,W))相同Key的元素先分別做聚合,最後返回(K,Iterator,Iterator)形式的RDD,numPartitions設置分區數,提高作業並行度
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.cogroup(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
10. join
內連接。只有在兩個pairRDD中都存在的鍵纔會輸出。當一個輸出對應的某個鍵有多個值時,生成的pairRDD會包括來自兩個輸入RDD的每一組相對應的value
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("B", "B1"), ("A", "A2"), ("B", "B2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.join(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
11. leftOuterJoin
左外連接。源pairRDD每個鍵都有對應的記錄,第二個RDD中無記錄值用None表示
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("C", "C1"), ("A", "A2"), ("C", "C2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.leftOuterJoin(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()
12. rightOuterJoin
右外連接。第二個pairRDD每個鍵都有對應的記錄,源RDD中無記錄值用None表示
val sparkConf = new SparkConf().setAppName("transformations examples").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val arr1 = List(("A", 1), ("B", 2), ("A", 2), ("B", 3))
val arr2 = List(("A", "A1"), ("C", "C1"), ("A", "A2"), ("C", "C2"))
val rdd1 = sc.parallelize(arr1, 3)
val rdd2 = sc.parallelize(arr2, 3)
val groupByKeyRDD = rdd1.rightOuterJoin(rdd2)
groupByKeyRDD.foreach(println)
sc.stop()