Spark中針對鍵值對類型的RDD做各種操作比較常用的兩個方法就是ReduceByKey與GroupByKey方法,下面從源碼裏面看看ReduceByKey與GroupByKey方法的使用以及內部邏輯。
官方源碼解釋:三種形式的reduceByKey
總體來說下面三種形式的方法備註大意爲:
根據用戶傳入的函數來對(K,V)中每個K對應的所有values做merge操作(具體的操作類型根據用戶定義的函數),在將結果發送給reducer節點前該merge操作首先會在本地Mapper端進行。但是具體到每個方法,根據傳入的參數其含義又有所延伸,下面會具體解釋:
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
* 傳入分區器,根據分區器重新分區
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
* 重新設置分區數
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
* 使用默認分區器
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
接着往下面來看,reduceByKey方法主要執行邏輯在combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)這個方法中,貼出源碼:
def combineByKeyWithClassTag[C](
createCombiner: V => C, //把V裝進C中
mergeValue: (C, V) => C, //把V整合進入C中
mergeCombiners: (C, C) => C, //整合兩個C成爲一個
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
//這裏可以看到,pairRDD的key類型不能爲數組,否則會報錯
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
//hash分區器不能作用於數組鍵
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
//判斷傳入分區器是否相同
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
//不相同的話重新返回shufferRDD
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
官方源碼解釋:三種形式的groupByKey
三個方法只是傳遞的參數不同,整體需要實現的功能是相同的,需要對結果的分區進行控制的話可以使用帶有分區器參數的方法,需要重新設置分區數量的話可以使用帶有分區數參數的方法,使用官方默認設置的話則是用無參數的方法。
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
* 默認設置的方法
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 帶有分區器參數的方法
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions. The ordering of elements within
* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 帶有分區數量參數的方法
*/
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
groupByKey方法主要作用是將鍵相同的所有的鍵值對分組到一個集合序列當中,如(a,1),(a,3),(b,1),(c,1),(c,3),分組後結果是((a,1),(a,3)),(b,1),((c,1),(c,3)),分組後的集合中的元素順序是不確定的,比如鍵a的值集合也可能是((a,3),(a,1)).
相對而言,groupByKey方法是比較昂貴的操作,意思就是說比較消耗資源。所以如果你的目的是分組後對每一個鍵所對應的所有值進行求和或者取平均的話,那麼請使用PairRDD中的reduceByKey方法或者aggregateByKey方法,這兩種方法可以提供更好的性能
groupBykey是把所有的鍵值對集合都加載到內存中存儲計算,所以如果一個鍵對應的值太多的話,就會導致內存溢出的錯誤,這是需要重點關注的地方
reduceByKey與groupByKey進行對比
- 返回值類型不同:reduceByKey返回的是RDD[(K, V)],而groupByKey返回的是RDD[(K, Iterable[V])],舉例來說這兩者的區別。比如含有一下數據的rdd應用上面兩個方法做求和:(a,1),(a,2),(a,3),(b,1),(b,2),(c,1);reduceByKey產生的中間結果(a,6),(b,3),(c,1);而groupByKey產生的中間結果結果爲((a,1)(a,2)(a,3)),((b,1)(b,2)),(c,1),(以上結果爲一個分區中的中間結果)可見groupByKey的結果更加消耗資源
- 作用不同,reduceByKey作用是聚合,異或等,groupByKey作用主要是分組,也可以做聚合(分組之後)
- map端中間結果對鍵對應的值得聚合方式不同
單詞計數說明兩種方式的區別
val words = Array("a", "a", "a", "b", "b", "b")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _) //reduceByKey
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum)) //groupByKey
- 1
- 2
- 3
- 4
- 5
- 6
- 7
上面兩種方法的結果是相同的,但是計算過程卻又很大的區別,借用網上的一幅對比圖來說明:
- reduceByKey在每個分區移動數據之前,會對每一個分區中的key所對應的values進行求和,然後再利用reduce對所有分區中的每個鍵對應的值進行再次聚合。整個過程如圖:
- groupByKey是把分區中的所有的鍵值對都進行移動,然後再進行整體求和,這樣會導致集羣節點之間的開銷較大,傳輸效率較低,也是上文所說的內存溢出錯誤出現的根本原因
轉自: https://blog.csdn.net/do_yourself_go_on/article/details/76033252 ,如有侵權請聯繫