先上源碼:
/** RDD.scala
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
* 返回分組項的RDD,每個分組包含一個key和這個key對應的元素的一個序列,不保證序列的順序。
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
* 注意:This operation may be very expensive.
* 如果要在每一個key上做聚合操作(比如sum/average),建議用reduceByKey/aggregateByKey來獲得更好的性能。
*/
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}
/************************* groupByKey *****************************/
/**
* PairRDDFunctions.scala
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
/**
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
@Experimental
def combineByKeyWithClassTag[C](
createCombiner: V => C, // 默認操作是改變VALUE的類型爲C
mergeValue: (C, V) => C, // 按C歸併V,預聚合操作
mergeCombiners: (C, C) => C, // reduce
partitioner: Partitioner, // 分區對象
mapSideCombine: Boolean = true, // 是否開啓map端聚合,默認開啓
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
/************************* reduceByKey *****************************/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
reduceByKey 和 groupByKey 都是通過combineByKeyWithClassTag函數實現的。
但是它們調用combineByKeyWithClassTag的參數不同,返回值不同。
- 先看返回值,groupByKey()返回值是RDD[(K, Iterable[V])],包含了每個key的分組數據。reduceByKey()的返回值是RDD[(K, C)],只是一個普通的RDD。
- 再看調用參數,groupByKey調用時的泛型參數是CompactBuffer[V]:
combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
注意,groupByKey 把mapSideCombine設置成了false!關閉了map端預聚合。
- reduceByKey調用時的泛型參數是V:
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
reduceByKey的createCombiner對象參數是(v: V) => v ;mergeValue 和 mergeCombiners 都是 func,
partitioner不變,mapSideCombine使用默認值 true.
reduceByKey 和 groupByKey最大不同是mapSideCombine 參數,它決定是是否會先在節點上進行一次 Combine 操作。
從二者的實現可見,reduceByKey對每個key對應的多個value進行merge操作,最重要的是它能夠在本地進行merge操作,並且merge操作可以通過函數自定義。
而groupByKey不能自定義函數,我們需要先用groupByKey生成RDD,然後才能對此RDD通過map進行自定義函數操作。
例如,通常這樣使用這兩個算子:
val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _)
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum))
reduceByKey使用“ _ + _ ”這樣的自定義函數來預聚合,groupByKey沒有這種參數,
當調用groupByKey時,所有的 key-value pair 都會被移動,發送本機所有的map,在一個機器上suffle,集羣節點之間傳輸的開銷很大。
如圖:
That's all.
Ref:
https://blog.csdn.net/zongzhiyuan/article/details/49965021