最近從源碼角度溫習之前學的Spark的基礎,在RDD的Dependency這一節中,關於一些Transition操作是Narrow Dependency還是Shuffle Dependency。
對於map/filter等操作我們能很清晰的知道它是窄依賴,對於一些複雜的或者不是那麼明確的轉換操作就不太能區分是什麼依賴,如groupByKey()。較多博客直接說這個轉換操作是寬依賴,真的是寬依賴嗎?
我們看看源碼:
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
取默認分區方式(該RDD如有分區方式則使用該分區方式)作爲參數調用另一個帶參數的groupByKey:
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
最後調用了函數combineByKeyWithClassTag,看看這個函數:
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
上述函數中可以看到,如果該RDD的分區方式與參數中的分區方式相同,則調用mapPartitions函數,該函數生成MapPartitionsRDD,爲窄依賴。分區方式不同,才生成ShuffledRDD,爲寬依賴。
groupByKey還有另外兩個函數:groupByKey(numPartitions: Int)和groupByKey(partitioner: Partitioner)。這兩個函數會有新的分區方式。
因此,groupByKey()應該不一定是寬依賴吧!