前言
什麼是RDD
RDD分區 partitons()
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
}
partitions_
}
}
注意會調用checkpointRDD,如果該RDD checkpoint過,則調用CheckpointRDD.partitions(後面還會詳細講解),否則調用該RDD的getPartitions方法,不同的RDD會實現不同的getPartitions方法RDD優先位置 preferredLocation
final def preferredLocations(split: Partition): Seq[String] = {
checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
getPreferredLocations(split)
}
}
也會優先在checkpointRDD查找,如果沒有就調用該RDD的getPreferredLocations方法這個方法是在DAGScheduler.subMissingTasks會調用,會一直找到第一個RDD,返回數據本地化的節點
RDD依賴關係 dependencies
final def dependencies: Seq[Dependency[_]] = {
checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
if (dependencies_ == null) {
dependencies_ = getDependencies
}
dependencies_
}
}
和上面一樣,也會先在checkpointRDD中查找是否有依賴,deps是在創建RDD實例傳入的,其實deps存放的就是依賴的RDD- 窄依賴:每一個父RDD的分區最多隻被子RDD的一個分區所用
- 寬依賴:多個子RDD的分區會依賴於同一個父RDD的分區
RDD分區計算compute()
def compute(split: Partition, context: TaskContext): Iterator[T]
對於Spark中每個RDD的計算都是以partition爲單位,compute函數會使用用戶編寫好的程序,最終返回相應分區數據的迭代器RDD分區類partitioner
RDD之間的轉換
val hdfsFile = sc.textFile(args(1))
val flatMapRdd = hdfsFile.flatMap(s => s.split(" "))
val filterRdd = flatMapRdd.filter(_.length == 2)
val mapRdd = filterRdd.map(word => (word, 1))
val reduce = mapRdd.reduceByKey(_ + _)
reduce.cache()
reduce.saveAsTextFile(hdfs://...)
第一行屬於創建操作:從存儲系統HDFS、HBase等讀入數據,轉換成HadoopRDD def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners)
if (self.partitioner == Some(partitioner)) {
self.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else if (mapSideCombine) {
val combined = self.mapPartitionsWithContext((context, iter) => {
aggregator.combineValuesByKey(iter, context)
}, preservesPartitioning = true)//先在partition內部做mapSideCombine,返回一個MapPartitionsRDD
val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner)
.setSerializer(serializer)//ShuffledRDD,進行shuffle
partitioned.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context))
}, preservesPartitioning = true)//shuffle完成後,在reduce端在做一次combine,返回一個MapPartitionsRDD
} else {
// Don't apply map-side combiner.
val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializer)
values.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
}
}
其中C有可能只是簡單類型, 但經常是seq, 比如(Int, Int) to (Int, Seq[Int])
- createCombiner: V => C, C不存在的情況下, 比如通過V創建seq C
- mergeValue: (C, V) => C, 當C已經存在的情況下, 需要merge, 比如把item V加到seq
C中, 或者疊加
- mergeCombiners:
(C, C) => C, 合併兩個C
- partitioner:
分區函數, Shuffle時需要的Partitioner
- mapSideCombine: Boolean = true, 爲了減小傳輸量, 很多combine可以在map端先做, 比如疊加, 可以先在一個partition中把所有相同的key的value疊加, 再shuffle
class ShuffledRDD[K, V, P <: Product2[K, V] : ClassTag](
@transient var prev: RDD[P],
part: Partitioner)
extends RDD[P](prev.context, Nil) {
private var serializer: Serializer = null
def setSerializer(serializer: Serializer): ShuffledRDD[K, V, P] = {
this.serializer = serializer
this
}
override def getDependencies: Seq[Dependency[_]] = {
List(new ShuffleDependency(prev, part, serializer))
}
override val partitioner = Some(part)
override def getPartitions: Array[Partition] = {
Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
}
override def compute(split: Partition, context: TaskContext): Iterator[P] = {
val shuffledId = dependencies.head.asInstanceOf[ShuffleDependency[K, V]].shuffleId
val ser = Serializer.getSerializer(serializer)
SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context, ser)
}
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
compute函數中主要是Shuffle Read,它由shuffleFetcher完成