前言
什么是RDD
RDD分区 partitons()
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
}
partitions_
}
}
注意会调用checkpointRDD,如果该RDD checkpoint过,则调用CheckpointRDD.partitions(后面还会详细讲解),否则调用该RDD的getPartitions方法,不同的RDD会实现不同的getPartitions方法RDD优先位置 preferredLocation
final def preferredLocations(split: Partition): Seq[String] = {
checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
getPreferredLocations(split)
}
}
也会优先在checkpointRDD查找,如果没有就调用该RDD的getPreferredLocations方法这个方法是在DAGScheduler.subMissingTasks会调用,会一直找到第一个RDD,返回数据本地化的节点
RDD依赖关系 dependencies
final def dependencies: Seq[Dependency[_]] = {
checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
if (dependencies_ == null) {
dependencies_ = getDependencies
}
dependencies_
}
}
和上面一样,也会先在checkpointRDD中查找是否有依赖,deps是在创建RDD实例传入的,其实deps存放的就是依赖的RDD- 窄依赖:每一个父RDD的分区最多只被子RDD的一个分区所用
- 宽依赖:多个子RDD的分区会依赖于同一个父RDD的分区
RDD分区计算compute()
def compute(split: Partition, context: TaskContext): Iterator[T]
对于Spark中每个RDD的计算都是以partition为单位,compute函数会使用用户编写好的程序,最终返回相应分区数据的迭代器RDD分区类partitioner
RDD之间的转换
val hdfsFile = sc.textFile(args(1))
val flatMapRdd = hdfsFile.flatMap(s => s.split(" "))
val filterRdd = flatMapRdd.filter(_.length == 2)
val mapRdd = filterRdd.map(word => (word, 1))
val reduce = mapRdd.reduceByKey(_ + _)
reduce.cache()
reduce.saveAsTextFile(hdfs://...)
第一行属于创建操作:从存储系统HDFS、HBase等读入数据,转换成HadoopRDD def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners)
if (self.partitioner == Some(partitioner)) {
self.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else if (mapSideCombine) {
val combined = self.mapPartitionsWithContext((context, iter) => {
aggregator.combineValuesByKey(iter, context)
}, preservesPartitioning = true)//先在partition内部做mapSideCombine,返回一个MapPartitionsRDD
val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner)
.setSerializer(serializer)//ShuffledRDD,进行shuffle
partitioned.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context))
}, preservesPartitioning = true)//shuffle完成后,在reduce端在做一次combine,返回一个MapPartitionsRDD
} else {
// Don't apply map-side combiner.
val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializer)
values.mapPartitionsWithContext((context, iter) => {
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
}
}
其中C有可能只是简单类型, 但经常是seq, 比如(Int, Int) to (Int, Seq[Int])
- createCombiner: V => C, C不存在的情况下, 比如通过V创建seq C
- mergeValue: (C, V) => C, 当C已经存在的情况下, 需要merge, 比如把item V加到seq
C中, 或者叠加
- mergeCombiners:
(C, C) => C, 合并两个C
- partitioner:
分区函数, Shuffle时需要的Partitioner
- mapSideCombine: Boolean = true, 为了减小传输量, 很多combine可以在map端先做, 比如叠加, 可以先在一个partition中把所有相同的key的value叠加, 再shuffle
class ShuffledRDD[K, V, P <: Product2[K, V] : ClassTag](
@transient var prev: RDD[P],
part: Partitioner)
extends RDD[P](prev.context, Nil) {
private var serializer: Serializer = null
def setSerializer(serializer: Serializer): ShuffledRDD[K, V, P] = {
this.serializer = serializer
this
}
override def getDependencies: Seq[Dependency[_]] = {
List(new ShuffleDependency(prev, part, serializer))
}
override val partitioner = Some(part)
override def getPartitions: Array[Partition] = {
Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
}
override def compute(split: Partition, context: TaskContext): Iterator[P] = {
val shuffledId = dependencies.head.asInstanceOf[ShuffleDependency[K, V]].shuffleId
val ser = Serializer.getSerializer(serializer)
SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context, ser)
}
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
compute函数中主要是Shuffle Read,它由shuffleFetcher完成