參考文章
coalesce()方法和repartition()方法
 Transformations

repartitionAndSortWithinPartitions
- 解釋
- 返回
- 源碼
coalesce和repartition
- 解釋
- 返回
- 源碼
pipe
- 解釋
- 返回
- 源碼
cartesian
- 解釋
- 返回
- 源碼
cogroup
- 解釋
- 源碼
join
- 解釋
- 返回
- 源碼
sortByKey
- 解釋
- 返回
- 源碼
aggregateByKey
- 解釋
- 返回
- 源碼
reduceByKey
- 解釋
- 返回
- 源碼
groupByKey
- 解釋
- 返回
- 源碼
distinct
- 解釋
- 返回
- 源碼
intersection
- 解釋
- 返回
- 源碼
union
- 解釋
- 返回
- 源碼
sample
- 解釋
- 返回
- 源碼
map
- 解釋
- 返回
- 源碼
mapPartitions
- 解釋
- 返回
- 源碼
mapPartitionsWithIndex
- 返回
- 源碼
flatMap
- 解釋
- 返回
- 源碼
filter
- 解釋
- 返回
- 源碼
核心函數combineByKeyWithClassTag

之前剛寫spark的時候，囫圇吞棗似的瞭解過一點點Transformations，詳情參見RDD操作
今天利用空閒時間好好的再徐一敘這些RDD的轉換操作，加深理解。

repartitionAndSortWithinPartitions

解釋

字面意思是在重新分配分區的時候，分區內的數據也進行排序操作。參數爲分區器(下一節我會講講分區器系統)。官方文檔說該方法比repartition要高效，因爲他在進入shuffle機器前，已經進行過排序了。

ShuffledRDD

源碼

OrderedRDDFunctions.scala

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

代碼邏輯相對比較簡單，就是創建了一個ShuffledRDD，而且設置了鍵排序器。

coalesce和repartition

解釋

爲何把這兩個一起說，因爲源碼顯示repartition其實就是調用的coalesce，只是傳遞的參數爲true。
那就簡單了，我們只要理解了coalesce方法就行了。該方法的作用是重新設置分區個數，第二個參數是設置在重新分區的時候是否進行shuffle操作。

CoalescedRDD

源碼

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

pipe

解釋

簡單來說就是執行命令，得到命令的輸出，轉化爲RDD[String]，很多利用這個特性來跨語言執行php,python等腳本語言，來達到與scala的相互調用。

PipedRDD

源碼

 /**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String): RDD[String] = withScope {
    // Similar to Runtime.exec(), if we are given a single string, split it into words
    // using a standard StringTokenizer (i.e. by spaces)
    pipe(PipedRDD.tokenize(command))
  }

  /**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String, env: Map[String, String]): RDD[String] = withScope {
    // Similar to Runtime.exec(), if we are given a single string, split it into words
    // using a standard StringTokenizer (i.e. by spaces)
    pipe(PipedRDD.tokenize(command), env)
  }


  def pipe(
      command: Seq[String],
      env: Map[String, String] = Map(),
      printPipeContext: (String => Unit) => Unit = null,
      printRDDElement: (T, String => Unit) => Unit = null,
      separateWorkingDir: Boolean = false,
      bufferSize: Int = 8192,
      encoding: String = Codec.defaultCharsetCodec.name): RDD[String] = withScope {
    new PipedRDD(this, command, env,
      if (printPipeContext ne null) sc.clean(printPipeContext) else null,
      if (printRDDElement ne null) sc.clean(printRDDElement) else null,
      separateWorkingDir,
      bufferSize,
      encoding)
  }

cartesian

解釋

與另一個RDD中的數據進行笛卡爾積計算。但一般這種場景很少見，我就一筆帶過了。

CartesianRDD

源碼

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

cogroup

解釋

針對的也是Pair類型的RDD，對相同K的不同value，進行組合，生成多元tuple，有多少個不同的value，就是幾元元組。

類似於(A,1),(A,2),(A,3)，經過cogroup操作後，得到(A,(1,2,3))

源碼

cogroup的方法有很9個，我只列舉了一個方法如下：

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
    cg.mapValues { case Array(vs, w1s, w2s) =>
      (vs.asInstanceOf[Iterable[V]],
        w1s.asInstanceOf[Iterable[W1]],
        w2s.asInstanceOf[Iterable[W2]])
    }
  }

join

解釋

類似於mysql中的內聯語句。

CoGroupedRDD

源碼

既然我們說和mysql的內聯關係一樣，那join自然分內聯，左外內聯，右外內聯。所以源碼中關於join的方法如下圖所示：

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

從源碼得知，調用的其實是cogroup方法。

sortByKey

解釋

針對(K,V)格式的RDD，以K進行排序，參數設置倒序還是正序。

ShuffledRDD

源碼

OrderedRDDFunctions中

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

aggregateByKey

解釋

按key進行聚合操作。

ShuffledRDD

源碼

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

reduceByKey

解釋

以key進行聚合，value值進行合併操作，具體合併函數以第一個參數提供。

ShuffledRDD

源碼


def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

groupByKey

解釋

(K,V)類型RDD的操作，以Key對數據進行分組，重新分區。

ShuffledRDD

源碼

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

distinct

解釋

去重操作

跟父RDD一致

源碼

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

intersection

解釋

返回兩個RDD的交集，並進行去重操作

父RDD一致

源碼

  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   *
   * @param partitioner Partitioner to use for the resulting RDD
   */
  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.  Performs a hash partition across the cluster
   *
   * @note This method performs a shuffle internally.
   *
   * @param numPartitions How many partitions to use in the resulting RDD
   */
  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }

union

解釋

合併不去重

UnionRDD/PartitionerAwareUnionRDD

源碼

def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = withScope {
    val partitioners = rdds.flatMap(_.partitioner).toSet
    if (rdds.forall(_.partitioner.isDefined) && partitioners.size == 1) {
      new PartitionerAwareUnionRDD(this, rdds)
    } else {
      new UnionRDD(this, rdds)
    }
  }

sample

解釋

抽樣

父RDD

源碼

 def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }
  }

map

解釋

最簡單的Transformations方法,在每一個父RDD作用傳入的函數，一一對應得到另一個RDD，父類RDD和子類RDD的數量一樣。

MapPartitionsRDD

源碼

 def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions

解釋

在分區內進行map操作。

MapPartitionsRDD

源碼

def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

mapPartitionsWithIndex

比mapPartitions多了一個分區索引值可供使用。

MapPartitionsRDD

源碼

def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

flatMap

解釋

先以傳入的函數，將元素轉變爲多個元素，然後進行平鋪。

MapPartitionsRDD

源碼

 def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

filter

解釋

過濾操作，以filter的條件來過濾父RDD，滿足條件的流入子類RDD。

MapPartitionsRDD

源碼

 def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

核心函數combineByKeyWithClassTag

在解釋groupByKey,aggregateByKey,reduceByKey等操作(K,V)形式的RDD時，源碼中都是用了combineByKeyWithClassTag方法，所以很有必要弄懂該方法。

參考文章:combineByKey
combineByKey

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

核心就是三個函數

createCombiner:初始化第一個值。
mergeValue:用第一個值處理剩餘其他的值，迭代處理。
mergeCombiners：如果數據處於不同分區，用該函數進行合併。

該函數式將RDD[(K,V)]轉換爲RDD[(K,C)]的格式，V爲父RDD的value值，K爲父RDD的KEY，我們要做的操作是根據K，將V轉換爲C,C可以理解爲任何類型，也包括K類型。

都是根據Key分類後進行的操作，不同key之間是不認識的，以下講解都是以Key分類後，各個小組的處理方式

第一個函數createCombiner，抽象定義了C的格式，他的定義爲V=>C,輸入爲V，返回爲C，這是一個初始化函數，將RDD中分區第一個數據的V值作用到這個函數，變成C。第二個函數mergeValue,抽象形式爲(C,V)=>C,其實就是利用初始化後得到的C，與RDD其他數據進行合併操作，最終得到一個C。第三個函數mergeCombiners,只有數據分散在不同分區時，纔會調用該函數，來合併所有分區的數據。他的抽象形式是(C,C)=>C,意思就是兩個combiner合併爲一個combiner。

一個高度的抽象函數，解決了很多上層的不同的邏輯，傳入不同的函數，方法的效果和功能就不同。

Spark成長之路(3)-再談RDD的Transformations

repartitionAndSortWithinPartitions

解釋

返回

源碼

coalesce和repartition

解釋

返回

源碼

pipe

解釋

返回

源碼

cartesian

解釋

返回

源碼

cogroup

解釋

源碼

join

解釋

返回

源碼

sortByKey

解釋

返回

源碼

aggregateByKey

解釋

返回

源碼

reduceByKey

解釋

返回

源碼

groupByKey

解釋

返回

源碼

distinct

解釋

返回

源碼

intersection

解釋

返回

源碼

union

解釋

返回

源碼

sample

解釋

返回

源碼

map

解釋

返回

源碼

mapPartitions

解釋

返回

源碼

mapPartitionsWithIndex

返回

源碼

flatMap

解釋

返回

源碼

filter

解釋

返回

源碼

核心函數combineByKeyWithClassTag