Spark成長之路(3)-再談RDD的Transformations

參考文章
coalesce()方法和repartition()方法
Transformations

之前剛寫spark的時候,囫圇吞棗似的瞭解過一點點Transformations,詳情參見RDD操作
今天利用空閒時間好好的再徐一敘這些RDD的轉換操作,加深理解。

repartitionAndSortWithinPartitions

解釋

字面意思是在重新分配分區的時候,分區內的數據也進行排序操作。參數爲分區器(下一節我會講講分區器系統)。官方文檔說該方法比repartition要高效,因爲他在進入shuffle機器前,已經進行過排序了。

返回

ShuffledRDD

源碼

OrderedRDDFunctions.scala

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

代碼邏輯相對比較簡單,就是創建了一個ShuffledRDD,而且設置了鍵排序器。

coalesce和repartition

解釋

爲何把這兩個一起說,因爲源碼顯示repartition其實就是調用的coalesce,只是傳遞的參數爲true
那就簡單了,我們只要理解了coalesce方法就行了。該方法的作用是重新設置分區個數,第二個參數是設置在重新分區的時候是否進行shuffle操作

返回

CoalescedRDD

源碼

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

pipe

解釋

簡單來說就是執行命令,得到命令的輸出,轉化爲RDD[String],很多利用這個特性來跨語言執行php,python等腳本語言,來達到與scala的相互調用。

返回

PipedRDD

源碼

 /**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String): RDD[String] = withScope {
    // Similar to Runtime.exec(), if we are given a single string, split it into words
    // using a standard StringTokenizer (i.e. by spaces)
    pipe(PipedRDD.tokenize(command))
  }

  /**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String, env: Map[String, String]): RDD[String] = withScope {
    // Similar to Runtime.exec(), if we are given a single string, split it into words
    // using a standard StringTokenizer (i.e. by spaces)
    pipe(PipedRDD.tokenize(command), env)
  }


  def pipe(
      command: Seq[String],
      env: Map[String, String] = Map(),
      printPipeContext: (String => Unit) => Unit = null,
      printRDDElement: (T, String => Unit) => Unit = null,
      separateWorkingDir: Boolean = false,
      bufferSize: Int = 8192,
      encoding: String = Codec.defaultCharsetCodec.name): RDD[String] = withScope {
    new PipedRDD(this, command, env,
      if (printPipeContext ne null) sc.clean(printPipeContext) else null,
      if (printRDDElement ne null) sc.clean(printRDDElement) else null,
      separateWorkingDir,
      bufferSize,
      encoding)
  }

cartesian

解釋

與另一個RDD中的數據進行笛卡爾積計算。但一般這種場景很少見,我就一筆帶過了。

返回

CartesianRDD

源碼

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

cogroup

解釋

針對的也是Pair類型的RDD,對相同K的不同value,進行組合,生成多元tuple,有多少個不同的value,就是幾元元組。

類似於(A,1),(A,2),(A,3),經過cogroup操作後,得到(A,(1,2,3))

源碼

cogroup的方法有很9個,我只列舉了一個方法如下:

這裏寫圖片描述

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
    cg.mapValues { case Array(vs, w1s, w2s) =>
      (vs.asInstanceOf[Iterable[V]],
        w1s.asInstanceOf[Iterable[W1]],
        w2s.asInstanceOf[Iterable[W2]])
    }
  }

join

解釋

類似於mysql中的內聯語句。

返回

CoGroupedRDD

源碼

既然我們說和mysql的內聯關係一樣,那join自然分內聯,左外內聯,右外內聯。所以源碼中關於join的方法如下圖所示:

這裏寫圖片描述

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

從源碼得知,調用的其實是cogroup方法。

sortByKey

解釋

針對(K,V)格式的RDD,以K進行排序,參數設置倒序還是正序。

返回

ShuffledRDD

源碼

OrderedRDDFunctions中

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

aggregateByKey

解釋

按key進行聚合操作。

返回

ShuffledRDD

源碼

這裏寫圖片描述

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

reduceByKey

解釋

key進行聚合,value值進行合併操作,具體合併函數以第一個參數提供。

返回

ShuffledRDD

源碼

這裏寫圖片描述


def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

groupByKey

解釋

(K,V)類型RDD的操作,以Key對數據進行分組,重新分區。

返回

ShuffledRDD

源碼

這裏寫圖片描述

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

distinct

解釋

去重操作

返回

跟父RDD一致

源碼

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

intersection

解釋

返回兩個RDD的交集,並進行去重操作

返回

父RDD一致

源碼

  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   *
   * @param partitioner Partitioner to use for the resulting RDD
   */
  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.  Performs a hash partition across the cluster
   *
   * @note This method performs a shuffle internally.
   *
   * @param numPartitions How many partitions to use in the resulting RDD
   */
  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }

union

解釋

合併不去重

返回

UnionRDD/PartitionerAwareUnionRDD

源碼

def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = withScope {
    val partitioners = rdds.flatMap(_.partitioner).toSet
    if (rdds.forall(_.partitioner.isDefined) && partitioners.size == 1) {
      new PartitionerAwareUnionRDD(this, rdds)
    } else {
      new UnionRDD(this, rdds)
    }
  }

sample

解釋

抽樣

返回

父RDD

源碼

 def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }
  }

map

解釋

最簡單的Transformations方法,在每一個父RDD作用傳入的函數,一一對應得到另一個RDD,父類RDD和子類RDD的數量一樣。

返回

MapPartitionsRDD

源碼

 def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions

解釋

在分區內進行map操作。

返回

MapPartitionsRDD

源碼

def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

mapPartitionsWithIndex

mapPartitions多了一個分區索引值可供使用。

返回

MapPartitionsRDD

源碼

def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

flatMap

解釋

先以傳入的函數,將元素轉變爲多個元素,然後進行平鋪。

返回

MapPartitionsRDD

源碼

 def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

filter

解釋

過濾操作,以filter的條件來過濾父RDD,滿足條件的流入子類RDD。

返回

MapPartitionsRDD

源碼

 def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

核心函數combineByKeyWithClassTag

這裏寫圖片描述

在解釋groupByKey,aggregateByKey,reduceByKey等操作(K,V)形式的RDD時,源碼中都是用了combineByKeyWithClassTag方法,所以很有必要弄懂該方法。

參考文章:combineByKey
combineByKey

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

核心就是三個函數

  • createCombiner:初始化第一個值。
  • mergeValue:用第一個值處理剩餘其他的值,迭代處理。
  • mergeCombiners:如果數據處於不同分區,用該函數進行合併。

該函數式將RDD[(K,V)]轉換爲RDD[(K,C)]的格式,V爲父RDDvalue值,K爲父RDDKEY,我們要做的操作是根據K,將V轉換爲C,C可以理解爲任何類型,也包括K類型。

都是根據Key分類後進行的操作,不同key之間是不認識的,以下講解都是以Key分類後,各個小組的處理方式

第一個函數createCombiner,抽象定義了C的格式,他的定義爲V=>C,輸入爲V,返回爲C,這是一個初始化函數,將RDD中分區第一個數據的V值作用到這個函數,變成C。第二個函數mergeValue,抽象形式爲(C,V)=>C,其實就是利用初始化後得到的C,與RDD其他數據進行合併操作,最終得到一個C。第三個函數mergeCombiners,只有數據分散在不同分區時,纔會調用該函數,來合併所有分區的數據。他的抽象形式是(C,C)=>C,意思就是兩個combiner合併爲一個combiner

一個高度的抽象函數,解決了很多上層的不同的邏輯,傳入不同的函數,方法的效果和功能就不同。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章