经典面试题目之:groupbykey 和 reducebykey以及aggregatebykey 的区别?

 

       从源码层面来看看吧,别的都显得比较苍白,源码的实现是一方面,还有笔者觉得我们更要养成多读源码里的注释的习惯,个人觉得spark源码里的注释做的相当之良心,可以让你少走很多弯路,快速的理解源码都实现了什么。走起~

groupbykey  :

       这个算子总给人一共食之无味弃之可惜的感觉,因为很多时候我们并不使用它,并且很多场景下你使用他都会被当作一个优化的场景,比如求sum  或者 average 这种常见的场合,但是笔者觉得,不管是什么算子,都有其适合的场景,调优其实最关键的也就是找到最合适的场景。

【PairRDDFunctions】

   这个算子最长用于就是有键值对的这样的rdd

   源码的注释要点摘录:

  1. 这个操作是非常 expensive (代价很高)的,建议你在sum 或者是average的场景下,使用aggregatebykey或者是reducebykey
  2. 需要把所有的k-v对都放在内存里,所以内存的大小比较关键。如果说key有很多的values ,那么容易报OOM ,(数据倾斜等都有可能的)
  3. mapsideCombine = false (默认值是true)
/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * Note: This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
   *
   * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

调用的是def combineByKeyWithClassTag生成的结果

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

aggregatebykey:

 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }

 reducebykey:

源码的注释要点摘录

  1. 在mapper端会做本地的聚合,然后把聚合后的结果发给reducer.
 /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

可以看到三者其实调用的都是:

def combineByKeyWithClassTag

感觉aggregatebykey貌似要复杂不少和reducebykey比起来。其实在实际的使用的时候也确实是这个样子的。

先来分析简单的reducebykey:

从源码里我们可以看到:

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) 他的

这个两个函数一样的,都是用的是func

它的combiner 没有做聚合的处理。

看下aggregatebykey:

 def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

这里我们可以看到这个的两个函数是比较灵活的,你可以自己去定义。seqOp做的就是在每个分区内部的聚合操作,而combOp就是汇总每个分区的结果的一个全局的操作。可以试想一下用这个函数来实现经典的wordcount和reducebykey的实现方式的区别。

今天先到这里,后期我会加上一个案例来说明 aggregatebykey的用法。

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章