經典面試題目之：groupbykey 和 reducebykey以及aggregatebykey 的區別？

原創

2020-05-03 22:06

從源碼層面來看看吧，別的都顯得比較蒼白，源碼的實現是一方面，還有筆者覺得我們更要養成多讀源碼裏的註釋的習慣，個人覺得spark源碼裏的註釋做的相當之良心，可以讓你少走很多彎路，快速的理解源碼都實現了什麼。走起~

groupbykey ：

這個算子總給人一共食之無味棄之可惜的感覺，因爲很多時候我們並不使用它，並且很多場景下你使用他都會被當作一個優化的場景，比如求sum 或者 average 這種常見的場合，但是筆者覺得，不管是什麼算子，都有其適合的場景，調優其實最關鍵的也就是找到最合適的場景。

【PairRDDFunctions】

這個算子最長用於就是有鍵值對的這樣的rdd

源碼的註釋要點摘錄：

這個操作是非常 expensive （代價很高）的，建議你在sum 或者是average的場景下，使用aggregatebykey或者是reducebykey
需要把所有的k-v對都放在內存裏，所以內存的大小比較關鍵。如果說key有很多的values ,那麼容易報OOM ,(數據傾斜等都有可能的)
mapsideCombine = false (默認值是true)

/**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * Note: This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
   *
   * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

調用的是def combineByKeyWithClassTag生成的結果

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {
      if (mapSideCombine) {
        throw new SparkException("Cannot use map-side combining with array keys.")
      }
      if (partitioner.isInstanceOf[HashPartitioner]) {
        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](
      self.context.clean(createCombiner),
      self.context.clean(mergeValue),
      self.context.clean(mergeCombiners))
    if (self.partitioner == Some(partitioner)) {
      self.mapPartitions(iter => {
        val context = TaskContext.get()
        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {
      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

aggregatebykey:

 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }

reducebykey:

源碼的註釋要點摘錄：

在mapper端會做本地的聚合，然後把聚合後的結果發給reducer.

 /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

可以看到三者其實調用的都是：

def combineByKeyWithClassTag

感覺aggregatebykey貌似要複雜不少和reducebykey比起來。其實在實際的使用的時候也確實是這個樣子的。

先來分析簡單的reducebykey：

從源碼裏我們可以看到：

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) 他的

這個兩個函數一樣的，都是用的是func

它的combiner 沒有做聚合的處理。

看下aggregatebykey：

 def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

這裏我們可以看到這個的兩個函數是比較靈活的，你可以自己去定義。seqOp做的就是在每個分區內部的聚合操作，而combOp就是彙總每個分區的結果的一個全局的操作。可以試想一下用這個函數來實現經典的wordcount和reducebykey的實現方式的區別。

今天先到這裏，後期我會加上一個案例來說明 aggregatebykey的用法。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

經典面試題目之：groupbykey 和 reducebykey以及aggregatebykey 的區別？

groupbykey ：

aggregatebykey:

reducebykey:

先來分析簡單的reducebykey：

看下aggregatebykey：

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

忍不住配個圖來簡單說下idea 中 git的分支合併

flume01

源碼走讀篇之：spark讀取textfile時是如何決定分區數的

經典面試題目之：groupbykey 和 reducebykey以及aggregatebykey 的區別？

徹底說清楚scala中的閉包和柯里化的概念

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結