大數據Spark面試，distinct去重原理，是如何實現的

原創

呆若喵喵

2020-07-06 09:10

最近，有位朋友問我，distinct去重原理是怎麼實現的？

“在面試時，面試官問他了解distinct算子嗎？”

“瞭解啊，Spark的rdd，一種transFormation去重的算子，主要用來去重的”。

“喲，看來你經常使用distinct算子，對distinct算子很熟悉啊”。

“好說，好說”。

“那你能說說distinct是如何實現去重的嗎？”

我朋友支支吾吾半天：“就是這樣、那樣去重的啊”。

“這樣、那樣是怎麼去重的呢”

“具體有點忘記了(其實是根本就不知道)”。

那麼distinct，底層到底是如何實現去重功能的呢？這個是面試spark部分時，經常被問到的問題。

先來看一段代碼，我們測試一下distinct去重的作用：

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object SparkDistinct {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkDistinct")
    val sc: SparkContext = new SparkContext(conf)
    //定義一個數組
    val array: Array[Int] = Array(1,1,1,2,2,3,3,4)
    //把數組轉爲RDD算子,後面的數字2代表分區，也可以指定3，4....個分區，也可以不指定。
    val line: RDD[Int] = sc.parallelize(array,2)
      line.distinct().foreach(x => println(x))
  //輸出的結果已經去重：1，2，3，4
  }
}

通過上面的代碼可以看出，使用distinct以後，會對重複的元素進行去重。我們來看下源碼

/**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

上面是distinct的源碼，有帶參和無參兩種。當我們調用無參的distinct時，底層調用的是如下源碼：

def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

而無參distinct()中又調用了帶參數的distinct(partitions.length)。

其中，partitions.length代表是分區數，而這個分區則是我們在使用 sc.parallelize(array,2) 時指定的2個分區。

帶參數的distinct其內部就很容易理解了，這就是一個wordcount統計單詞的方法，區別是：後者通過元組獲取了第一個單詞元素。

map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)

其中，numPartitions就是分區數。

我們也可以寫成這樣：

map(x => (x, null)).reduceByKey((x, y) => x).map(_._1)

也可以這樣寫：

line.map(x =>(x,1)).reduceByKey(_+_).map(_._1)

通過上面的流程圖很清晰的看出來，distinct的原理流程。

使用map算子把元素轉爲一個帶有null的元組；使用reducebykey對具有相同key的元素進行統計；之後再使用map算子，取得元組中的單詞元素，實現去重的效果。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據Spark面試，distinct去重原理，是如何實現的

Java入門知識總結(1)

SparkSql寫hdfs報權限錯誤BUG解決

Hive Sql常用的時間處理類，都在這裏了

Java入門知識總結(2)

大數據Spark面試，distinct去重原理，是如何實現的

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結