你還記得Spark湖畔reduce和reduceByKey的區別嗎

原創

种豆大叔

2020-04-23 22:59

一、方法說明

def reduce(f: (T, T) ⇒ T): T

Reduces the elements of this RDD using the specified commutative and associative binary operator.
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

二、區別

1、首先，從名稱中可以看出的區別就是“ByKey”

reduce是作用在普通RDD上，返回的是一個值
reduceByKey是作用在鍵值對RDD上的，返回的也是一個鍵值對RDD

2、其次，他們是不同類型的算子

reduce是一個行動（action）

行動的作用是運行計算後返回一個值給驅動程序（driver program）
reduceByKey是一個轉換（transformation）

轉換的作用是創建一個新的數據集；

Spark中，所有轉換都是惰性的，不會馬上計算結果，而是由行動來驅動；

轉換有可能會被重複計算，如果有多個行動去觸發它。針對此場景，可考慮在轉換之後使用persist或者cache方法對轉換計算的結果進行持久化（可緩存到內存或者硬盤中，cache() = persist(StorageLevel.MEMORY_ONLY)），從而減少重複計算量。

三、示例

reduce示例

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext


object ReduceExampleApp {

  def reduceExample(sc: SparkContext): Unit = {

    val words = sc.textFile("C:\\bd\\data\\the soul.txt").map(x => x.split(" ").length)
    val wordCount = words.reduce((x, y) => x + y)
    println(s"wordCount=$wordCount")

  }

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName(this.getClass.getName()).setMaster("local[4]")
    val sc = new SparkContext(conf)

    reduceExample(sc)

  }

}

運行結果：
wordCount=131

reduceByKey示例

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext


object ReduceExampleApp {

  def reduceExample(sc: SparkContext): Unit = {

    val words = sc.textFile("C:\\bd\\data\\the soul.txt").map(x => x.split(" ").length)
    val wordCount = words.reduce((x, y) => x + y)
    println(s"wordCount=$wordCount")

  }

  def reduceByKeyExample(sc: SparkContext): Unit = {

    val words = sc.textFile("C:\\bd\\data\\the soul.txt").flatMap(x => x.split(" "))
    val wordCountDS = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
    println(wordCountDS)
    wordCountDS.sortBy(_._2, false).take(3).foreach(println)

  }

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName(this.getClass.getName()).setMaster("local[4]")
    val sc = new SparkContext(conf)

    reduceByKeyExample(sc)

  }

}

運行結果：
ShuffledRDD[4] at reduceByKey at ReduceExampleApp.scala:27
(my,15)
(the,13)
(soul,8)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

你還記得Spark湖畔reduce和reduceByKey的區別嗎

一、方法說明

二、區別

三、示例

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

pip安裝pyspark報MemoryError錯誤

【tkGo】一鍵查找文件內容

【皇室戰爭】使用Clash Royale API，構建你的皇室應用

Hive函數大全（含例子）之集合函數、日期函數、條件函數

Linux下安裝mysqlclient

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結