一、方法說明
-
def reduce(f: (T, T) ⇒ T): T
Reduces the elements of this RDD using the specified commutative and associative binary operator.
-
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
二、區別
1、首先,從名稱中可以看出的區別就是“ByKey”
-
reduce是作用在普通RDD上,返回的是一個值
-
reduceByKey是作用在鍵值對RDD上的,返回的也是一個鍵值對RDD
2、其次,他們是不同類型的算子
-
reduce是一個行動(action)
行動的作用是運行計算後返回一個值給驅動程序(driver program)
-
reduceByKey是一個轉換(transformation)
轉換的作用是創建一個新的數據集;
Spark中,所有轉換都是惰性的,不會馬上計算結果,而是由行動來驅動;
轉換有可能會被重複計算,如果有多個行動去觸發它。針對此場景,可考慮在轉換之後使用persist或者cache方法對轉換計算的結果進行持久化(可緩存到內存或者硬盤中,cache() = persist(StorageLevel.MEMORY_ONLY)),從而減少重複計算量。
三、示例
- reduce示例
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object ReduceExampleApp {
def reduceExample(sc: SparkContext): Unit = {
val words = sc.textFile("C:\\bd\\data\\the soul.txt").map(x => x.split(" ").length)
val wordCount = words.reduce((x, y) => x + y)
println(s"wordCount=$wordCount")
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getName()).setMaster("local[4]")
val sc = new SparkContext(conf)
reduceExample(sc)
}
}
運行結果:
wordCount=131
- reduceByKey示例
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object ReduceExampleApp {
def reduceExample(sc: SparkContext): Unit = {
val words = sc.textFile("C:\\bd\\data\\the soul.txt").map(x => x.split(" ").length)
val wordCount = words.reduce((x, y) => x + y)
println(s"wordCount=$wordCount")
}
def reduceByKeyExample(sc: SparkContext): Unit = {
val words = sc.textFile("C:\\bd\\data\\the soul.txt").flatMap(x => x.split(" "))
val wordCountDS = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
println(wordCountDS)
wordCountDS.sortBy(_._2, false).take(3).foreach(println)
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getName()).setMaster("local[4]")
val sc = new SparkContext(conf)
reduceByKeyExample(sc)
}
}
運行結果:
ShuffledRDD[4] at reduceByKey at ReduceExampleApp.scala:27
(my,15)
(the,13)
(soul,8)