wordcount基本原理深度剖析
在學習spark編程過程中,我想大多數人寫的第一個spark程序應該是WordCount程序。今天我對WordCount程序做深入剖析。
WordCount程序代碼
/**
* Created by cuiyufei on 2018/2/13.
*/
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
private val master = "spark://spark1:7077"
private val remote_file = "F:\\spark\\spark.txt"
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster(master)
val sc = new SparkContext(conf)
val lines = sc.textFile(remote_file)
val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey((a,b) => a + b)
//val wordCounts = pairs.reduceByKey(_+_)
wordCounts.foreach(println)
}
}
spark中RDD的變化過程如下圖所示。
這就是所謂的spark的分佈式、內存式迭代式的計算模型,也是spark之所以速度比MapReduce更快的原因,如果是MapReduce,就必須走磁盤讀寫,速度必然下降