第11課：徹底解析wordcount運行原理

本期內容：
1. 從數據流動視角解密WordCount，即用Spark作單詞計數統計，數據到底是怎麼流動的。
2. 從RDD依賴關係的視角解密WordCount。Spark中的一切操作皆RDD，後面的RDD對前面的RDD有依賴關係。
3. DAG與Lineage的思考。依賴關係會形成DAG。

1. 從數據流動視角解密WordCount
（1）在IntelliJ IDEA中編寫下面代碼：

package com.dt.spark
/**
* 使用Java的方式開發進行本地測試Spark的WordCount程序
* @author DT大數據夢工廠
* http://weibo.com/ilovepains
*/
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]){

val conf = new SparkConf()
conf.setAppName("Wow, My First Spark App!")
conf.setMaster("local")
val sc = new SparkContext(conf)
val lines = sc.textFile("D://tmp//helloSpark.txt", 1)
val words = lines.flatMap { line => line.split(" ") }
val pairs = words.map { word => (word,1) }
val wordCounts = pairs.reduceByKey(_+_)
wordCounts.foreach(wordNumberPair => println(wordNumberPair._1 + " : " + wordNumberPair._2))
sc.stop()
}
}
（2）在D盤下地tmp文件夾下新建helloSpark.txt文件，內容如下：
Hello Spark Hello Scala
Hello Hadoop
Hello Flink
Spark is awesome
（3）在WordCount代碼區域點擊右鍵選擇Run 'WordCount'。可以得到如下運行結果：
Flink : 1
Spark : 2
is : 1
Hello : 4
awesome : 1
Hadoop : 1
Scala : 1
下面從數據流動的視角分析數據到底是怎麼被處理的。

說明：
Spark有三大特點：
1. 分佈式。無論數據還是計算都是分佈式的。默認分片策略：Block多大，分片就多大。但這種說法不完全準確，因爲分片切分時有的記錄可能跨兩個Block，所以一個分片不會嚴格地等於Block的大小，例如HDFS的Block大小是128MB的話，分片可能多幾個字節或少幾個字節。一般情況下，分片都不會完全與Block大小相等。
分片不一定小於Block大小，因爲如果最後一條記錄跨兩個Block的話，分片會把最後一條記錄放在前一個分片中。
2. 基於內存（部分基於磁盤）
3. 迭代

textFile源碼（SparkContext中）;
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
可以看出在進行了hadoopFile之後又進行了map操作。
HadoopRDD從HDFS上讀取分佈式文件，並且以數據分片的方式存在於集羣之中。
map的源碼（RDD.scala中）
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
讀取到的一行數據（key,value的方式），對行的索引位置不感興趣，只對其value事情興趣。pair時有個匿名函數，是個tuple，取第二個元素。
此處又產生了MapPartitionsRDD。MapPartitionsRDD基於hadoopRDD產生的Parition去掉行的KEY。
注：可以看出一個操作可能產生一個RDD也可能產生多個RDD。如sc.textFile就產生了兩個RDD：hadoopRDD和MapParititionsRDD。

下一步：val words = lines.flatMap { line => line.split(" ") }
對每個Partition中的每行進行單詞切分，並合併成一個大的單詞實例的集合。
FlatMap做的一件事就是對RDD中的每個Partition中的每一行的內容進行單詞切分。
這邊有4個Partition，對單詞切分就變成了一個一個單詞，
下面是FlatMap的源碼（RDD.scala中）
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
可以看出flatMap又產生了一個MapPartitionsRDD,
此時的各個Partition都是拆分後的單詞。

下一步： val pairs = words.map { word => (word,1) }
將每個單詞實例變爲形如word=>(word,1)
map操作就是把切分後的每個單詞計數爲1。
根據源碼可知，map操作又會產生一個MapPartitonsRDD。此時的MapPartitionsRDD是把每個單詞變成Array(""Hello",1),("Spark",1)等這樣的形式。

下一步：val wordCounts = pairs.reduceByKey(_+_)
reduceByKey是進行全局單詞計數統計，對相同的key的value相加，包括local和reducer同時進行reduce。所以在map之後，本地又進行了一次統計，即local級別的reduce。
shuffle前的Local Reduce操作，主要負責本地局部統計，並且把統計後的結果按照分區策略放到不同的File。
下一Stage就叫Reducer了，下一階段假設有3個並行度的話，每個Partition進行Local Reduce後都會把數據分成三種類型。最簡單的方式就是用HashCode對其取模。
至此都是stage1。
Stage內部完全基於內存迭代，不需要每次操作都有讀寫磁盤，所以速度非常快。
reduceByKey的源碼：
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

/**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
可以看到reduceByKey內部有combineByKeyWithClassTag。combineByKeyWithClassTag的源碼如下：
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
可以看出在combineByKeyWithClassTag內又new 了一個ShuffledRDD。
ReduceByKey有兩個作用：
1. 進行Local級別的Reduce，減少網絡傳輸。
2. 把當前階段的內容放到本地磁盤上供shuffle使用。

下一步是shuffledRDD,
產生Shuffle數據就需要進行分類，MapPartitionsRDD時其實已經分好類了，最簡單的分類策略就是Hash分類。
ShuffledRDD需要從每臺機上抓取同一單詞。
reduceByKey發生在哪裏？
Stage2全部都是reduceByKey

最後一步：保存數據到HDFS（MapPartitionsRDD）
統計完的結果：（“Hello”，4）只是一個Value，而不是Key:"Hello",value:4。但輸出到文件系統時需要KV的格式，現在只有Value，所以需要造個KEY。
saveAsTextFile的源碼：
def saveAsTextFile(path: String){
this.map(x => (NullWritable.get())),new Text(x.toStirng))
.saveAsHadoopFile[TextOutputFormat[NullWritable,Text]](path)
}
this.map把當前的值（x）變成tuple。tuple的Key是Null，Value是（“Hello”，4）。
爲什麼要爲樣？因爲saveAsHadoopFile時要求以這樣的格式輸出。Hadoop需要KV的格式！！
map操作時把key捨去了，輸出時就需要通過生成Key。
第一個Stage有哪些RDD？HadoopRDD、MapPartitionsRDD、MapPartitionsRDD、MapPartitionsRDD、MapPartitionsRDD
第二個Stage有哪些RDD？ShuffledRDD、MapPartitionsRDD

只有Collect 或saveAsTextFile會觸發作業，其他的時候都沒有觸發作業（Lazy）

第11課：徹底解析wordcount運行原理

DAPPER 事務 TRANSACTION

Java中線程的創建方式

有向圖的拓撲排序算法JAVA實現

Java之深入JVM(3) - 由一個棧溢出的問題看Java類和對象的初始化

lucene 分配數組大小 (newSize + 1) & 0x7ffffffe

Java內部類詳解

數據結構--圖的JAVA實現(上)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結