# 圖解Spark排序算子sortBy的核心源碼

## 一、案例說明

``````val money = ss.sparkContext.parallelize(
List(("Alice", 9973),
("Bob", 6084),
("Charlie", 3160),
("David", 8588),
("Emma", 8241),
("Frank", 117),
("Grace", 5217),
("Hannah", 5811),
("Ivy", 4355),
("Jack", 2106))
)
money.sortBy(x =>x._2, false).foreach(println)

(Ivy,4355)
(Grace,5217)
(Jack,2106)
(Frank,117)
(Emma,8241)
(Alice,9973)
(Charlie,3160)
(Bob,6084)
(Hannah,5811)
(David,8588)
``````

``````money.sortBy(x =>x._2, false).collect().foreach(println)

money.repartition(1).sortBy(x =>x._2, false).foreach(println)

money.sortBy(x =>x._2, false).saveAsTextFile("result")

(Alice,9973)
(David,8588)
(Emma,8241)
(Bob,6084)
(Hannah,5811)
(Grace,5217)
(Ivy,4355)
(Charlie,3160)
(Jack,2106)
(Frank,117)
``````

## 二、sortBy源碼分析

``````def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
``````

### 2.1、逐節分析sortBy源碼之一：this.keyByK

this.keyBy[K](f)這行代碼是基於_.sortBy(x =>x._2, false)傳進來的x =>x._2重新生成一個新RDD數據，可以進入到其底層源碼看一下——

``````def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
val cleanedF = sc.clean(f)
map(x => (cleanedF(x), x))
}
``````

``````map(x => (sc.clean(x =>x._2), x))
``````

sc.clean(x =>x._2)這個clean相當是對傳入的函數做序列化，因爲最後會將這個函數得到結果當作排序key分發到不同分區節點做排序，故而涉及到網絡傳輸，因此做序列化後就方便在分佈式計算中在不同節點之間傳遞和執行函數，clean最終底層實現是這行代碼SparkEnv.get.closureSerializer.newInstance().serialize(func)，感興趣可以深入研究。

keyBy最終會生成一個新的RDD，至於這個結構是怎樣的，通過原先的測試數據調用keyBy打印一下就一目瞭然——

``````val money = ss.sparkContext.parallelize(
List(("Alice", 9973),
("Bob", 6084),
("Charlie", 3160),
("David", 8588),
("Emma", 8241),
("Frank", 117),
("Grace", 5217),
("Hannah", 5811),
("Ivy", 4355),
("Jack", 2106))
)
money.keyBy(x =>x._2).foreach(println)

(5217,(Grace,5217))
(5811,(Hannah,5811))
(8588,(David,8588))
(8241,(Emma,8241))
(9973,(Alice,9973))
(3160,(Charlie,3160))
(4355,(Ivy,4355))
(2106,(Jack,2106))
(117,(Frank,117))
(6084,(Bob,6084))
``````

### 2.2、逐節分析sortBy源碼之二：sortByKey

``````/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*
*按鍵對RDD進行排序，以便每個分區包含一個已排序的元素範圍。
在結果RDD上調用collect或save將返回或輸出一個有序的記錄列表
(在save情況下，它們將按照鍵的順序寫入文件系統中的多個part-X文件)。
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
``````

``````money.sortBy(x => x._2, false).foreachPartition(x => {
//val index = UUID.randomUUID()
x.foreach(x => {
println("分區號" + partitionId + "：   " + x)
})
})

``````

sortBy主要流程如下，假設運行環境有3個分區，讀取的數據去創建一個RDD的時候，會按照默認Hash分區器將數據分到3個分區裏。

shuffleRDD中，使用mapPartitions會對每個分區的數據按照key進行相應的升序或者降序排序，得到分區內有序的結果集。

### 2.3、逐節分析sortBy源碼之三：.values

sortBy底層源碼裏 this.keyBy[K](f).sortByKey(ascending, numPartitions).values，在sortByKey之後，最後調用了.values。源碼.values裏面是def values: RDD[V] = self.map(_._2)，就意味着，排序完成後，只返回x._2的數據，用於排序生成的RDD。類似排序過程中RDD是(5217,(Grace,5217))這樣結構，排序後，若只返回x._2，就只返回(Grace,5217)這樣結構的RDD即可。