本文給出了spark常用的一些轉換操作，基於源碼，對部分API做了使用範例。

1.map

def map[U: ClassTag](f: T => U): RDD[U]

一對一轉換。
返回一個新的RDD，該RDD由每一個輸入元素經過f函數轉換後組成。
如產生1-100的數據，並讓每個元素乘以2：

val rdd = sc.makeRDD(1 to 100)
rdd.map(_*2).collect 或 rdd.map(x => x*2).collect

2.filter

def filter(f: T => Boolean): RDD[T]

傳入一個Boolean的方法，過濾。
返回一個新的RDD，該RDD由經過f函數計算後返回值爲true的輸入元素組成。
如：

rdd.filter(_%3==0).collect

3.flatMap

 def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

一對多，並將多壓平。
類似於map，但是每一個輸入元素可以被映射爲0或多個輸出元素（所以f應該返回一個序列，而不是單一元素）
如：

rdd.flatMap(1 to _).collect

4.mapPartitions

def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false): RDD[U]

對每一個分區中的所有數據執行一個函數，性能要比map高。
類似於map，但獨立地在RDD的每一個分片上運行，因此在類型爲T的RDD上運行時，f的函數類型必須是Iterator[T] => Iterator[U]。假設有N個元素，有M個分區，那麼map的函數的將被調用N次,而mapPartitions被調用M次,一個函數一次處理所有分區
如將3的倍數加上hello：

rdd.mapPartitions(x=>x.filter(_%3==0).map(_+"hello")).collect

5.mapPartitionsWithIndex

 def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

對每一個分區內的數據執行一個函數。
類似於mapPartitions，但f帶有一個整數參數表示分片的索引值，因此在類型爲T的RDD上運行時，f的函數類型必須是(Int, Interator[T]) => Iterator[U]
如將每個分區的數據以：分區號:[數據項1,數據項2,…, 數據項n]的形式輸出，其中i爲分區號，items爲分區i中的數據。

rdd.mapPartitionsWithIndex((i,items)=>Iterator(i+":["+items.mkString(",")+"]")).collect

6.sample

 def sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]

主要用於抽樣，處理數據傾斜問題中可能會用到。
以指定的隨機種子隨機抽樣出數量爲fraction的數據，withReplacement表示是抽出的數據是否放回，true爲有放回的抽樣，false爲無放回的抽樣，seed用於指定隨機數生成器種子。例子從RDD中隨機且有放回的抽出50%的數據，隨機種子值爲3（即可能以1 2 3的其中一個起始值）
如：

rdd.sample(true,0.3,5).collect

7.union

 def union(other: RDD[T]): RDD[T]

聯合一個RDD，返回一個RDD。
對源RDD和參數RDD求並集後返回一個新的RDD
如：

sc.makeRDD(1 to 10).union(sc.makeRDD(11 to 20)).collect

8.intersection

 def intersection(other: RDD[T]): RDD[T]

求交集。
對源RDD和參數RDD求交集後返回一個新的RDD
如：

sc.makeRDD(1 to 10).intersection(sc.makeRDD(10 to 20)).collect

9.distinct

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] 
def distinct(): RDD[T]

去重，將產生shuffle操作，默認不排序。
如：

sc.makeRDD(1 to 10).union(sc.makeRDD(10 to 20)).distinct.collect

10.partitionBy

def partitionBy(partitioner: Partitioner): RDD[(K, V)]

用提供的分區器分區。
如將變量rdd分區數改成5個，類型爲Hash分區：

rdd.map(x=>(x,x)).partitionBy(new org.apache.spark.HashPartitioner(5))

11.reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

根據key進行聚合，會產生預聚合操作，即在進行shuffle之前，會先聚合一次，能夠減低網絡傳輸。
在一個(K,V)的RDD上調用，返回一個(K,V)的RDD，使用指定的reduce函數，將相同key的值聚合到一起，reduce任務的個數可以通過第二個可選的參數來設置。

12.groupByKey

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

將key相同的value聚集在一起，不進行最終聚合。

13. combineByKey

def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,numPartitions: Int): RDD[(K, C)]

最重要的一個。

對相同K，把V合併成一個集合.
createCombiner: combineByKey() 會遍歷分區中的所有元素，因此每個元素的鍵要麼還沒有遇到過，要麼就和之前的某個元素的鍵相同。如果這是一個新的元素,combineByKey() 會使用一個叫作 createCombiner() 的函數來創建
那個鍵對應的累加器的初始值
mergeValue: 如果這是一個在處理當前分區之前已經遇到的鍵，它會使用 mergeValue() 方法將該鍵的累加器對應的當前值與這個新的值進行合併
mergeCombiners: 由於每個分區都是獨立處理的，因此對於同一個鍵可以有多個累加器。如果有兩個或者更多的分區都有對應同一個鍵的累加器，就需要使用用戶提供的 mergeCombiners() 方法將各個分區的結果進行合併。

如：求每個學生的平均分

Array((a,90),(a,80),(a,60),(b,76),(b,84),(b,96),(c,90),(c,86))
      
rdd.combineByKey(
v=>(v,1),
(c:(Int,Int),v)=>(c._1+v,c._2+1),
(c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2)
).map{case (k,v:(Int,Int))=>(k,v._1/v._2)}.collect

或者

val result = rdd.combineByKey(
v=>(v,1),
(c:(Int,Int),v)=>(c._1+v,c._2+1),
(c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2)
).collect
result.map{case (k,v:(Int,Int))=>(k,v._1/v._2)}

14.aggregateByKey

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,combOp: (U, U) => U): RDD[(K, U)]

是combineBykey的簡化版，可以通過zeroValue直接提供一個初始值。
在kv對的RDD中，按key將value進行分組合並，合併時，將每個value和初始值作爲seq函數的參數，進行計算，返回的結果作爲一個新的kv對，然後再將結果按照key進行合併，最後將每個分組的value傳遞給combine函數進行計算（先將前兩個value進行計算，將返回結果和下一個value傳給combine函數，以此類推），將key與計算結果作爲一個新的kv對輸出。
seqOp函數用於在每一個分區中用初始值逐步迭代value，combOp函數用於合併每個分區中的結果。
以下三個案例進行了aggregateByKey與combineByKey之間的演示：

案例1：

求每個分區中鍵相同的元素的最大值，並將所有分區中鍵相同的元素的值求和。

val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect

用combineByKey函數：

rdd.combineByKey(
v=>math.max(0,v),
(c:Int,v)=>math.max(c,v),
(c1:Int,c2:Int)=>c1+c2
).collect

案例2：

求每個鍵的最大值：先求單個分區內的最大值，再求整體的最大值。

val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd.aggregateByKey(0)(math.max(_,_),math.max(_,_)).collect

用combineByKey函數：

rdd.combineByKey(
v=>math.max(0,v),
(c:Int,v)=>math.max(c,v),
(c1:Int,c2:Int)=>math.max(c1,c2)
).collect

案例3：

求均值。

val rdd1 = sc.makeRDD(Array(("a",90),("a",80),("a",60),("b",76),("b",84),("b",96),("c",90),("c",86)))
rdd1.aggregateByKey((0,0))((c:(Int,Int),v)=>(c._1+v,c._2+1),(c1:(Int,Int),c2:(Int,Int))=>(c1._1+c2._1,c1._2+c2._2)).map{case (k,v:(Int,Int))=>(k,v._1/v._2)}.collect

15.foldByKey

def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]

是aggregateByKey函數的簡化版,seqOP與combineOP相同。
如：求每個鍵的最大值：先求單個分區內的最大值，再求整體的最大值。

val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8)),3)
rdd.foldByKey(0)(math.max(_,_)).collect

16.sortByKey

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

根據key來進行排序。如果key目前不支持排序，需要混入Ordering特質（with Ordering），實現compare方法，告訴spark怎樣判斷key的大小。
如：

rdd.sortByKey().collect

17.sortBy

def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

排序，根據f函數提供可以排序的key。
如：

val rdd = sc.makeRDD(1 to 10)
scala> rdd.sortBy(x=>x%4).collect
res29: Array[Int] = Array(4, 8, 1, 5, 9, 2, 6, 10, 3, 7)

18.join

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

連接兩個key相同的RDD的value。
在類型爲(K,V)和(K,W)的RDD上調用，返回一個相同key對應的所有元素對在一起的(K,(V,W))的RDD。
JOIN：只留下雙方都有key。
left JOIN：留下左邊RDD所有的數據。
right JOIN：留下右邊RDD所有的數據。

19.cogroup

def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]

分別將相同key的數據聚集在一起。

20.cartesian

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

做笛卡爾積。n*m級別擴展。

21.pipe

def pipe(command: String): RDD[String]

執行外部腳本。
在/home/dendan/下創建文件pip.sh

#！/bin/sh
echo "AA"
while read LINE;do
    echo ">>>"${LINE}
done

調用：

scala> val rdd= sc.parallelize(List("hi","hello","how","are","you"),1)
scala> rdd.pipe("/home/dendan/pip.sh").collect
res1: Array[String] = Array(AA, >>>hi, >>>hello, >>>how, >>>are, >>>you)

22.coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null): RDD[T]

縮減分區數，用於大數據集過濾後提高小數據集的執行效率。

23.repartition

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

重新分區。

24.repartitionAndSortWithinPartitions

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

如果重新分區後需要排序，直接用這個。
repartitionAndSortWithinPartitions函數是repartition函數的變種，與repartition函數不同的是，repartitionAndSortWithinPartitions在給定的partitioner內部進行排序，性能比repartition要高。

25.glom

def glom(): RDD[Array[T]]

將每個分區的元素分成一個數組。

26.mapValues

def mapValues[U](f: V => U): RDD[(K, U)]

對於KV結構的RDD，只處理value。

27.subtract

def subtract(other: RDD[T]): RDD[T]

去掉和other重複的元素。

spark-6.sparkcore_3_RDD的轉換操作