說明

本文記錄一部分Spark RDD接口Scala代碼實現。

大數據博客列表

接口說明

map

對RDD中的每個元素執行一個指定函數產生一個新的RDD。任何原RDD中的元素在新RDD中都有且只有一個元素與之對應，實例如下：

val a =sc.parallelize(1 to 9, 3)
val b =a.map(x => x*2)
a.collect  //Array[Int]= Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
b.collect //Array[Int]= Array(2, 4, 6, 8, 10, 12, 14, 16, 18)

filter

對RDD中的每個元素執行一個指定的函數來過濾產生一個新的RDD。任何原RDD中的元素在新RDD中都有且只有一個元素與之對應。

val rdd =sc.parallelize(List(1,2,3,4,5,6)) 
val filterRdd = rdd.filter(_> 5)
filterRdd.collect() //返回所有大於5的數據的一個Array，值Array(6,8,10,12)

flatMap

與map類似，區別是map處理後只能生成一個元素，flatmap處理後可生成多個元素構建新RDD。舉例：對原RDD中的每個元素x產生y個元素（從1到y，y爲元素x的值）

val a = sc.parallelize(1 to 4, 2)
val b =a.flatMap(x => 1 to x)
b.collect   // Array[Int]= Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)

mapPartitions

map的一個變種。map的輸入函數是應用於RDD中每個元素，而mapPartitions的輸入函數是應用於每個分區，也就是把每個分區中的內容作爲整體來處理的。它的函數定義爲： def mapPartitions[U:ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean =false): RDD[U]
f即爲輸入函數，它處理每個分區裏面的內容。每個分區中的內容將以Iterator[T]傳遞給輸入函數f，f的輸出結果是Iterator[U]。最終的RDD由所有分區經過輸入函數處理後的結果合併起來的，舉例如下：

val a =sc.parallelize(1 to 9, 3)
def myfunc[T](iter:Iterator[T]) : Iterator[(T, T)] = {
  var res = List[(T, T)]()
  var pre = iter.next
  while(iter.hasNext) {
    val cur = iter.next
    res.::=(pre, cur)
    pre = cur 
  }
  res.iterator
}

a.mapPartitions(myfunc).collect() //Array[(Int,Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

上述例子中的函數myfunc是把分區中一個元素和它的下一個元素組成一個Tuple。因爲分區中最後一個元素沒有下一個元素了，所以(3,4)和(6,7)不在結果中。 mapPartitions還有些變種，比如mapPartitionsWithContext，它能把處理過程中的一些狀態信息傳遞給用戶指定的輸入函數。還有mapPartitionsWithIndex，它能把分區的index傳遞給用戶指定的輸入函數。

mapPartitionsWithIndex

defmapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]函數作用同mapPartitions，不過提供了兩個參數，第一個參數爲分區的索引。

var rdd1 =sc.makeRDD(1 to 5,2)
//rdd1有兩個分區
var rdd2 =rdd1.mapPartitionsWithIndex{
        (x,iter) => {
          var result = List[String]()

            var i = 0

            while(iter.hasNext){
              i += iter.next()

            }

            result.::(x + "|" +i).iterator
        }

      }

//rdd2將rdd1中每個分區的數字累加，並在每個分區的累加結果前面加了分區索引
rdd2.collect() //Array[String] = Array(0|3, 1|12)

mapWith

mapWith是map的另外一個變種，map只需要一個輸入函數，而mapWith有兩個輸入函數。它的定義如下： def mapWith[A:ClassTag, U: ](constructA: Int => A, preservesPartitioning: Boolean =false)(f: (T, A) => U): RDD[U]
- 第一個函數constructA是把RDD的partition index（index從0開始）作爲輸入，輸出爲新類型A；
- 第二個函數f是把二元組(T, A)作爲輸入（其中T爲原RDD中的元素，A爲第一個函數的輸出），輸出類型爲U。
舉例：把partition index 乘以10加2,作爲新的RDD的元素。

val x =sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
x.mapWith(a =>a * 10)((b, a) => (b,a + 2)).collect()

結果：
(1,2)
(2,2)
(3,2)
(4,12)
(5,12)
(6,12)
(7,22)
(8,22)
(9,22)
(10,22)

flatMapWith

flatMapWith與mapWith很類似，都是接收兩個函數，一個函數把partitionIndex作爲輸入，輸出是一個新類型A；另外一個函數是以二元組（T,A）作爲輸入，輸出爲一個序列，這些序列裏面的元素組成了新的RDD。它的定義如下：def flatMapWith[A:ClassTag, U: ClassTag](constructA: Int => A, preservesPartitioning: Boolean= false)(f: (T, A) => Seq[U]): RDD[U]
舉例：

val a =sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
a.flatMapWith(x => x, true)((x, y) => List(y, x)).collect()
//res58: Array[Int]= Array(0, 1, 0, 2, 0, 3, 1, 4, 1, 5, 1, 6, 2, 7, 2,
8, 2, 9)

coalesce

該函數用於將RDD進行重分區，使用HashPartitioner，使用如下：defcoalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord:Ordering[T] = null): RDD[T]
- 第一個參數爲重分區的數目
- 第二個爲是否進行shuffle，默認爲false;
實例：

var data= sc.parallelize(1 to 12, 3)
data.collect
data.partitions.size
var rdd1= data.coalesce(1)
rdd1.partitions.size
var rdd1= data.coalesce(4)
rdd1.partitions.size //res2: Int = 1   如果重分區的數目大於原來的分區數，那麼必須指定shuffle參數爲true，否則，分區數不便
var rdd1= data.coalesce(4,true)
rdd1.partitions.size //res3: Int = 4

repartition

defrepartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]該函數其實就是coalesce函數第二個參數爲true的實現

var data= sc.parallelize(1 to 12, 3)
data.collect
data.partitions.size

var rdd1= data. repartition(1)
rdd1.partitions.size

var rdd1= data. repartition(4)
rdd1.partitions.size  //res3: Int = 4

randomSplit

defrandomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong):Array[RDD[T]] 該函數根據weights權重，將一個RDD切分成多個RDD,幹函數有兩個參數：

第一個參數：權重參數爲一個Double數組
第二個參數：爲random的種子，基本可忽略。
實例如下：

var rdd= sc.makeRDD(1 to 12,12)
rdd:org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at :21
rdd.collect //Array[Int] =Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 

varsplitRDD = rdd.randomSplit(Array(0.5, 0.1, 0.2, 0.2))
splitRDD:Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[17] atrandomSplit at :23,
MapPartitionsRDD[18]at randomSplit at :23,
MapPartitionsRDD[19]at randomSplit at :23,
MapPartitionsRDD[20]at randomSplit at :23) 

//這裏注意：randomSplit的結果是一個RDD數組
splitRDD.size  //res8: Int = 4

//由於randomSplit的第一個參數weights中傳入的值有4個，因此，就會切分成4個RDD,
//把原來的rdd按照權重0.5, 0.1, 0.2, 0.2，隨機劃分到這4個RDD中，權重高的RDD，劃分到數據的機率就大一些。
//注意，權重的總和加起來爲1，否則會不正常

splitRDD(0).collect  //res10: Array[Int]= Array(1, 4)
splitRDD(1).collect  //res11: Array[Int]= Array(3)                                                 
splitRDD(2).collect  //res12: Array[Int]= Array(5, 9)
splitRDD(3).collect  //res13: Array[Int]= Array(2, 6, 7, 8, 10)

glom

def glom():RDD[Array[T]]該函數是將RDD中每一個分區中類型爲T的元素轉換成Array[T]，這樣每一個分區就只有一個數組元素。

var rdd= sc.makeRDD(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int]= ParallelCollectionRDD[38] at makeRDD at :21

rdd.partitions.size  //res33: Int =3  該RDD有3個分區
rdd.glom().collect

res35:Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
//glom將每個分區中的元素放到一個數組中，這樣，結果就變成了3個數組

union並集

該函數用於將兩個數據集合併爲一個數據集

val rdd1 =sc.parallelize(List(5, 6, 4, 3))
val rdd2 =sc.parallelize(List(1, 2, 3, 4))

//求並集
val rdd3 =rdd1.union(rdd2)
rdd3.collect

distinct

該函數將兩個數據集合並的基礎上去重生成一個新的數據集

val rdd1 =sc.parallelize(List(5, 6, 4, 3))
val rdd2 =sc.parallelize(List(1, 2, 3, 4))

//求並集
val rdd3 =rdd1.union(rdd2)

//去重輸出
rdd3.distinct.collect

總結

不積跬步無以至千里，不要忽略任何渺小的問題和成長，每天積攢一點點，終有一天量變引起質變。

spark（五）：RDD API接口

說明

分享

接口說明

map

filter

flatMap

mapPartitions

mapPartitionsWithIndex

mapWith

flatMapWith

coalesce

repartition

randomSplit

glom

union並集

distinct

總結

ci 404 問題總結

探祕Python爬蟲技術：王者榮耀英雄圖片爬取

BizDevOps全局建設思路：橫向串聯，縱向深化

MySQL 創建表後神祕消失？揭祕零寬字符陷阱

寫給職場新人｜從迷茫到屢獲殊榮的技術人成長之路

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結