Spark操作——行動操作(一)

  • 集合標量行動操作

  • 存儲行動操作

 

集合標量行動操作

  • first(): T  返回RDD中的第一個元素,不進行排序

  • count(): Long    返回RDD中的元素個數

  • reduce(f:(T, T) => T): T    根據映射函數f,對元素進行二元計算

  • collect(): Array[T]    將RDD轉換爲數組

  • take(num: Int): Array[T]    獲取RDD中下標從0—num-1的元素,不進行排序

  • top(num: Int): Array[T]    從RDD中,按照默認(降序)或者指定排序規則,返回前num個元素
  • takeOrdered(num: Int): Array[T]    和top功能類似,區別在於按照top相反的順序返回元素
scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[60] at makeRDD at <console>:24

scala> rdd.collect
res50: Array[(String, Int)] = Array((A,1), (A,2), (A,3), (B,4), (B,5), (C,6), (C,7), (C,8), (C,9), (D,10))

scala> rdd.count()
res46: Long = 10

scala> rdd.first()
res45: (String, Int) = (A,1)

scala> rdd.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
res49: (String, Int) = (AACCABBCCD,55)

scala> rdd.take(2)
res51: Array[(String, Int)] = Array((A,1), (A,2))

scala> rdd.top(1)
res54: Array[(String, Int)] = Array((D,10))

scala> rdd.takeOrdered(1)
res56: Array[(String, Int)] = Array((A,1))

scala> rdd.takeOrdered(2)
res57: Array[(String, Int)] = Array((A,1), (A,2))
  • aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U)(implicit arg(): ClassTag[U]): U    

聚合RDD中的元素,先使用seqOp將RDD中每個分區中的T類型元素聚合成U類型,再使用combOp將之前每個分區聚合後的U類型聚合成U類型,需要注意的是seqOp和combOp都會使用到zeroValue的值

// 定義rdd,設置第一個分區中包含1,2,3,4,5,第二個分區中包含6,7,8,9,10
scala> var rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[65] at makeRDD at <console>:24

scala> rdd.mapPartitionsWithIndex{
     |     (partIdx, iter) => {
     |         var part_map = scala.collection.mutable.Map[String, List[(Int)]]()
     |         while(iter.hasNext){
     |             var part_name = "part_" + partIdx;
     |             var elem = iter.next()
     |             if(part_map.contains(part_name)) {
     |                 var elems = part_map(part_name)
     |                 elems ::= elem
     |                 part_map(part_name) = elems
     |             }
     |             else{
     |                 part_map(part_name) = List[(Int)]{elem}
     |             }
     |         }
     |         part_map.iterator
     |     }
     | }.collect
res59: Array[(String, List[Int])] = Array((part_0,List(5, 4, 3, 2, 1)), (part_1,List(10, 9, 8, 7, 6)))

// aggregate的最後結果是58,原因是先在每個分區中迭代執行(x: Int, y: Int) => x + y,並且使用zeroValue的值1,
// 即part_0中計算過程爲 1+1+2+3+4+5=16,part_1中計算過程爲1+6+7+8+9+10=41
// 再將兩個分區中的結果執行(a: Int, b: Int) => a + b,並應用zeroValue的值,結果爲1+16+41=58
scala> rdd.aggregate(1)(
     |     {(x: Int, y: Int) => x + y},
     |     {(a: Int, b: Int) => a + b}
     | )
res61: Int = 58
  • fold(zeroValue: T)(op: (T, T) => T): T

fold操作與aggregate操作功能類似,區別在於seqOp和combOp是統一個函數

scala> rdd.fold(1)(
     | (x, y) => x + y
     | )
res63: Int = 58
  • lookup(key: K): Seq[K]

該操作應用於(K, V)形式的RDD,返回指定K所對應的所以V值

scala> rdd.fold(1)(
     | (x, y) => x + y
     | )
res63: Int = 58
  • countByKey(): Map[K, Long]

統計RDD[K, V]中每個K的數量

scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24

scala> rdd.countByKey()
res65: scala.collection.Map[String,Long] = Map(D -> 1, A -> 3, B -> 2, C -> 4)
  • foreach(f: (T) => Unit): Unit

  • foreachPartition(f: (Iterator[T]) => Unit): Unit

foreach遍歷RDD中的每個元素,並應用函數f。foreachPartition與foreach類型,區別在於前者對針對每個分區。

scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24

scala> rdd.foreach(println)
(A,1)
(A,3)
(C,8)
(C,6)
(C,9)
(B,4)
(A,2)
(B,5)
(D,10)
(C,7)
  • sortBy[K](f: (T) => K, ascending: Boolean=true, numPartitions: Int=this.partitions.length): RDD[T]

根據指定的排序函數f對K進行排序

scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24

scala> rdd.sortBy(x => x).collect
res68: Array[(String, Int)] = Array((A,1), (A,2), (A,3), (B,4), (B,5), (C,6), (C,7), (C,8), (C,9), (D,10))

scala> rdd.sortBy(x => x, false).collect
res70: Array[(String, Int)] = Array((D,10), (C,9), (C,8), (C,7), (C,6), (B,5), (B,4), (A,3), (A,2), (A,1))

 

參考:

[1] 郭景瞻. 圖解Spark:核心技術與案例實戰[M]. 北京:電子工業出版社, 2017.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章