Spark高級算子：mapPartitionsWithIndex，aggregate，aggregateByKey

1：mapPartitionsWithIndex：

對RDD中的每個分區（帶有下標）進行操作，通過自己定義的一個函數來處理
       API文檔：
       def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U])
       def mapPartitions[U](f: (Iterator[T]) => Iterator[U]

       參數：f 是函數參數，接收兩個參數：
       （1）Int：代表分區號
       （2）Iterator[T]：分區中的元素
       （3）返回：Iterator[U]：操作完後，返回的結果

       舉例：將每個分區中的元素包括分區號，直接打印出來
        （1）創建一個RDD
             val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)

（2）創建函數f，對每個分區中的元素進行操作
             將元素與分區號，拼加起來

              def func1(index:Int,iter:Iterator[Int]):Iterator[String]={
                iter.toList.map(x=>"[PartID:"+index+",value="+x+"]").iterator
              }

        （3）調用
             rdd1.mapPartitionsWithIndex(func1).collect

        （4）輸出的結果：
            0號分區對應的數據
            [PartID:0,value=1], [PartID:0,value=2], [PartID:0,value=3], [PartID:0,value=4],

            1號分區對應的數據：
            [PartID:1,value=5], [PartID:1,value=6], [PartID:1,value=7], [PartID:1,value=8], [PartID:1,value=9]

2：aggregate

聚合操作，類似分組（Group By）
        （1）先對局部進行聚合操作，然後再對全局進行聚合操作
            val rdd2 = sc.parallelize(List(1,2,3,4,5),2)

            調用func1獲取每個分區的元素
            rdd2.mapPartitionsWithIndex(func1).collect

            結果
            [PartID:0,value=1], [PartID:0,value=2],
            [PartID:1,value=3], [PartID:1,value=4], [PartID:1,value=5]

            調用聚合操作
            （1）初始值是0
                  rdd2.aggregate(0)(math.max(_,_),_+_)

            （2）初始值是10
                  rdd2.aggregate(10)(math.max(_,_),_+_) 結果：30

注意：這裏初始值設置的不一樣，結果也會不同，我們可以看到初始值爲0時，結果爲7；初始值爲10，則結果爲30。解釋一下原因。

首先兩個分區中的數據如下：

[PartID:0,value=1], [PartID:0,value=2]

[PartID:1,value=3], [PartID:1,value=4]

在初始化爲10的過程中，首先每個分區與10比較，則分區0和分區1最大值都是10 ，累加後再與初始值10求和，最後得到結果30.

因此，比較後分區0最大值爲2，分區1最大值爲5，累加後再與初始值0求和，結果則爲7。

3：aggregateByKey

類似aggregate操作，區別：操作的是<Key Value>的數據類型
        API說明：PairRDDFunctions.aggregateByKey
        def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U)

        準備數據：
        val pairRDD = sc.parallelize(List(("cat",2),("cat", 5),("mouse", 4),("cat", 12),("dog", 12),("mouse", 2)), 2)

        重寫一個func3查看每個分區中的元素
        def func3(index:Int,iter:Iterator[(String,Int)]) = {
            iter.toList.map(x =>"[PartID:"+index+",value="+x+"]").iterator
        }

        pairRDD.mapPartitionsWithIndex(func3).collect

        結果
        0號分區（0動物園）
        [PartID:0,value=(cat,2)], [PartID:0,value=(cat,5)], [PartID:0,value=(mouse,4)],

        1號分區（1動物園）
        [PartID:1,value=(cat,12)], [PartID:1,value=(dog,12)], [PartID:1,value=(mouse,2)]

        操作：
        （1）將每個動物園（分區）中動物數最多的個數進行求和
                pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect
                結果：
                Array((dog,12), (cat,17), (mouse,6))

        （2）將所有的動物求和
                pairRDD.aggregateByKey(0)(_+_,_+_).collect
                結果：
                Array((dog,12), (cat,19), (mouse,6))

                也可以使用reduceByKey
                結果：Array((dog,12), (cat,19), (mouse,6))

4：coalesce與repartition

都是對RDD進行重分區
區別：

（1）coalesce默認，不會進行Shuffle（false）
def coalesce(numPartitions: Int, shuffle: Boolean = false

（2）repartition：將數據真正進行shuffle（在網絡上進行重分區）

        舉例：
            scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
            rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24

scala> val rdd2 = rdd1.repartition(3)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at repartition at <console>:26

scala> rdd2.partitions.length
res14: Int = 3

scala> val rdd3 = rdd1.coalesce(3,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[20] at coalesce at <console>:26

scala> rdd3.partitions.length
res15: Int = 3

scala> val rdd4 = rdd1.coalesce(3)
rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[21] at coalesce at <console>:26

scala> rdd4.partitions.length
res16: Int = 2

Spark高級算子：mapPartitionsWithIndex，aggregate，aggregateByKey

1：mapPartitionsWithIndex：

2：aggregate

3：aggregateByKey

4：coalesce與repartition

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

分佈式計算原理之分佈式協調與同步(1)——分佈式鎖

CNN & LSTM & Conv1D+LSTM 同一數據集預測案例分析

分佈式計算原理之分佈式協調與同步(1)——分佈式互斥

分佈式計算原理之分佈式協調與同步(1)——分佈式事務

分佈式計算原理之分佈式協調與同步(1)——分佈式選舉

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結