spark的coalesce和repartition算子管理分區

源碼地址https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala

repartition：

 / **
   *返回一個具有正確numPartitions分區的新RDD。
   *
   *可以增加或減少此RDD中的並行度。在內部，這使用
   *重新分配數據的隨機播放。
   *
   *如果要減少此RDD中的分區數，請考慮使用`coalesce`，
   *可以避免執行隨機播放。
   *
   * 
 def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }/

coalesce

/ **
   *返回一個新的RDD，它被縮減爲`numPartitions`分區。
   *
   *這導致了一個狹窄的依賴，例如，如果你從1000分區
   *到100個分區，不會有一個shuffle，而是每個100個
   *新分區將聲明10個當前分區。如果較大的數量請求分區，它將保持當前分區數。
   *
   *但是，如果你正在進行激烈的合併，例如對numPartitions = 1，
   *這可能導致您的計算髮生在比節點更少的節點上
   *你喜歡（例如，numPartitions = 1時的一個節點）。要避免這種情況，
   *你可以傳遞shuffle = true。這將添加一個洗牌步驟，但意味着
   *當前的上游分區將並行執行（無論如何當前分區是什麼）。
   *
   * 使用shuffle = true，您實際上可以合併爲更大的數字
   *分區。如果您有少量分區，這很有用
   *說100，可能有一些分區異常大。調用
   * coalesce（1000，shuffle = true）將導致1000個分區
   *使用散列分區器分發數據。可選的分區聚結器
   *傳入必須是可序列化的。
   * /
 def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

repartition方法是調用了coalesce方法,shuffle爲true的情況
所以使用coalesce默認情況下是不產生shuffle的。

分區介紹

當配置文件spark-default.conf中沒有配置時，則按照如下規則取值：
1、本地模式（不會啓動executor，由SparkSubmit進程生成指定數量的線程數來併發）：

spark-shell       spark.default.parallelism = 1
spark-shell --master local[N] spark.default.parallelism = N （使用N個核）
spark-shell --master local       spark.default.parallelism = 1

2、僞集羣模式（x爲本機上啓動的executor數，y爲每個executor使用的core數，z爲每個 executor使用的內存）

spark-shell --master local-cluster[x,y,z] spark.default.parallelism = x * y

3、其他模式（這裏主要指yarn模式，當然standalone也是如此）

Others: total number of cores on all executor nodes or 2, whichever is larger
spark.default.parallelism =  max（所有executor使用的core總數， 2）

假設使用 spark-shell --master local[3] 啓動：

val x = (1 to 1000).toList
val test_partitionDf = x.toDF(“test_partition”)

scala> test_partitionDf.rdd.partitions.size 
res0: Int = 3

coalesce方法減少了DataFrame中的分區數量。以下是如何合併兩個分區中的數據：

val test_partitionDf2 = test_partitionDf.coalesce(1)

我們可以驗證coalesce是否只創建了一個只有一個分區的新DataFrame：

scala> test_partitionDf2.rdd.partitions.size
res1: Int = 1

使用coalesce來增加分區，但是並不生效：

val test_partitionDf3 = test_partitionDf.coalesce(4)

scala> test_partitionDf3.rdd.partitions.size
res2: Int = 3

而repartition方法可用於增加或減少DataFrame中的分區數。
增加分區或減少分區

scala> val test_partition_repartitionDf = test_partitionDf.repartition(6)
scala> test_partition_repartitionDf.rdd.partitions.size
res3: Int = 6
scala> val test_partition_repartitionDf = test_partitionDf.repartition(1)
scala>test_partition_repartitionDf.rdd.partitions.size
res4: Int = 1

重新分區方法可以完全重排數據，因此可以增加或減少分區數。

coalesce和repartition之間的區別

repartition對數據進行完全重排，並創建相同大小的數據分區。coalesce結合現有分區以避免完全洗牌。

按列repartition

val color = List((1001,"blue"),(102,"red"),(1555,"blue"),(9,"red"),(1,"blue"))
val colorDf = color.toDF("sum","color")
val test_colorDf = colorDf.repartition($"color")
test_colorDf.rdd.partitions.size
res6: Int = 200

按列分區時，Spark默認會創建至少200個分區。查看分區數據，只有兩個分區有數據，且同一個分區中的數據的color字段是一致的。colorDf包含每種color的不同分區，並針對color提取進行了優化。按列分區類似於索引關係數據庫中的列。

考慮分區

1）N<M。一般情況下N個分區有數據分佈不均勻的狀況，利用HashPartitioner函數將數據重新分區爲M個，這時需要將shuffle設置爲true。

2）如果N>M並且N和M相差不多，(假如N是1000，M是100)那麼就可以將N個分區中的若干個分區合併成一個新的分區，最終合併爲M個分區，這時可以將shuff設置爲false，在shuffl爲false的情況下，如果M>N時，coalesce爲無效的，不進行shuffle過程，父RDD和子RDD之間是窄依賴關係。

3）如果N>M並且兩者相差懸殊，這時如果將shuffle設置爲false，父子RDD是窄依賴關係，他們同處在一個Stage中，就可能造成spark程序的並行度不夠，從而影響性能，如果在M爲1的時候，爲了使coalesce之前的操作有更好的並行度，可以將shuffle設置爲true。

總之：如果shuff爲false時，如果傳入的參數大於現有的分區數目，RDD的分區數不變，也就是說不經過shuffle，是無法將RDDde分區數變多的。

參考：https://blog.csdn.net/u011981433/article/details/50035851
https://blog.csdn.net/jiangsanfeng1111/article/details/78191891

spark的coalesce和repartition算子管理分區

分區介紹

coalesce和repartition之間的區別

按列repartition

考慮分區

keras自定義評估函數

簡單構建新聞數據對股票的情緒因子（大盤因子）

股票收益評價指標的幾個計算（用於回測）

解決多TF模型和多Keras模型同時使用，graph問題。

利用py2neo建立金融知識圖譜（1）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結