Partitioner決定了KV類型RDD中一條數據在Shuffle的時候屬於後續哪一個分區，它和數據傾斜有很密切的聯繫。所以這裏來學習下常用的HashPartitioner和RangePartitioner的內部實現是如何實現的，爲什麼會造成數據傾斜？爲什麼有時候repartition()一把能解決數據傾斜問題？

沒有指定Partitioner和分區數的情況下：

一般情況下，都是對同一個RDD進行變換，可以理解爲下一個RDD默認的Partition數量和上一個RDD的Partition數量一致。

首先看下HashPartitioner:

計算Partition的方法如下：Key.hashCode % partitionNum；如果值爲負數，那麼就加上partitionNum。

def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}

這個方法還是蠻簡單的，但是如果相同的Key很多的話，是會出現數據傾斜的情況。

注意，這裏計算hashCode的方法是計算key的hashCode，如果key是Java數組的話，那麼就算的就是基於數據對象本身的hashCode，而不是數組內容的hashCode，這個會導致數組內容一致的Key沒辦法分到同一個Partition中，此時最好自定義Partitioner。

再來看下RangePartitioner:

上面的HasePartitioner已經能夠滿足絕大部分的情況了，但是由於Key是隨機分的，如果Key有排序操作的話就很難進行。所以又提供了一種便於對Key進行排序的RangePartitioner，它能保證各個Partition之間的Key是有序的，並且各個Partition之間數據量差不多，但是不保證單個Partition內Key的有序性。

RangePartitioner會將Key切分成一段段範圍(rangeBounds)，每段範圍對應一個Partition。簡單的說就是將一定範圍內的Key映射到某一個分區內，映射的過程如下：

所以它的核心就是怎麼劃分Key範圍：

劃分各個Partition的Key範圍的方法稱之爲“水塘抽樣”算法，這裏大致說下它的實現：

假如10000個數,我們要抽取十個隨機數，一萬個數的樣本集合數組記作S，十個隨機數的數組記作R,代表result。

先取數組S中前十個數填充進數組R。

算法的第一次迭代流程是這樣的:

從第十一個數(下標爲10)開始迭代,生成一個0到10的隨機整數j,如果j<10(假如J=4),我們就將數組R中的第5項(R[4])替換成S數組中的第11項(S[10])。

遍歷完成生成的R數組,就是我們要求的隨機數組。

算法的具體解釋可以看“參考二”中的鏈接，裏面講的非常詳細。

看下Spark代碼裏面是如何實現的：

總體流程如下：

　　1. 如果分區數量小於等於1或者rdd中不存在數據的情況下，不需要計算range的邊界，直接返回一個空的Array

　　2. 計算總體的數據抽樣大小sampleSize(math.min(20.0 * partitions, 1e6)),即最多1M的數據量或者每個分區抽取20個數據(Partition數量達到5W的時候處於臨界點)

　　3. 根據sampleSize和分區數量計算每個分區的數據抽樣樣本數量sampleSizePerPartition(math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt)，即每個分區抽取的數據量一般會比之前計算的大一點)

　　4.根據sampleSizePerPartition對每個分區進行數據抽樣，返回數據總量，並且返回每個分區如下信息：(Int, Long, Array[K])，分別是ParttitionID，分區總數據量，抽取的樣本

　　5. 計算樣本的整體佔比以及數據量過多的數據分區，防止數據傾斜。對於數據比較多的Partition重新進行數據抽取。計算分區總數據量/抽取樣本數，這個值稱爲weight。

　　6. 將最終的樣本數據進行數據排序分配，計算出rangeBounds。根據sumWeights / partitions計算出每個Partiton的weight跨度，最後遞加weight和Partition劃分出Key的範圍。

// An array of upper bounds for the first (partitions - 1) partitions
//劃分Key總流程
private var rangeBounds: Array[K] = {
  if (partitions <= 1) {
    Array.empty//只有一個分區，直接返回
  } else {
    // This is the sample size we need to have roughly balanced output partitions, capped at 1M.
    val sampleSize = math.min(20.0 * partitions, 1e6)
    // Assume the input partitions are roughly balanced and over-sample a little bit.
    val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt//計算抽樣總數
    val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)//從每個Partition中抽取數據
    if (numItems == 0L) {
      Array.empty
    } else {
      // If a partition contains much more than the average number of items, we re-sample from it
      // to ensure that enough items are collected from that partition.
      val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)//計算計劃樣本總數和實際抽取數據的比重，最大爲1.0
      val candidates = ArrayBuffer.empty[(K, Float)]//用於記錄樣本的集合
      val imbalancedPartitions = mutable.Set.empty[Int]//記錄數據量偏大的分區ID，後續需要進一步的細分抽取
      sketched.foreach { case (idx, n, sample) =>
        if (fraction * n > sampleSizePerPartition) {//判斷當前分區是否比較大，如果是的話下面會重新再抽取當前分區
          imbalancedPartitions += idx
        } else {
          // The weight is 1 over the sampling probability.
          val weight = (n.toDouble / sample.length).toFloat//後面根據計算所有weight的平均長度，逐步增加weight來劃分Key的範圍
          for (key <- sample) {
            candidates += ((key, weight))//不是的話，直接將樣本的信息記錄下來
          }
        }
      }
      if (imbalancedPartitions.nonEmpty) {//這一步就是對數據量較大的Partition重新抽取
        // Re-sample imbalanced partitions with the desired sampling probability.
        val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
        val seed = byteswap32(-rdd.id - 1)
        val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
        val weight = (1.0 / fraction).toFloat
        candidates ++= reSampled.map(x => (x, weight))
      }
      //劃分Key的範圍
      RangePartitioner.determineBounds(candidates, partitions)
    }
  }
}

//抽取數據流程

/**
 * Sketches the input RDD via reservoir sampling on each partition.
 *
 * @param rdd the input RDD to sketch
 * @param sampleSizePerPartition max sample size per partition
 * @return (total number of items, an array of (partitionId, number of items, sample))
 */
def sketch[K : ClassTag](
    rdd: RDD[K],
    sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
  val shift = rdd.id
  // val classTagK = classTag[K] // to avoid serializing the entire partitioner object
  val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
    val seed = byteswap32(idx ^ (shift << 16))
    val (sample, n) = SamplingUtils.reservoirSampleAndCount(
      iter, sampleSizePerPartition, seed)
    Iterator((idx, n, sample))
  }.collect()
  val numItems = sketched.map(_._2).sum
  (numItems, sketched)
}

//劃分Key的邊界

def determineBounds[K : Ordering : ClassTag](
    candidates: ArrayBuffer[(K, Float)],
    partitions: Int): Array[K] = {
  val ordering = implicitly[Ordering[K]]
  val ordered = candidates.sortBy(_._1) //排序，默認升序排列
  val numCandidates = ordered.size
  val sumWeights = ordered.map(_._2.toDouble).sum
  val step = sumWeights / partitions//計算平均weight
  var cumWeight = 0.0
  var target = step
  val bounds = ArrayBuffer.empty[K]
  var i = 0
  var j = 0
  var previousBound = Option.empty[K]
  while ((i < numCandidates) && (j < partitions - 1)) {//累計權重，當權重達到一個步長的範圍，計算出一個一個分區的Key邊界
    val (key, weight) = ordered(i)
    cumWeight += weight
    if (cumWeight >= target) {
      // Skip duplicate values.
      if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
        bounds += key
        target += step
        j += 1
        previousBound = Some(key)
      }
    }
    i += 1
  }
  bounds.toArray
}

參考：

https://www.cnblogs.com/krcys/p/9121487.html(水塘抽樣算法解釋)

https://www.cnblogs.com/strugglion/p/6424874.html(水塘抽樣算法解釋)

HashPartitioner 與 RangePartitioner

沒有指定Partitioner和分區數的情況下：

首先看下HashPartitioner:

再來看下RangePartitioner:

所以它的核心就是怎麼劃分Key範圍：

看下Spark代碼裏面是如何實現的：

Wireshark 安裝+使用（一）

整理一些Spark數據傾斜解決的思路

Flink Job執行流程分析

HBase的BulkLoad

SparkOnYarn-Container啓動流程

ThreadLocal是幹嘛用的？

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結