spark core源碼分析15 Shuffle詳解－寫流程

Shuffle是一個比較複雜的過程，有必要詳細剖析一下內部寫的邏輯

ShuffleManager分爲SortShuffleManager和HashShuffleManager

一、SortShuffleManager

每個ShuffleMapTask不會爲每個Reducer生成一個單獨的文件；相反，它會將所有的結果寫到一個本地文件裏，同時會生成一個index文件，Reducer可以通過這個index文件取得它需要處理的數據。避免產生大量的文件的直接收益就是節省了內存的使用和順序Disk IO帶來的低延時。

它在寫入分區數據的時候，首先會根據實際情況對數據採用不同的方式進行排序操作，底線是至少按照Reduce分區Partition進行排序，這樣同一個Map任務Shuffle到不同的Reduce分區中去的所有數據都可以寫入到同一個外部磁盤文件中去，用簡單的Offset標誌不同Reduce分區的數據在這個文件中的偏移量。這樣一個Map任務就只需要生成一個shuffle文件，從而避免了上述HashShuffleManager可能遇到的文件數量巨大的問題

/** Get a writer for a given partition. Called on executors by map tasks. */
  override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
      : ShuffleWriter[K, V] = {
    val baseShuffleHandle = handle.asInstanceOf[BaseShuffleHandle[K, V, _]]
    shuffleMapNumber.putIfAbsent(baseShuffleHandle.shuffleId, baseShuffleHandle.numMaps)
    new SortShuffleWriter(
      shuffleBlockResolver, baseShuffleHandle, mapId, context)
  }

shuffleMapNumber是一個HashMap<shuffleId,numMaps>

SortShuffleWriter提供write接口用於真實數據的寫磁盤，而在write接口中會使用shuffleBlockResolver與底層文件打交道

下面看獲得SortShuffleWriter之後，調用write進行寫

writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

write的參數其實就是調用了rdd的compute方法進行計算，返回的這個partition的迭代器

/** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    if (dep.mapSideCombine) {
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      sorter = new ExternalSorter[K, V, C](
        dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
      sorter.insertAll(records)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      sorter = new ExternalSorter[K, V, V](None, Some(dep.partitioner), None, dep.serializer)
      sorter.insertAll(records)
    }

    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    val outputFile = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
    val partitionLengths = sorter.writePartitionedFile(blockId, context, outputFile)
    shuffleBlockResolver.writeIndexFile(dep.shuffleId, mapId, partitionLengths)

    mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
  }

可以看到，設置了mapSideCombine的需要將aggregator和keyOrdering傳入到ExternalSorter中，否則將上面兩項參數設爲None。接着調用insertAll方法

def insertAll(records: Iterator[_ <: Product2[K, V]]): Unit = {
    // TODO: stop combining if we find that the reduction factor isn't high
    val shouldCombine = aggregator.isDefined

    if (shouldCombine) {
      // Combine values in-memory first using our AppendOnlyMap
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        maybeSpillCollection(usingMap = true)
      }
    } else if (bypassMergeSort) {
      // SPARK-4479: Also bypass buffering if merge sort is bypassed to avoid defensive copies
      if (records.hasNext) {
        spillToPartitionFiles(
          WritablePartitionedIterator.fromIterator(records.map { kv =>
            ((getPartition(kv._1), kv._1), kv._2.asInstanceOf[C])
          })
        )
      }
    } else {
      // Stick values into our buffer
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }

解釋一下內部邏輯：

(1) 如果是shouldCombine，將k-v信息記錄到一個Array中，默認是大小是64*2,存儲格式爲key0,value0,key1,value1,key2,value2...。map.changeValue方法就是將value的值不斷的調用mergeValue方法，去更新array中指定位置的value值。如果k-v的量達到array size的0.7時，會自動擴容。

之後調用maybeSpillCollection，首先判斷是否需要spill，依據是開啓spillingEnabled標誌(不開啓有OOM風險，其實上面rehash擴容的時候就應該是有OOM風險了)，且讀取的元素是32的整數倍，且目前佔用的內存大於設置的閥值(5M)，就去向shuffleMemoryManager申請內存(shuffleMemoryManager中有一個閥值，每個shuffle task向他申請時都會記錄一下），申請的容量是當前使用容量的2倍減去閥值(5M)，如果申請成功就增加閥值。如果目前內存佔用量還是大於新的閥值，則必須要進行spill了，否則認爲內存還夠用。真正spill操作之後，釋放剛纔從shuffleMemoryManager中申請的內存以及還原閥值到初始值(5M)。

spill方法：如果partition數量<=200，且沒有設置map端的combine，就調用spillToPartitionFiles方法，否則調用spillToMergeableFile方法，之後會講到。

所以在這個分支而言，我們是shouldCombine的，所以調用的是spillToMergeFile方法。

需要注意的是，在spill之前，我們是有一個數據結構來保存數據的，有map和buffer可選擇。由於shouldCombine是有可能去更新數據的，即調用我們的mergeValue方法之類的，所以我們用map。

(2) 如果是bypassMergeSort(partition數量<=200，且沒有設置map端的combine)，調用的是spillToPartitionFiles方法。這種模式直接寫partition file，就沒有緩存這一說了。

(3) 如果是非shouldCombine，非bypassMergeSort，這裏因爲我們不需要merge操作，直接使用buffer作爲spill前的緩存結構。之後調用maybeSpillCollection方法。

看一下spillToMergeableFile方法：

(1) 在localDirs下面的子目錄下創建一個寫shuffle的文件

(2) 對緩存中的數據進行排序，原則是按partitionID和partition內的key排序，得到的數據格式爲((partitionId_0,key_0),value_0),((partitionId_0,key_1),value_1)......((partitionId_100,key_100),value_100)。

(3) 逐步往文件裏寫，每寫10000個，sync一把。同時保存一個spilledFile的結構在內存中。

也就是說，一個map task，每次spill都生成一個文件(因爲有可能一個map task有多次spill)，文件內有序。

這樣，一次spill就完成了。

看一下spillToPartitionFiles方法：

每個map task對每一個reduce 分區都建立一個不同的文件，也不需要排序。

insertAll方法介紹完了，接着往下介紹。

根據shuffleId＋ mapId信息創建data文件，調用writePartitionedFile方法：

(1) 如果之前是bypassMergeSort,即調用的是spillToPartitionFiles,就把剩餘的buffer中的信息寫到指定的reduce分區對應的文件。然後將所有的輸出文件合併成一個data文件

(2) 如果內存中沒有spilledFile的信息，即全部的信息都在內存中，就直接寫到data文件即可

(3) 否則，也是最複雜的情況，將這個map task輸出的所有文件，按partition進行整合到一個data文件中，格式大概爲(partition0，這個map task中分區0的全部數據),(partition1，這個map task中分區1的全部數據)......

需要注意的是，(2)和(3)的情況寫到一個data文件中時，每個partition在data文件中的的大小是記錄下來的。

創建data文件相對應的index文件，index文件記錄了data文件中的每個partition的起始offset。可以想象，記錄了每個partition的offset，其實就是知道了每個partition在data文件中哪一部分。

最後將shuffleServerId(記錄了host，port、executorId)，每個partition的文件length封裝成mapStatus返回。

二、HashShuffleManager

Spark在每一個Mapper中爲每個Reducer創建一個bucket，並將RDD計算結果放進bucket中。每一個bucket擁有一個DiskObjectWriter，每個write handler擁有一個buffer size，使用這個write handler將Map output寫入文件中。也就是說Map output的key-value pair是逐個寫入到磁盤而不是預先把所有數據存儲在內存中在整體flush到磁盤中去，這樣對於內存的壓力會小很多。當然，同時運行的map數受限於資源，所以所需內存大概爲cores＊reducer num＊buffer size。但是，當reduce數量和map數量很大的時候，所需的內存開銷也是驚人的。

hashShuffleManager寫的流程相對而言就簡單很多了

 /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    val iter = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        dep.aggregator.get.combineValuesByKey(records, context)
      } else {
        records
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      records
    }

    for (elem <- iter) {
      val bucketId = dep.partitioner.getPartition(elem._1)
      shuffle.writers(bucketId).write(elem._1, elem._2)
    }
  }

(1) 如果定義了mapSideCombine,同上insertAll方法中的shouldCombine分支類似，對k-v進行合併處理。否則就不做處理。

(2) 然後將所有的k-v計算需要輸出到哪個分區，逐個寫入指定的分區文件中。

這種模式自然不需要排序，merge等複雜操作，因爲最終每個map task對每一個reduce分區輸出一個文件。

最後還是同樣組裝成一個mapStatus結構返回。

至此，shuffle的寫流程就介紹結束了。

下一節介紹shuffle的讀流程。

yueqian_zhu

發佈了79 篇原創文章 · 獲贊 6 · 訪問量 9萬+

私信關注

spark core源碼分析15 Shuffle詳解－寫流程

spark core源碼分析13 異常情況下的容錯保證

spark core源碼分析12 spark緩存清理

spark core源碼分析7 Executor的運行

spark core源碼分析6 Spark job的提交

spark core源碼分析9 從簡單例子看action操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結