Spark — HashShuffle源碼分析

HashShuffle源碼分析

之前分析了兩種Shuffle的區別，現在我們通過源碼來進行分析，首先看HashShuffle，回顧之前流程，Executor在接收到LaunchTask的消息後，調用executor的launchTask()方法，將Task封裝爲一個TaskRunner（線程），然後放入線程池中執行，在執行的時候最終會調用Task.run()方法，這裏面調用了runTask()方法，在runTask裏面就是真正執行task的地方了，前面也分析過了相應的源碼。
在runTask中首先獲取ShuffleManager，它有兩個子類HashShuffleManager和SortShuffleManager，我們先分析HashShuffleManager，它通過getWriter，獲取一個HashShuffleWriter，接着調用它的write()方法，進行數據處理，和結果的文件寫入，我們看一下HashShuffleWriter的write()方法。

override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
    // 首先判斷是否需要在Map端本地聚合
    // 這裏的話，如果是reduceByKey操作，那麼dep.aggregator.isDefined就是true
    // dep.mapSideCombine也相應的是true
    val iter = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // 這裏就會執行本地聚合，比如本地有(hello, 1) (hello, 1) => (hello, 2)
        dep.aggregator.get.combineValuesByKey(records, context)
      } else {
        records
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      records
    }
    // 如果需要聚合，那麼先進行本地聚合操作
    // 接着遍歷數據，對每個數據調用partitioner，默認是HashPartitioner，生成bucketId。
    // 這就決定了，每一份數據要寫入哪個bucket中，相同key一定寫入同一個bucket中
    for (elem <- iter) {
      val bucketId = dep.partitioner.getPartition(elem._1)
      // 獲取到bucketId之後，會調用shuffleBlockManager.forMapTask()方法，生成bucketId對應的writer
      // 然後調用writer將數據寫入bucket
      shuffle.writers(bucketId).write(elem)
    }
  }

這裏比較重要的是，生成bucketId，對每個數據調用HashPartitioner，也就是對每個數據進行hash操作，那麼對於相同key的數據就會分到同一個bucket（緩存）中，因爲他們的bucketId是相同的。
接着調用ShuffleBlockManager的forMapTask生成writer，然後調用它的write方法將數據寫入磁盤文件中。其中這裏的ShuffleBlockManager是一個trait，它的子類是FileShuffleBlockManager，我們去這裏面看forMapTask()方法：

def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer,
      writeMetrics: ShuffleWriteMetrics) = {
    new ShuffleWriterGroup {
      shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets))
      private val shuffleState = shuffleStates(shuffleId)
      private var fileGroup: ShuffleFileGroup = null

      val openStartTime = System.nanoTime
      // 這裏就很關鍵，對應我們之前說的，HashShuffle有兩種模式，一種普通的，一種是優化後的，這裏就會判斷,
      // 如果開啓了consolidation機制，即consolidateShuffleFile爲true的話
      // 不會給每個bucket都獲取一個獨立的文件
      // 而是爲這個bucket獲取一個ShuffleGroup的Writer
      val writers: Array[BlockObjectWriter] = if (consolidateShuffleFiles) {
        fileGroup = getUnusedFileGroup()
        Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
          // 首先，用shuffleId，mapId，bucketId(reduceId)生成一個唯一的ShuffleBlockId
          // 然後用bucketId，來調用ShuffleFileGroup的apply()函數，爲bucket獲取一個ShuffleFileGroup
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
          // 然後用BlockManager的getDiskWriter()方法，針對ShuffleFileGroup獲取一個Writer
          // 這樣的話，我們就清楚了，如果開啓了consolidation機制，對於每一個bucket，都會獲取一個針對ShuffleFileGroup的writer
          // 而不是一個獨立的ShuffleBlockFile的writer，這樣就實現了，多個ShuffleMapTask的輸出數據的合併。
          blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializer, bufferSize,
            writeMetrics)
        }
      } else {
        // 如果沒有開啓consolidation機制，也就是普通shuffle操作
        Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
          // 同樣生成一個ShuffleBlockId
          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)
          // 然後調用BlockManager的diskBlockManager，獲取一個代表了要寫入本地磁盤文件的blockFile
          val blockFile = blockManager.diskBlockManager.getFile(blockId)
          // Because of previous failures, the shuffle file may already exist on this machine.
          // If so, remove it.
          // 假如這個blockFile存在的話，就刪除它 -- 因爲一個bucket對應一個blockFile
          if (blockFile.exists) {
            if (blockFile.delete()) {
              logInfo(s"Removed existing shuffle file $blockFile")
            } else {
              logWarning(s"Failed to remove existing shuffle file $blockFile")
            }
          }
          // 然後調用blockManager的getDiskWriter()方法，針對那個blockFile生成writer
          blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
        }

        // 使用普通的shuffle操作的話，對於每一個ShuffleMapTask輸出的bucket，
        // 那麼都會在本地獲取一個單獨的ShuffleBlockFile
      }
      // 省略一些代碼
      ........
   }
 }

這個方法主要就是給每個map task返回一個ShuffleWriterGroup，從這個方法裏面我們就能清晰的看到開啓Consolidation機制和未開啓Consolidation機制的區別了。
如果開啓了Consolidation機制，首先會去獲取一個filegroup，如果這個filegroup沒有被創建，那麼會新建，如果已經存在，那麼就返回已經存在的filegroup，這就是複用第一個Task創建的filegroup（複用同一個文件）。然後利用shuffleId，mapId，bucketId創建一個唯一的ShuffleBlockID，然後使用BlockManager針對ShuffleGroupFile生成一個Writer，裏面包含了blockId和filegroup，以及待寫入的緩存bucket等。
針對沒有開啓Consolidation機制而言，同樣先生成一個ShuffleBlockId，接着會生成一個blockFile文件，假如這個文件已經存在，那麼是之前某個task創建的，先刪除再創建。然後同樣獲得一個writer。
從上面的源碼中，我們就能看出其中的區別了，開啓了Consolidation機制會複用第一個Task創建的文件，把它封裝爲了一個FileGroup，而沒有開啓則每次寫的時候都會創建一個新的文件，這就是他們的最大區別，從源碼中也體現出來了。
這個區別就導致Task創建文件數量的不同，Task map端產生的文件數量在很大程度上會影響Spark的性能，因此假如現在還在使用老版本中的HashShuffle，那麼在實際生產環境中，強烈建議開啓Consolidation機制（SparkConf設置spark.shuffle.consolidateFiles爲true即可）。

Spark — HashShuffle源碼分析

HashShuffle源碼分析

Wireshark 安裝+使用（一）

圖 - DFS深度優先搜索和BFS廣度優先搜索

SparkStreaming — 架構原理分析

SparkCore — CacheManager持久化原理

SparkCore — checkpoint機制

SparkCore — CacheManager源碼分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結