Spark技術內幕: Shuffle詳解（二）

原創

2020-07-02 19:31

本文主要關注ShuffledRDD的Shuffle Read是如何從其他的node上讀取數據的。

上文講到了獲取如何獲取的策略都在org.apache.spark.storage.BlockFetcherIterator.BasicBlockFetcherIterator#splitLocalRemoteBlocks中。可以見註釋。

    protected def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
      // Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
      // smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
      // nodes, rather than blocking on reading output from one node.
      // 爲了快速的得到數據，每次都會啓動5個線程去最多5個node上取數據；
      // 每次請求的數據不會超過spark.reducer.maxMbInFlight（默認值爲48MB） / 5。
      // 這樣做的原因有幾個：
      // 1. 避免佔用目標機器的過多帶寬，在千兆網卡爲主流的今天，帶寬還是比較重要的。
      //    如果一個連接將要佔用48M的帶寬，這個Network IO可能會成爲瓶頸。
      // 2. 請求數據可以平行化，這樣請求數據的時間可以大大減少。請求數據的總時間就是那個請求最長的。
      //    如果不是並行請求，那麼總時間將是所有的請求時間之和。
      // 而設置spark.reducer.maxMbInFlight，也是爲了不要佔用過多的內存
      val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
      logInfo("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize)

      // Split local and remote blocks. Remote blocks are further split into FetchRequests of size
      // at most maxBytesInFlight in order to limit the amount of data in flight.
      val remoteRequests = new ArrayBuffer[FetchRequest]
      var totalBlocks = 0
      for ((address, blockInfos) <- blocksByAddress) { //  address實際上是executor_id
        totalBlocks += blockInfos.size
        if (address == blockManagerId) { //數據在本地，那麼直接走local read
          // Filter out zero-sized blocks
          localBlocksToFetch ++= blockInfos.filter(_._2 != 0).map(_._1)
          _numBlocksToFetch += localBlocksToFetch.size
        } else {
          val iterator = blockInfos.iterator
          var curRequestSize = 0L
          var curBlocks = new ArrayBuffer[(BlockId, Long)]
          while (iterator.hasNext) {
          // blockId 是org.apache.spark.storage.ShuffleBlockId，
          // 格式："shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
            val (blockId, size) = iterator.next()
            // Skip empty blocks
            if (size > 0) { //過濾掉爲大小爲0的文件
              curBlocks += ((blockId, size))
              remoteBlocksToFetch += blockId
              _numBlocksToFetch += 1
              curRequestSize += size
            } else if (size < 0) {
              throw new BlockException(blockId, "Negative block size " + size)
            }
            if (curRequestSize >= targetRequestSize) { // 避免一次請求的數據量過大
              // Add this FetchRequest
              remoteRequests += new FetchRequest(address, curBlocks)
              curBlocks = new ArrayBuffer[(BlockId, Long)]
              logDebug(s"Creating fetch request of $curRequestSize at $address")
              curRequestSize = 0
            }
          }
          // Add in the final request
          if (!curBlocks.isEmpty) { // 將剩餘的請求放到最後一個request中。
            remoteRequests += new FetchRequest(address, curBlocks)
          }
        }
      }
      logInfo("Getting " + _numBlocksToFetch + " non-empty blocks out of " +
        totalBlocks + " blocks")
      remoteRequests
    }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark技術內幕: Shuffle詳解（二）

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

Spark技術內幕：Master的故障恢復

Spark技術內幕: Shuffle詳解（二）

Cassandra使用pycassa批量導入數據

Spark技術內幕：Stage劃分及提交源碼分析

Spark技術內幕：Executor分配詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結