spark core源碼分析16 Shuffle詳解－讀流程

原創

2020-02-22 07:08

shuffle的讀流程也是從compute方法開始的

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
    val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
    SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
      .read()
      .asInstanceOf[Iterator[(K, C)]]
  }

目前來說，不管是sortShuffleManager還是hashShuffleManager，getReader方法返回的都是HashShuffleReader。

接着調用read方法，如下：

/** Read the combined key-values for this reduce task */
  override def read(): Iterator[Product2[K, C]] = {
    val ser = Serializer.getSerializer(dep.serializer)
    val iter = BlockStoreShuffleFetcher.fetch(handle.shuffleId, startPartition, context, ser)

    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        new InterruptibleIterator(context, dep.aggregator.get.combineCombinersByKey(iter, context))
      } else {
        new InterruptibleIterator(context, dep.aggregator.get.combineValuesByKey(iter, context))
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")

      // Convert the Product2s to pairs since this is what downstream RDDs currently expect
      iter.asInstanceOf[Iterator[Product2[K, C]]].map(pair => (pair._1, pair._2))
    }

    // Sort the output if there is a sort ordering defined.
    dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        // Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled,
        // the ExternalSorter won't spill to disk.
        val sorter = new ExternalSorter[K, C, C](ordering = Some(keyOrd), serializer = Some(ser))
        sorter.insertAll(aggregatedIter)
        context.taskMetrics.incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics.incDiskBytesSpilled(sorter.diskBytesSpilled)
        sorter.iterator
      case None =>
        aggregatedIter
    }
  }

該方法首先調用了fetch方法，介紹一下

1、在task運行那節介紹過，shuffleMapTask運行完成後，會將shuffleId及mapstatus的映射註冊到mapOutputTracker中

2、fetch方法首先嚐試在本地mapstatuses中查找是否有該shuffleId的信息，有則本地取；否則想master的mapOutputTracker請求並讀取，返回塊管理器的地址和對應partition的文件長度

3、然後根據我們得到的shuffleId等信息去remote或者local通過netty/nio讀取，返回一個迭代器

4、返回的迭代器中的數據並不是全部在內存中的，讀取時會根據配置的內存最大值來讀取。內存不夠的話，下一個待讀取

fetch方法返回一個迭代器後，根據是否mapSideCombine來區分時候需要將讀取到的數據進行合併操作。合併過程與寫流程類似，內存放不下就寫入本地磁盤。

如果還需要keyOrdering的，new一個ExternalSorter進行外部排序。之後也是同shuffle寫流程的insertAll。

yueqian_zhu

發佈了79 篇原創文章 · 獲贊 6 · 訪問量 9萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark core源碼分析16 Shuffle詳解－讀流程

spark core源碼分析13 異常情況下的容錯保證

spark core源碼分析12 spark緩存清理

spark core源碼分析7 Executor的運行

spark core源碼分析6 Spark job的提交

spark core源碼分析9 從簡單例子看action操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結