本文主要關注ShuffledRDD的Shuffle Read是如何從其他的node上讀取數據的。
上文講到了獲取如何獲取的策略都在org.apache.spark.storage.BlockFetcherIterator.BasicBlockFetcherIterator#splitLocalRemoteBlocks中。可以見註釋。
protected def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
// Make remote requests at most maxBytesInFlight / 5 in length; the reason to keep them
// smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5
// nodes, rather than blocking on reading output from one node.
// 爲了快速的得到數據,每次都會啓動5個線程去最多5個node上取數據;
// 每次請求的數據不會超過spark.reducer.maxMbInFlight(默認值爲48MB) / 5。
// 這樣做的原因有幾個:
// 1. 避免佔用目標機器的過多帶寬,在千兆網卡爲主流的今天,帶寬還是比較重要的。
// 如果一個連接將要佔用48M的帶寬,這個Network IO可能會成爲瓶頸。
// 2. 請求數據可以平行化,這樣請求數據的時間可以大大減少。請求數據的總時間就是那個請求最長的。
// 如果不是並行請求,那麼總時間將是所有的請求時間之和。
// 而設置spark.reducer.maxMbInFlight,也是爲了不要佔用過多的內存
val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
logInfo("maxBytesInFlight: " + maxBytesInFlight + ", targetRequestSize: " + targetRequestSize)
// Split local and remote blocks. Remote blocks are further split into FetchRequests of size
// at most maxBytesInFlight in order to limit the amount of data in flight.
val remoteRequests = new ArrayBuffer[FetchRequest]
var totalBlocks = 0
for ((address, blockInfos) <- blocksByAddress) { // address實際上是executor_id
totalBlocks += blockInfos.size
if (address == blockManagerId) { //數據在本地,那麼直接走local read
// Filter out zero-sized blocks
localBlocksToFetch ++= blockInfos.filter(_._2 != 0).map(_._1)
_numBlocksToFetch += localBlocksToFetch.size
} else {
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
while (iterator.hasNext) {
// blockId 是org.apache.spark.storage.ShuffleBlockId,
// 格式:"shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
val (blockId, size) = iterator.next()
// Skip empty blocks
if (size > 0) { //過濾掉爲大小爲0的文件
curBlocks += ((blockId, size))
remoteBlocksToFetch += blockId
_numBlocksToFetch += 1
curRequestSize += size
} else if (size < 0) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) { // 避免一次請求的數據量過大
// Add this FetchRequest
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize = 0
}
}
// Add in the final request
if (!curBlocks.isEmpty) { // 將剩餘的請求放到最後一個request中。
remoteRequests += new FetchRequest(address, curBlocks)
}
}
}
logInfo("Getting " + _numBlocksToFetch + " non-empty blocks out of " +
totalBlocks + " blocks")
remoteRequests
}