Note: spark版本2.3.1
HiveSQL優化時, 輸入分片需要開啓參數進行合併, 否則會產生很多分片.
那麼SparkSQL是如何應對大量輸入小文件的呢?
本例以Hive表爲例(大量parquet小文件, 可切分).
首先我們Debug到這裏(“package org.apache.spark.sql.execution.FileSourceScanExec”)
這裏有個模式匹配, 我們是非分區表, 走默認匹配.
代碼如下
private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
val splitFiles = selectedPartitions.flatMap { partition =>
partition.files.flatMap { file =>
val blockLocations = getBlockLocations(file)
if (fsRelation.fileFormat.isSplitable(
fsRelation.sparkSession, fsRelation.options, file.getPath)) {
(0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}
} else {
val hosts = getBlockHosts(blockLocations, 0, file.getLen)
Seq(PartitionedFile(
partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts))
}
}
}.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
val partitions = new ArrayBuffer[FilePartition]
val currentFiles = new ArrayBuffer[PartitionedFile]
var currentSize = 0L
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
val newPartition =
FilePartition(
partitions.size,
currentFiles.toArray.toSeq) // Copy to a new Array.
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
// Assign files to partitions using "Next Fit Decreasing"
splitFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
closePartition()
new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
}
其中有幾個關鍵的參數和函數
defaultMaxSplitBytes: 默認可切分塊, 最大Size(默認128M), 對應的參數是"spark.sql.files.maxPartitionBytes"
openCostInBytes: 默認4M, 對應的參數是"spark.sql.files.openCostInBytes"
defaultParallelism: 默認並行度(如果沒有設置, 一般是CPU的core數)
totalBytes: 所有parquet文件的總大小(selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum, 從代碼來看, 每個文件都要加上參數openCostInBytes)
bytesPerCore: 每個核處理數據Size(totalBytes / defaultParallelism)
maxSplitBytes: Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)), 一般處理大量數據都是128M.
val splitFiles = selectedPartitions.flatMap:
splitFiles是一個Array, 其中是按邏輯切分後的文件索引.
邏輯如代碼所示, 總的來說就是把所有文件切成小於128M的文件索引
(0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}
合併文件索引, 把加總後超過128M的文件索引放入一個分區
// Assign files to partitions using "Next Fit Decreasing"
splitFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
val newPartition =
FilePartition(
partitions.size,
currentFiles.toArray.toSeq) // Copy to a new Array.
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
最後返回new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
至此我們完成了文件的分割和合並.
所以SparkSQL並不需要關心輸入端是否有小文件, 從而可以把精力放在邏輯實現上.