SparkSQL優化之輸入小文件是否需要合併?

Note: spark版本2.3.1
HiveSQL優化時, 輸入分片需要開啓參數進行合併, 否則會產生很多分片.
那麼SparkSQL是如何應對大量輸入小文件的呢?
本例以Hive表爲例(大量parquet小文件, 可切分).
首先我們Debug到這裏(“package org.apache.spark.sql.execution.FileSourceScanExec”)
在這裏插入圖片描述
這裏有個模式匹配, 我們是非分區表, 走默認匹配.
代碼如下

  private def createNonBucketedReadRDD(
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Seq[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
    val defaultMaxSplitBytes =
      fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
    val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    val bytesPerCore = totalBytes / defaultParallelism

    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

    val splitFiles = selectedPartitions.flatMap { partition =>
      partition.files.flatMap { file =>
        val blockLocations = getBlockLocations(file)
        if (fsRelation.fileFormat.isSplitable(
            fsRelation.sparkSession, fsRelation.options, file.getPath)) {
          (0L until file.getLen by maxSplitBytes).map { offset =>
            val remaining = file.getLen - offset
            val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
            val hosts = getBlockHosts(blockLocations, offset, size)
            PartitionedFile(
              partition.values, file.getPath.toUri.toString, offset, size, hosts)
          }
        } else {
          val hosts = getBlockHosts(blockLocations, 0, file.getLen)
          Seq(PartitionedFile(
            partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts))
        }
      }
    }.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)

    val partitions = new ArrayBuffer[FilePartition]
    val currentFiles = new ArrayBuffer[PartitionedFile]
    var currentSize = 0L

    /** Close the current partition and move to the next. */
    def closePartition(): Unit = {
      if (currentFiles.nonEmpty) {
        val newPartition =
          FilePartition(
            partitions.size,
            currentFiles.toArray.toSeq) // Copy to a new Array.
        partitions += newPartition
      }
      currentFiles.clear()
      currentSize = 0
    }

    // Assign files to partitions using "Next Fit Decreasing"
    splitFiles.foreach { file =>
      if (currentSize + file.length > maxSplitBytes) {
        closePartition()
      }
      // Add the given file to the current partition.
      currentSize += file.length + openCostInBytes
      currentFiles += file
    }
    closePartition()

    new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
  }

其中有幾個關鍵的參數和函數
defaultMaxSplitBytes: 默認可切分塊, 最大Size(默認128M), 對應的參數是"spark.sql.files.maxPartitionBytes"

openCostInBytes: 默認4M, 對應的參數是"spark.sql.files.openCostInBytes"

defaultParallelism: 默認並行度(如果沒有設置, 一般是CPU的core數)

totalBytes: 所有parquet文件的總大小(selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum, 從代碼來看, 每個文件都要加上參數openCostInBytes)

bytesPerCore: 每個核處理數據Size(totalBytes / defaultParallelism)

maxSplitBytes: Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)), 一般處理大量數據都是128M.

val splitFiles = selectedPartitions.flatMap:
splitFiles是一個Array, 其中是按邏輯切分後的文件索引.
邏輯如代碼所示, 總的來說就是把所有文件切成小於128M的文件索引

  (0L until file.getLen by maxSplitBytes).map { offset =>
    val remaining = file.getLen - offset
    val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    val hosts = getBlockHosts(blockLocations, offset, size)
    PartitionedFile(
      partition.values, file.getPath.toUri.toString, offset, size, hosts)
  }

合併文件索引, 把加總後超過128M的文件索引放入一個分區

    // Assign files to partitions using "Next Fit Decreasing"
    splitFiles.foreach { file =>
      if (currentSize + file.length > maxSplitBytes) {
        closePartition()
      }
      // Add the given file to the current partition.
      currentSize += file.length + openCostInBytes
      currentFiles += file
    }
    /** Close the current partition and move to the next. */
    def closePartition(): Unit = {
      if (currentFiles.nonEmpty) {
        val newPartition =
          FilePartition(
            partitions.size,
            currentFiles.toArray.toSeq) // Copy to a new Array.
        partitions += newPartition
      }
      currentFiles.clear()
      currentSize = 0
    }

最後返回new FileScanRDD(fsRelation.sparkSession, readFile, partitions)

至此我們完成了文件的分割和合並.

所以SparkSQL並不需要關心輸入端是否有小文件, 從而可以把精力放在邏輯實現上.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章