SparkSQL优化之输入小文件是否需要合并?

Note: spark版本2.3.1
HiveSQL优化时, 输入分片需要开启参数进行合并, 否则会产生很多分片.
那么SparkSQL是如何应对大量输入小文件的呢?
本例以Hive表为例(大量parquet小文件, 可切分).
首先我们Debug到这里(“package org.apache.spark.sql.execution.FileSourceScanExec”)
在这里插入图片描述
这里有个模式匹配, 我们是非分区表, 走默认匹配.
代码如下

  private def createNonBucketedReadRDD(
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Seq[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
    val defaultMaxSplitBytes =
      fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
    val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    val bytesPerCore = totalBytes / defaultParallelism

    val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

    val splitFiles = selectedPartitions.flatMap { partition =>
      partition.files.flatMap { file =>
        val blockLocations = getBlockLocations(file)
        if (fsRelation.fileFormat.isSplitable(
            fsRelation.sparkSession, fsRelation.options, file.getPath)) {
          (0L until file.getLen by maxSplitBytes).map { offset =>
            val remaining = file.getLen - offset
            val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
            val hosts = getBlockHosts(blockLocations, offset, size)
            PartitionedFile(
              partition.values, file.getPath.toUri.toString, offset, size, hosts)
          }
        } else {
          val hosts = getBlockHosts(blockLocations, 0, file.getLen)
          Seq(PartitionedFile(
            partition.values, file.getPath.toUri.toString, 0, file.getLen, hosts))
        }
      }
    }.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)

    val partitions = new ArrayBuffer[FilePartition]
    val currentFiles = new ArrayBuffer[PartitionedFile]
    var currentSize = 0L

    /** Close the current partition and move to the next. */
    def closePartition(): Unit = {
      if (currentFiles.nonEmpty) {
        val newPartition =
          FilePartition(
            partitions.size,
            currentFiles.toArray.toSeq) // Copy to a new Array.
        partitions += newPartition
      }
      currentFiles.clear()
      currentSize = 0
    }

    // Assign files to partitions using "Next Fit Decreasing"
    splitFiles.foreach { file =>
      if (currentSize + file.length > maxSplitBytes) {
        closePartition()
      }
      // Add the given file to the current partition.
      currentSize += file.length + openCostInBytes
      currentFiles += file
    }
    closePartition()

    new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
  }

其中有几个关键的参数和函数
defaultMaxSplitBytes: 默认可切分块, 最大Size(默认128M), 对应的参数是"spark.sql.files.maxPartitionBytes"

openCostInBytes: 默认4M, 对应的参数是"spark.sql.files.openCostInBytes"

defaultParallelism: 默认并行度(如果没有设置, 一般是CPU的core数)

totalBytes: 所有parquet文件的总大小(selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum, 从代码来看, 每个文件都要加上参数openCostInBytes)

bytesPerCore: 每个核处理数据Size(totalBytes / defaultParallelism)

maxSplitBytes: Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)), 一般处理大量数据都是128M.

val splitFiles = selectedPartitions.flatMap:
splitFiles是一个Array, 其中是按逻辑切分后的文件索引.
逻辑如代码所示, 总的来说就是把所有文件切成小于128M的文件索引

  (0L until file.getLen by maxSplitBytes).map { offset =>
    val remaining = file.getLen - offset
    val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
    val hosts = getBlockHosts(blockLocations, offset, size)
    PartitionedFile(
      partition.values, file.getPath.toUri.toString, offset, size, hosts)
  }

合并文件索引, 把加总后超过128M的文件索引放入一个分区

    // Assign files to partitions using "Next Fit Decreasing"
    splitFiles.foreach { file =>
      if (currentSize + file.length > maxSplitBytes) {
        closePartition()
      }
      // Add the given file to the current partition.
      currentSize += file.length + openCostInBytes
      currentFiles += file
    }
    /** Close the current partition and move to the next. */
    def closePartition(): Unit = {
      if (currentFiles.nonEmpty) {
        val newPartition =
          FilePartition(
            partitions.size,
            currentFiles.toArray.toSeq) // Copy to a new Array.
        partitions += newPartition
      }
      currentFiles.clear()
      currentSize = 0
    }

最后返回new FileScanRDD(fsRelation.sparkSession, readFile, partitions)

至此我们完成了文件的分割和合并.

所以SparkSQL并不需要关心输入端是否有小文件, 从而可以把精力放在逻辑实现上.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章