翻開源碼看Spark是如何確立RDD分區數的

原創

Quan.S

2020-07-03 13:46

翻開源碼看Spark如何確立rdd的分區數

這大概是個爺爺不疼奶奶不愛的問題，但是很多小夥伴還是不太清楚的。藉機開始spark的源碼閱讀之旅。

RDD 分區確定

翻開DataSourceScanExec的源碼，會發現產生rdd有兩個方法：

createBucketedReadRDD
createNonBucketedReadRDD

這樣的分類，和hive的bucket機制有比較大的淵源，bucket可以將key值相同的數值合併在一起，同時又不像partition那樣爲每個key值建立一個文件夾。stackOverFlow中有個不錯的回答： HIVE中partition和bucket的區別

Ok，我們主要來關注一般的情況，就是Non Bucketed創建Rdd時如何分區。代碼不長，直接貼進來，註釋在代碼中：

private def createNonBucketedReadRDD(
    readFile: (PartitionedFile) => Iterator[InternalRow],
    selectedPartitions: Array[PartitionDirectory],
    fsRelation: HadoopFsRelation): RDD[InternalRow] = {
    
    // OpenCost，是一個經驗配置值，默認的是4M，啥意思呢？就是你的打開文件的需要消耗的時間成本，然後根據經驗，這個時長可以讀多少的數據
    val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    
    // maxSplitBytes, 看這名字就很關鍵，後文展開看
    val maxSplitBytes =
        FilePartition.maxSplitBytes(fsRelation.sparkSession, selectedPartitions)
        logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
                s"open cost is considered as scanning $openCostInBytes bytes.")

	//selectedPartitions，這個是和hdfs的分區相關的，也就是表下面的分區目錄個數  
    val splitFiles = selectedPartitions.flatMap { partition =>
        partition.files.flatMap { file =>
            // getPath() is very expensive so we only want to call it once in this block:
            val filePath = file.getPath
            val isSplitable = relation.fileFormat.isSplitable(
            relation.sparkSession, relation.options, filePath)
            
            // 這裏是如何去切單個文件的邏輯，單獨拿出來說。
            PartitionedFileUtil.splitFiles(
                sparkSession = relation.sparkSession,
                file = file,
                filePath = filePath,
                isSplitable = isSplitable,
                maxSplitBytes = maxSplitBytes,
                partitionValues = partition.values
        )
        ....

分區最大字節數確定

代碼來自FilePartiion::maxSplitBytes：

def maxSplitBytes(
    sparkSession: SparkSession,
    selectedPartitions: Seq[PartitionDirectory]): Long = {
    // 最大分區字節數配置，默認128M
    val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
    
    // 上面已經提到過的一個經驗值
    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
    
    // 這個是默認的併發度，理論上來說是和集羣的最大core size相等的
    val defaultParallelism = sparkSession.sparkContext.defaultParallelism
    
    // 文件夾下所有文件中數據量的總字節數，把每個文件的open消耗也算上
    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
    
    // 換算成每個core需要處理的字節數
    val bytesPerCore = totalBytes / defaultParallelism

    // 關鍵的來了，看仔細了
    // 1. OpenCostInBytes 和 每core處理數取最大值，也就是說最小是單個的文件打開開銷，因爲文件打開開銷肯定是單core背鍋，分攤不了。
    // 2. defaultParallelism 和 第一步的結果取最小值。
    Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
}

在都採用默認值的情況下, open開銷4M, 4 core，我們來舉例看看split是如何切的(內存單位M)：

文件個數	文件大小	計算公式	maxSplitBytes
10	20	min(128, max(10*(20+4) /4, 4))	60
10	47.1	min(128, max(10*(47.1+4) /4, 4))	127.75
10	47.2	min(128, max(10*(47.2+4) /4, 4))	128
10	200	min(128, max(100*(20+4) /4, 4))	128

OK，上面的三個例子應該就足以說明了。

數據總量+open開銷小於 core * defaultMaxSplitBytes時，maxSplitBytes = （數據總量+open開銷) / core
數據總量+open開銷大於 core * defaultMaxSplitBytes時，maxSplitBytes = defaultMaxSplitBytes

需要的數據都準備好了，看如何劃分分區

如何將文件劃分爲分區

處理邏輯在類FilePartition中:

def getFilePartitions(
    sparkSession: SparkSession,
    partitionedFiles: Seq[PartitionedFile],
    maxSplitBytes: Long): Seq[FilePartition] = {
    val partitions = new ArrayBuffer[FilePartition]
    val currentFiles = new ArrayBuffer[PartitionedFile]
    var currentSize = 0L

    /** Close the current partition and move to the next. */
    def closePartition(): Unit = {
        if (currentFiles.nonEmpty) {
            // Copy to a new Array.
            val newPartition = FilePartition(partitions.size, currentFiles.toArray                   partitions += newPartition
        }
        currentFiles.clear()
        currentSize = 0
    }

    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
    // Assign files to partitions using "Next Fit Decreasing"
    // 這裏的核心思想是在合併小文件，將多個小文件合併到maxSplitBytes
    partitionedFiles.foreach { file =>
        if (currentSize + file.length > maxSplitBytes) {
            closePartition()
        }
        // Add the given file to the current partition.
        currentSize += file.length + openCostInBytes
        currentFiles += file
    }
    closePartition()
    partitions
}

吐槽下，上面的代碼寫的並不怎麼清爽，核心思想是合併小文件，大文件就直接變爲partition了。一路下來會以爲會切大文件，然而並不會。

加強理解

怎麼理解上面的兩步騷氣的操作呢？總體來說：第一要充分的利用cpu，別每個小文件就一個task，資源利用率太低。回過去結合上面的舉例看看分區結果：

文件個數	文件大小	maxSplitBytes	分區結果
10	20	60	(20 + 4) *3 > 60, 每3個文件合併一個分區，4分區
10	47.1	127.75	(47.1 + 4) *3 > 127.75, 每3個文件合併一個分區，4分區
10	47.2	128	(47.2 + 4) *3 > 128, 每3個文件合併一個分區，4分區
10	200	128	200 > 128，每個文件一個分區，10分區

其他

這樣看下來，對defaultMaxSplitBytes 是不是有了新的認識？我們可以理解爲將小文件合併到同一個分區時的最大字節數限制。但是這個限制有什麼用呢？如果47.2M的文件有100個，切分成了34個分區，而不是4分區，那又怎麼樣呢？

個人看來是因爲task因爲數據量、數據分佈等因素會導致處理的速度不一樣，完全按照數據量切換成和core數量相等的分區容易形成長尾。這個理由也不是很牢靠，知道確切原因的同學請留言。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

翻開源碼看Spark是如何確立RDD分區數的

翻開源碼看Spark如何確立rdd的分區數

RDD 分區確定

分區最大字節數確定

如何將文件劃分爲分區

加強理解

其他

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

sql server sp_executesql 中使用表變量進行查詢

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

理解分佈式一致性協議Paxos

kafka connect rebalance時herder大概率異常

kafka connector commit 失敗

kafka hdfs connect 會產生只有一個記錄的小文件

hadoop 相關配置蒐集

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結