[離線計算-Spark|Hive] HDFS小文件處理

背景

HDFS 小文件過多會對hadoop 擴展性以及穩定性造成影響, 因爲要在namenode 上存儲維護大量元信息.

大量的小文件也會導致很差的查詢分析性能，因爲查詢引擎執行查詢時需要進行太多次文件的打開/讀取/關閉.

小文件解決思路

通常能想到的方案就是通過Spark API 對文件目錄下的小文件進行讀取，然後通過Spark的算子repartition操作進行合併小文件，repartition 分區數通過輸入文件的總大小和期望輸出文件的大小通過預計算而得。

總體流程如下：

該方案適合針對已發現有小文件問題,然後對其進行處理. 下面介紹下hudi是如何實現在寫入時實現對小文件的智能處理.

Hudi小文件處理

Hudi會自管理文件大小，避免向查詢引擎暴露小文件，其中自動處理文件大小起很大作用

在進行insert/upsert操作時，Hudi可以將文件大小維護在一個指定文件大小

hudi 小文件處理流程:

每次寫入都會遵循此過程，以確保Hudi表中沒有小文件。

核心代碼：

寫入文件分配：

org.apache.hudi.table.action.commit.UpsertPartitioner#assignInserts

 //獲取分區路徑
 Set<String> partitionPaths = profile.getPartitionPaths();

 //根據先前提交期間寫入的記錄獲取平均記錄大小。用於估計有多少記錄打包到一個文件中。
 long averageRecordSize = averageBytesPerRecord(table.getMetaClient().getActiveTimeline().getCommitTimeline().filterCompletedInstants(),config);
 
 LOG.info("AvgRecordSize => " + averageRecordSize);

 //獲取每個分區文件路徑下小文件
 Map<String, List<SmallFile>> partitionSmallFilesMap =
        getSmallFilesForPartitions(new ArrayList<String>(partitionPaths), jsc);


for (String partitionPath : partitionPaths) {
     ...
    
     List<SmallFile> smallFiles = partitionSmallFilesMap.get(partitionPath);
    //未分配的寫入記錄
    long totalUnassignedInserts = pStat.getNumInserts();  

    ...

    for (SmallFile smallFile : smallFiles) {
      //hoodie.parquet.max.file.size 數據文件最大大小，Hudi將試着維護文件大小到該指定值
      //算出數據文件大小 - 小文件 就是剩餘可以寫入文件大小， 除以平均記錄大小就是插入的記錄行數      
      long recordsToAppend = Math.min((config.getParquetMaxFileSize() - smallFile.sizeBytes) / averageRecordSize, totalUnassignedInserts);

        //分配記錄到小文件中
        if (recordsToAppend > 0 && totalUnassignedInserts > 0) {
            // create a new bucket or re-use an existing bucket
            int bucket;
            if (updateLocationToBucket.containsKey(smallFile.location.getFileId())) {
              bucket = updateLocationToBucket.get(smallFile.location.getFileId());
              LOG.info("Assigning " + recordsToAppend + " inserts to existing update bucket " + bucket);
            } else {
              bucket = addUpdateBucket(partitionPath, smallFile.location.getFileId());
              LOG.info("Assigning " + recordsToAppend + " inserts to new update bucket " + bucket);
            }
            bucketNumbers.add(bucket);
            recordsPerBucket.add(recordsToAppend);
            //減去已經分配的記錄數
            totalUnassignedInserts -= recordsToAppend;
          }  


        //如果記錄沒有分配完
        if (totalUnassignedInserts > 0) {
            //hoodie.copyonwrite.insert.split.size 每個分區條數
            long insertRecordsPerBucket = config.getCopyOnWriteInsertSplitSize();
            //是否自動計算每個分區條數
            if (config.shouldAutoTuneInsertSplits()) {
                insertRecordsPerBucket = config.getParquetMaxFileSize() / averageRecordSize;
            }

           //計算要創建的bucket
           int insertBuckets = (int) Math.ceil((1.0 * totalUnassignedInserts) / insertRecordsPerBucket); 
           
          ...
          
          for (int b = 0; b < insertBuckets; b++) {
            bucketNumbers.add(totalBuckets);
            if (b == insertBuckets - 1) {
              //針對最後一個buket處理，就是寫完剩下的記錄
              recordsPerBucket.add(totalUnassignedInserts - (insertBuckets - 1) * insertRecordsPerBucket);
            } else {
              recordsPerBucket.add(insertRecordsPerBucket);
            }
            BucketInfo bucketInfo = new BucketInfo();
            bucketInfo.bucketType = BucketType.INSERT;
            bucketInfo.partitionPath = partitionPath;
            bucketInfo.fileIdPrefix = FSUtils.createNewFileIdPfx();
            bucketInfoMap.put(totalBuckets, bucketInfo);
            totalBuckets++;
          }

        }

    }

}

獲取每個分區路徑下小文件：
org.apache.hudi.table.action.commit.UpsertPartitioner#getSmallFiles

 if (!commitTimeline.empty()) { // if we have some commits
      HoodieInstant latestCommitTime = commitTimeline.lastInstant().get();
      List<HoodieBaseFile> allFiles = table.getBaseFileOnlyView()
          .getLatestBaseFilesBeforeOrOn(partitionPath, latestCommitTime.getTimestamp()).collect(Collectors.toList());

      for (HoodieBaseFile file : allFiles) {

        //獲取小於 hoodie.parquet.small.file.limit 參數值就爲小文件
        if (file.getFileSize() < config.getParquetSmallFileLimit()) {
          String filename = file.getFileName();
          SmallFile sf = new SmallFile();
          sf.location = new HoodieRecordLocation(FSUtils.getCommitTime(filename), FSUtils.getFileId(filename));
          sf.sizeBytes = file.getFileSize();
          smallFileLocations.add(sf);
        }
      }
    }

UpsertPartitioner繼承spark的Partitioner, hudi在寫入的時候會利用spark 自定分區的機制優化記錄分配到不同文件的能力, 從而達到在寫入時不斷優化解決小文件問題.

涉及到的關鍵配置:

hoodie.parquet.max.file.size：數據文件最大大小，Hudi將試着維護文件大小到該指定值；
hoodie.parquet.small.file.limit：小於該大小的文件均被視爲小文件；
hoodie.copyonwrite.insert.split.size：單文件中插入記錄條數，此值應與單個文件中的記錄數匹配（可以根據最大文件大小和每個記錄大小來確定）

在hudi寫入時候如何使用、配置參數?

在寫入hudi的代碼中 .option中配置上述參數大小,如下:

.option(HoodieStorageConfig.DEFAULT_PARQUET_FILE_MAX_BYTES, 120 * 1024 * 1024)

總結

本文主要介紹小文件的處理方法思路,以及通過閱讀源碼和相關資料學習hudi 如何在寫入時智能的處理小文件問題新思路.Hudi利用spark 自定義分區的機制優化記錄分配到不同文件的能力,達到小文件的合併處理.

參考

https://www.cnblogs.com/leesf456/p/14642991.html

[離線計算-Spark|Hive] HDFS小文件處理

背景

小文件解決思路

Hudi小文件處理

總結

參考

個人技術方向發展

Rust學習入門

Hive一次更新多個分區數據方案

Flink如何處理update數據

Flink Catalog

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結