文章目錄

前言

在Ozone中，Ozone的文件對象數據是以Block的形式進行組織的，和HDFS想類似。不過Ozone對Block進行實際存儲的時候，是以更細粒度的chunk文件的形式進行物理存儲的。簡單來說，Block下面又會劃分出多個chunk個塊，每個chunk文件對應有Block的一個相對offset。本文筆者要聊聊Ozone chunk的layout，即chunk文件在Datanode上的存儲方式。在原有實現上，多個chunk文件分離存儲的方式還導致有大量chunk的文件的生成，這在效率上並不高效。在最新社區的優化中，Block能支持一個chunk的形式進行存儲，接下來我們就來聊聊這2種layout方式。

Ozone Datanode Chunk文件原有layout

首先我們來聊聊Ozone Chunk文件原有的layout佈局方式，它是怎麼樣的一個過程呢？

1）首先是一個文件按照block size劃分出多個Block塊
2）BlockOutputStream按照chunk size的，分割出多個chunk文件進行寫出

此種方式會有以下一些弊端：

如果Block設置的比較大，那麼將會有很多chunk文件生成
每次的Block讀寫操作涉及到多個chunk文件的讀寫，讀寫效率不夠高
讀寫過程要做多次文件的檢查操作，同樣影響了效率

於是，社區實現一種基於一個Block一個chunk的文件layout方式，來提升Ozone文件讀寫的效率。原有Chunk的layout方式在名稱上稱之爲FILE_PER_CHUNK，新的則爲FILE_PER_BLOCK。

Ozone Datanode Chunk Layout：FILE_PER_CHUNK和FILE_PER_BLOCK

以上Chunk layout模式對應實質上是分別由Ozone Datanode內部的FilePerChunkStrategy和FilePerBlockStrategy所控制的。

以上2種policy的本質區別在於Datanode這邊對於BlockOutputStream發來的ChunkBuffer數據的處理方式：

FilePerChunkStrategy的處理方式（原有方式）：以新chunk文件形式進行寫出
FilePerBlockStrategy的處理方式：在原有chunk文件當前的offset處進行追加寫

用一個更爲直觀的圖示過程，如下所示：

上圖FILE_PER_CHUNK左邊的chunk1，2，3文件對應到FILE_PER_BLOCK的方式就是一個chunk文件內的不同段，而FILE_PER_BLOCK模式下，一個Block文件只有1個chunk文件。

下面我們來看具體的實現邏輯，主要在write chunk的文件方法：

FilePerChunkStrategy策略(FilePerChunkStrategy.java)：

  public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
      ChunkBuffer data, DispatcherContext dispatcherContext)
      throws StorageContainerException {

    checkLayoutVersion(container);

    Preconditions.checkNotNull(dispatcherContext);
    DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
    try {

      KeyValueContainerData containerData = (KeyValueContainerData) container
          .getContainerData();
      HddsVolume volume = containerData.getVolume();
      VolumeIOStats volumeIOStats = volume.getVolumeIOStats();

      // 1)獲取此chunk文件的路徑
      File chunkFile = ChunkUtils.getChunkFile(containerData, info);

      boolean isOverwrite = ChunkUtils.validateChunkForOverwrite(
          chunkFile, info);
      // 2)獲取此chunk的臨時文件路徑
      File tmpChunkFile = getTmpChunkFile(chunkFile, dispatcherContext);
      if (LOG.isDebugEnabled()) {
        LOG.debug(
            "writing chunk:{} chunk stage:{} chunk file:{} tmp chunk file:{}",
            info.getChunkName(), stage, chunkFile, tmpChunkFile);
      }

      long len = info.getLen();
      // 忽略offset值，因爲是一個新的獨立文件
      long offset = 0; // ignore offset in chunk info
      switch (stage) {
      case WRITE_DATA:
        if (isOverwrite) {
          // if the actual chunk file already exists here while writing the temp
          // chunk file, then it means the same ozone client request has
          // generated two raft log entries. This can happen either because
          // retryCache expired in Ratis (or log index mismatch/corruption in
          // Ratis). This can be solved by two approaches as of now:
          // 1. Read the complete data in the actual chunk file ,
          //    verify the data integrity and in case it mismatches , either
          // 2. Delete the chunk File and write the chunk again. For now,
          //    let's rewrite the chunk file
          // TODO: once the checksum support for write chunks gets plugged in,
          // the checksum needs to be verified for the actual chunk file and
          // the data to be written here which should be efficient and
          // it matches we can safely return without rewriting.
          LOG.warn("ChunkFile already exists {}. Deleting it.", chunkFile);
          FileUtil.fullyDelete(chunkFile);
        }
        if (tmpChunkFile.exists()) {
          // If the tmp chunk file already exists it means the raft log got
          // appended, but later on the log entry got truncated in Ratis leaving
          // behind garbage.
          // TODO: once the checksum support for data chunks gets plugged in,
          // instead of rewriting the chunk here, let's compare the checkSums
          LOG.warn("tmpChunkFile already exists {}. Overwriting it.",
                  tmpChunkFile);
        }
        // 3)如果是在寫chunk文件數據階段，則進行臨時文件的寫入 
        ChunkUtils.writeData(tmpChunkFile, data, offset, len, volumeIOStats,
            doSyncWrite);
        // No need to increment container stats here, as still data is not
        // committed here.
        break;
      case COMMIT_DATA:
         ...
        // 4)如果是chunk文件結束階段，則進行臨時文件到最終正式文件的rename操作
        commitChunk(tmpChunkFile, chunkFile);
        // Increment container stats here, as we commit the data.
        containerData.updateWriteStats(len, isOverwrite);
        break;
      case COMBINED:
        // directly write to the chunk file
        ChunkUtils.writeData(chunkFile, data, offset, len, volumeIOStats,
            doSyncWrite);
        containerData.updateWriteStats(len, isOverwrite);
        break;
      default:
        throw new IOException("Can not identify write operation.");
      }
    } catch (StorageContainerException ex) {
      throw ex;
    } catch (IOException ex) {
      throw new StorageContainerException("Internal error: ", ex,
          IO_EXCEPTION);
    }
  }

下面我們再來看另外一種chunk layout的policy策略實現：

FilePerBlockStrategy.java

  @Override
  public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
      ChunkBuffer data, DispatcherContext dispatcherContext)
      throws StorageContainerException {

    checkLayoutVersion(container);

    Preconditions.checkNotNull(dispatcherContext);
    DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
    ...

    KeyValueContainerData containerData = (KeyValueContainerData) container
        .getContainerData();

    // 1）同樣獲取chunk文件的路徑名
    File chunkFile = getChunkFile(containerData, blockID);
    boolean overwrite = validateChunkForOverwrite(chunkFile, info);
    long len = info.getLen();
    // 2）拿到Chunk數據對應於Block數據內的offset，即這個block所寫chunk文件的offset
    long offset = info.getOffset();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Writing chunk {} (overwrite: {}) in stage {} to file {}",
          info, overwrite, stage, chunkFile);
    }

    HddsVolume volume = containerData.getVolume();
    VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
    // 3）從open文件cache中拿到此chunk文件對應的FileChannel
    FileChannel channel = files.getChannel(chunkFile, doSyncWrite);
    // 4）進行指定offset位置的數據讀寫
    ChunkUtils.writeData(channel, chunkFile.getName(), data, offset, len,
        volumeIOStats);

    containerData.updateWriteStats(len, overwrite);
  }

因爲在FILE_PER_BLOCK模式下，一個Block文件可能會處於一段連續時間內被寫入的狀態，因此在這裏實現了FileChannel的cache，避免短時間內多次文件的close，重新open操作。

  private static final class OpenFiles {

    private static final RemovalListener<String, OpenFile> ON_REMOVE =
        event -> close(event.getKey(), event.getValue());

    // OpenFile的文件cache
    private final Cache<String, OpenFile> files = CacheBuilder.newBuilder()
        .expireAfterAccess(Duration.ofMinutes(10))
        .removalListener(ON_REMOVE)
        .build();

    /**
     * Chunk文件的FileChannel獲取，打開文件操作.
     */
    public FileChannel getChannel(File file, boolean sync)
        throws StorageContainerException {
      try {
        return files.get(file.getAbsolutePath(),
            () -> open(file, sync)).getChannel();
      } catch (ExecutionException e) {
        if (e.getCause() instanceof IOException) {
          throw new UncheckedIOException((IOException) e.getCause());
        }
        throw new StorageContainerException(e.getCause(),
            ContainerProtos.Result.CONTAINER_INTERNAL_ERROR);
      }
    }

    private static OpenFile open(File file, boolean sync) {
      try {
        return new OpenFile(file, sync);
      } catch (FileNotFoundException e) {
        throw new UncheckedIOException(e);
      }
    }
    /**
     * 當Open中的文件在cache中過期了，則進行cache清除操作，並附帶文件close操作
     */
    public void close(File file) {
      if (file != null) {
        files.invalidate(file.getAbsolutePath());
      }
    }
    ...
}

據社區對以上2種方式的數據寫入的測試結果來對比，新的chunk layout方式要比原有FILE_PER_CHUNK的方式高效不少，也已經將FILE_PER_BLOCK的方式變爲了默認的chunk layout了。相關配置如下：

  <property>
    <name>ozone.scm.chunk.layout</name>
    <value>FILE_PER_BLOCK</value>
    <tag>OZONE, SCM, CONTAINER, PERFORMANCE</tag>
    <description>
      Chunk layout defines how chunks, blocks and containers are stored on disk.
      Each chunk is stored separately with FILE_PER_CHUNK.  All chunks of a
      block are stored in the same file with FILE_PER_BLOCK.  The default is
      FILE_PER_BLOCK.
    </description>
  </property>

Chunk新舊layout實際存儲對比

筆者在實際測試集羣對這2種chunk layout方式進行測試，看看實際chunk是怎樣的形式存儲的，以下是測試結果：

FILE_PER_BLOCK layout模式：

[hdfs@lyq containerDir0]$ cd 11/chunks/
[hdfs@lyq chunks]$ ll
total 16384
-rw-rw-r-- 1 hdfs hdfs 16777216 Mar 14 08:32 103822128652419072.block

FILE_PER_CHUNK layout模式：

[hdfs@lyq ~]$ ls -l /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
-rw-r--r-- 1 hdfs hdfs 12 Dec 24 07:56 /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1

引用

[1].https://issues.apache.org/jira/browse/HDDS-2717

Ozone Block Chunk文件的layout方式

文章目錄

前言

Ozone Datanode Chunk文件原有layout

Ozone Datanode Chunk Layout：FILE_PER_CHUNK和FILE_PER_BLOCK

Chunk新舊layout實際存儲對比

引用

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

HDFS Rolling Upgrade的實現要點分析

Alluxio基於冷熱數據分離的元數據管理策略

存儲系統元數據管理演變升級

Ozone的Erasure Coding方案設計

Ozone數據寫入過程分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結