文章目錄
前言
在Ozone中,Ozone的文件對象數據是以Block的形式進行組織的,和HDFS想類似。不過Ozone對Block進行實際存儲的時候,是以更細粒度的chunk文件的形式進行物理存儲的。簡單來說,Block下面又會劃分出多個chunk個塊,每個chunk文件對應有Block的一個相對offset。本文筆者要聊聊Ozone chunk的layout,即chunk文件在Datanode上的存儲方式。在原有實現上,多個chunk文件分離存儲的方式還導致有大量chunk的文件的生成,這在效率上並不高效。在最新社區的優化中,Block能支持一個chunk的形式進行存儲,接下來我們就來聊聊這2種layout方式。
Ozone Datanode Chunk文件原有layout
首先我們來聊聊Ozone Chunk文件原有的layout佈局方式,它是怎麼樣的一個過程呢?
1)首先是一個文件按照block size劃分出多個Block塊
2)BlockOutputStream按照chunk size的,分割出多個chunk文件進行寫出
此種方式會有以下一些弊端:
- 如果Block設置的比較大,那麼將會有很多chunk文件生成
- 每次的Block讀寫操作涉及到多個chunk文件的讀寫,讀寫效率不夠高
- 讀寫過程要做多次文件的檢查操作,同樣影響了效率
於是,社區實現一種基於一個Block一個chunk的文件layout方式,來提升Ozone文件讀寫的效率。原有Chunk的layout方式在名稱上稱之爲FILE_PER_CHUNK,新的則爲FILE_PER_BLOCK。
Ozone Datanode Chunk Layout:FILE_PER_CHUNK和FILE_PER_BLOCK
以上Chunk layout模式對應實質上是分別由Ozone Datanode內部的FilePerChunkStrategy和FilePerBlockStrategy所控制的。
以上2種policy的本質區別在於Datanode這邊對於BlockOutputStream發來的ChunkBuffer數據的處理方式:
FilePerChunkStrategy的處理方式(原有方式):以新chunk文件形式進行寫出
FilePerBlockStrategy的處理方式:在原有chunk文件當前的offset處進行追加寫
用一個更爲直觀的圖示過程,如下所示:
上圖FILE_PER_CHUNK左邊的chunk1,2,3文件對應到FILE_PER_BLOCK的方式就是一個chunk文件內的不同段,而FILE_PER_BLOCK模式下,一個Block文件只有1個chunk文件。
下面我們來看具體的實現邏輯,主要在write chunk的文件方法:
FilePerChunkStrategy策略(FilePerChunkStrategy.java):
public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
ChunkBuffer data, DispatcherContext dispatcherContext)
throws StorageContainerException {
checkLayoutVersion(container);
Preconditions.checkNotNull(dispatcherContext);
DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
try {
KeyValueContainerData containerData = (KeyValueContainerData) container
.getContainerData();
HddsVolume volume = containerData.getVolume();
VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
// 1)獲取此chunk文件的路徑
File chunkFile = ChunkUtils.getChunkFile(containerData, info);
boolean isOverwrite = ChunkUtils.validateChunkForOverwrite(
chunkFile, info);
// 2)獲取此chunk的臨時文件路徑
File tmpChunkFile = getTmpChunkFile(chunkFile, dispatcherContext);
if (LOG.isDebugEnabled()) {
LOG.debug(
"writing chunk:{} chunk stage:{} chunk file:{} tmp chunk file:{}",
info.getChunkName(), stage, chunkFile, tmpChunkFile);
}
long len = info.getLen();
// 忽略offset值,因爲是一個新的獨立文件
long offset = 0; // ignore offset in chunk info
switch (stage) {
case WRITE_DATA:
if (isOverwrite) {
// if the actual chunk file already exists here while writing the temp
// chunk file, then it means the same ozone client request has
// generated two raft log entries. This can happen either because
// retryCache expired in Ratis (or log index mismatch/corruption in
// Ratis). This can be solved by two approaches as of now:
// 1. Read the complete data in the actual chunk file ,
// verify the data integrity and in case it mismatches , either
// 2. Delete the chunk File and write the chunk again. For now,
// let's rewrite the chunk file
// TODO: once the checksum support for write chunks gets plugged in,
// the checksum needs to be verified for the actual chunk file and
// the data to be written here which should be efficient and
// it matches we can safely return without rewriting.
LOG.warn("ChunkFile already exists {}. Deleting it.", chunkFile);
FileUtil.fullyDelete(chunkFile);
}
if (tmpChunkFile.exists()) {
// If the tmp chunk file already exists it means the raft log got
// appended, but later on the log entry got truncated in Ratis leaving
// behind garbage.
// TODO: once the checksum support for data chunks gets plugged in,
// instead of rewriting the chunk here, let's compare the checkSums
LOG.warn("tmpChunkFile already exists {}. Overwriting it.",
tmpChunkFile);
}
// 3)如果是在寫chunk文件數據階段,則進行臨時文件的寫入
ChunkUtils.writeData(tmpChunkFile, data, offset, len, volumeIOStats,
doSyncWrite);
// No need to increment container stats here, as still data is not
// committed here.
break;
case COMMIT_DATA:
...
// 4)如果是chunk文件結束階段,則進行臨時文件到最終正式文件的rename操作
commitChunk(tmpChunkFile, chunkFile);
// Increment container stats here, as we commit the data.
containerData.updateWriteStats(len, isOverwrite);
break;
case COMBINED:
// directly write to the chunk file
ChunkUtils.writeData(chunkFile, data, offset, len, volumeIOStats,
doSyncWrite);
containerData.updateWriteStats(len, isOverwrite);
break;
default:
throw new IOException("Can not identify write operation.");
}
} catch (StorageContainerException ex) {
throw ex;
} catch (IOException ex) {
throw new StorageContainerException("Internal error: ", ex,
IO_EXCEPTION);
}
}
下面我們再來看另外一種chunk layout的policy策略實現:
FilePerBlockStrategy.java
@Override
public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
ChunkBuffer data, DispatcherContext dispatcherContext)
throws StorageContainerException {
checkLayoutVersion(container);
Preconditions.checkNotNull(dispatcherContext);
DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
...
KeyValueContainerData containerData = (KeyValueContainerData) container
.getContainerData();
// 1)同樣獲取chunk文件的路徑名
File chunkFile = getChunkFile(containerData, blockID);
boolean overwrite = validateChunkForOverwrite(chunkFile, info);
long len = info.getLen();
// 2)拿到Chunk數據對應於Block數據內的offset,即這個block所寫chunk文件的offset
long offset = info.getOffset();
if (LOG.isDebugEnabled()) {
LOG.debug("Writing chunk {} (overwrite: {}) in stage {} to file {}",
info, overwrite, stage, chunkFile);
}
HddsVolume volume = containerData.getVolume();
VolumeIOStats volumeIOStats = volume.getVolumeIOStats();
// 3)從open文件cache中拿到此chunk文件對應的FileChannel
FileChannel channel = files.getChannel(chunkFile, doSyncWrite);
// 4)進行指定offset位置的數據讀寫
ChunkUtils.writeData(channel, chunkFile.getName(), data, offset, len,
volumeIOStats);
containerData.updateWriteStats(len, overwrite);
}
因爲在FILE_PER_BLOCK模式下,一個Block文件可能會處於一段連續時間內被寫入的狀態,因此在這裏實現了FileChannel的cache,避免短時間內多次文件的close,重新open操作。
private static final class OpenFiles {
private static final RemovalListener<String, OpenFile> ON_REMOVE =
event -> close(event.getKey(), event.getValue());
// OpenFile的文件cache
private final Cache<String, OpenFile> files = CacheBuilder.newBuilder()
.expireAfterAccess(Duration.ofMinutes(10))
.removalListener(ON_REMOVE)
.build();
/**
* Chunk文件的FileChannel獲取,打開文件操作.
*/
public FileChannel getChannel(File file, boolean sync)
throws StorageContainerException {
try {
return files.get(file.getAbsolutePath(),
() -> open(file, sync)).getChannel();
} catch (ExecutionException e) {
if (e.getCause() instanceof IOException) {
throw new UncheckedIOException((IOException) e.getCause());
}
throw new StorageContainerException(e.getCause(),
ContainerProtos.Result.CONTAINER_INTERNAL_ERROR);
}
}
private static OpenFile open(File file, boolean sync) {
try {
return new OpenFile(file, sync);
} catch (FileNotFoundException e) {
throw new UncheckedIOException(e);
}
}
/**
* 當Open中的文件在cache中過期了,則進行cache清除操作,並附帶文件close操作
*/
public void close(File file) {
if (file != null) {
files.invalidate(file.getAbsolutePath());
}
}
...
}
據社區對以上2種方式的數據寫入的測試結果來對比,新的chunk layout方式要比原有FILE_PER_CHUNK的方式高效不少,也已經將FILE_PER_BLOCK的方式變爲了默認的chunk layout了。相關配置如下:
<property>
<name>ozone.scm.chunk.layout</name>
<value>FILE_PER_BLOCK</value>
<tag>OZONE, SCM, CONTAINER, PERFORMANCE</tag>
<description>
Chunk layout defines how chunks, blocks and containers are stored on disk.
Each chunk is stored separately with FILE_PER_CHUNK. All chunks of a
block are stored in the same file with FILE_PER_BLOCK. The default is
FILE_PER_BLOCK.
</description>
</property>
Chunk新舊layout實際存儲對比
筆者在實際測試集羣對這2種chunk layout方式進行測試,看看實際chunk是怎樣的形式存儲的,以下是測試結果:
FILE_PER_BLOCK layout模式:
[hdfs@lyq containerDir0]$ cd 11/chunks/
[hdfs@lyq chunks]$ ll
total 16384
-rw-rw-r-- 1 hdfs hdfs 16777216 Mar 14 08:32 103822128652419072.block
FILE_PER_CHUNK layout模式:
[hdfs@lyq ~]$ ls -l /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
-rw-r--r-- 1 hdfs hdfs 12 Dec 24 07:56 /tmp/hadoop-hdfs/dfs/data/hdds/762187f8-3d8d-4c2c-8659-9ca66987c829/current/containerDir0/4/chunks/103363337595977729_chunk_1
引用
[1].https://issues.apache.org/jira/browse/HDDS-2717