hadoop分析之二元數據備份方案的機制

原創

kntao

2020-02-21 11:21

1、NameNode啓動加載元數據情景分析

NameNode函數裏調用FSNamesystemm讀取dfs.namenode.name.dir和dfs.namenode.edits.dir構建FSDirectory。
FSImage類recoverTransitionRead和saveNameSpace分別實現了元數據的檢查、加載、內存合併和元數據的持久化存儲。
saveNameSpace將元數據寫入到磁盤，具體操作步驟：首先將current目錄重命名爲lastcheckpoint.tmp;然後在創建新的current目錄，並保存文件；最後將lastcheckpoint.tmp重命名爲privios.checkpoint.
checkPoint的過程：Secondary NameNode會通知nameNode產生一個edit log文件edits.new，之後所有的日誌操作寫入到edits.new文件中。接下來Secondary NameNode會從namenode下載fsimage和edits文件，進行合併產生新的fsimage.ckpt;然後Secondary會將fsimage.ckpt文件上傳到namenode。最後namenode會重命名fsimage.ckpt爲fsimage，edtis.new爲edits；

2、元數據更新及日誌寫入情景分析

以mkdir爲例:

logSync代碼分析：

代碼:

public void logSync () throws IOException {
ArrayList<EditLogOutputStream > errorStreams = null ;
long syncStart = 0;

// Fetch the transactionId of this thread.
long mytxid = myTransactionId .get (). txid;
EditLogOutputStream streams[] = null;
boolean sync = false;
try {
synchronized (this) {
assert editStreams. size() > 0 : "no editlog streams" ;
printStatistics (false);
// if somebody is already syncing, then wait
while (mytxid > synctxid && isSyncRunning) {
try {
wait (1000 );
} catch (InterruptedException ie ) {
}
}
//
// If this transaction was already flushed, then nothing to do
//
if (mytxid <= synctxid ) {
numTransactionsBatchedInSync ++;
if (metrics != null) // Metrics is non-null only when used inside name node
metrics .transactionsBatchedInSync .inc ();
return;
}
// now, this thread will do the sync
syncStart = txid ;
isSyncRunning = true;
sync = true;
// swap buffers
for( EditLogOutputStream eStream : editStreams ) {
eStream .setReadyToFlush ();
}
streams =
editStreams .toArray (new EditLogOutputStream[editStreams. size()]) ;
}
// do the sync
long start = FSNamesystem.now();
for (int idx = 0; idx < streams. length; idx++ ) {
EditLogOutputStream eStream = streams [idx ];
try {
eStream .flush ();
} catch (IOException ie ) {
FSNamesystem .LOG .error ("Unable to sync edit log." , ie );
//
// remember the streams that encountered an error.
//
if (errorStreams == null) {
errorStreams = new ArrayList <EditLogOutputStream >( 1) ;
}
errorStreams .add (eStream );
}
}
long elapsed = FSNamesystem.now() - start ;
processIOError (errorStreams , true);
if (metrics != null) // Metrics non-null only when used inside name node
metrics .syncs .inc (elapsed );
} finally {
synchronized (this) {
synctxid = syncStart ;
if (sync ) {
isSyncRunning = false;
}
this.notifyAll ();
}
}
}

3、Backup Node 的checkpoint的過程分析：

/**
* Create a new checkpoint
*/
void doCheckpoint() throws IOException {
long startTime = FSNamesystem.now ();
NamenodeCommand cmd =
getNamenode().startCheckpoint( backupNode. getRegistration());
CheckpointCommand cpCmd = null;
switch( cmd. getAction()) {
case NamenodeProtocol .ACT_SHUTDOWN :
shutdown() ;
throw new IOException ("Name-node " + backupNode .nnRpcAddress
+ " requested shutdown.");
case NamenodeProtocol .ACT_CHECKPOINT :
cpCmd = (CheckpointCommand )cmd ;
break;
default:
throw new IOException ("Unsupported NamenodeCommand: "+cmd.getAction()) ;
}

CheckpointSignature sig = cpCmd. getSignature();
assert FSConstants.LAYOUT_VERSION == sig .getLayoutVersion () :
"Signature should have current layout version. Expected: "
+ FSConstants.LAYOUT_VERSION + " actual " + sig. getLayoutVersion();
assert !backupNode .isRole (NamenodeRole .CHECKPOINT ) ||
cpCmd. isImageObsolete() : "checkpoint node should always download image.";
backupNode. setCheckpointState(CheckpointStates .UPLOAD_START );
if( cpCmd. isImageObsolete()) {
// First reset storage on disk and memory state
backupNode. resetNamespace();
downloadCheckpoint(sig);
}

BackupStorage bnImage = getFSImage() ;
bnImage. loadCheckpoint(sig);
sig.validateStorageInfo( bnImage) ;
bnImage. saveCheckpoint();

if( cpCmd. needToReturnImage())
uploadCheckpoint(sig);

getNamenode() .endCheckpoint (backupNode .getRegistration (), sig );

bnImage. convergeJournalSpool();
backupNode. setRegistration(); // keep registration up to date
if( backupNode. isRole( NamenodeRole.CHECKPOINT ))
getFSImage() .getEditLog (). close() ;
LOG. info( "Checkpoint completed in "
+ (FSNamesystem .now() - startTime )/ 1000 + " seconds."
+ " New Image Size: " + bnImage .getFsImageName (). length()) ;
}
}

4、元數據可靠性機制。

配置多個備份路徑。NameNode在更新日誌或進行Checkpoint的過程，會將元數據放在多個目錄下。
對於沒一個需要保存的元數據文件，都創建一個輸出流，對訪問過程中出現的異常輸出流進行處理，將其移除。並再合適的時機再次檢查移除的數據量是否恢復正常。有效的保證了備份輸出流的異常問題。
採用了多種機制來保證元數據的可靠性。例如在checkpoint的過程中，分爲幾個階段，通過不同的文件名來標識當前所處的狀態。爲存儲失敗後進行恢復提供了可能。

5、元數據的一致性機制。

首先從NameNode啓動時，對每個備份目錄是否格式化、目錄元數據文件名是否正確等進行檢查，確保元數據文件間的狀態一致性，然後選取最新的加載到內存，這樣可以確保HDFS當前狀態和最後一次關閉時的狀態一致性。
其次，通過異常輸出流的處理，可以確保正常輸出流數據的一致性。
運用同步機制，確保了輸出流一致性問題。

kntao

發佈了117 篇原創文章 · 獲贊 24 · 訪問量 45萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hadoop分析之二元數據備份方案的機制

hadoop分析之三org.apache.hadoop.hdfs.server.namenode各個類的功能與角色

hadoop三個配置文件的參數含義說明

Hbase,Zookeeper性能優化之-參數設置

Java, C#, JavaScript三種語言實現單例模式

Hbase 錯誤記錄及修改方法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結