文章目錄

前言

在存儲系統中，數據的安全性無疑是最top priority的事情，因此當數據發生丟失的時候，如何快速找到這些數據的位置並且快速地對他們進行恢復是最最要緊的事情。本文筆者想聊聊關於HDFS上數據丟失查找的問題。在HDFS中，數據是以多副本的形式存儲於DataNode節點之中。倘若其中出現文件Block塊發生多副本同時出現丟失的情況，就會造成我們常見的missing block現象。不過HDFS內部並沒有記錄最後數據所在的節點，這意味着我們無從得知這些數據丟失前到底分佈在哪些節點之上，這對於我們快速恢復數據製造了一個些挑戰。本文筆者想聊聊對這塊的改進，基於的代碼版本爲Hadoop 2.6版本。

HDFS Block副本storage location的移除邏輯

在介紹筆者對missing block改進措施之前，我們有必要了解下現在HDFS Block存儲location的移除邏輯。什麼時候Block的存儲location會被移除，從而造成所有的location都沒有保存，以至於管理員十分難恢復丟失的數據。

這裏的一個主要場景在DataNode變爲dead節點的邏輯之中，DataNode變爲dead節點之後，NameNode這邊會執行removeDeadnode邏輯，並移除與dead node節點相關的block的storage location。

DatanodeManager的removeDeadDatanode方法，

  /** Remove a dead datanode. */
  void removeDeadDatanode(final DatanodeID nodeID) {
      synchronized(datanodeMap) {
        DatanodeDescriptor d;
        try {
          d = getDatanode(nodeID);
        } catch(IOException e) {
          d = null;
        }
        if (d != null && isDatanodeDead(d)) {
          NameNode.stateChangeLog.info(
              "BLOCK* removeDeadDatanode: lost heartbeat from " + d);
          removeDatanode(d);
        } else {
        	LOG.warn("datanode is timeout but not removed " + d);
        }
      }
  }

  /**
   * Remove a datanode descriptor.
   * @param nodeInfo datanode descriptor.
   */
  private void removeDatanode(DatanodeDescriptor nodeInfo) {
    assert namesystem.hasWriteLock();
    heartbeatManager.removeDatanode(nodeInfo);
    // 此處會執行block移除dead storage location的操作
    blockManager.removeBlocksAssociatedTo(nodeInfo);
    networktopology.remove(nodeInfo);
    decrementVersionCount(nodeInfo.getSoftwareVersion());
    blockManager.getBlockReportLeaseManager().unregister(nodeInfo);

    if (LOG.isDebugEnabled()) {
      LOG.debug("remove datanode " + nodeInfo);
    }
    namesystem.checkSafeMode();
  }

最後會執行到BlockInfoContiguous的removeStorage方法，BlockInfoContiguous是最終保存block副本location所在信息的類：

  /**
   * Remove {@link DatanodeStorageInfo} location for a block
   */
  boolean removeStorage(DatanodeStorageInfo storage) {
    int dnIndex = findStorageInfo(storage);
    if(dnIndex < 0) // the node is not found
      return false;
    assert getPrevious(dnIndex) == null && getNext(dnIndex) == null : 
      "Block is still in the list and must be removed first.";
    // find the last not null node
    int lastNode = numNodes()-1; 
    // replace current node triplet by the lastNode one 
    setStorageInfo(dnIndex, getStorageInfo(lastNode));
    setNext(dnIndex, getNext(lastNode)); 
    setPrevious(dnIndex, getPrevious(lastNode)); 
    // set the last triplet to null
    setStorageInfo(lastNode, null);
    setNext(lastNode, null); 
    setPrevious(lastNode, null);
    return true;
  }

BlockInfoContiguous的內部對象數組triplets同時保存了3個副本的storage location引用和每個副本前後block信息的引用。

BlockInfoContiguous類

  /**
   * This array contains triplets of references. For each i-th storage, the
   * block belongs to triplets[3*i] is the reference to the
   * {@link DatanodeStorageInfo} and triplets[3*i+1] and triplets[3*i+2] are
   * references to the previous and the next blocks, respectively, in the list
   * of blocks belonging to this storage.
   * 
   * Using previous and next in Object triplets is done instead of a
   * {@link LinkedList} list to efficiently use memory. With LinkedList the cost
   * per replica is 42 bytes (LinkedList#Entry object per replica) versus 16
   * bytes using the triplets.
   */
  private Object[] triplets;

那麼問題來了，爲什麼NameNode要在節點become dead的時候，從Block中將dead node的location移除掉呢？倘若出現3個副本節點全部瞬間變爲dead後，豈不是會出現副本找不到位置的現象？

沒錯，多副本節點同時dead就是這種處理邏輯帶來的一個弊端問題。不過本身Block本身的removeStorage邏輯並沒有什麼問題，它需要及時更新內部現有的存放副本的location，移除掉舊的storage信息，否則舊的location信息佔據的空間按照千萬級別block數來計算的話，也將達到GB級別的內存使用。因此NameNode在這邊做了快速的storage清理處理。

HDFS Block的last stored location的優化

但是話說回來了，如果我們只是爲了保證location的絕對精簡處理，把missing block最後的location全部清理了後，我們將得不到任何有用的信息。在這裏我們是否能夠額外保留它最後存儲的location位置呢？因爲信息是額外保存的，也就是說將不破壞現有的Block處理邏輯，風險性也小。

在做這塊的處理優化之前，筆者實現了2套解決方案：

第一套方案，簡單粗暴爲每個Block記錄現有副本的location位置，爲了節省空間，只記錄副本位置的ip地址，並且用byte數組去存(String內部每個字符會用char來存，佔2個byte，而byte數組會用到1位，更省空間)。此方案能存儲詳盡的副本位置信息，缺點是會佔據比較多的heap使用。
第二套方案，只爲missing block這類的block存儲最後一個location的位置，同樣用byte去存儲location的ip地址。這裏基於的原則是每當Block移除完沒有任何location位置信息後，則進行最後一個location的記錄。此方案對現有NameNode的heap使用幾乎不會造成任何的影響。因爲大部分的Block信息是有其現有有效的存儲位置的，因此無需記錄last location的。

OK，我們先來看第一套實現方案，這裏筆者只展示diff的change。

diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
index bc020aff6a..185827f1bf 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
@@ -20,7 +20,9 @@
 import java.util.LinkedList;
 
 import org.apache.hadoop.classification.InterfaceAudience;
+import org.apache.hadoop.hdfs.DFSUtil;
 import org.apache.hadoop.hdfs.protocol.Block;
+import org.apache.hadoop.hdfs.protocol.DatanodeID;
 import org.apache.hadoop.hdfs.server.common.HdfsServerConstants.BlockUCState;
 import org.apache.hadoop.util.LightWeightGSet;
 
@@ -56,18 +58,27 @@
    */
   private Object[] triplets;
 
+  /**
+   * Last locations that this block stored. The locations will miss when
+   * all replicated DN node got removed when they become dead. Use byte array
+   * to store ip address to reduce memory size cost.
+   */
+  private byte[][] lastStoredLocations;
+
   /**
    * Construct an entry for blocksmap
    * @param replication the block's replication factor
    */
   public BlockInfoContiguous(short replication) {
     this.triplets = new Object[3*replication];
+    this.lastStoredLocations = new byte[replication][];
     this.bc = null;
   }
   
   public BlockInfoContiguous(Block blk, short replication) {
     super(blk);
     this.triplets = new Object[3*replication];
+    this.lastStoredLocations = new byte[replication][];
     this.bc = null;
   }
 
@@ -79,6 +90,7 @@ public BlockInfoContiguous(Block blk, short replication) {
   protected BlockInfoContiguous(BlockInfoContiguous from) {
     super(from);
     this.triplets = new Object[from.triplets.length];
+    this.lastStoredLocations = new byte[from.lastStoredLocations.length][];
     this.bc = from.bc;
   }
 
@@ -101,6 +113,25 @@ DatanodeStorageInfo getStorageInfo(int index) {
     return (DatanodeStorageInfo)triplets[index*3];
   }
 
+  /**
+   * Get last stored node address that stores this block before, this method
+   * currently used in hdfs fsck command.
+   */
+  public String[] getLastStoredNodes() {
+    if (lastStoredLocations != null) {
+      String[] nodeAddresses = new String[lastStoredLocations.length];
+      for (int i = 0; i < lastStoredLocations.length; i++) {
+        if (lastStoredLocations[i] != null) {
+          nodeAddresses[i] = DFSUtil.bytes2String(lastStoredLocations[i]);
+        }
+      }
+
+      return nodeAddresses;
+    } else {
+      return null;
+    }
+  }
+
   private BlockInfoContiguous getPrevious(int index) {
     assert this.triplets != null : "BlockInfo is not initialized";
     assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";
@@ -182,6 +213,42 @@ private int ensureCapacity(int num) {
     return last;
   }
 
+  /**
+   * Ensure that there is enough  space to include num more triplets.
+   * @return first free triplet index.
+   */
+  private void updateLastStoredLocations() {
+    if (lastStoredLocations != null) {
+      if (lastStoredLocations.length * 3 == triplets.length) {
+        setLastStoredLocations();
+      } else {
+        // This is for the case of increasing replica number from users.
+        lastStoredLocations = new byte[triplets.length / 3][];
+        setLastStoredLocations();
+      }
+    }
+  }
+
+  /**
+   * Reset block last stored locations from triplets objects.
+   */
+  private void setLastStoredLocations() {
+    int storedIndex = 0;
+    // DatanodeStorageInfo is stored in position of triplets[i*3]
+    for (int i = 0; i < triplets.length; i += 3) {
+      if (triplets[i] != null) {
+        DatanodeStorageInfo currentDN = (DatanodeStorageInfo) triplets[i];
+        if (currentDN.getDatanodeDescriptor() != null) {
+          String ipAddress = currentDN.getDatanodeDescriptor().getIpAddr();
+          if (ipAddress != null) {
+            lastStoredLocations[storedIndex] = DFSUtil.string2Bytes(ipAddress);
+            storedIndex++;
+          }
+        }
+      }
+    }
+  }
+
   /**
    * Count the number of data-nodes the block belongs to.
    */
@@ -204,6 +271,10 @@ boolean addStorage(DatanodeStorageInfo storage) {
     setStorageInfo(lastNode, storage);
     setNext(lastNode, null);
     setPrevious(lastNode, null);
+    // Only need update last stored locations in adding storage,
+    // adding storage behavior mean there is a block replicated successfully
+    // and we can ensure there is at least one valid block location.
+    updateLastStoredLocations();
     return true;
   }
 
diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
index 1ee7d879fa..82c49d7d35 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
@@ -676,7 +676,17 @@ private void collectBlocksSummary(String parent, HdfsFileStatus file, Result res
           report.append(" " + sb.toString());
         }
       }
-      report.append('\n');
+
+      // Print last location the block stored.
+      BlockInfoContiguous blockInfo = bm.getStoredBlock(block.getLocalBlock());
+      String[] nodeAddresses = blockInfo.getLastStoredNodes();
+      report.append("\nBlock last stored locations: [");
+      if (nodeAddresses != null) {
+        for (String node : nodeAddresses) {
+          report.append(node).append(", ");
+        }
+      }
+      report.append("].\n");
       blockNumber++;
     }

上面方案的核心點在於在每次新加入storage的時候，我們更新一下當前有效storage location的位置到內部變量lastStoredLocations中。如果這期間有storage的移除，那麼在下次addStorage方法執行之後(做副本replication或balance操作，或重新節點塊上報)，lastStoredLocations還是能夠得到當前最新的location的位置的。

然後筆者將lastStoredLocations以fsck命令工具的形式供外部使用，這樣能夠達到一個比較好的使用效果。不過筆者在測試此方案的過程中，還是發覺此方案額外消耗的heap空間過大，比如一個Block按照3個ip的byte數組保存，就是3*16個byte(字符數字數，假設每個ip 16位數字)=48個byte。然後48個byte如果按照5000w的block計算=48bx5000w=2.24GB。倘若此改動運用在上億個Block塊的集羣內，將會達到10GB+的heap使用。這麼大的heap使用對於部分集羣維護者來說可能不能接受。

於是筆者實現了另外一個方案的改進手段，即上面談論的第二套方案，我們再來看看方案的實現：

diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
index bc020aff6a..5e27b45d8b 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
@@ -20,6 +20,7 @@
 import java.util.LinkedList;
 
 import org.apache.hadoop.classification.InterfaceAudience;
+import org.apache.hadoop.hdfs.DFSUtil;
 import org.apache.hadoop.hdfs.protocol.Block;
 import org.apache.hadoop.hdfs.server.common.HdfsServerConstants.BlockUCState;
 import org.apache.hadoop.util.LightWeightGSet;
@@ -56,18 +57,27 @@
    */
   private Object[] triplets;
 
+  /**
+   * Last location that this block stored. The locations will miss when
+   * all replicated DN node got removed when they become dead. Use byte array
+   * to store ip address to reduce memory size cost.
+   */
+  private byte[] lastStoredLocation;
+
   /**
    * Construct an entry for blocksmap
    * @param replication the block's replication factor
    */
   public BlockInfoContiguous(short replication) {
     this.triplets = new Object[3*replication];
+    this.lastStoredLocation = null;
     this.bc = null;
   }
   
   public BlockInfoContiguous(Block blk, short replication) {
     super(blk);
     this.triplets = new Object[3*replication];
+    this.lastStoredLocation = null;
     this.bc = null;
   }
 
@@ -79,6 +89,7 @@ public BlockInfoContiguous(Block blk, short replication) {
   protected BlockInfoContiguous(BlockInfoContiguous from) {
     super(from);
     this.triplets = new Object[from.triplets.length];
+    this.lastStoredLocation = from.lastStoredLocation;
     this.bc = from.bc;
   }
 
@@ -101,6 +112,18 @@ DatanodeStorageInfo getStorageInfo(int index) {
     return (DatanodeStorageInfo)triplets[index*3];
   }
 
+  /**
+   * Get last stored node address that stores this block before, this method
+   * currently used in hdfs fsck command.
+   */
+  public String getLastStoredNodes() {
+    if (lastStoredLocation != null) {
+      return DFSUtil.bytes2String(lastStoredLocation);
+    } else {
+      return null;
+    }
+  }
+
   private BlockInfoContiguous getPrevious(int index) {
     assert this.triplets != null : "BlockInfo is not initialized";
     assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";
@@ -182,6 +205,31 @@ private int ensureCapacity(int num) {
     return last;
   }
 
+  /**
+   * Update block last storage location once given removed storage is the last
+   * block replica location.
+   */
+  private void updateLastStoredLocation(DatanodeStorageInfo removedStorage) {
+    if (triplets != null) {
+      for (int i = 0; i < triplets.length; i += 3) {
+        if (triplets[i] != null) {
+          // There still exists other storage location and given removed
+          // storage isn't last block storage location.
+          return;
+        }
+      }
+
+      // If given removed storage is the block last storage, we need to
+      // store the ip address of this storage node.
+      if (removedStorage.getDatanodeDescriptor() != null) {
+        String ipAddress = removedStorage.getDatanodeDescriptor().getIpAddr();
+        if (ipAddress != null) {
+          lastStoredLocation = DFSUtil.string2Bytes(ipAddress);
+        }
+      }
+    }
+  }
+
   /**
    * Count the number of data-nodes the block belongs to.
    */
@@ -204,6 +252,12 @@ boolean addStorage(DatanodeStorageInfo storage) {
     setStorageInfo(lastNode, storage);
     setNext(lastNode, null);
     setPrevious(lastNode, null);
+
+    if (lastStoredLocation != null) {
+      // Reset last stored location to null since we have new
+      // valid storage location added.
+      lastStoredLocation = null;
+    }
     return true;
   }
 
@@ -225,7 +279,9 @@ assert getPrevious(dnIndex) == null && getNext(dnIndex) == null :
     // set the last triplet to null
     setStorageInfo(lastNode, null);
     setNext(lastNode, null); 
-    setPrevious(lastNode, null); 
+    setPrevious(lastNode, null);
+    // update last block stored location
+    updateLastStoredLocation(storage);
     return true;
   }
 
diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
index 1ee7d879fa..20d3a31c7d 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeFsck.java
@@ -676,7 +676,11 @@ private void collectBlocksSummary(String parent, HdfsFileStatus file, Result res
           report.append(" " + sb.toString());
         }
       }
-      report.append('\n');
+
+      // Print last location the block stored.
+      BlockInfoContiguous blockInfo = bm.getStoredBlock(block.getLocalBlock());
+      report.append("\nBlock last stored location: [" + blockInfo.getLastStoredNodes());
+      report.append("].\n");
       blockNumber++;
     }
 
diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
index f57dece80e..47c7c299c3 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
@@ -172,4 +172,37 @@ public void testBlockListMoveToHead() throws Exception {
           blockInfoList.get(j), dd.getBlockListHeadForTesting());
     }
   }
+
+  @Test
+  public void testLastStoredLocation() {
+    BlockInfoContiguous blockInfo = new BlockInfoContiguous((short) 3);
+
+    String lastStoredLocation = "127.0.0.3";
+    DatanodeStorageInfo storage1 = DFSTestUtil.createDatanodeStorageInfo(
+        "storageID1", "127.0.0.1");
+    DatanodeStorageInfo storage2 = DFSTestUtil.createDatanodeStorageInfo(
+        "storageID2", "127.0.0.2");
+    DatanodeStorageInfo storage3 = DFSTestUtil.createDatanodeStorageInfo(
+        "storageID3", lastStoredLocation);
+
+    blockInfo.addStorage(storage1);
+    blockInfo.addStorage(storage2);
+    blockInfo.addStorage(storage3);
+
+    Assert.assertEquals(storage1, blockInfo.getStorageInfo(0));
+    Assert.assertEquals(storage2, blockInfo.getStorageInfo(1));
+    Assert.assertEquals(storage3, blockInfo.getStorageInfo(2));
+    Assert.assertNull(blockInfo.getLastStoredNodes());
+
+    blockInfo.removeStorage(storage1);
+    blockInfo.removeStorage(storage2);
+    blockInfo.removeStorage(storage3);
+
+    Assert.assertNull(blockInfo.getDatanode(0));
+    Assert.assertNull(blockInfo.getDatanode(1));
+    Assert.assertNull(blockInfo.getDatanode(2));
+
+    Assert.assertNotNull(blockInfo.getLastStoredNodes());
+    Assert.assertEquals(lastStoredLocation, blockInfo.getLastStoredNodes());
+  }
 }
\ No newline at end of file

方案二的實現點在於Block在移除最後一個storage位置的時候，記錄最後一個removed storage位置。如果這些dead node重新回來之後，將會觸發addStorage方法，然後再次清空lastStoredLocation的保存信息。換句話說，方案二隻會額外記錄那些missing block的存儲位置信息。不過因爲邏輯是添加在removeStorage方法內的，文件的正常刪除行爲也會觸發到最後lastStoredLocation的保存操作，不過對刪除文件的行爲本身影響還好。

HDFS Missing Block lastStoredLocationd的測試

筆者用fsck命令對上述改動做了測試，測下效果還不錯，如下所示：

FSCK started by hdfs from /xx.xx.xx.xx for path /tmp/testfile3 at Tue Mar 03 00:32:55 GMT-07:00 2020
/tmp/testfile3 12 bytes, 1 block(s):  OK
0. BP-xx.xx.xx.xx:blk_1075951756_2217202 len=12 repl=3 [DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-1c6de9cf-4440-426e-b5a6-c6f3b212d39b,DISK], DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-7ea05055-025d-4a96-ad17-b87c677ff421,DISK], DatanodeInfoWithStorage[xx.xx.xx.xx:50010,DS-8ce45298-7109-4d9d-82cf-dfaece62263f,DISK]]
Block last stored location: [null].

// 同時將datanode變爲dead之後
FSCK started by hdfs from /xx.xx.xx.xx for path /tmp/testfile3 at Tue Mar 03 00:52:14 GMT-07:00 2020
/tmp/testfile3 12 bytes, 1 block(s):
/tmp/testfile3: CORRUPT blockpool BP-xx.xx.xx.xx block blk_1075951756
 MISSING 1 blocks of total size 12 B
0. BP-xx.xx.xx.xx:blk_1075951756_2217202 len=12 MISSING!
Block last stored location: [xx.xx.xx.xx].

筆者目前只在測試集羣中對上述改動進行了測試，不過還並未在正式生產環境進行使用，不過相信此部分的改進會對集羣的維護者來說是一個不錯的幫助。

HDFS Missing Block診斷信息的改進

文章目錄

前言

HDFS Block副本storage location的移除邏輯

HDFS Block的last stored location的優化

HDFS Missing Block lastStoredLocationd的測試

HDFS Rolling Upgrade的實現要點分析

Alluxio基於冷熱數據分離的元數據管理策略

存儲系統元數據管理演變升級

Ozone的Erasure Coding方案設計

Ozone數據寫入過程分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結