硬盤無法識別導致HDFS無法正常使用

環境信息:

Hadoop版本:2.7.2

現象:

HDFS升級配置重啓後空間大量減少

HDFS狀態爲INCONSISTENT,無法正常使用,DataNode進程隨即消失

問題分析:

可能原因:

1、由於HADOOP集羣進行過擴展,導致集羣配置異構,hdfs-site.xml的配置不同,可能在配置文件scp的時候導致錯誤的替換,部分硬盤未識別

2、部分硬盤損壞導致數據無法讀取

問題排查:

1、查看hdfs-site.xml,發現配置正確,排除該問題的可能

2、查看各個DataNode的日誌,日誌如下:

直接報錯信息爲:

2018-07-23 10:22:16,959 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to hb1/192.168.10.32:9000. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 6, volumes configured: 10, volumes failed: 4, volume failures tolerated: 0
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:285)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1371)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
        at java.lang.Thread.run(Thread.java:745)

第一反應是硬盤出問題了,導致HDFS不識別,而HDFS配置是有一塊盤不識別即認爲該DataNode不可用,所以導致節點無法使用,使用smartctl -H /dev/sdb命令查看所有的硬盤狀態,得到的都是SMART Health Status: OK,不是硬盤本身的問題

繼續查看日誌,發現了真正的問題:

2018-07-23 10:22:16,731 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data6 is in an inconsistent state: Root /data6: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,760 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data7/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,770 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data7 is in an inconsistent state: Root /data7: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,778 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data8/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,803 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2078449412-192.168.10.32-1497544230293
2018-07-23 10:22:16,803 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /data8/current/BP-2078449412-192.168.10.32-1497544230293
2018-07-23 10:22:16,837 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data9/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,862 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data9 is in an inconsistent state: Root /data9: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f38
5b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory
.2018-07-23 10:22:16,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data10/in_use.lock acquired by nodename 173507@hd12
2018-07-23 10:22:16,947 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data10 is in an inconsistent state: Root /data10: DatanodeUuid=4310e0f4-6667-4cdc-bc1b-931cba355606, does not match f
385b27e-6a2f-4077-8468-40fd3dec7dc2 from other StorageDirectory.

正好4塊盤,與錯誤提示中4塊盤失敗對應上了,至此找到了真正的問題所在

問題解決:

去有問題硬盤目錄(配置在hdfs-site.xml中dfs.datanode.data.dir配置項的目錄)下的current目錄,找到VERSION文件,打開該文件,將datanodeUuid屬性按照錯誤日誌進行修改保存,重啓HDFS問題解決(我的例子中是將4310e0f4-6667-4cdc-bc1b-931cba355606修改爲f385b27e-6a2f-4077-8468-40fd3dec7dc2)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章