CDH Can't scan a pre-transactional edit log,Timed out waiting 120000ms ,JournalNode數據文件破壞集羣恢復方法

簡介:
CDH5.11集羣,由於停電或者磁盤滿了造成節點全部掛掉,重啓後HDFS報錯,同時由於HDFS報錯,引起其他基於HDFS的應用如HBASE等也報錯,恢復方法如下。

報錯介紹:
我這裏的錯誤,摘錄部分日誌如下:
在namenode中的報錯如下

2017-07-03 13:53:10,377 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.60.43:8485, 192.168.60.45:8485, 192.168.60.46:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
at org.apache.hadoop.hdfs.server.namenode.JournalSet8.apply(JournalSet.java:624)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)atorg.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1478)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1236)atorg.apache.hadoop.hdfs.server.namenode.NameNode8.apply(JournalSet.java:624) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1478) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1236) at org.apache.hadoop.hdfs.server.namenode.NameNodeNameNodeHAContext.startActiveServices(NameNode.java:1771)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1644)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1378)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService2.callBlockingMethod(HAServiceProtocolProtos.java:4460)atorg.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngineServerProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)atorg.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPCServer.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler1.run(Server.java:2220)atorg.apache.hadoop.ipc.Server1.run(Server.java:2220) at org.apache.hadoop.ipc.ServerHandler1.run(Server.java:2216)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)atorg.apache.hadoop.ipc.Server1.run(Server.java:2216) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.ServerHandler.run(Server.java:2214)
2017-07-03 13:53:10,382 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-07-03 13:53:10,384 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

在JournalNode的日誌文件中,報錯如下:

2017-07-03 14:06:36,898 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /home/journal/nameservice1/current/edits_inprogress_0000000000002539938 while determining its valid length. Position was 1048576
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOpLegacyReader.scanOp(FSEditLogOp.java:4610)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerLegacyReader.scanOp(FSEditLogOp.java:4610) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355) at org.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:551)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:153)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngineServerProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)atorg.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPCServer.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler1.run(Server.java:2220)atorg.apache.hadoop.ipc.Server1.run(Server.java:2220) at org.apache.hadoop.ipc.ServerHandler1.run(Server.java:2216)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)atorg.apache.hadoop.ipc.Server1.run(Server.java:2216) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.ServerHandler.run(Server.java:2214)
2017-07-03 14:14:59,948 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: STARTUP_MSG:

恢復方法:
1.初步分析:經過查看日誌,我分析的原因是JournalNode維護的edits文件被破壞了。我共有3個Journalnode節點,其中有2個節點的日誌在報上面journalnode的錯,有一臺Journalnode的日誌沒有發現報錯。於是初步分析,那2臺出現的破壞,只需要將第三臺完好的Journalnode的數據拷貝過去,應該就能夠恢復。下面開始操作。
2.實際步驟:
①首先停掉集羣所有服務。
②爲了保險起見,備份所有journalnode節點維護的數據。我的journalnode數據目錄是/home/journal,於是使用tar命令打包備份(這一步所有journalnode節點都要操作):

cd /home/journal
tar  -zcvf journal.bak.tar.gz  ./journal

③刪除破損數據:分別進入出了問題的journalnode節點的數據目錄,刪除數據.(這一步只在日誌報錯的journalnode節點操作)

cd  /home/journal/nameservice1/current
rm -rf *

④複製數據:使用scp,將完好的journalnode節點數據複製給出錯的journalnode節點。

cd  /home/journal/nameservice1/current
scp ./* [email protected]:/home/journal/nameservice1/current/
scp -r ./paxos/ [email protected]:/home/journal/nameservice1/current/

⑤修改權限:需要將文件權限修改成和以前一樣,因爲使用scp發送後的數據,文件組和用戶會變。可以在完好的journalnode節點上,查看文件所有者和所屬組,然後在出問題的journalnode節點上,改成一樣的就行。
⑥重啓服務。應該就沒有問題了

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章