Error: recoverUnfinalizedSegments failed for required journal

轉自:https://blog.csdn.net/dudefu011/article/details/78463207#

一、問題描述
HA按照規劃配置好,啓動後,NameNode不能正常啓動。剛啓動的時候 jps 看到了NameNode,但是隔了一兩分鐘,再看NameNode就不見了。
但是測試之後,發現下面2種情況:
1)先啓動JournalNode,再啓動Hdfs,NameNode可以啓動並可以正常運行
2)使用start-dfs.sh啓動,衆多服務都啓動了,隔兩分鐘NameNode會退出,再次hadoop-daemon.sh start namenode單獨啓動可以成功穩定運行NameNode。

再看NameNode的日誌,不要嫌日誌長,其實出錯的蛛絲馬跡都包含其中了,如下:
2016-03-09 10:50:27,123 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = node1/192.168.56.201
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.5.1
STARTUP_MSG: build = Unknown -r Unknown; compiled by ‘root’ on 2014-10-20T05:53Z
STARTUP_MSG: java = 1.7.0_09
************************************************************/
2016-03-09 10:50:27,132 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2016-03-09 10:50:27,138 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode []
2016-03-09 10:50:27,465 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-03-09 10:50:27,623 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started
2016-03-09 10:50:27,625 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: fs.defaultFS is hdfs://hadoopha
2016-03-09 10:50:27,626 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Clients are to use hadoopha to access this namenode/service.
2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting web server as: dfs.web.authentication.kerberos.principal2016030910:50:28,048INFOorg.apache.hadoop.hdfs.DFSUtil:StartingWebserverforhdfsat:http://node1:500702016030910:50:28,121INFOorg.mortbay.log:Loggingtoorg.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log)viaorg.mortbay.log.Slf4jLog2016030910:50:28,128INFOorg.apache.hadoop.http.HttpRequestLog:Httprequestlogforhttp.requests.namenodeisnotdefined2016030910:50:28,145INFOorg.apache.hadoop.http.HttpServer2:Addedglobalfiltersafety(class=org.apache.hadoop.http.HttpServer2{dfs.web.authentication.kerberos.principal} 2016-03-09 10:50:28,048 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://node1:50070 2016-03-09 10:50:28,121 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2016-03-09 10:50:28,128 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined 2016-03-09 10:50:28,145 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2QuotingInputFilter)
2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilterStaticUserFilter)tocontexthdfs2016030910:50:28,149INFOorg.apache.hadoop.http.HttpServer2:Addedfilterstaticuserfilter(class=org.apache.hadoop.http.lib.StaticUserWebFilterStaticUserFilter) to context hdfs 2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilterStaticUserFilter) to context static
2016-03-09 10:50:28,149 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilterStaticUserFilter)tocontextlogs2016030910:50:28,209INFOorg.apache.hadoop.http.HttpServer2:Addedfilterorg.apache.hadoop.hdfs.web.AuthFilter(class=org.apache.hadoop.hdfs.web.AuthFilter)2016030910:50:28,211INFOorg.apache.hadoop.http.HttpServer2:addJerseyResourcePackage:packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources,pathSpec=/webhdfs/v1/2016030910:50:28,268INFOorg.apache.hadoop.http.HttpServer2:Jettyboundtoport500702016030910:50:28,269INFOorg.mortbay.log:jetty6.1.262016030910:50:28,580WARNorg.apache.hadoop.security.authentication.server.AuthenticationFilter:signature.secretconfigurationnotset,usingarandomvalueassecret2016030910:50:28,648INFOorg.mortbay.log:StartedHttpServer2StaticUserFilter) to context logs 2016-03-09 10:50:28,209 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter) 2016-03-09 10:50:28,211 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/* 2016-03-09 10:50:28,268 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 50070 2016-03-09 10:50:28,269 INFO org.mortbay.log: jetty-6.1.26 2016-03-09 10:50:28,580 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: 'signature.secret' configuration not set, using a random value as secret 2016-03-09 10:50:28,648 INFO org.mortbay.log: Started HttpServer2SelectChannelConnectorWithSafeStartup@node1:50070
2016-03-09 10:50:28,687 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!
2016-03-09 10:50:28,741 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsLock is fair:true
2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
2016-03-09 10:50:28,802 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2016-03-09 10:50:28,805 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2016-03-09 10:50:28,807 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: The block deletion will start around 2016 Mar 09 10:50:28
2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlocksMap
2016-03-09 10:50:28,810 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
2016-03-09 10:50:28,813 INFO org.apache.hadoop.util.GSet: capacity = 2^21 = 2097152 entries
2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: dfs.block.access.token.enable=false
2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication = 3
2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication = 512
2016-03-09 10:50:28,852 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication = 1
2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplicationStreams = 2
2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: replicationRecheckInterval = 3000
2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: encryptDataTransfer = false
2016-03-09 10:50:28,853 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner = hadoop (auth:SIMPLE)
2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup = supergroup
2016-03-09 10:50:28,859 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled = true
2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Determined nameservice ID: hadoopha
2016-03-09 10:50:28,865 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: HA Enabled: true
2016-03-09 10:50:28,866 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true
2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap
2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: 1.0% max memory 966.7 MB = 9.7 MB
2016-03-09 10:50:29,120 INFO org.apache.hadoop.util.GSet: capacity = 2^20 = 1048576 entries
2016-03-09 10:50:29,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: Computing capacity for map cachedBlocks
2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: 0.25% max memory 966.7 MB = 2.4 MB
2016-03-09 10:50:29,186 INFO org.apache.hadoop.util.GSet: capacity = 2^18 = 262144 entries
2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
2016-03-09 10:50:29,188 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache on namenode is enabled
2016-03-09 10:50:29,190 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: Computing capacity for map NameNodeRetryCache
2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
2016-03-09 10:50:29,194 INFO org.apache.hadoop.util.GSet: capacity = 2^15 = 32768 entries
2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: ACLs enabled? false
2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: XAttrs enabled? true
2016-03-09 10:50:29,199 INFO org.apache.hadoop.hdfs.server.namenode.NNConf: Maximum size of an xattr: 16384
2016-03-09 10:50:29,208 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/hadoop/tmp/dfs/name/in_use.lock acquired by nodename 4394@node1
2016-03-09 10:50:29,610 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property ‘ssl.client.truststore.location’ has not been set, no TrustStore will be loaded
2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:31,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:32,055 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
此處省去重複的N行
2016-03-09 10:50:35,807 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
此處省去重複的N行
2016-03-09 10:50:39,812 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:40,065 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:40,069 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:636)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:279)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:955)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:700)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:751)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:735)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1407)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1473)
2016-03-09 10:50:40,071 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected.
2016-03-09 10:50:40,116 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 1 INodes.
2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.
2016-03-09 10:50:40,174 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 0 from /home/hadoop/hadoop/tmp/dfs/name/current/fsimage_0000000000000000000
2016-03-09 10:50:40,184 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups
2016-03-09 10:50:40,185 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 10986 msecs
2016-03-09 10:50:40,408 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to node1:8020
2016-03-09 10:50:40,414 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2016-03-09 10:50:40,429 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2016-03-09 10:50:40,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0
2016-03-09 10:50:40,474 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of blocks under construction: 0
2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 11 secs
2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
2016-03-09 10:50:40,475 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2016-03-09 10:50:40,536 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2016-03-09 10:50:40,539 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: node1/192.168.56.201:8020
2016-03-09 10:50:40,542 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
2016-03-09 10:50:40,545 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at node5/192.168.56.205:8020 every 120 seconds.
2016-03-09 10:50:40,550 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread…
Checkpointing active NN at http://node5:50070
Serving checkpoints at http://node1:50070
2016-03-09 10:50:41,551 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
此處省去重複的N行
2016-03-09 10:50:50,557 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.

2016-03-09 10:50:50,561 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:50,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node4/192.168.56.204:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:50,676 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-03-09 10:50:50,677 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1430)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1450)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:212)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.doWork(EditLogTailer.java:324)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.doWork(EditLogTailer.java:324) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.access200(EditLogTailer.java:282)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer200(EditLogTailer.java:282) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread1.run(EditLogTailer.java:299)atorg.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer1.run(EditLogTailer.java:299) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.run(EditLogTailer.java:295)
2016-03-09 10:50:50,677 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2016-03-09 10:50:50,678 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.doWork(EditLogTailer.java:337)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.doWork(EditLogTailer.java:337) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.access200(EditLogTailer.java:282)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer200(EditLogTailer.java:282) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread1.run(EditLogTailer.java:299)atorg.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411)atorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer1.run(EditLogTailer.java:299) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:411) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailerEditLogTailerThread.run(EditLogTailer.java:295)
2016-03-09 10:50:50,682 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2016-03-09 10:50:50,684 WARN org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory: The property ‘ssl.client.truststore.location’ has not been set, no TrustStore will be loaded
2016-03-09 10:50:50,690 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments…
2016-03-09 10:50:51,698 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node3/192.168.56.203:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
此處省去重複的N行
2016-03-09 10:51:00,715 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.56.202:8485, 192.168.56.203:8485, 192.168.56.204:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.56.203:8485: Call From node1/192.168.56.201 to node3:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.202:8485: Call From node1/192.168.56.201 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.56.204:8485: Call From node1/192.168.56.201 to node4:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:182)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:436)
at org.apache.hadoop.hdfs.server.namenode.JournalSet7.apply(JournalSet.java:590)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:587)atorg.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1361)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1068)atorg.apache.hadoop.hdfs.server.namenode.NameNode7.apply(JournalSet.java:590) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:359) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:587) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1361) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1068) at org.apache.hadoop.hdfs.server.namenode.NameNodeNameNodeHAContext.startActiveServices(NameNode.java:1624)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1502)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1197)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService2.callBlockingMethod(HAServiceProtocolProtos.java:4460)atorg.apache.hadoop.ipc.ProtobufRpcEngine2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngineServerProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)atorg.apache.hadoop.ipc.RPCProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPCServer.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler1.run(Server.java:2013)atorg.apache.hadoop.ipc.Server1.run(Server.java:2013) at org.apache.hadoop.ipc.ServerHandler1.run(Server.java:2009)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:415)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)atorg.apache.hadoop.ipc.Server1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.ServerHandler.run(Server.java:2007)
2016-03-09 10:51:00,717 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-03-09 10:51:00,718 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.56.201
************************************************************/

二、問題分析
看着日誌很長,來分析一下,注意看日誌中使用顏色突出的部分。
可以肯定NameNode不能正常運行,不是配置錯了,而是不能連接上JournalNode、
查看JournalNode的日誌沒有問題,那麼問題就在JournalNode的客戶端NameNode。
2016-03-09 10:50:31,053 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node2/192.168.56.202:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
來分析上句的日誌:
NameNode作爲JournalNode的客戶端發起連接請求,但是失敗了,然後NameNode又向其他節點依次發起了請求都失敗了,直至到了最大重試次數。

通過實驗知道,先啓動JournalNode或者再次啓動NameNode就可以了,說明JournalNode並沒有準備好,而NameNode已經用完了所有重試次數。

三、解決辦法
修改core-site.xml中的ipc參數

ipc.client.connect.max.retries
100
Indicates the number of retries a client will make to establish
a server connection.



ipc.client.connect.retry.interval
10000
Indicates the number of milliseconds a client will wait for
before retrying to establish a server connection.

Namenode向JournalNode發起的ipc連接請求的重試間隔時間和重試次數,我的虛擬機集羣實驗大約需要2分鐘,NameNode即可連接上JournalNode。連接後很穩定。

注意:僅對於這種由於服務沒有啓動完成造成連接超時的問題,都可以調整core-site.xml中的ipc參數來解決。如果目標服務本身沒有啓動成功,這邊調整ipc參數是無效的。

下面關於Hadoop的文章您也可能喜歡,不妨看看:

Ubuntu14.04下Hadoop2.4.1單機/僞分佈式安裝配置教程 http://www.linuxidc.com/Linux/2015-02/113487.htm

CentOS安裝和配置Hadoop2.2.0 http://www.linuxidc.com/Linux/2014-01/94685.htm

Ubuntu 13.04上搭建Hadoop環境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1版本集羣配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu上搭建Hadoop環境(單機模式+僞分佈模式) http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu下Hadoop環境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

單機版搭建Hadoop環境圖文教程詳解 http://www.linuxidc.com/Linux/2012-02/53927.htm

更多Hadoop相關信息見Hadoop 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=13

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章