HBase中wal文件過多導致Zookeeper異常問題

最近hbase出現了大量KeeperErrorCode = ConnectionLoss for /hbase/splitWAL 異常,而且在重啓hbase的時候,沒有辦法啓動hbase,經過仔細診斷之後發現是由於hbase的WAL文件非常多(達到30TB),導致hbase在zk的節點(存儲WAL文件信息的節點)超過4096*1024 默認大小,無法正常提供服務。因此,hbase master無法正常啓動。通過增加zk節點的大小參數,並且優化WAL文件,最終解決該問題。

故障現象
日誌報錯無法連接上zk 的 /hbase/splitWAL節點

2019-12-03 05:48:05,797 ERROR [SplitLogWorker-HDPC238160:60020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2019-12-03 05:48:05,798 WARN [SplitLogWorker-HDPC238160:60020] zookeeper.ZKUtil: regionserver:60020-0x16bdfc9dd27ac74, quorum=HDPC238162:2181,HDPC238160:2181,HDPC238161:2181, baseZNode=/hbase Unable to list children of znode /hbase/splitWAL
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/splitWAL
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:296)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:518)
2019-12-03 05:48:05,798 ERROR [SplitLogWorker-HDPC238160:60020] zookeeper.ZooKeeperWatcher: regionserver:60020-0x16bdfc9dd27ac74, quorum=HDPC238161:2181,HDPC238160:2181,HDPC238162:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/splitWAL
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:296)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:518)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/splitWAL
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:296)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:518)

故障分析與解決
從上述報錯來看,導致hbase master無法啓動的原因是hbase在zk的節點(存儲WAL文件信息的節點)超過一定值,導致hbase master無法連接到zk節點,進而無法啓動。按理說,hbase存儲WAL不會太多,zk節點也不會超過4M大小。我們先查看hbase 的 WAL文件大小:

hbase 的 WAL文件大小超過30TB,看來確實是異常

[hadoop@HDPC238160 logs]$ hadoop fs -count /hbase/WALs

      70       244915     31591336406185 /hbase/WALs

zk的單節點大小不能超過 4096*1024,10980003 已經超過該閾值。

經過上訴分析,基本上可以判斷是由於hbase 的WAL文件太多,導致zk節點 /hbase/WALs 接近10M,超過4M的閾值限制,導致hbase master無法連接到 zk節點 /hbase/WALs,進而無法啓動hbase master。

故障解決辦法
在hbase-env.sh文件將jute.maxbuffer 這個參數的值調整到100MB。

export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS  **-Djute.maxbuffer=104857600** "

最後重啓hbase,即可生效

後續進一步優化
進一步的還需要找到hbase 爲什麼有如此多WAL文件的原因,還需要進一步分析。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章