HBase節點掉線問題排查

原創

梦回从前

2018-09-06 06:20

環境信息：

Hadoop2.7.2+HBase1.2.2+Zookeeper3.4.10

11臺服務器，1主10從，基本配置：128G內存，2個CPU12核48線程

服務器上運行了HDFS（11臺），HBase（11臺），Zookeeper（11臺，部分複用集羣資源），Yarn（11臺，上面運行MR以及Spark任務），以及部分業務多線程程序

問題描述：

HBase啓動後頻繁報錯，然後進程就aborted了：

regionserver日誌：

java.io.EOFException: Premature EOF: no length prefix available

java.io.IOException: 斷開的管道

[RS_OPEN_META-hd16:16020-0-MetaLogRoller] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.10.38:50010, 192.168.10.48:50010], original=[192.168.10.38:50010, 192.168.10.48:50010]). The current failed data
node replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted

datanode日誌：

java.io.EOFException: Premature EOF: no length prefix available

java.io.IOException: 斷開的管道

java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/192.168.10.52:50010 remote=/192.168.10.48:48482]. 60000 millis timeout left.

java.io.IOException: Premature EOF from inputStream
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch

問題分析：

1、socket超時，修改超時時間：

<property>
<name>dfs.client.socket-timeout</name>
<value>6000000</value>
</property>
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>6000000</value>
</property>

重啓並未起到相應的作用，錯誤仍在繼續

2、hbase的JVM參數配置不佳，導致Full GC過於頻繁，時間過長，超時，修改JVM參數，Full GC次數明顯減少，且時間從幾十秒變成幾秒

export HBASE_MASTER_OPTS="-Xmx2000m -Xms2000m -Xmn750m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
export HBASE_REGIONSERVER_OPTS="-Xmx12800m -Xms12800m -Xmn1000m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly"

重啓並未起到相應的作用，錯誤仍在繼續

3、修改Zookeeper的超時時間以及HBase超時後不abort變爲restart，配置如下：

<property>
<name>zookeeper.session.timeout</name>
<value>600000</value>
<description>ZooKeepersession timeout.
HBase passes this tothe zk quorum as suggested maximum time for a
session. See http://hadoop.apache.org/zooke ... sions
“The client sends a requested timeout, theserver responds with the
timeout that it cangive the client. The current implementation
requires that thetimeout be a minimum of 2 times the tickTime
(as set in the serverconfiguration) and a maximum of 20 times
the tickTime.” Set thezk ticktime with hbase.zookeeper.property.tickime.
In milliseconds.
</description>
</property>
<property>
<name>hbase.regionserver.restart.on.zk.expire</name>
<value>true</value>
<description>
Zookeeper sessionexpired will force regionserver exit.
Enable this will makethe regionserver restart.
</description>
</property>

重啓後錯誤仍在繼續，只是堅持的時間變長了點，可見根本問題不在此

4、上述兩個參數修改完畢後一度陷入僵局，後來分析進來集羣的yarn負載較大，常態化的運行T級別的MR以及Spark任務，導致磁盤IO過大，響應超時，修改HDFS（hdfs-site.xml）配置：

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>ALWAYS</value>
</property>

重啓後錯誤仍在繼續，只是堅持的時間變長了點，可見根本問題不在此

5、最後分析可能是由於集羣負載過大，尤其是IO壓力過大，嘗試着增大DataNode的內存資源以及連接數來嘗試，配置如下：

<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>

注：此處我開始配置的是16384（網上有人說此值的值域爲[1-8192]），未經證實，安全起見改爲8192，有確認答案的朋友歡迎留言確認

修改DataNode的內存分配（hadoop-env.sh）

export HADOOP_HEAPSIZE=16384（之前是8192）

重啓後，HBase運行正常，未再出現節點掉線的問題

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

HBase節點掉線問題排查

分享5款.NET開源免費的Redis客戶端組件庫

創建 Vue3 項目

golang開發 gorilla websocket的使用

面試官：如果不允許線程池丟棄任務，應該選擇哪個拒絕策略？

記一次 .NET某工業設計軟件崩潰分析

Mac卸載 Node npm，升級 Node

嵌入式汽車電子學習路線

uni.showModel內容換行

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

TS + Webpack 整合 Jest

Elasticsearch按天生成和刪除index腳本

ElasticSearch內存詳解

elasticsearch中match、match_phrase、query_string和term的區別

kafka使用kafka-console-consumer.sh和kafka-console-producer.sh生產消費數據樣例

ansible入門--安裝、配置與執行介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結