RocketMQ集羣節點閃斷引起的故障

1,問題描述

收到告警平臺告警,Rockemq集羣有異常

RocketMQ節點cpu飆高,

image.png

網絡丟包100%

image.png

IO有飆高

image.png

內存中斷

image.png


2,查看broker日誌

2.1查看GC日誌並無異常

$cd /dev/shm/

2020-02-23T23:02:11.864+0800: 11683510.551: Total time for which application threads were stopped: 0.0090361 seconds, Stopping threads took: 0.0002291 seconds

2020-02-23T23:03:36.815+0800: 11683595.502: [GC pause (G1 Evacuation Pause) (young) 11683595.502: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7992, predicted base time: 5.44 ms, remaining time: 194.56 ms, target pause time: 200.00 ms]

 11683595.502: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 255 regions, survivors: 1 regions, predicted young region time: 1.24 ms]

 11683595.502: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 255 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 6.68 ms, target pause time: 200.00 ms]

, 0.0080468 secs]

   [Parallel Time: 4.1 ms, GC Workers: 23]

      [GC Worker Start (ms): Min: 11683595502.0, Avg: 11683595502.3, Max: 11683595502.5, Diff: 0.5]

      [Ext Root Scanning (ms): Min: 0.6, Avg: 0.9, Max: 1.5, Diff: 0.9, Sum: 20.5]

      [Update RS (ms): Min: 0.0, Avg: 1.0, Max: 2.5, Diff: 2.5, Sum: 23.7]

         [Processed Buffers: Min: 0, Avg: 11.8, Max: 35, Diff: 35, Sum: 271]

      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.7]

      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]

      [Object Copy (ms): Min: 0.0, Avg: 1.1, Max: 1.8, Diff: 1.8, Sum: 26.4]

      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 5.4]

         [Termination Attempts: Min: 1, Avg: 3.5, Max: 7, Diff: 6, Sum: 80]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.0]

      [GC Worker Total (ms): Min: 3.2, Avg: 3.4, Max: 3.6, Diff: 0.5, Sum: 77.9]

      [GC Worker End (ms): Min: 11683595505.6, Avg: 11683595505.6, Max: 11683595505.7, Diff: 0.1]

   [Code Root Fixup: 0.1 ms]

   [Code Root Purge: 0.0 ms]

   [Clear CT: 0.8 ms]

   [Other: 3.0 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 1.1 ms]

      [Ref Enq: 0.0 ms]

      [Redirty Cards: 1.0 ms]

      [Humongous Register: 0.0 ms]

      [Humongous Reclaim: 0.0 ms]

      [Free CSet: 0.3 ms]

   [Eden: 4080.0M(4080.0M)->0.0B(4080.0M) Survivors: 16.0M->16.0M Heap: 4341.0M(16.0G)->262.4M(16.0G)]

 [Times: user=0.07 sys=0.00, real=0.01 secs]


2.2查看broker日誌

2020-02-23T23:02:11.864 ERROR BrokerControllerScheduledThread1 - SyncTopicConfig Exception, x.x.x.x:10911 

org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <x.x.x.x:10909> timeout, 3000(ms)

        at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.invokeSyncImpl(NettyRemotingAbstract.java:427) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

        at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:375) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

通過查看RocketMQ的集羣和GC日誌,可以看出網絡超時,造成主從同步問題;並未發現Broker自身出問題了

通過監控看到網絡、CPU、磁盤IO都出現了問題;到底是磁盤IO引起CPU飆高的?引起網絡問題;還是CPU先飆高引起網絡中斷和磁盤IO的。機器上只有一個RocketMQ進程,而且負載並不高;所以不是由應用進程導致CPU、網絡、磁盤IO等問題。那會不會阿里雲抖動呢?有可能是阿里雲網絡抖動引起的,如果是阿里雲網絡拉動,爲何一個集羣只有幾個節點抖動,同一機房的業務機器也沒有出現抖動。接着分析linux日誌


3,查看系統日誌 

#grep -ci "page allocation failure" /var/log/messages*

通過查看系統日誌,發現頁分配失敗“page allocation failure. order:0, mode:0x20”,也就Page不夠了。

image.png

通過查詢資料發現問題

https://access.redhat.com/solutions/90883

設置這倆參數設置一下,Increase vm.min_free_kbytes value, for example to a higher value than a single allocation request.
Change vm.zone_reclaim_mode to 1 if it's set to zero, so the system can reclaim back memory from cached memory.


3.1 修改配置文件

$ sed -i '/swappiness=1/a\vm.zone_reclaim_mode = 1\nvm.min_free_kbytes = 512000'  /etc/sysctl.conf && sysctl -p /etc/sysctl.conf

zone_reclaim_mode默認爲0即不啓用zone_reclaim模式,1爲打開zone_reclaim模式從本地節點回收內存;

min_free_kbytesy允許內核使用的最小內存


簡單的說,就是Rocketmq是吃內存大戶,如果沒有開啓內核限制,Rocketmq不斷的向系統索要內存,系統將內存耗盡,當內存耗盡後,系統無法響應請求,導致網絡丟包,cpu飆高。

broker節點的操作系統版本爲Centos6.10,可能將系統升級到7版本以上不存在這種問題


4,分析原因

在系統空閒內存低於 watermark[low]時,開始啓動內核線程kswapd進行內存回收,直到該zone的空閒內存數量達到watermark[high]後停止回收。如果上層申請內存的速度太快,導致空閒內存降至watermark[min]後,內核就會進行direct reclaim(直接回收),即直接在應用程序的進程上下文中進行回收,再用回收上來的空閒頁滿足內存申請,因此實際會阻塞應用程序,帶來一定的響應延遲,而且可能會觸發系統OOM。這是因爲watermark[min]以下的內存屬於系統的自留內存,用以滿足特殊使用,所以不會給用戶態的普通申請來用。

min_free_kbytes大小的影響 

min_free_kbytes設的越大,watermark的線越高,同時三個線之間的buffer量也相應會增加。這意味着會較早的啓動kswapd進行回收,且會回收上來較多的內存(直至watermark[high]纔會停止),這會使得系統預留過多的空閒內存,從而在一定程度上降低了應用程序可使用的內存量。極端情況下設置min_free_kbytes接近內存大小時,留給應用程序的內存就會太少而可能會頻繁地導致OOM的發生。

min_free_kbytes設的過小,則會導致系統預留內存過小。kswapd回收的過程中也會有少量的內存分配行爲(會設上PF_MEMALLOC)標誌,這個標誌會允許kswapd使用預留內存;另外一種情況是被OOM選中殺死的進程在退出過程中,如果需要申請內存也可以使用預留部分。這兩種情況下讓他們使用預留內存可以避免系統進入deadlock狀態。

備註:以上分析來自網絡

















發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章