使用akka和某內部配置服務導致服務hang住問題排查

X應用在生產環境部署了兩套集羣服務,一個叫A集羣,一個是B集羣,兩個集羣分別處理不通的數據,互不干擾。

在11月17號晚上20點發版之後,兩個服務集羣一切正常。

第二天11月18號上午,服務的數據監控發現A集羣不處理數據了,監控數據直接下降到0,但是B集羣是運行沒問題的,數據也正常。這時我們將A服務重啓後一切正常,但是兩個小時候又降爲0。

11月18號下午16點懷疑是新增的判斷邏輯代碼有問題,去掉了添加的某判斷邏輯代碼後發版上線,A集羣在發版後的兩個小時處理數據量又突然降爲0.

A 和 B 的部署代碼是一模一樣的的,只是處理的數據類別不一樣,理論上要出問題應該是兩個服務集羣都會有這個問題纔對。

登錄A集羣的機器上去看了下運行日誌,出現很多如下所示:

ispatcher-26] a.c.Cluster(akka://FeedSync) - Cluster Node [akka.tcp://[email protected]:15x0] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://[email protected]:xx800,xx4997892)] 
2021-11-18 11:02:41.857 INFO  [FeedSync-akka.actor.default-dis


2021-11-17 23:59:49.635 INFO  [FeedSync-akka.actor.default-dispatcher-26] a.c.Cluster(akka://FeedSync) - Cluster Node [akka.tcp://[email protected]:15x0] - is the new leader among reachable nodes (more leaders may exist)

初步簡單的看了下jstack的線程信息,發現大量的線程鎖住了,都鎖在了網絡io,因爲服務本身是重度使用了akka組件的akka集羣服務,出現這種情況一般是認爲akka的master節點掛了,或者是akka的cluster節點本身的網絡有問題脫離了集羣,用ping和telnet都試了下,ip和端口都正常,不存在網絡故障。

那麼很奇怪爲什麼節點會脫離集羣。

看了下A集羣的master節點的CPU佔用,742%,這個CPU數值是正常情況的2-3倍。

使用 top -Hp PID 看了下A集羣的master節點服務進程裏的線程資源使用情況。

偶爾會閃過十幾個線程CPU利用率達到50%以上,其中最高線程使用爲99%

拷貝下佔用CPU最高的線程id爲28260,轉換爲十六進制是 0x28260

在A集羣的master節點執行jstack命令將所有線程信息拉下來看了下,

jstack PID > 202111118.jstack.txt

通過搜索線程的十六進制nid爲0x28260去202111118.jstack.txt查找對應的線程信息:

202111118.jstack.txt


"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx38025000 nid=0x28260 runnable 

"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x0000xx438026800 nid=0x28261 runnable 

"Gang worker#2 (Parallel GC Threads)" os_prio=0 tid=0x0000xx2438028800 nid=0x28262 runnable 

"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx43802a800 nid=0x28263 runnable 

"Gang worker#4 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx43802c000 nid=0x28264 runnable 

"Gang worker#5 (Parallel GC Threads)" os_prio=0 tid=0x0000xxxx3802e000 nid=0x28265 runnable 

"Gang worker#6 (Parallel GC Threads)" os_prio=0 tid=0x0000xxxx8030000 nid=0x28266 runnable 

可以很明顯看出來佔用CPU高的線程是 Parallel GC Threads ,也就是說是因爲GC導致了CPU異常,這種現象一般有兩種情況:

  1. 服務剛啓動遇到了非常大量的流量和數據要加載
  2. 真的發生了GC
ps -eo pid,lstart,etime,cmd | grep java

可以看到 A集羣服務沒有發生被動重啓,啓動時間已經很久了,那就只有可能是GC導致了CPU高漲。

找到gc日誌看了下:

2021-11-18T23:54:42.880+0800: 2344.443: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0067813 secs]
   [Parallel Time: 3.4 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2344443.6, Avg: 2344443.8, Max: 2344444.0, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.0, Avg: 2.2, Max: 3.1, Diff: 1.1, Sum: 116.0]
      [Update RS (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 9.1]
         [Processed Buffers: Min: 0, Avg: 0.4, Max: 2, Diff: 2, Sum: 19]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.4]
      [Termination (ms): Min: 0.0, Avg: 0.6, Max: 0.7, Diff: 0.7, Sum: 31.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 2.8, Avg: 3.0, Max: 3.2, Diff: 0.4, Sum: 158.2]
      [GC Worker End (ms): Min: 2344446.7, Avg: 2344446.7, Max: 2344446.9, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 2.8 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.0 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.14 sys=0.00, real=0.01 secs] 

2021-11-18T23:54:42.893+0800: 2344.456: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0073447 secs]
   [Parallel Time: 3.5 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2344456.2, Avg: 2344456.4, Max: 2344456.6, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.1, Avg: 2.3, Max: 3.2, Diff: 1.1, Sum: 120.7]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 4.7]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 2, Diff: 2, Sum: 10]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.3]
      [Termination (ms): Min: 0.0, Avg: 0.7, Max: 0.9, Diff: 0.9, Sum: 37.5]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 2.9, Avg: 3.1, Max: 3.3, Diff: 0.5, Sum: 164.8]
      [GC Worker End (ms): Min: 2344459.4, Avg: 2344459.5, Max: 2344459.6, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 3.2 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.16 sys=0.01, real=0.01 secs] 


2021-11-18T23:54:42.906+0800: 2344.469: [Full GC (Allocation Failure)  31G->26G(32G), 42.7612268 secs]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->26.2G(32.0G)], [Metaspace: 162114K->162033K(176128K)]
 [Times: user=70.17 sys=0.32, real=42.77 secs] 
2021-11-18T23:55:25.669+0800: 2387.232: [GC concurrent-mark-abort]
2021-11-18T23:55:26.021+0800: 2387.584: [GC pause (GCLocker Initiated GC) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0193987 secs]
   [Parallel Time: 11.4 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2387586.8, Avg: 2387587.0, Max: 2387587.2, Diff: 0.5]
      [Ext Root Scanning (ms): Min: 2.1, Avg: 2.5, Max: 5.4, Diff: 3.3, Sum: 132.6]
      [Update RS (ms): Min: 0.0, Avg: 1.6, Max: 2.1, Diff: 2.1, Sum: 86.8]
         [Processed Buffers: Min: 0, Avg: 27.5, Max: 55, Diff: 55, Sum: 1457]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 5.5, Avg: 6.5, Max: 6.7, Diff: 1.2, Sum: 346.4]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.8]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 6.5]
      [GC Worker Total (ms): Min: 10.5, Avg: 10.8, Max: 11.2, Diff: 0.8, Sum: 573.6]
      [GC Worker End (ms): Min: 2387597.7, Avg: 2387597.8, Max: 2387598.0, Diff: 0.3]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.2 ms]
   [Clear CT: 0.7 ms]
   [Other: 7.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 2.7 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.3 ms]
   [Eden: 1648.0M(1632.0M)->0.0B(1568.0M) Survivors: 0.0B->64.0M Heap: 27.8G(32.0G)->26.2G(32.0G)]

 ...


2021-11-19T00:44:38.484+0800: 5340.038: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0083657 secs]
   [Parallel Time: 5.1 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 5340038.1, Avg: 5340038.4, Max: 5340038.6, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.2, Avg: 2.5, Max: 4.6, Diff: 2.3, Sum: 132.0]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 3.3]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 12]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.7]
      [Termination (ms): Min: 0.0, Avg: 1.9, Max: 2.0, Diff: 2.0, Sum: 98.9]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 4.1, Avg: 4.5, Max: 4.7, Diff: 0.5, Sum: 236.6]
      [GC Worker End (ms): Min: 5340042.7, Avg: 5340042.8, Max: 5340042.9, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 2.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.7 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.21 sys=0.00, real=0.00 secs] 
2021-11-19T00:44:38.499+0800: 5340.053: [Full GC (Allocation Failure) 

GC日誌信息描述的是一直在發生GC,平均一秒鐘GC次數是兩次,應用啓動後head的大小一直在增長,哪怕是一直GC也無效,直到heap滿了,發生full GC,然後fullGC失敗,接着又發生fullgc又失敗。。。週而復始導致應用這個時候基本上處於癱瘓。

問題來了,是什麼原因導致堆滿了。

先簡單的抽取了前三十個最多的對象看了下:

jmap -histo PID|head -n 30

看不出來是哪個對象有泄漏問題,那隻能用將整個堆棧文件導出來看下了。

jmap -dump:live,format=b,file=/xxx/logs/202111182052.hprof PID

這個jmap執行了很久,執行完之後的堆棧文件大小是:33G

如果要拉取到本地電腦使用mat來分析的話需要保證內存有33g+才能打開,很明顯這不現實,這就需要在高CPU大內存的服務器上安裝mat來進行分析,這個時候只能先把A集羣服務先關閉把內存和CPU資源節省出來給mat做分析用。

注:mat的最新版本至少需要jdk15,我們是jdk8,所以這裏用的是1.9.2版本, MemoryAnalyzer-1.9.2.20200115-linux.gtk.x86_64.zip

MemoryAnalyzer-1.9.2.20200115-linux.gtk.x86_64.zip解壓後的文件目錄是:

configuration  features  MemoryAnalyzer  p2  ParseHeapDump.sh  plugins  workspace

這需要修改下ParseHeapDump.sh,加上我們的jdk路徑,因爲我們的bash沒有把jdk的bin加進去,所以需要自己手動配置下jdk的路徑,只需要加上-vm /usr/local/xxx/jdk8-1.0/bin ,加上後的內容如下所示:

"$(dirname -- "$0")"/MemoryAnalyzer  -vm /usr/local/xxx/jdk8-1.0/bin -consolelog -application org.eclipse.mat.api.parse "[email protected]"

配置好mat後,先用mat生成分析索引文件,因爲hprof文件33G,所以需要要分配40g的內存才能保證分析過程mat不會內存溢出。

./ParseHeapDump.sh /xxx/logs/202111182052.hprof ls -vmargs -Xmx40g -XX:-UseGCOverheadLimit

在hprof文件的同目錄會有大量的xxxxx.index文件,生成分析索引文件後,再生成內存泄漏分析報告,同理,因爲hprof文件33G,所以需要要分配40g的內存才能保證生成報告過程mat不會內存溢出。

./ParseHeapDump.sh /xxx/logs/202111182052.hprof -vmargs -Xmx40g org.eclipse.mat.api:suspects

這個時候在hprof文件的同目錄會生成一個分析文件202111182052_Leak_Suspects.zip ,將這個壓縮文件下載到本地電腦,解壓後使用chrome打開index.html,可以看到:

有一部分對象佔用內存空間29.3G,佔比93%,這個內存泄漏報告數據詳情裏面寫着,com.xxx.springboot.property.UpdateValue對象裏面有個List列表對象,列表裏面存在大量的對象沒有被回收.

找到com.xxx.springboot.property.UpdateValue代碼

...
...
private static final Logger logger = LoggerFactory.getLogger(UpdateValue.class);
     /**
      * 所有被註解標註的bean
      */
     private static final List<Object> BEAN_LIST = new ArrayList<>();

     public static void addRainbowValueBean(Object bean) {

          BEAN_LIST.add(bean);
     }
...
...

裏面有個BEAN_LIST的對象會一直持有,且addRainbowValueBean如果被調用會一直往裏面塞數據,找到這個方法的調用方

public class RainbowAnnotationProcessor implements BeanPostProcessor, PriorityOrdered {


     public RainbowAnnotationProcessor() {
     }

     @Override
     public Object postProcessBeforeInitialization(Object bean, String beanName) throws BeansException {

          Class<?> clazz = bean.getClass();

          //處理所有有註解的 field
          Field[] declaredFields = clazz.getDeclaredFields();
          for (Field declaredField : declaredFields) {
               Value value = declaredField.getAnnotation(Value.class);
               if (value != null) {
                    UpdateValue.addRainbowValueBean(bean);
               }
          }

          //獲取所有有ConfigurationProperties標記的bean
          ConfigurationProperties configurationProperties = clazz.getAnnotation(ConfigurationProperties.class);
          if (configurationProperties != null) {
               UpdateConfigurationProperties.addRainbowConfigurationPropertiesBean(bean);
          }


          return bean;
     }

     @Override
     public Object postProcessAfterInitialization(Object bean, String beanName) throws BeansException {

          return bean;
     }

     @Override
     public int getOrder() {
          //make it as late as possible
          return Ordered.LOWEST_PRECEDENCE;
     }


}

這是一個實現了BeanPostProcessor接口的類,其中postProcessBeforeInitialization方法重寫了,裏面的邏輯會調用UpdateValue.addRainbowValue進行添加對象。

BeanPostProcessor接口有兩個回調方法。當一個BeanPostProcessor的實現類註冊到Spring IOC容器後,對於該Spring IOC容器所創建的每個bean實例在初始化方法(如afterPropertiesSet和任意已聲明的init方法)調用前,將會調用BeanPostProcessor中的postProcessBeforeInitialization方法。

RainbowAnnotationProcessor的總體邏輯大致是當你的類生成的時候如果存在Value註解,那麼會將這個類添加到``com.xxx.springboot.property.UpdateValue裏的BEAN_LIST列表裏, 當存在配置變更的時候會將BEAN_LIST裏面的所有的類裏的Value`註解的值及時更新。

我們再回過頭來看下我們的應用是akka應用,也使用spring,其中akka的每一個actor使用了Actor註解,Actor註解裏又有@Scope("prototype")註解

@Target({ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
@Documented
@Component
@Scope("prototype")
public @interface Actor {

}

也就是說我們的這些用了actor註解的都是prototype作用域。

spring中bean的scope屬性,有如下5種類型:

  1. singleton 表示在spring容器中的單例,通過spring容器獲得該bean時總是返回唯一的實例
  2. prototype表示每次獲得bean都會生成一個新的對象
  3. request表示在一次http請求內有效
  4. session表示在一個用戶會話內有效
  5. globalSession表示在全局會話內有效

因爲actor是一個封裝好的整體的處理邏輯的動作,我們定義了很多的這樣的actor,且這些actor都是高併發大數據量的處理動作,每一個actor類裏面都有使用到Value註解的對象,由於使用的prototype作用域,每次都會生成一個對象,然後每次都往這個UpdateValue.addRainbowValueBean裏面加東西,然後內存滿了。。。

爲了驗證我們的推斷是正確的,使用arthas工具連上了我們的‘正常’運行的B集羣應用,看了下應用裏BEAN_LIST的數據和大小,裏面有7000+個Actor對象,同時也拉取了這個集羣的hprof文件下來看了下,這個集羣也存在同樣的內存泄漏問題,只不過這個集羣的文件拉取次數少,頻率沒有那麼高,所以問題並沒有完全爆發出來。

// 使用ognl語法查看 BEAN_LIST 的大小
getstatic com.xxx.springboot.property.UpdateValue BEAN_LIST size
// 查看BEAN_LIST裏面的內容
getstatic com.xxx.springboot.property.UpdateValue BEAN_LIST

com.xxx.springboot.property.UpdateValue對象的所屬包是某配置服務包xxx-sdk-java-springboot-starter-3.1.3.jar,經確認,17號晚上20點發版上線的版本里確實是新增了這個包。

臨時解決方案就是先把這個包和相應的新引進的關聯功能邏輯剔除掉,再發版上線就可以了

長期解決方案:使用http的方式來使用配置中心

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章