使用akka和某内部配置服务导致服务hang住问题排查

X应用在生产环境部署了两套集群服务,一个叫A集群,一个是B集群,两个集群分别处理不通的数据,互不干扰。

在11月17号晚上20点发版之后,两个服务集群一切正常。

第二天11月18号上午,服务的数据监控发现A集群不处理数据了,监控数据直接下降到0,但是B集群是运行没问题的,数据也正常。这时我们将A服务重启后一切正常,但是两个小时候又降为0。

11月18号下午16点怀疑是新增的判断逻辑代码有问题,去掉了添加的某判断逻辑代码后发版上线,A集群在发版后的两个小时处理数据量又突然降为0.

A 和 B 的部署代码是一模一样的的,只是处理的数据类别不一样,理论上要出问题应该是两个服务集群都会有这个问题才对。

登录A集群的机器上去看了下运行日志,出现很多如下所示:

ispatcher-26] a.c.Cluster(akka://FeedSync) - Cluster Node [akka.tcp://[email protected]:15x0] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://[email protected]:xx800,xx4997892)] 
2021-11-18 11:02:41.857 INFO  [FeedSync-akka.actor.default-dis


2021-11-17 23:59:49.635 INFO  [FeedSync-akka.actor.default-dispatcher-26] a.c.Cluster(akka://FeedSync) - Cluster Node [akka.tcp://[email protected]:15x0] - is the new leader among reachable nodes (more leaders may exist)

初步简单的看了下jstack的线程信息,发现大量的线程锁住了,都锁在了网络io,因为服务本身是重度使用了akka组件的akka集群服务,出现这种情况一般是认为akka的master节点挂了,或者是akka的cluster节点本身的网络有问题脱离了集群,用ping和telnet都试了下,ip和端口都正常,不存在网络故障。

那么很奇怪为什么节点会脱离集群。

看了下A集群的master节点的CPU占用,742%,这个CPU数值是正常情况的2-3倍。

使用 top -Hp PID 看了下A集群的master节点服务进程里的线程资源使用情况。

偶尔会闪过十几个线程CPU利用率达到50%以上,其中最高线程使用为99%

拷贝下占用CPU最高的线程id为28260,转换为十六进制是 0x28260

在A集群的master节点执行jstack命令将所有线程信息拉下来看了下,

jstack PID > 202111118.jstack.txt

通过搜索线程的十六进制nid为0x28260去202111118.jstack.txt查找对应的线程信息:

202111118.jstack.txt


"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx38025000 nid=0x28260 runnable 

"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x0000xx438026800 nid=0x28261 runnable 

"Gang worker#2 (Parallel GC Threads)" os_prio=0 tid=0x0000xx2438028800 nid=0x28262 runnable 

"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx43802a800 nid=0x28263 runnable 

"Gang worker#4 (Parallel GC Threads)" os_prio=0 tid=0x0000xxx43802c000 nid=0x28264 runnable 

"Gang worker#5 (Parallel GC Threads)" os_prio=0 tid=0x0000xxxx3802e000 nid=0x28265 runnable 

"Gang worker#6 (Parallel GC Threads)" os_prio=0 tid=0x0000xxxx8030000 nid=0x28266 runnable 

可以很明显看出来占用CPU高的线程是 Parallel GC Threads ,也就是说是因为GC导致了CPU异常,这种现象一般有两种情况:

  1. 服务刚启动遇到了非常大量的流量和数据要加载
  2. 真的发生了GC
ps -eo pid,lstart,etime,cmd | grep java

可以看到 A集群服务没有发生被动重启,启动时间已经很久了,那就只有可能是GC导致了CPU高涨。

找到gc日志看了下:

2021-11-18T23:54:42.880+0800: 2344.443: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0067813 secs]
   [Parallel Time: 3.4 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2344443.6, Avg: 2344443.8, Max: 2344444.0, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.0, Avg: 2.2, Max: 3.1, Diff: 1.1, Sum: 116.0]
      [Update RS (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 9.1]
         [Processed Buffers: Min: 0, Avg: 0.4, Max: 2, Diff: 2, Sum: 19]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.4]
      [Termination (ms): Min: 0.0, Avg: 0.6, Max: 0.7, Diff: 0.7, Sum: 31.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 2.8, Avg: 3.0, Max: 3.2, Diff: 0.4, Sum: 158.2]
      [GC Worker End (ms): Min: 2344446.7, Avg: 2344446.7, Max: 2344446.9, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 2.8 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.0 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.14 sys=0.00, real=0.01 secs] 

2021-11-18T23:54:42.893+0800: 2344.456: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0073447 secs]
   [Parallel Time: 3.5 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2344456.2, Avg: 2344456.4, Max: 2344456.6, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.1, Avg: 2.3, Max: 3.2, Diff: 1.1, Sum: 120.7]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 4.7]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 2, Diff: 2, Sum: 10]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.3]
      [Termination (ms): Min: 0.0, Avg: 0.7, Max: 0.9, Diff: 0.9, Sum: 37.5]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 2.9, Avg: 3.1, Max: 3.3, Diff: 0.5, Sum: 164.8]
      [GC Worker End (ms): Min: 2344459.4, Avg: 2344459.5, Max: 2344459.6, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 3.2 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.16 sys=0.01, real=0.01 secs] 


2021-11-18T23:54:42.906+0800: 2344.469: [Full GC (Allocation Failure)  31G->26G(32G), 42.7612268 secs]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->26.2G(32.0G)], [Metaspace: 162114K->162033K(176128K)]
 [Times: user=70.17 sys=0.32, real=42.77 secs] 
2021-11-18T23:55:25.669+0800: 2387.232: [GC concurrent-mark-abort]
2021-11-18T23:55:26.021+0800: 2387.584: [GC pause (GCLocker Initiated GC) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0193987 secs]
   [Parallel Time: 11.4 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 2387586.8, Avg: 2387587.0, Max: 2387587.2, Diff: 0.5]
      [Ext Root Scanning (ms): Min: 2.1, Avg: 2.5, Max: 5.4, Diff: 3.3, Sum: 132.6]
      [Update RS (ms): Min: 0.0, Avg: 1.6, Max: 2.1, Diff: 2.1, Sum: 86.8]
         [Processed Buffers: Min: 0, Avg: 27.5, Max: 55, Diff: 55, Sum: 1457]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 5.5, Avg: 6.5, Max: 6.7, Diff: 1.2, Sum: 346.4]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.8]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 6.5]
      [GC Worker Total (ms): Min: 10.5, Avg: 10.8, Max: 11.2, Diff: 0.8, Sum: 573.6]
      [GC Worker End (ms): Min: 2387597.7, Avg: 2387597.8, Max: 2387598.0, Diff: 0.3]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.2 ms]
   [Clear CT: 0.7 ms]
   [Other: 7.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 2.7 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.3 ms]
   [Eden: 1648.0M(1632.0M)->0.0B(1568.0M) Survivors: 0.0B->64.0M Heap: 27.8G(32.0G)->26.2G(32.0G)]

 ...


2021-11-19T00:44:38.484+0800: 5340.038: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
, 0.0083657 secs]
   [Parallel Time: 5.1 ms, GC Workers: 53]
      [GC Worker Start (ms): Min: 5340038.1, Avg: 5340038.4, Max: 5340038.6, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 2.2, Avg: 2.5, Max: 4.6, Diff: 2.3, Sum: 132.0]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 3.3]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 12]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.7]
      [Termination (ms): Min: 0.0, Avg: 1.9, Max: 2.0, Diff: 2.0, Sum: 98.9]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 53]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 4.1, Avg: 4.5, Max: 4.7, Diff: 0.5, Sum: 236.6]
      [GC Worker End (ms): Min: 5340042.7, Avg: 5340042.8, Max: 5340042.9, Diff: 0.2]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 2.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.7 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->32.0G(32.0G)]
 [Times: user=0.21 sys=0.00, real=0.00 secs] 
2021-11-19T00:44:38.499+0800: 5340.053: [Full GC (Allocation Failure) 

GC日志信息描述的是一直在发生GC,平均一秒钟GC次数是两次,应用启动后head的大小一直在增长,哪怕是一直GC也无效,直到heap满了,发生full GC,然后fullGC失败,接着又发生fullgc又失败。。。周而复始导致应用这个时候基本上处于瘫痪。

问题来了,是什么原因导致堆满了。

先简单的抽取了前三十个最多的对象看了下:

jmap -histo PID|head -n 30

看不出来是哪个对象有泄漏问题,那只能用将整个堆栈文件导出来看下了。

jmap -dump:live,format=b,file=/xxx/logs/202111182052.hprof PID

这个jmap执行了很久,执行完之后的堆栈文件大小是:33G

如果要拉取到本地电脑使用mat来分析的话需要保证内存有33g+才能打开,很明显这不现实,这就需要在高CPU大内存的服务器上安装mat来进行分析,这个时候只能先把A集群服务先关闭把内存和CPU资源节省出来给mat做分析用。

注:mat的最新版本至少需要jdk15,我们是jdk8,所以这里用的是1.9.2版本, MemoryAnalyzer-1.9.2.20200115-linux.gtk.x86_64.zip

MemoryAnalyzer-1.9.2.20200115-linux.gtk.x86_64.zip解压后的文件目录是:

configuration  features  MemoryAnalyzer  p2  ParseHeapDump.sh  plugins  workspace

这需要修改下ParseHeapDump.sh,加上我们的jdk路径,因为我们的bash没有把jdk的bin加进去,所以需要自己手动配置下jdk的路径,只需要加上-vm /usr/local/xxx/jdk8-1.0/bin ,加上后的内容如下所示:

"$(dirname -- "$0")"/MemoryAnalyzer  -vm /usr/local/xxx/jdk8-1.0/bin -consolelog -application org.eclipse.mat.api.parse "$@"

配置好mat后,先用mat生成分析索引文件,因为hprof文件33G,所以需要要分配40g的内存才能保证分析过程mat不会内存溢出。

./ParseHeapDump.sh /xxx/logs/202111182052.hprof ls -vmargs -Xmx40g -XX:-UseGCOverheadLimit

在hprof文件的同目录会有大量的xxxxx.index文件,生成分析索引文件后,再生成内存泄漏分析报告,同理,因为hprof文件33G,所以需要要分配40g的内存才能保证生成报告过程mat不会内存溢出。

./ParseHeapDump.sh /xxx/logs/202111182052.hprof -vmargs -Xmx40g org.eclipse.mat.api:suspects

这个时候在hprof文件的同目录会生成一个分析文件202111182052_Leak_Suspects.zip ,将这个压缩文件下载到本地电脑,解压后使用chrome打开index.html,可以看到:

有一部分对象占用内存空间29.3G,占比93%,这个内存泄漏报告数据详情里面写着,com.xxx.springboot.property.UpdateValue对象里面有个List列表对象,列表里面存在大量的对象没有被回收.

找到com.xxx.springboot.property.UpdateValue代码

...
...
private static final Logger logger = LoggerFactory.getLogger(UpdateValue.class);
     /**
      * 所有被注解标注的bean
      */
     private static final List<Object> BEAN_LIST = new ArrayList<>();

     public static void addRainbowValueBean(Object bean) {

          BEAN_LIST.add(bean);
     }
...
...

里面有个BEAN_LIST的对象会一直持有,且addRainbowValueBean如果被调用会一直往里面塞数据,找到这个方法的调用方

public class RainbowAnnotationProcessor implements BeanPostProcessor, PriorityOrdered {


     public RainbowAnnotationProcessor() {
     }

     @Override
     public Object postProcessBeforeInitialization(Object bean, String beanName) throws BeansException {

          Class<?> clazz = bean.getClass();

          //处理所有有注解的 field
          Field[] declaredFields = clazz.getDeclaredFields();
          for (Field declaredField : declaredFields) {
               Value value = declaredField.getAnnotation(Value.class);
               if (value != null) {
                    UpdateValue.addRainbowValueBean(bean);
               }
          }

          //获取所有有ConfigurationProperties标记的bean
          ConfigurationProperties configurationProperties = clazz.getAnnotation(ConfigurationProperties.class);
          if (configurationProperties != null) {
               UpdateConfigurationProperties.addRainbowConfigurationPropertiesBean(bean);
          }


          return bean;
     }

     @Override
     public Object postProcessAfterInitialization(Object bean, String beanName) throws BeansException {

          return bean;
     }

     @Override
     public int getOrder() {
          //make it as late as possible
          return Ordered.LOWEST_PRECEDENCE;
     }


}

这是一个实现了BeanPostProcessor接口的类,其中postProcessBeforeInitialization方法重写了,里面的逻辑会调用UpdateValue.addRainbowValue进行添加对象。

BeanPostProcessor接口有两个回调方法。当一个BeanPostProcessor的实现类注册到Spring IOC容器后,对于该Spring IOC容器所创建的每个bean实例在初始化方法(如afterPropertiesSet和任意已声明的init方法)调用前,将会调用BeanPostProcessor中的postProcessBeforeInitialization方法。

RainbowAnnotationProcessor的总体逻辑大致是当你的类生成的时候如果存在Value注解,那么会将这个类添加到``com.xxx.springboot.property.UpdateValue里的BEAN_LIST列表里, 当存在配置变更的时候会将BEAN_LIST里面的所有的类里的Value`注解的值及时更新。

我们再回过头来看下我们的应用是akka应用,也使用spring,其中akka的每一个actor使用了Actor注解,Actor注解里又有@Scope("prototype")注解

@Target({ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
@Documented
@Component
@Scope("prototype")
public @interface Actor {

}

也就是说我们的这些用了actor注解的都是prototype作用域。

spring中bean的scope属性,有如下5种类型:

  1. singleton 表示在spring容器中的单例,通过spring容器获得该bean时总是返回唯一的实例
  2. prototype表示每次获得bean都会生成一个新的对象
  3. request表示在一次http请求内有效
  4. session表示在一个用户会话内有效
  5. globalSession表示在全局会话内有效

因为actor是一个封装好的整体的处理逻辑的动作,我们定义了很多的这样的actor,且这些actor都是高并发大数据量的处理动作,每一个actor类里面都有使用到Value注解的对象,由于使用的prototype作用域,每次都会生成一个对象,然后每次都往这个UpdateValue.addRainbowValueBean里面加东西,然后内存满了。。。

为了验证我们的推断是正确的,使用arthas工具连上了我们的‘正常’运行的B集群应用,看了下应用里BEAN_LIST的数据和大小,里面有7000+个Actor对象,同时也拉取了这个集群的hprof文件下来看了下,这个集群也存在同样的内存泄漏问题,只不过这个集群的文件拉取次数少,频率没有那么高,所以问题并没有完全爆发出来。

// 使用ognl语法查看 BEAN_LIST 的大小
getstatic com.xxx.springboot.property.UpdateValue BEAN_LIST size
// 查看BEAN_LIST里面的内容
getstatic com.xxx.springboot.property.UpdateValue BEAN_LIST

com.xxx.springboot.property.UpdateValue对象的所属包是某配置服务包xxx-sdk-java-springboot-starter-3.1.3.jar,经确认,17号晚上20点发版上线的版本里确实是新增了这个包。

临时解决方案就是先把这个包和相应的新引进的关联功能逻辑剔除掉,再发版上线就可以了

长期解决方案:使用http的方式来使用配置中心

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章