性能优化-技术专题-top和jstack分析高CPU问题

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"通常我们所说的 CPU 使用率过高,这里面其实隐含着一个用来比较高与低的基准值,比如 JVM 在峰值负载下的平均 CPU 利用率为 40%,如果 CPU 使用率飙到 80% 就可以被认为是不正常的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#FF827B","name":"pink"}},{"type":"strong"}],"text":"典型的 JVM 进程包含多个 Java 线程,其中一些在等待工作,另一些则正在执行任务"},{"type":"text","text":"。"},{"type":"text","marks":[{"type":"strong"}],"text":"在单个 Java 程序的情况下,线程数可以非常低,而对于处理大量并发事务的互联网后台来说,线程数可能会比较高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 对于 CPU 的问题,最重要的是要找到是"},{"type":"text","marks":[{"type":"strong"}],"text":"哪些线程在消耗 CPU"},{"type":"text","text":",通过线程栈定位到问题代码;如果没有找到个别线程的 CPU 使用率特别高,我们要怀疑到是不是线程上下文切换导致了 CPU 使用率过高。下面我们通过一个实例来学习 CPU 问题定位的过程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"定位高 CPU 使用率的线程和代码"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"写一个模拟程序来模拟 CPU 使用率过高的问题,这个程序会在线程池中创建 4096 个线程。"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"代码如下:"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"@SpringBootApplication\n@EnableScheduling\npublic class DemoApplication {\n\n //创建线程池,其中有4096个线程。\n private ExecutorService executor = Executors.newFixedThreadPool(4096);\n \n //全局变量,访问它需要加锁。\n private int count;\n \n //以固定的速率向线程池中加入任务\n @Scheduled(fixedRate = 10)\n public void lockContention() {\n IntStream.range(0, 1000000)\n .forEach(i -> executor.submit(this::incrementSync));\n }\n \n //具体任务,就是将count数加一\n private synchronized void incrementSync() {\n count = (count + 1) % 10000000;\n }\n \n public static void main(String[] args) {\n SpringApplication.run(DemoApplication.class, args);\n }\n\n}\n"}]},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在 Linux 环境下启动程序:"}]}]}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"java -Xss256k -jar demo-0.0.1-SNAPSHOT.jar\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 请注意,这里我将线程栈大小指定为 "},{"type":"text","marks":[{"type":"strong"}],"text":"256KB"},{"type":"text","text":"。对于测试程序来说,操作系统默认值 "},{"type":"text","marks":[{"type":"strong"}],"text":"8192KB"},{"type":"text","text":" 过大,因为我们需要创建 "},{"type":"text","marks":[{"type":"strong"}],"text":"4096 "},{"type":"text","text":"个线程。"}]},{"type":"numberedlist","attrs":{"start":"3","normalizeStart":"3"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用 top 命令,我们看到 Java 进程的 CPU 使用率达到了 262.3%,注意到进程 ID 是 4361。"}]}]}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bd0c2f396cfdaae0a4d02528f2419150.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接着我们用更精细化的 top 命令查看这个 Java 进程中各线程使用 CPU 的情况:"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"#top -H -p 4361"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c3/c325386a59359258b9591242f898b52e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 从图上我们可以看到,有个叫“scheduling-1”的线程占用了较多的 CPU,达到了 42.5%。因此下一步我们要找出这个线程在做什么事情。"}]},{"type":"numberedlist","attrs":{"start":"5","normalizeStart":"5"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"为了找出线程在做什么事情,我们需要用 jstack 命令生成线程快照,具体方法是:"}]}]}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"jstack 4361"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"jstack 的输出比较大,你可以将输出写入文件:"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"jstack 4361 > 4361.log"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然后我们打开 4361.log,定位到第 4 步中找到的名为“scheduling-1”的线程,发现它的线程栈如下:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f3/f3dc1b00e59a234407cdd89dfff52c73.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 从线程栈中我们看到了"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"AbstractExecutorService.submit"},{"type":"text","text":"这个函数调用,说明它是 Spring Boot 启动的周期性任务线程,向线程池中提交任务,这个线程消耗了大量 CPU。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"进一步分析上下文切换开销"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 一般来说,通过上面的过程,我们就能定位到大量消耗 CPU 的线程以及有问题的代码,比如死循环。但是对于这个实例的问题,你是否发现这样一个情况:Java 进程占用的 CPU 是 262.3%, 而“scheduling-1”线程只占用了 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"42.5%"},{"type":"text","text":" 的 CPU,那还有将近 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"220% "},{"type":"text","text":"的 CPU 被谁占用了呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 不知道你注意到没有,我们在第 4 步用"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"top -H -p 4361"},{"type":"text","text":"命令看到的线程列表中还有许多名为“"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"pool-1-thread-x"},{"type":"text","text":"”的线程,它们单个的 CPU 使用率不高,但是似乎数量比较多。你可能已经猜到,这些就是线程池中干活的线程。那剩下的 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"220% "},{"type":"text","text":"的 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"CPU "},{"type":"text","text":"是不是被这些线程消耗了呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 要弄清楚这个问题,我们还需要看 jstack 的输出结果,主要是看这些线程池中的线程是不是真的在干活,还是在“休息”呢?"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/dd/ddd16aad196a2708c19a943be2f91fdc.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 通过上面的图我们发现这些“"},{"type":"text","marks":[{"type":"strong"}],"text":"pool-1-thread-x"},{"type":"text","text":"”线程基本都处于 "},{"type":"text","marks":[{"type":"strong"}],"text":"WAITING "},{"type":"text","text":"的状态,那什么是 "},{"type":"text","marks":[{"type":"strong"}],"text":"WAITING "},{"type":"text","text":"状态呢?或者说 Java 线程都有哪些状态呢?你可以通过下面的图来理解一下:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4d8253acfc05af1de46c747822c06ca3.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从图上我们看到“Blocking”和“Waiting”是两个不同的状态,我们要注意它们的区别:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"Blocking 指的是一个线程因为等待临界区的锁(Lock 或者 synchronized 关键字)而被阻塞的状态,请你注意的是处于这个状态的线程还没有拿到锁。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"Waiting 指的是一个线程拿到了锁,但是需要等待其他线程执行某些操作。比如调用了 Object.wait、Thread.join 或者 LockSupport.park 方法时,进入 Waiting 状态。前提是这个线程已经拿到锁了,并且在进入 Waiting 状态前,操作系统层面会自动释放锁,当等待条件满足,外部调用了 Object.notify 或者 LockSupport.unpark 方法,线程会重新竞争锁,成功获得锁后才能进入到 Runnable 状态继续执行。"}]}]}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "},{"type":"text","marks":[{"type":"strong"}],"text":" 回到我们的“pool-1-thread-x”线程,这些线程都处在“Waiting”状态,从线程栈我们看到,这些线程“等待”在 getTask 方法调用上,线程尝试从线程池的队列中取任务,但是队列为空,所以通过 LockSupport.park 调用进到了“Waiting”状态。"}]},{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "},{"type":"text","text":"那“pool-1-thread-x”线程有多少个呢?通过下面这个命令来统计一下,结果是 4096,正好跟线程池中的线程数相等。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2c/2cc664a3f61f0a566dfbfbd64f5b1cb4.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 你可能好奇了,那剩下的 220% 的 CPU 到底被谁消耗了呢?分析到这里,我们应该怀疑 CPU 的上下文切换开销了,因为我们看到 Java 进程中的线程数比较多。下面我们通过 vmstat 命令来查看一下操作系统层面的线程上下文切换活动:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/34f8ade6e7ad038c80beb6eb86fb7bd7.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 其中 cs 那一栏表示"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"线程上下文切换次数"},{"type":"text","text":",in 表示 CPU 中断次数,我们发现这两个数字非常高,基本证实了我们的猜测,线程上下文切切换消耗了大量 CPU。那么问题来了,具体是哪个进程导致的呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 我们停止 Spring Boot 测试程序,再次运行 "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong"}],"text":"vmstat "},{"type":"text","text":"命令,会看到 in 和 cs 都大幅下降了,这样就证实了引起线程上下文切换开销的 Java 进程正是 4361。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/77/7759382b7e6ce9a23dd8d4ebcc295182.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 当我们遇到 CPU 过高的问题时,首先要定位是哪个进程的导致的,之后可以通过top -H -p pid命令定位到具体的线程。其次还要通 jstack 查看线程的状态,看看线程的个数或者线程的状态,如果线程数过多,可以怀疑是线程上下文切换的开销,我们可以通过 vmstat 和 pidstat 这两个工具进行确认。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章