今日頭條 ANR 優化實踐系列分享 - 實例剖析集錦

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在前文,我們用了較多的篇幅介紹了"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&chksm=e9d0d954dea75042193ed09f30eb8ba0acd93870227c5d33b33361b739a03562afb685df9215&token=36916042&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"ANR 設計原理及影響因素"}],"marks":[{"type":"strong"}]},{"type":"text","text":",並根據不同場景進行了分類,如:當前消息嚴重耗時,歷史消息耗時嚴重,業務異常密集執行,進程內資源搶佔,進程間資源搶佔等場景。爲了應對系統監控能力不足以及應用側獲取信息受限的情況,我們在應用側實現了一套"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&chksm=e9d0d954dea75042193ed09f30eb8ba0acd93870227c5d33b33361b739a03562afb685df9215&token=36916042&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"消息調度監控工具"}],"marks":[{"type":"strong"}]},{"type":"text","text":",重點監控主線程的“過去,現在和將來”,同時結合相關日誌對 ANR 問題的分析思路進行了總結。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了便於大家更好的理解上述知識,接下來我們將結合工作中遇到的一些比較有代表性的問題,並按照前文歸因分類,由淺入深進行實例解剖,下面就來看看這幾類問題,我們是如何藉助系統日誌和監控工具進行分析及定位的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例一:當前業務耗時嚴重"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1f\/1fe30ef8c06440f5f493fd1f28935525.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"分析過程:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析堆棧:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據前文講到的問題分析思路,首先觀察 Trace 主線程堆棧,從上面堆棧,可以看到正在業務邏輯,但是當前這個業務邏輯是否耗時呢?如果耗時較長,是不是受到系統調度影響呢?帶着這些疑問,我們來看一下系統側的表現。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析系統&進程負載:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8f\/8f4a9f740bb5a534ec1691841629a661.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察不同時段系統負載:"},{"type":"text","text":" 在 ANR Info 中我們搜索 Load 關鍵字,發現 ANR 前 1,前 5,前 15 分鐘("},{"type":"codeinline","content":[{"type":"text","text":"Load:7.45 \/ 6.92 \/ 6.84"}]},{"type":"text","text":"),系統 CPU 整體負載並不高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 使用率:"},{"type":"text","text":" 進一步觀察 "},{"type":"codeinline","content":[{"type":"text","text":"CPU usage from 0 ms to 8745ms later"}]},{"type":"text","text":",看到 ANR 之後 "},{"type":"text","marks":[{"type":"strong"}],"text":"(later:表示發生 ANR 後一段時間各進程 CPU 佔比,ago 表示:ANR 之後一段時間,各進程 CPU 佔比)的這段時間,應用主進程 CPU 佔比達到 161%,而且在 user,kernel 佔比分佈上,kernel 比例高達 103%!"},{"type":"text","text":" 通過前文的介紹,我們知道對應用來說 kernel 空間 CPU 佔比較高,說明應用側應該發生了大量的系統調用,對普通應用來說,發生系統調用的多數場景都是文件 IO 操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察線程:"},{"type":"text","text":" 繼續觀察上圖,可以看到應用進程內部的線程 CPU 佔用,可以看到"},{"type":"text","marks":[{"type":"strong"}],"text":"主線程的 CPU 佔比 93%,其中 kernel 佔用 69%,明顯高於用戶空間 23%"},{"type":"text","text":",說明 ANR 之後的這 8S 多,主線程依然在進行大量系統調用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統側結論:"},{"type":"text","text":" 通過觀察系統負載和各個進程的 CPU 使用情況,我們發現系統整體負載比較正常,但是我們的主進程 CPU 佔比明顯偏高,並且集中在主線程。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"業務耗時:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據上面的分析,我們將思路再次回到主線程,進一步觀察 Trace 堆棧,發現上面有一個明顯的系統調用(read),難道是本次系統調用導致的 kernel 佔比較高嗎?那麼當前邏輯在 ANR 發生前已經持續了多久呢?分析到這裏,如果沒有進一步的監控,或者業務不認可當前業務邏輯耗時,那麼事情到這裏可能就要告一段落,或者需要在業務層添加監控埋點進行復現了,但是如果每次遇到這類問題都要添加埋點,那麼將是一個非常糟糕的事情。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"別忘了,在前面介紹 ANR 問題分析思路時,我們還有一個殺手鐗(參見:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&chksm=e9d0d954dea75042193ed09f30eb8ba0acd93870227c5d33b33361b739a03562afb685df9215&token=36916042&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"監控工具 Raster"}],"marks":[{"type":"strong"}]},{"type":"text","text":"),通過這個監控工具可以看到很"},{"type":"text","marks":[{"type":"strong"}],"text":"清晰的看到:歷史消息耗時,當前消息持續時間以及消息隊列未調度消息 Block 耗時"},{"type":"text","text":"。結合上面這個問題,我們看到了本次主線程消息調度情況,參見下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b2\/b25e0d89c49676a399dccf5787b7f2dc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過還原的時序圖,可以清晰看到 ANR 之前,主線程消息調度及耗時基本正常,沒有發現明顯耗時嚴重的消息,但是觀察正在調度的消息,其 Wall duration 超過 9000ms,並且 Cpu duration 長達 4800ms+,說明該消息耗時非常嚴重。而且上圖灰色部分顯示第一個消息,被 Block 了 9S 以上,因此可以進一步的說明該消息被 Block 的時長,基本都是當前正在執行的消息導致。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題結論:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面這些分析信息和實際數據,"},{"type":"text","marks":[{"type":"strong"}],"text":"我們可以得出如下結論:發生 ANR 問題時,系統負載良好。發生問題前,應用主線程消息調度狀態良好,但是當前正在調度的消息耗時嚴重,導致了後續消息未能及時響應,引起了 ANR"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶着這些結論,在和業務對接時便會清晰高效很多。這種場景,屬於比較常規和常見的問題,也是大部分同學排查問題的分析思路,過程相對輕鬆。接下來,我們再看另一種場景的 ANR 問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例二:歷史消息耗時嚴重"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面分析的這個案例,是大家經常遇到的,也是引起很多人困惑的一種場景:"},{"type":"text","marks":[{"type":"strong"}],"text":"發生 ANR 時,Trace 堆棧落在了 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}],"marks":[{"type":"strong"}]},{"type":"text","text":"。根據我們的理解,這個場景一般表示當前線程處於空閒或 JNI 執行完之後,進行上下文切換的過程,相關邏輯比較清晰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可是日常遇到的問題,有很多 Case 都屬於這類場景,如果僅僅依靠系統日誌讓我們很難進一步分析。下面就看看我們是如何應對這類問題的:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e55cfbdaf24a3ea2891bef618d175662.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"分析思路:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析堆棧:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到上面的 Trace 堆棧,基本無從下手,線程狀態以及線程 utm,stm 時長都沒有明顯異常,但是線上確實有大量的 ANR 問題落在這個場景,難度在某種極端情況下虛擬機這部分邏輯耗時嚴重?但是從理論上來說這些假設確實不存在的。既然堆棧沒有太多信息,我們就移步到系統層面,看看是否有線索可尋。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析系統&進程負載:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fd\/fd4b541c8a851f85aadfa2b59d2447d4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統負載:"},{"type":"text","text":" 在 ANR Info 中查看"},{"type":"codeinline","content":[{"type":"text","text":"Load"}]},{"type":"text","text":"關鍵字,發現系統在前 1 分鐘,前 5 分鐘,前 15 分鐘各個時段負載並不算高,但是最近 1 分鐘 CPU 負載爲 11.81,比前 5,15 分鐘都高,說明負載有加重趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 分佈:"},{"type":"text","text":" 進一步觀察 "},{"type":"codeinline","content":[{"type":"text","text":"CPU usage from 0 ms to 5599 later"}]},{"type":"text","text":" 關鍵字,看到 ANR 之後這 5S 多的時間,我們應用主進程 CPU 佔比達到 155%,而且在 user,kernel 層面 CPU 佔比分佈上,user 佔比 124%,明顯很高,這時直覺告訴我們,當前進程應該有問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統 CPU 分佈:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/33\/332191a201b17733c28a798382e8493c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步分析系統負載,發現整體 CPU 使用率並不高。user 佔比 17%,kernel 佔比 7.5%,iowait 佔比 0.7%,說明這段時間系統 IO 並不繁忙,整體也符合預期。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統側結論:"},{"type":"text","text":" 通過觀察系統負載和各個進程的 CPU 使用情況,發現系統環境比較正常,但是我們的主進程 CPU 佔比明顯偏高,但因本次 Anr Info 中未獲取到進程內部各線程 CPU 使用率,需要回到應用側進一步分析。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"應用側分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據上面的分析,我們將方向轉回到當前進程,通過對比 Trace 中各線程耗時(utm+stm),發現除了主線程之外,其它部分子線程耗時(utm+stm)沒有明顯異常,因此基本排除了進程內部子線程 CPU 競爭導致的問題,因此問題進一步聚焦到主線程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來再將目光聚焦在主線程消息調度情況,通過監控工具觀察主線程的歷史消息,當前消息以及待調度消息,如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/99\/99e9c57553369115e35fa2436652140b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖可以看到當前消息調度 Wall Duration 爲 46ms。在"},{"type":"text","marks":[{"type":"strong"}],"text":"ANR 之前存在一個索引爲 46 的歷史消息耗時嚴重(wall duration:10747ms)"},{"type":"text","text":",同時還有一個索引爲 32 的歷史消息耗時也比較嚴重(wall duration:10747ms)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步觀察消息隊列第一個待調度消息,發現其實際 block 時長爲 9487ms,因此我們排除了索引爲 32 的歷史消息,認爲"},{"type":"text","marks":[{"type":"strong"}],"text":"索引爲 46 的消息耗時是導致後續消息未能及時調度的主要原因"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在鎖定索引爲 46 的歷史消息之後,接下來利用監控工具中的"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&chksm=e9d0d954dea75042193ed09f30eb8ba0acd93870227c5d33b33361b739a03562afb685df9215&token=36916042&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"消息堆棧採樣監控"}],"marks":[{"type":"strong"}]},{"type":"text","text":",發現該消息執行過程,多次堆棧採樣均命中創建 Webview 邏輯,原來業務在 UI 繪製過程直接實例化 Webview!(涉及到業務代碼,採用堆棧詳情不在此展示)"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題結論:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面的分析並利用監控工具,我們可以很清晰看到"},{"type":"text","marks":[{"type":"strong"}],"text":"發生 ANR 問題時,NativePollOnce 場景耗時並不長,導致本次 ANR 的主要原因是歷史消息在 UI 繪製過程中同步創建 Webview,導致耗時嚴重。但是期間系統超時監控並沒有觸發,待主線程繼續調度後續消息時,系統監控客戶端響應超時,捕獲了主線程正在執行的邏輯"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這類場景在線上線下都大量存在,但是從 ANR 設計原理和 Trace 採集流程來看,很多並不耗時的消息在調度過程中都成爲了“替罪羊”。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例三:業務異常密集執行"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來分析的這個 Case,在多個產品都有遇到,僅從堆棧上面看,也是經常遇到並困惑我們的,現象和上面分析的案例有些類似,即業務邏輯很簡單,實際耗時很少,但是經常出現在各種 ANR Trace 上面,只是依照堆棧信息,就把這個問題草率的分配給相應業務方去解決,業務同學大概率也是一頭霧水,不知道從何下手。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是如果按照上面兩類問題的分析思路,可能也會陷入困惑,這時如果換個思路,可能會是另一番景象,下面就來看看我們是如何分析的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"ANR 現場堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c0b78f38119405250cd06b1fd63045a7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題分析:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析堆棧:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據前文分析思路,先觀察 Trace 主線程堆棧,從上面堆棧,可以看到業務邏輯,第一反應就是業務耗時?按照經驗,我們還是習慣性的再去看看系統日誌,進一步縮小或鎖定方向。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"分析 CPU 負載:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bd\/bdb58450b7018a015bf440caeb3efd2f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統負載:"},{"type":"text","text":" 在 ANR Info 中搜索 Load 關鍵字,看到系統在各個時段(前 1,5,15 分鐘)負載並不高,但是有加重趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 分佈:"},{"type":"text","text":" 進一步觀察\"CPU usage from 0 ms to 9460ms later\"期間各個進程 CPU 佔比情況,看到這段時間目標應用的主進程 CPU 佔比達到 153%,而且在 user,kernel 佔比分佈上,"},{"type":"text","marks":[{"type":"strong"}],"text":"user 佔比高達 127%,存在明顯異常。Kernel 佔比 25%,也有些偏高"},{"type":"text","text":"。與此同時,我們進一步觀察 kswapd,mmc 進程 CPU 使用率,發現佔用率不是太高,說明當前系統的整體內存和 IO 並沒有太嚴重問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統 CPU 分佈:"},{"type":"text","text":" 爲了進一步驗證系統 IO 及內存負載是否正常,接下來再觀察一下系統整體 CPU 使用和分佈,發現 iowait 佔比 7.5%,相對來說有些偏高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/08\/08c71cbfc2c65d74c267fbb157e5ece7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"進程 CPU 再觀察:"},{"type":"text","text":" 與此同時,我們在 ANR Info 裏面還發現了一個關鍵信息,看到了另一個時段問題進程內部主要線程的 CPU 佔用情況,通過下圖我們可以看到主線程 CPU 佔用 95%,屬於明顯偏高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c4\/c494f17abd1e8f7759cacba1061ace21.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統側分析結論:通過上面的分析,基本可以得出如下結論:發生問題時,系統 CPU,Mem 負載比較正常,IO 負載有些偏高,發生 ANR 問題的應用進程 CPU 使用率存在異常,而且集中在主線程。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"業務耗時:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據上面的分析,已經將排查方向鎖定在主線程 CPU 使用率較高,接下來觀察該線程的 utm+stm,發現累計耗時(1752+128)*10ms=18.8S。對比本次進程啓動時長才 22S,說明進程啓動後,主線程基本是在滿負荷執行!而且主線程的 CPU 響應能力非常不錯!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此我們再次確認主線程存在嚴重耗時,難道又是當前消息或某個歷史消息耗時嚴重?於是快速切換到消息調度監控,一探究竟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a9\/a9241eddc4ea69a63b32dcbdd2692f9c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是看到上面的消息調度監控時序圖,"},{"type":"text","marks":[{"type":"strong"}],"text":"發現當前消息執行時長才 300ms,並不是我們期待的耗時很嚴重哪種場景,進一步觀察 ANR 之前歷史消息調度,也沒有看到有單次耗時嚴重的消息。繼續觀察上圖待調度消息,發現確實被嚴重 Block。既然主線程沒有看到嚴重耗時,系統負載也比較正常,那麼主線程 CPU 使用率爲何會這麼高呢,這個情況與預期不符,不符合常理啊!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察 Block 消息:"},{"type":"text","text":" 之前我們介紹"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&chksm=e9d0d954dea75042193ed09f30eb8ba0acd93870227c5d33b33361b739a03562afb685df9215&token=36916042&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"Raster 工具"}],"marks":[{"type":"strong"}]},{"type":"text","text":"時,介紹該工具不僅可以記錄歷史消息,標記嚴重耗時,關鍵消息之外。還有一個作用就是獲取消息隊列待調度的消息,於是我們繼續觀察這些被 Block 的消息,發現了一個非常奇怪的現象(由於時序圖只能在鼠標停留時展示單個消息詳情,因此直接截取原始數據),如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bc95af9aebbebe5cbbac355a0158c151.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"通過對比,發現消息隊列待調度的消息中,除了第一個消息之外,其餘的 200 多個消息(爲了便於展示,目前只獲取最多前 300 個),竟然是同一個 Handler 對象(hash:1173da0)的消息,再進一步對比發現當前正在調度的也是該 Handler 對象的消息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/363fdcbb47a4502de310dd1b7354c297.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個情況引起了我們的注意,順着這個思路,繼續翻看歷史調度信息,發現每條歷史記錄 "},{"type":"text","marks":[{"type":"strong"}],"text":"(儘管存在多條消息耗時較少被聚合的場景,但是我們會保留最後一個消息的 msg string 信息)"},{"type":"text","text":" 的last msg 也是相同的 handler 對象,如下圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/816a48ca82a476b69c62f59c1b874551.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"這麼多消息都來自同一個 Handler,這麼密集的在主線程執行,每條記錄 cpu 耗時都在 290ms 左右,每條記錄監控統計期間調度了 5 條消息。分析到這裏,我們基本就有答案了,很有可能是當前業務發生異常,導致不停的向主線程發送消息,頻繁密集的消息依次執行,嚴重阻塞了後續消息調度。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題結論:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶着上面這些分析信息和監控數據,我們得出如下結論:"},{"type":"text","marks":[{"type":"strong"}],"text":"應用在啓動之後,業務邏輯發生異常,瞬間產生大量消息,儘管單次消息執行耗時並不嚴重,但是這些消息在主線程密集執行,嚴重影響了後續消息調度,進而導致後續消息響應超時。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這個問題,當我們把相關數據和結論反饋給業務同學時,業務同學進一步分析業務邏輯,發現當前邏輯確實存在隱患,改進之後,該 ANR 問題就此得到解決。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"這種場景屬於典型的業務邏輯異常造成的問題,如果沒有監控工具由點到面的聚合和展示,單獨分析某一次消息耗時,無論如何是找不到問題原因的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例四:進程內 IO 負載異常"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面重點介紹了主線程內部環境導致問題的相關案例,介紹當前消息耗時嚴重,歷史消息耗時嚴重,以及消息密集這幾種類型的分析思路,接下來再分析一個進程內 IO 搶佔的案例,下面就來看看如何層層撥雲揭霧,尋找真相的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/69\/6949c8b2f8abe894cd53514580bdaab5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題分析:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"堆棧分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面這個 Trace 信息,也是無從下手,堆棧本身是一個很清晰的系統邏輯,但是現實中確實也有大量 ANR Trace 落在這個場景,難道是系統進入 epoll_wait 之後,沒有被及時喚醒?但是從理論上來說這些假設不存在。既然堆棧信息沒有太明顯,我們就把方向轉移到系統側,看看是否有線索。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"系統&進程負載分析:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c48e8f9230ebdca233f74a2df376516.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統負載:"},{"type":"text","text":" 在 ANR Info 中我們搜索 Load 關鍵字,發現"},{"type":"text","marks":[{"type":"strong"}],"text":"系統各個時段(前 1,5,15 分鐘)負載明顯很高,並且最近 1 分鐘負載爲 71,又明顯高於前 5,15 分鐘,說明有系統負載進一步加重!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 分佈:"},{"type":"text","text":" 進一步觀察 "},{"type":"codeinline","content":[{"type":"text","text":"CPU usage from 0 ms to 21506 later"}]},{"type":"text","text":" 關鍵字,看到"},{"type":"text","marks":[{"type":"strong"}],"text":"ANR 之後這段時間,內核線程 kworker 的 CPU 佔比明顯偏高,累計佔比超過 45%!其它系統或應用進程 CPU 使用率普遍偏低。"},{"type":"text","text":" 通過前文介紹我們知道 kworker 屬於內核線程,當 IO 負載過重時會在該線程有所體現。進一步觀察 kswapd0 線程 cpu 使用率,發現只有 2%,而 kswapd0 主要應用在內存緊張的場景,說明這段時間系統內存負載基本正常。通過上面這些信息解讀,可以大致推測最近一段時間系統負載過高,應該是 IO 請求導致,至於系統內存壓力尚可接受,接下來我們繼續求證。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/05\/05444d58d418f0c569f00a948e0d3673.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"觀察進一步分析系統整體負載,發現 user 佔比只有 5.4%,kernel 佔比 11%,"},{"type":"text","marks":[{"type":"strong"}],"text":"但是 iowait 佔比高達 57%!明顯高於 user,kernel 使用率,說明這段時間系統 IO 負載非常嚴重。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而這個 IO 佔比較高,也進一步實錘了我們上面的“觀察進程 CPU 分佈”的結論。那麼是哪個應用導致的呢?遺憾的,受限於系統日誌獲取能力,依靠現有的信息並沒有明顯看到異常進程,那麼 IO 發生在哪裏,是否和當前進程有關呢?於是我們將思路再次回到應用內部。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"應用側分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面的分析,我們基本鎖定了是 IO 負載過重導致的問題,接下來便要進一步排查是否是當前進程內部存在異常,於是我們對比了各個線程的耗時(utm+stm)情況:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7c01d6d678142d3eeee76624f7f6a118.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖線程耗時對比可以清晰的發現,"},{"type":"text","marks":[{"type":"strong"}],"text":"DBHelper-AsyncOp-New 線程無論是 utm 時長,還是 stm 時長,都明顯超過其它線程,而 stm 高達 2915!"},{"type":"text","text":" 這個耗時超出了我們的預期,實際場景中我們認爲主線程才 CPU 消耗大戶,其它線程都是配角。下面我們再看一下線程詳情:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/24\/24588da35d7031feadfc7f623db7e1cc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步查看該線程堆棧,"},{"type":"text","marks":[{"type":"strong"}],"text":"發現存在明顯的 IO 操作,而且子線程優先級(nice=0)相對較高,stm(2915)+utm(1259)高達 4000+,換算成時長相當於 CPU 真實執行超過了 40S!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對比主線程耗時(utm:1035,stm:216),以及進程啓動時長(4 分 18 秒),可以更好證明了 DBHelper 線程存在異常,stm 明顯過長,說明存在大量系統調用,結合該線程業務,可以很快就猜到是 IO 讀寫引起的問題了。因爲該線程本身就是負責應用內部數據庫清理功能的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過上面的分析之後,下面來看一下主線程調度時序圖,看看 IO 負載過重對主線程有多大影響。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"消息調度時序圖分析"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/43\/439b7f3cec0b2ceea226e91b461ccd1d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖,可以清晰看到 ANR 消息之前,有多個歷史消息耗時存在明顯異常,而且 Wall duration 與 Cpu duration 耗時比例差距較大,部分消息 cpu 時長更是小於 1ms(單位 ms,0 則表示小於 1ms),說明在此期間主線程整體調度受到很大影響,而且這些消息內部涉及 IO 訪問的邏輯將會受到更大影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時結合我們現場 checkTime 機制,發現 checkTime 調度明顯存在嚴重 delay 情況。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題結論:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶着上面這些分析信息和數據,我們可以得出如下結論:"},{"type":"text","marks":[{"type":"strong"}],"text":"通過層層分析我們可以發現,發生ANR時的當前消息耗時近2S,但並不是root case,主線程出現多個歷史消息耗時,但也不是root case,真正導致本次ANR的原因是DBHelper-AsyncOp線程在過去一段時間進行了長時間的IO操作,嚴重影響了主線程乃至進程本身的CPU調度,導致後續消息響應不及時,觸發系統超時(ANR)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於該類問題,除了應用本身優化之外,也與一些機型設備差異有關,例如不同機型 IO 性能本身就存在很大差異,因此理論上無法徹底解決。同時無論是進程內部還是其他進程進行 IO 密集操作,都可能引起系統 IO 負載過重,進而導致系統乃至所有進程調度受到影響,"},{"type":"text","marks":[{"type":"strong"}],"text":"對於該類問題只能進一步的優化相關邏輯,降低 IO 訪問頻率,減少主線程 IO 訪問等等"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這類問題,在線上比較常見,但是在開發同學的線下測試過程,性能普遍符合預期,針對線上用戶,應用場景錯綜複雜,絕非線下能模擬,並且針對不同手機設備,不同芯片平臺,甚至磁盤可用空間的差異,其 IO 性能也表現的千差萬別而這些小概率的問題,在數億萬用戶環境中,會頻頻出現。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例五:其它進程及系統負載異常"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面我們分析的幾類問題,基本都是應用進程內部因素導致的問題,如主線消息耗時,消息密集執行,子線程 IO 資源搶佔等等。線上環境中,除了進程或主線線程自身因素導致的問題外,還有一些是外部因素導致的問題,如其它進程或系統負載過重,進而影響當前進程正常調度。下面我們就來看一看這類問題是如何分析的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4a1a54f1f35f70210e20534543afe3f5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題分析:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"堆棧分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到上面這個 Trace 信息,同樣是熟悉的味道,發生 ANR 時,系統正處於 epoll wait 狀態,線程 utm 及 stm 耗時並不算長,累計(376+340)*10=7160ms。觀察到這裏基本沒有看到太多有效信息。接下來繼續把方向轉移到系統側,看看是否有線索可循。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"系統&進程負載分析:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/db\/dbc4df80d10825a33912ff17de3a7020.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統負載:"},{"type":"text","text":" 在 ANR Info 中搜索 Load 關鍵字,看到系統在各個時段(前 1,5,15 分鐘)負載比較高,分別爲 37.09,39.37,49.44,呈現加重趨勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 分佈:"},{"type":"text","text":" 進一步觀察\"CPU usage from 2401 ms to -22043ms ago\"期間,各個進程 CPU 佔比情況,看到這段時間目標進程 cpu 使用率很低,只有 17%。看到其它關鍵進程或線程,如 Kswapd0 線程,cpu 佔比 20%,對於該線程來說,其出現則表示系統內存比較緊張了,而且看到了與其相關的 kworker,mmcqd 線程 cpu 佔比也比較高。這些線程同時出現,足以說明當前系統環境發生了比較大的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而系統資源緊張,一般是因爲某個或多個進程出現內存泄漏或大量 IO 讀寫導致,結合上面 Top 進程的 CPU 佔比,com.youku.phone 以及 com.android.contacts 進程可疑性最大。而發生 ANR 問題的 com.ss.android.article.news 進程其 CPU 佔比只有 17%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統 CPU 分佈:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/70c4360a8714bc4ab3a0c657a48e471d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖信息可以看到,系統 CPU 整體使用率達到 54%,kernel 佔比 15%,iowait 佔比高達 24%,有些偏高。說明系統負載確實存在異常,其結論與我們上面分析的基本一致。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"當然在這裏比較遺憾的是,因爲是線上問題,我們無法通過拿到系統以及其它進程更多信息,否則可以更加清晰的看到發生異常的是哪個進程,"},{"type":"text","text":" 但是接下來要進一步排除是當前進程導致的系統負載問題,我們將視野再次回到應用側。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"應用側分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面的分析,我們基本鎖定了是內存負載過重導致的問題,接下來便要進一步排查是否是當前進程內部存在異常,於是我們對比了各線程的耗時(utm+stm)情況:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e0\/e034c760c95344969a027e02644e9948.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖可以看到,排名第一的是主線程,其 utm+stm=700,換算成系統時長,700*10=7000ms,但是"},{"type":"text","marks":[{"type":"strong"}],"text":"對比觀察進程啓動時長,發現進程已經啓動 108S,我們知道對應進程來說,啓動的前 1~2 分鐘,主線程是最爲繁忙的,大量的業務和系統消息需要在主線程進行執行。同時我們進一步對比系統負載正常的情況,進程啓動 2 分鐘時主線程 CPU 執行時長明顯大於當前時長"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們再來看一下系統負載過重,對主線程消息調度的影響,如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4acbd8dec755502fe79fc1b41caf447b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖,可以清晰看到 ANR 消息之前,有多個歷史消息耗時存在明顯異常,而且 Wall duration 與 Cpu duration 耗時比例差距較大,從側面也反映了在此期間主線程整體調度受到較大影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"超時 Service 消息:"},{"type":"text","text":" 從上圖可以清晰看到第一個待調度消息,其 Block 時長超過 18S 之多,接近於前面諸多耗時消息之和。同時從下圖可以清晰看到發生 ANR 的這個 service 消息在消息隊列排在第 8,消息 block 時長爲 18482ms。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/82775594a036da8108e330db355ba2be.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"在前文應用四大組件超時歸類中提到 Service 超時時長分別爲 20S 或 200S,現在該消息在應用側 block 時長爲 18482ms,那就說明剩下 1S 多的時間,耗費在系統 AMS 服務發送到客戶端 Binder 線程接收過程,否則沒有達到 20S 超時,是不會觸發系統超時的。因此也進一步說明了系統調度性能存在問題"},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶着上面這些分析信息和相關數據,我們可以得出如下結論:"},{"type":"text","marks":[{"type":"strong"}],"text":"在進程啓動前,系統負載已經很重,整個系統調度性能受到較大的影響,儘管發生 ANR 時當前堆棧耗時較長,但並不是 root case,多個歷史耗時嚴重的消息也不是 root case,而導致本次 ANR 的應該是 com.youku.phone 或 com.android.contacts 進程,在過去一段時間進行大量系統資源訪問,造成系統負載加重,嚴重影響了其他進程 CPU 調度,導致主線程消息處理不及時,觸發系統超時(ANR)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於該類問題,因爲是其它進程導致系統資源緊張,進而影響了當前進程,因此我們無法從根本上解決,當然能夠很好的分析並找出原因,也是對待問題的一種態度吧。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"案例六:跨進程死鎖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在前面更多從應用側介紹了 ANR 案例的分析思路,接下來看看如何藉助更多系統日誌分析這類問題。當然,這類問題如果發生在線上,會因爲應用側無法獲取跨進程 Trace 的原因,可能會被誤歸類爲 IPC 耗時場景。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/71\/713b1295d82c361a446404cddb36bb26.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"問題分析:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"堆棧分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據前面講到的問題分析思路,先觀察 Trace 主線程堆棧,從上面堆棧,可以看到業務邏輯發生 Binder 調用被 Block 的情況,但是這次拿到的是完整的 Trace 日誌,那麼接下來就沿着 Binder 請求邏輯,看一下響應進程的狀態。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"服務進程分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先查找客戶端主線程在和哪個 Binder 線程進行通信,搜索代理接口"},{"type":"codeinline","content":[{"type":"text","text":"setAppCallback"}]},{"type":"text","text":"(Android 命名習慣,代理端和服務端函數命名基本保持一致),發現是 Nfc 的 Binder_3 線程響應了客戶端請求:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/65\/65a3dc43652e173a44ccc99b9182a1e6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是進一步觀察堆棧信息,發現 Binder_3 線程被當前進程的主線程 Block,那麼沿着這個線索看看主線程狀態:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8e\/8eae9f3a2d032947041b09286b5cbf9a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"觀察主線程狀態,發現此刻主線程也在執行 Binder 通信,請求"},{"type":"codeinline","content":[{"type":"text","text":"createBeamShareData"}]},{"type":"text","text":",同樣根據命名習慣,繼續搜索關鍵字"},{"type":"codeinline","content":[{"type":"text","text":"createBeamShareData"}]},{"type":"text","text":",看看這次請求是哪個進程在響應,結果發現是 ANR 所在進程的 Binder_6 線程響應此請求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/06\/7f\/06f686a7418f7e0fa3c6f3689c235b7f.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過觀察 Binder_6 線程的堆棧和狀態,發現該線程處於 await 狀態,明顯是在等待其它線程通知喚醒!分析到這裏,就需要大家結合 Read the Fuck Code 的精神進一步分析業務邏輯了,在研究一番業務邏輯之後,發現喚醒此線程的業務邏輯,已經通過 Handler 發送到主線程,正等待主線程執行呢,但是如果時序處理的不恰當,就會出現主線程還沒來得及執行這個消息,就去監聽 NFC 狀態,進而引起了連鎖反應。至此找到了依賴鏈路:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e0\/aa\/e09b70ceb473905c3c4b85c756be08aa.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"通過上圖可以清晰的看到本次 ANR 原因:跨進程死鎖。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們按照前文 ANR 影響因素及歸類分別選取了一個線上案例,並進行分析總結。回過頭來看,第一類問題按照多數人的\"解題思路\"可能會很快的找到答案。在面對第二類問題時,如果沒有監控工具就可能掉入“Trace 陷阱”了。第三類問題並不常見,但是在公司多個產品都有遇到過,因爲這類問題更加隱蔽,如果依靠現有系統日誌只能鎖定方向,之後需要耗費大量的時間去添加埋點分析定位,但是通過我們的監控工具直觀展示並暴露了更多細節,爲成功定位問題扮演了關鍵角色。後面兩類因爲資源搶佔的導致線程調度不及時的問題,通過監控工具很好的還原了 ANR 之前消息調度情況,更加清晰的證實了資源競爭對主線程的影響;但「barrier導致主線程假死」,「Service執行時序顛倒」等問題還有待解決。掌握了上述幾類問題的分析思路,相信可以幫助大家應對工作中遇到大部分問題,但是所謂“林子大了,什麼鳥都有”,下一篇將會介紹主線程假死,Service 執行時序顛倒等更加棘手的案例分析,敬請期待。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:字節跳動技術團隊(ID:BytedanceTechBlog)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/4-_SnG4dfjMnkrb3rhgUag","title":"xxx","type":null},"content":[{"type":"text","text":"今日頭條 ANR 優化實踐系列分享 - 實例剖析集錦"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章