今日頭條 ANR 優化實踐系列 - Barrier 導致主線程假死

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"簡述:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前文,我們通過線上案例對影響 ANR 問題的六大場景進行剖析,這幾類場景基本覆蓋了線上大部分問題,詳見"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488243&idx=1&sn=1f948e0ef616c6dfe54513a2a94357be&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"ANR 案例分析集錦"}],"marks":[{"type":"strong"}]},{"type":"text","text":"。同時我們選取了較多 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景的案例,便於大家更好理解,ANR 時看到的 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景的問題,並不是導致 ANR 的根本問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面要介紹的這類問題,Trace 現場依然是 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 信息,但與前幾類問題不同的是,這類問題真的發生在 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景,接下來就看看到底是什麼原因導致的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"主線程 Trace 堆棧:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4a14a171318f23da84dc7afc2c189333.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"分析思路:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對該類問題,當看到這個信息時,第一判斷依然是主線程歷史消息耗時嚴重,或者系統負載過重導致的問題,因爲工作日常分析了太多這類場景的 ANR 問題,而且最後的結論也證明都與此場景無關。但分析這個問題時,進一步拆解大盤指標發現一段時間內 ANR 增加的量級,基本全部落在這個場景,這不太符合我們的預期。但是鑑於 Trace 信息有限,只好把目光轉移到系統側,看看是否有線索。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"分析系統&進程負載:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e0\/e0da4e317d66227a806241e56c3079b6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統負載:"},{"type":"text","text":" 在 ANR Info 中查看 Load 關鍵字,發現系統在前 1 分鐘,前 5 分鐘,前 15 分鐘各個時段負載並不高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察進程 CPU 分佈:"},{"type":"text","text":" 進一步觀察"},{"type":"codeinline","content":[{"type":"text","text":"\"CPU usage from 0 ms to 24301 later\""}]},{"type":"text","text":",看到 ANR 之後這 24S 多的時間,應用主進程 CPU 佔使用率只有 15%,但是 com.meizu.mstore 應用 CPU 使用率達到 92%,user 與 kenel 比例分別爲 71%,20%。與此同時 "},{"type":"codeinline","content":[{"type":"text","text":"kswapd"}]},{"type":"text","text":","},{"type":"codeinline","content":[{"type":"text","text":"mmc-cmdqd"}]},{"type":"text","text":" 等內核線程 CPU 使用率並不高,說明系統負載總體正常。如果根據我們前面"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488243&idx=1&sn=1f948e0ef616c6dfe54513a2a94357be&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"案例分析"}],"marks":[{"type":"strong"}]},{"type":"text","text":"得出的結論來看,這種場景難道是 "},{"type":"codeinline","content":[{"type":"text","text":"com.meizu.mstore"}]},{"type":"text","text":" 進程嚴重搶佔 CPU 導致的?帶着這個疑惑,繼續觀察系統 CPU 分佈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"觀察系統 CPU 分佈:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f1\/f1f3910eb370b975b2ac67aaf62d4fe3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步分析系統負載,發現整體 CPU 使用率稍微有些高。user 佔比 37%,kernel 佔比 24%,iowait 佔比 6.9%,說明這段時間系統 IO 確實有些繁忙。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統側結論:"},{"type":"text","text":" 通過觀察系統負載和各個進程的 CPU 使用情況,發現系統環境比較正常,但是 "},{"type":"codeinline","content":[{"type":"text","text":"com.meizu.mstore"}]},{"type":"text","text":" 進程 CPU 佔比偏高,而且 kernel 層 cpu 使用率(20%)較高,與系統 iowait (6.9%)佔用較高可能存在一定關聯,那麼 IO 負載較高對當前應用有多大影響呢?我們回到應用側進一步分析。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"應用側分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據上面的分析,我們將方向再回到當前進程,通過對比各線程 cpu 耗時(utm+stm),並沒有看到某個線程存在明顯異常。"},{"type":"text","marks":[{"type":"strong"}],"text":"主線程 CPU 執行時長 utm:187,stm:57,基本正常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分析對比完線程 CPU 耗時之後,將目光再次聚焦到"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MzI1MzYzMjE0MQ==&mid=2247488182&idx=1&sn=6337f1b51d487057b162064c3e24c439&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"Raster 監控工具"}],"marks":[{"type":"strong"}]},{"type":"text","text":"的調度時序圖上面。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fb\/fb894f64c0a9aec5658c550c1e268caf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過該時序圖,觀察以下三類信息特徵:"},{"type":"text","marks":[{"type":"strong"}],"text":"ANR 前的歷史消息,正在執行的消息,被 Block 的消息"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"歷史消息:"},{"type":"text","text":" 主線程並不存在單次歷史消息耗時嚴重的現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"當正在執行的消息:正在執行的消息"},{"type":"text","text":" Wall Duration 爲 21744ms,CPU Duration 爲 100ms。也就是說大量的時間發生在等待的場景,結合該場景,如果期間是因爲執行 Idle Task 導致的耗時嚴重或長時間 Wait,那麼 ANR 抓取的堆棧應該有 IdleTask 相關信息纔對,因此首先排除了是 Idle Task 耗時嚴重導致的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"被 Block 消息:從上圖可以看到,"},{"type":"text","text":" 第一個待調度的消息被 block 時長爲 22343ms,其 block 時長基本等於當前正在執行消息的 Wall Duration 時長。也就說明了本場景大量消息 blcok 是受到了當前正在執行的消息影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析到這裏我們就有些困惑了,ANR 發生時當前正處於 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景,但是前面我們多次在案例分析中提到,進入 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景的條件是:消息隊列沒有立刻調度的消息時,會有條件的進入 wait 狀態,等到超時或者新消息加入時會喚醒該線程並執行,但是從上圖可以看到消息隊列中存在大量待調度消息,而且很多消息都被 block 了 20 多 S,既然這麼多消息可以被調度,那麼系統爲何還是一直在 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 環境中呢?難道真的是底層發生了問題,導致無法喚醒當前線程?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶着這個疑惑,我們陸續分析了同段時間內其他用戶上報的問題,發現存在同樣的現象:"},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景的 WallDuration 普遍較長,有的甚至超過了 100S,但是 Cpu 時長很短。如下圖:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/037fd2a096a1ba7a96748b89ad865cc2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此我們第一反應是系統出問題了?但是進一步對比來看,該類現象只在某個版本之後明顯增加,而之前的版本並沒有這類現象,如果是廠商更新 rom 導致的問題,應該影響全版本,甚至會影響所有應用,但事實並非如此,因此這與我們的推測並不符合,無法自圓其說。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照我們的理解,如果直接進入 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 場景並且一直沒有喚醒的話,那麼 CPU Duration 應該會很少,並不應該是這樣表現(CPU Duration 達到或超過 100ms)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"定向監控:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮到國內廠商對 Rom 定製化的習慣,爲了確認上面監控的 Cpu 耗時是否是廠商在底層定製產生的耗時,我們在 Native 層通過 Hook 代理對 "},{"type":"codeinline","content":[{"type":"text","text":"nativePollOnce"}]},{"type":"text","text":" 接口進行了監測。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/35\/3560e6a676c3e5b4a56b6d46bc35480a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b7\/b7daa58ed63b7fc0759bcae285a286b5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在線上小範圍驗證和復現,通過觀察這類 ANR 問題時的線程調度時序圖,最終找到一個 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 停留時長高達 100S 的案例,如下圖:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/037fd2a096a1ba7a96748b89ad865cc2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖(TYPE=5)可以發現,ANR 發生前,主線程在消息調度結束與下次消息調度開始前,發生多次長時間停留的現象,而且期間都存在一定的 Cpu 耗時,但是遠小於 Wall duration。與此同時查看本次進行 epoll_wait 期間,"},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 是否是一直沒有返回,通過監控輸出的日誌,發現如下現象:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cf2a414432364244bca327840863c811.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在對齊監控時序圖與上圖日誌時間戳之後,看到 Java 層調用 "},{"type":"codeinline","content":[{"type":"text","text":"looper.next()"}]},{"type":"text","text":"獲取下一個消息過程中,Native 層 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 接口調用了多次,而且每次進入 epollwait 時傳入的參數 timeout 爲-1。分析到這裏有些疑惑了,這並不是我們預期的一直 wait 場景啊,參數-1 代表什麼意思呢?繼續向下看。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"MessageQueue 代碼分析:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然 ANR 這段時間,執行多次 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":",就說明其它線程已經多次將主線程多次從 epoll wait 狀態喚醒,但是消息隊列已經有大量待調度的消息,爲何主線程還依然停留在 "},{"type":"codeinline","content":[{"type":"text","text":"looper.next()"}]},{"type":"text","text":"內部呢?分析到這裏只好再次回到上層代碼繼續分析,這個參數-1 是哪裏設置的。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ce\/ce024bd01abda5450a8a58ad476b20c0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖可以看到,每當消息執行結束後,獲取下個消息之前會先主動調用一次 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":",但是 "},{"type":"codeinline","content":[{"type":"text","text":"nextPollTimeoutMillis"}]},{"type":"text","text":" 默認爲 0,並不是底層接口代理時看到的-1,那麼這個-1 是哪裏傳入的呢?繼續向下看。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b6\/b613d26a0bc20e8aee6c942a9fe0c6fa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖可以看到,只有一個地點將 "},{"type":"codeinline","content":[{"type":"text","text":"nextPollTimeoutMillis"}]},{"type":"text","text":" 設置爲-1,但是通過註釋可以清晰的看到提示"},{"type":"codeinline","content":[{"type":"text","text":"\"msg=mMessage\""}]},{"type":"text","text":",沒有消息?這與現實嚴重不符啊,ANR 發生時,消息隊列明顯有很多消息待執行,這裏卻提示"},{"type":"codeinline","content":[{"type":"text","text":"\"msg=mMessage\""}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過進一步觀察上述邏輯發現,該提示發生在 else 分支,如果進入到該分支,那麼則說明 msg 對象獲取爲空,但是在上面明明看到賦值過程"},{"type":"codeinline","content":[{"type":"text","text":"\"msg=mMessage\""}]},{"type":"text","text":",而且當前這種場景 mMessage 肯定不爲 null,畢竟在 ANR 時獲取的待調度消息也是通過 mMessage 遍歷得到的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然 "},{"type":"codeinline","content":[{"type":"text","text":"mMessage"}]},{"type":"text","text":" 不是 null,那麼就說明"},{"type":"codeinline","content":[{"type":"text","text":"\"msg=mMessage\""}]},{"type":"text","text":"肯定不是 null,但是到了下面卻爲 null,說明在此過程肯定被某個邏輯給重新賦值了,繼續分析。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7a1d75411c0388485d1829328741c469.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖可以看到只有這個場景可能將 msg 重新賦值,那麼這部分邏輯是做什麼的呢?"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Barrier 機制介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到上面的註釋瞬間明白了,原來是 Barrier 機制,是 Android 系統用來保障部分系統消息高優調度的一種機制,實現原理很簡單:會在每次消息返回前,檢測該消息是否是 barrier 消息,"},{"type":"text","marks":[{"type":"strong"}],"text":"barrier 消息的典型特徵就是 "},{"type":"codeinline","content":[{"type":"text","text":"msg.target"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":" 對象爲 null"},{"type":"text","text":",如下圖:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/16\/16ee809db6e0ff9b06c96808c369e1b4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是 Barrier 消息,那麼將對消息隊列中的消息進行遍歷,找到第一個異步消息,然後將其賦值給 msg。但是如果遍歷了所有的消息都沒有找到異步消息,那麼最後一個消息 msg.next 肯定爲 null,此時 msg 會被置爲 null,並退出循環。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b8\/b8e4c7b4833cb424b556c62df2500ddf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/74\/742c8c8746b224a49845c56510ccb0e3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖爲異步消息的設置和判斷是否是異步消息的接口實現,我們日常創建的 Message 是不會設置該屬性的。只有系統在某些特殊場景,如 UI 刷新,爲了保障交互體驗,會在請求 vsync 信號前,先發送一個 barrier 消息,然後等到 barrier 消息執行時,遍歷 vsync 消息並將其強制調整到頭部,以保證該類消息的響應能力:barrier 消息設置和移除,實現邏輯如下:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/88\/8834c5fab538b210f79cbe5868627df4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面實現可以看到,barrier 消息是不會主動移除的,需要設置 barrier 消息的業務消息得到響應後主動移除該消息,否則 barrier 消息會一直存在!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析到這裏就可以很好的解釋爲何在 "},{"type":"codeinline","content":[{"type":"text","text":"MessageQueue.next()"}]},{"type":"text","text":"接口內部多次調用 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":" 了,一定是當前的 "},{"type":"codeinline","content":[{"type":"text","text":"mMessage"}]},{"type":"text","text":" 是個 barrier 消息,但是與其關聯的業務消息一直沒有出現,或者執行之後沒有同步移除該消息,導致該 barrier 消息一直處於消息隊列頭部,每次獲取下一個消息時,都被 barrier 攔截和並遍歷異步消息,如果有異步消息則響應,沒有異步消息則通過 "},{"type":"codeinline","content":[{"type":"text","text":"nativePollOnce"}]},{"type":"text","text":" 進行等待,從而阻塞了正常消息的調度和響應!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步梳理 MessageQueue.next 接口執行邏輯,通過下圖就可以清晰的看到我們在 Native 層 Hook 時看到 nextPollTimeMills 傳參-1 的場景是怎麼設置的。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/46\/464ca17d97bbdcb0cc49b4ddc62c41f2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼結合本類 ANR 問題,消息隊列第一個待調度的消息是不是 barrier 消息呢?我們再次找到上面分析的案例,通過監控時序圖觀察第一個被 block 的消息。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b1\/b15f05770c6ae11b3fa5afc083ce1d6d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"通過上圖可以清晰的看到,當前消息 target 對象爲 null,正是 barrier 消息!破案了!"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"連鎖反應:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照上面的分析,"},{"type":"text","marks":[{"type":"strong"}],"text":"如果 barrier 消息沒有及時移除,那麼每次通過 "},{"type":"codeinline","content":[{"type":"text","text":"MessageQueue.next()"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":"查詢時,只能過濾並返回帶有異步屬性的消息,如用戶點擊消息 input,vsync 消息等等。"},{"type":"text","text":" 即使用戶交互和 UI 刷新消息可以正常執行,但是大量業務消息無法執行,這就導致了 UI 展示可能存在異常或功能異常,並且應用 service,receiver 等消息並沒有異步屬性,因此也會被 block,最終造成響應超時發生 ANR!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合當前問題,我們發現該用戶在第一次 ANR 不久之後再次發生 ANR,主線程 Trace:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b7\/b7d12ab5fca40ea70962615f161ac260.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二次 ANR 時,對應的調度時序圖如下:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a1\/a104f8784ff9593446d55acde4b21126.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過當前用戶連續 2 次 ANR 的消息調度監控時序圖可以看到,本次 ANR 時,之前的歷史消息記錄沒有發生任何變化,也就是說第一個 ANR 發生後確實沒有再調度其他消息,但 2 次 ANR 的 WallTime 間隔超過 40S,也就是說這 40S 時間裏,主線程確實一直 Block 在當前場景!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在消息調度時序圖上進一步對比分析,發現兩次 ANRCase,"},{"type":"text","marks":[{"type":"strong"}],"text":"主線程當前正在執行消息的 Cpu Time 時長卻發生了變化,即從 100ms 增加 450ms"},{"type":"text","text":"。那麼這個數據是否置信嗎?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合這兩次 ANR,分別分析一下 ANR Trace 主線程 utm+stm 的耗時(見上圖 2 次 Trace 堆棧):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"發生第一次 ANR 時線程狀態及 utm,stm 耗時:"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":" state=S schedstat=( 2442286229 338070603 5221 ) utm=187 stm=57 core=5 HZ=100\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"發生第二次 ANR 時線程狀態及 utm,stm 耗時:"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"| state=S schedstat=( 2796231342 442294098 6270 ) utm=202 stm=77 core=5 HZ=100\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"用第二次發生 ANR 時的 utm+stm 減去第一次 ANR 時的 utm+stm,即 202+77-(187+57)=35ms。這個值對應的是 cpu 時間片,utm,stm 單位換算成時間單位爲 1 比 10ms,即 35*10=350ms。這個值恰好等於 Raset 監控工具統計到的兩次 Cputime 差值:450ms-100ms=350ms。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說明在此期間消息隊列增加了多個消息,因爲每次增加一個消息,主線程都會從 epollwait 場景喚醒,然後回到 java 環境對消息隊列進行遍歷,判斷是否有異步消息,如果沒有找到,則再次進入 epollwait 狀態,如此反覆,從而導致了上述現象!"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"問題初定位:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面的層層分析,我們知道了是 barrier 同步機制出現了問題,導致消息調度發生異常,即:在 barrier 消息沒有被移除之前,主線程只能處理 asyncronous 屬性的消息,這類消息通常是用來刷新的 vsync 消息,以及響應用戶交互的 input 消息,但是正常的業務消息及其他系統消息則無法正常調度,如 Serivce,Receiver 具體超時行爲的消息,因此導致了 ANR。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"定位及修復:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在定位到原因之後,接下來就是找到問題並解決問題,具體什麼樣的改動會引起這裏問題了,通過分析我們知道既然是 Barrier 消息同步的問題,那麼我們可以在設置 barrier 和移除 barrier 的過程加入監控,判斷哪裏設置了 barrier 消息,但是沒有同步移除。通過 Java hook 代理了 MessageQueue 的 "},{"type":"codeinline","content":[{"type":"text","text":"postSyncBarrier"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"removeSyncBarrier"}]},{"type":"text","text":" 接口,進行 Barrier 消息同步監測,遺憾的是線下並沒有復現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此只能再次回到代碼層面,對相關改動進行分析,最終在一筆需求提交中發現了線索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"邏輯調整前:"},{"type":"text","text":" 先移除將要強制調度的並設置了異步屬性的消息,再強制調度該消息,以保證該消息不受 barrier 消息之前的消息 block,進而提高響應能力。"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"if (hasMsg) {\n    ......\n    handler.removeCallbacks(message.getCallback()); \/\/先移除\n    handler.dispatchMessage(cloneMsg); \/\/再強制調度該消息\n    ......\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"邏輯調整後:"},{"type":"text","text":" 先強制調度該消息,然後再將該消息從隊列中移除。"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"    ......\n        handler.dispatchMessage(newMessage); \/\/先強制調度\n       handler.removeCallbacks(message.getCallback());  \/\/從隊列中移除消息\n    ......\n    }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"但是時序調整後存在一定隱患,即在強制調用 DoFrame 消息期間,業務可能會再次觸發 UI 刷新邏輯,產生 barrier 消息併發出 vsync 請求,如果系統及時響應 vsync,併產生 DoFrame 消息,那麼調用 "},{"type":"codeinline","content":[{"type":"text","text":"removeCallbacks"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":" 接口會一次性清除消息隊列中所有的 DoFrame 消息,即:移除了消息隊列之前的 DoFrame 消息和下次待調度的 DoFrame 消息,但是與下次 DoFrame 消息同步的 barrier 消息並沒有被移除。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼爲什麼會移除多個消息呢?這就要從"},{"type":"codeinline","content":[{"type":"text","text":"handler.removeCallbacks"}]},{"type":"text","text":" 的實現說起了。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/112f6b5d96851c658eda988cb43b1dc8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步查看 "},{"type":"codeinline","content":[{"type":"text","text":"messageQueue.removeMessages"}]},{"type":"text","text":" 接口實現,發現該接口會遍歷消息隊列中符合當前 runnable 以及 object 的消息,但是上面傳遞的 Object 對象是 null,因此就相當於移除了當前 Handler 對象下面所有相同 runnable 對象的消息!"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/00\/002c6d55a340e6232498826e9c4f7a58.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲強制刷新和時序調整的問題,導致了消息隊列中同時存在 2 個 UI doFrame 消息,並在強制執行之後被同時移除,"},{"type":"text","marks":[{"type":"strong"}],"text":"從而導致一個無人認領的 barrier 消息一直停留在消息隊列 !"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"其它場景:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,除了上面遇到的場景會導致這類問題之外,還有一種場景也可能會導致這類問題,即:UI 異步刷新,儘管 Android 系統禁止異步刷新,並利用 checkThread 機制對 UI 刷新進行線程檢查,但是百密一疏,如果開啓硬件加速,在 AndroidO 及之後的版本會間接調用 "},{"type":"codeinline","content":[{"type":"text","text":"onDescendantInvalidated"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":" 觸發 UI 刷新,該邏輯躲過了系統 checkThread 檢查,將會造成線程併發隱患。如下圖,如果併發執行則會導致前一個線程的 "},{"type":"codeinline","content":[{"type":"text","text":"mTraversalBarrier"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":" 被覆蓋,從而導致 vsync 消息與 barrier 出現同步問題"},{"type":"text","text":"。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/119e7c936c1a1a3ca89e2543baf5deea.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看 Android Q 源碼,看到 "},{"type":"codeinline","content":[{"type":"text","text":"onDescendantInvalidated"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":" 內部加上了 checkThread,但被註釋掉了!解釋如下:修復攝像頭後重新啓用或者通過 targetSdk 檢查?好吧,或許是忘記這個 TODO 了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9ee257d62b7ff0fa5773af89e8d092ac.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此,我們完成了該類問題的分析和最終定位,綜合來看該類問題因 Trace 場景("},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}]},{"type":"text","text":")和問題本身的高度隱蔽性,給排查和定位帶來了極大挑戰,如果單純依靠系統提供的日誌,是很難發現 "},{"type":"codeinline","content":[{"type":"text","text":"MessageQueue.next()"}]},{"type":"text","text":"內部發生了異常。這裏我們通過 Raster 監控工具,還原了問題現場,並提供了重要線索。現在總結來看,"},{"type":"text","marks":[{"type":"strong"}],"text":"該類問題其實具有很明顯的特徵,表現在以下幾個方面"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"問題場景 ANR Trace 集中聚合在 "},{"type":"codeinline","content":[{"type":"text","text":"NativePollOnce"}],"marks":[{"type":"strong"}]},{"type":"text","marks":[{"type":"strong"}],"text":",此過程 Wall duration 持續很長,並且屏蔽了後續所有正常消息調度,看起來像主線被凍結一樣。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"通過 Raster 監控工具可以清晰的看到,消息隊列中如果第一個待消息 target 爲 null,即爲 barrier 消息,可以通過後續消息 block 時長評估影響程度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"出現該類問題時,因爲正常消息無法被調度,如 Service,Receiver 消息,將會導致應用連續發生 ANR,直到用戶主動 Kill 該進程。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"後續:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來的文章中,我們將會介紹系統服務與客戶端 binder 通信過程中,因爲時序顛倒引起的 ANR 問題,因爲是系統機制出現了 bug,理論上所有應用都會受到影響,問題表現同樣很隱蔽,那麼這類問題到底是什麼樣的表現呢?敬請期待。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:字節跳動技術團隊(ID:toutiaotechblog)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/OBYWrUBkWwV8o6ChSVaCvw","title":"xxx","type":null},"content":[{"type":"text","text":"今日頭條 ANR 優化實踐系列 - Barrier 導致主線程假死"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章