字節跳動如何系統性治理 iOS 穩定性問題

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文是豐亞東講師在2021 ArchSummit 全球架構師峯會中「如何系統性治理 iOS 穩定性問題」的分享全文。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先做一下自我介紹:我是豐亞東,2016 年 4 月加入字節跳動,先後負責今日頭條 App 的工程架構、基礎庫和體驗優化等基礎技術方向。2017 年 12 月至今專注在 APM 方向,從 0 到 1 參與了字節跳動 APM 中臺的建設,服務於字節的全系產品,目前主要負責 iOS 端的性能穩定性監控和優化。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bc/bcb564f61480424e39ee789d4f9b3ef8.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次分享主要分爲四大章節,分別是:1.穩定性問題分類;2.穩定性問題治理方法論;3.疑難問題歸因;4.總結回顧。其中第三章節「疑難問題歸因」是本次分享的重點,大概會佔到60%的篇幅。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、穩定性問題分類","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在講分類之前,我們先了解一下背景:大家都知道對於移動端應用而言,閃退是用戶能遇到的最嚴重的 bug,因爲在閃退之後用戶無法繼續使用產品,那麼後續的用戶留存以及產品本身的商業價值都無從談起。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏有一些數據想和大家分享:有 20% 的用戶在使用移動端產品的時候,最無法忍受的問題就是閃退,這個比例僅次於不合時宜的廣告;在因爲體驗問題流失的用戶中,有 1/3 的用戶會轉而使用競品,由此可見閃退問題是非常糟糕和嚴重的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fa/fa8e5a6fe3d5a0360f6587707e0543ca.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字節跳動作爲擁有像抖音、頭條等超大量級 App 的公司,對穩定性問題是非常重視的。過去幾年,我們在這方面投入了非常多的人力和資源,同時也取得了不錯的治理成果。過去兩年抖音、頭條、飛書等 App 的異常崩潰率都有 30% 以上的優化,個別產品的部分指標甚至有 80% 以上的優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上圖中右側的餅狀圖可以看出:我們以 iOS 平臺爲例,根據穩定性問題不同的原因,將已知穩定性問題分成了這五大類,通過佔比從高到低排序:第一大類是 OOM ,就是內存佔用過大導致的崩潰,這個比例能佔到 50% 以上;其次是 Watchdog,也就是卡死,類比於安卓中的 ANR;再次是普通的 Crash;最後是磁盤 IO 異常和 CPU 異常。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到這裏大家心裏可能會有一個疑問:字節跳動究竟做了什麼,才取得了這樣的成果?接下來我會將我們在穩定性治理方面沉澱的方法論分享給大家。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、穩定性問題治理的方法論","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c12512312275531235fc6e7a98866ebf.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們認爲在穩定性問題治理方面,從監控平臺側視角出發,最重要的就是要有完整的能力覆蓋,比如針對上一章節中提到所有類型的穩定性問題,監控平臺都應該能及時準確的發現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外是從業務研發同學的視角出發:穩定性問題治理這個課題,需要貫穿到軟件研發的完整生命週期,包括需求研發、測試、集成、灰度、上線等,在上述每個階段,研發同學都應該重視穩定性問題的發現和治理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中右側是我們總結的兩條比較重要的治理原則:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一條是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"控制新增,治理存量","attrs":{}},{"type":"text","text":"。一般來說新增的穩定性問題可能是一些容易爆發的問題,影響比較嚴重。存量問題相對來說疑難的問題居多,修復週期較長。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二條比較容易理解:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"先急後緩,先易後難。","attrs":{}},{"type":"text","text":"我們應該優先修復那些爆發的問題以及相對容易解決的問題。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7a6fb2ec0a9ad2e67edf9f4874f8399.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們將軟件研發週期聚焦在穩定性問題治理這個方向上,又可以抽象出以下幾個環節:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"首先第一個環節是問題發現:","attrs":{}},{"type":"text","text":"當用戶在線上遇到任何類型的閃退,監控平臺都應該能及時發現並上報。同時可以通過報警以及問題的自動分發,將這些問題第一時間通知給開發者,確保這些問題能夠被及時的修復。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第二個階段是歸因:","attrs":{}},{"type":"text","text":"當開發者拿到一個穩定性問題之後,要做的第一件事情應該是排查這個問題的原因。根據一些不同的場景,我們又可以把歸因分爲單點歸因、共性歸因以及爆發問題歸因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當排查到問題的原因之後,下一步就是把這個問題修復掉,也就是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題的治理","attrs":{}},{"type":"text","text":"。在這裏我們有一些問題治理的手段:如果是在線上階段,我們首先可以做一些問題防護,比如網易幾年前一篇文章提到的基於 OC Runtime 的線上 Crash 自動修復的方案大白,基於這種方案我們可以直接在線上做 Crash 防護;另外由於後端服務上線導致的穩定性問題爆發,我們可以通過服務的回滾來做到動態止損。除了這兩種手段之外,更多的場景還是需要研發在線下修復 native 代碼,再通過發版做徹底的修復。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後一個階段也是最近幾年比較火的一個話題,就是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題的防劣化。","attrs":{}},{"type":"text","text":"指的是需求從研發到上線之間的階段,可以通過機架的自動化單元測試/UI自動化測試,以及研發可以通過一些系統工具,比如說 Xcode 和 Instruments,包括一些第三方工具,比如微信開源的 MLeaksFinder 去提前發現和解決各類穩定性問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們想把穩定性問題治理做好的話,需要所有研發同學關注上述每一個環節,才能達到最終的目標。可是這麼多環節我們的重點究竟在哪裏呢?","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"從字節跳動的問題治理經驗來看,我們認爲最重要的環節是第二個——線上的問題的歸因。","attrs":{}},{"type":"text","text":"因爲通過內部的統計數據發現:線上之所以存在長期沒有結論,沒有辦法修復的問題,主要還是因爲研發並沒有定位到這些問題的根本原因。所以下一章節也是本次分享的重點:疑難問題歸因。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、疑難問題歸因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們根據開發者對這些問題的熟悉程度做了一下排序,分別是:Crash、Watchdog、OOM 和 CPU&Disk I/O。每一類疑難問題我都會分享這類問題的背景和對應的解決方案,並且會結合實戰案例演示各種歸因工具究竟是如何解決這些疑難問題的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.1 第一類疑難問題 —— Crash","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a85122d103dae4cf2e6ebb4e6e50398.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中左側這張餅狀圖是我們根據 Crash 不同的原因,把它細分成四大類:包括 Mach 異常、 Unix Signal 異常、OC 和 C++ 語言層面上的異常。其中比例最高的還是 Mach 異常,其次是 Signal 異常,OC 和 C++ 的異常相對比較少。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼是這個比例呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以看到右上角有兩個數據。第一個數據是微軟發佈的一篇文章,稱其發佈的 70% 以上的安全補丁都是內存相關的錯誤,對應到 iOS 平臺上就是 Mach 異常中的非法地址訪問,也就是 EXC_BAD_ACCESS。內部統計數據表明,字節跳動線上 Crash 有 80% 是長期沒有結論的,在這部分 Crash 當中,90% 以上都是 Mach 異常或者 Signal 異常。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到這裏,大家肯定心裏又有疑問了,爲什麼有這麼多 Crash 解決不了?究竟難在哪裏?我們總結了幾點這些問題歸因的難點:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先不同於 OC 和 C++ 的異常,可能開發者拿到的崩潰調用棧是一個純系統調用棧,這類問題顯然修復難度是非常大的;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外可能有一部分Crash是偶發而不是必現的問題,研發同學想在線下復現問題是非常困難的,因爲無法復現,也就很難通過 IDE 調試去排查和定位這些問題;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外對於非法地址訪問這類問題,崩潰的調用棧可能並不是第一現場。這裏舉一個很簡單的例子:A業務的內存分配溢出,踩到了B業務的內存,這個時候我們認爲 A 業務應該是導致這個問題的主要原因,但是有可能B業務在之後的某一個時機用到了這塊內存,發生了崩潰。顯然這種問題實際上是 A 業務導致的,最終卻崩在了 B 業務的調用棧裏,這就會給開發者排查和解決這個問題帶來非常大的干擾。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到這裏大家可能心裏又有問題:既然這類問題如此難解,是不是就完全沒有辦法了呢?其實也並不是,下面我會分享字節內部兩個解決這類疑難問題非常好用的歸因工具。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.1.1 Zombie 檢測","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/403e581cb6b3686f99dafb59c752776a.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先第一個是 Zombie 檢測,大家如果用過 Xcode 的 Zombie 監控,應該對這個功能比較熟悉。如果我們在調試之前打開了 Zombie Objects 這個開關,在運行的時候如果遇到了 OC 對象野指針造成的崩潰,Xcode 控制檯中會打印出一行日誌,它會告訴開發者哪個對象在調用什麼消息的時候崩潰了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們再解釋一下 Zombie 的定義,其實非常簡單,指的是已經釋放的 OC 對象。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zombie 監控的歸因優勢是什麼呢?首先它可以直接定位到問題發生的類,而不是一些隨機的崩潰調用棧;另外它可以提高偶現問題的復現概率,因爲大部分偶現問題可能跟多線程的運行環境有關,如果我們能把一個偶現問題變成必現問題的話,那麼開發者就可以藉助 IDE 和調試器非常方便地排查問題。但是這個方案也有自己的適用範圍,因爲它的底層原理基於 OC 的 runtime 機制,所以它僅僅適用於 OC 對象野指針導致的內存問題。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/79/7908f2ac794d8fd2042d101ec293c289.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏再和大家一起回顧一下 Zombie 監控的原理:首先我們會 hook 基類 NSObject 的 dealloc 方法,當任意 OC 對象被釋放的時候,hook 之後的那個 dealloc 方法並不會真正的釋放這塊內存,同時將這個對象的 ISA 指針指向一個特殊的殭屍類,因爲這個特殊的殭屍類沒有實現任何方法,所以這個殭屍對象在之後接收到任何消息都會 Crash,與此同時我們會將崩潰現場這個殭屍對象的類名以及當時調用的方法名上報到後臺分析。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8d/8dedcdf5560825db742491c1224dbbea.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏是字節的一個真實案例:這個問題是飛書在某個版本線上 Top 1 的 Crash,當時持續了兩個月沒有被解決。首先大家可以看到這個崩潰調用棧是一個純系統調用棧,它的崩潰類型是非法地址訪問,發生在視圖導航控制器的一次轉場動畫,可能開發者一開始看到這個崩潰調用棧是毫無思路的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9efe7dbeb42ba6f7cab8fade64a2e301.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼我們再看 Zombie 功能開啓之後的崩潰調用棧:這個時候報錯信息會更加豐富,可以直接定位到野指針對象的類型,是 MainTabbarController 對象在調用 retain 方法的時候發生了 Crash。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到這裏大家肯定有疑問了,MainTabbarController 一般而言都是首頁的根視圖控制器,理論上在整個生命週期內不應該被釋放。爲什麼它變成了一個野指針對象呢?可見這樣一個簡單的報錯信息,有時候還並不足以讓開發者定位到問題的根本原因。所以這裏我們更進一步,擴展了一個功能:將 Zombie 對象釋放時的調用棧信息同時上報上來。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6db72ed7234f864451692bc7af7e26da.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家看倒數第二行,實際上是一段飛書的業務代碼,是視圖導航控制器手勢識別的代理方法,這個方法在調用的時候釋放了 MainTabbarController。因爲通過這個調用棧找到了業務代碼的調用點,所以我們只需要對照源碼去分析爲什麼會釋放 TabbarController,就可以定位到這個問題的原因。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/99/997334d4983452842266d5b7264c20fc.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中右側是簡化之後的源碼(因爲涉及到代碼隱私問題,所以通過一段註釋代替)。歷史上爲了解決手勢滑動返回的衝突問題,在飛書視圖導航控制器的手勢識別代理方法中寫了一段 trick 代碼,正是這個 trick 方案導致了首頁視圖導航控制器被意外釋放。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排查到這裏,我們就找到了問題的根本原因,修復的方案也就非常簡單了:只要下掉這個 trick 方案,並且依賴導航控制器的原生實現來決定這個手勢是否觸發就解決了這個問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"3.1.2 Coredump","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛纔也提到:Zombie 監控方案是有一些侷限的,它僅適用於 OC 對象的野指針問題。大家可能又會有疑問:C 和 C++ 代碼同樣可能會出現野指針問題,在 Mach 異常和 Signal 異常中,除了內存問題之外,還有很多其他類型的異常比如 EXC_BAD_INSTRUCTION和SIGABRT。那麼其他的疑難問題我們又該怎麼解決呢?這裏我們給出了另外一個解決方案 —— Coredump。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d1/d1cf0c69ab4e277d084ddd99aa13e69f.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個先解釋一下什麼是 Coredump:Coredump 是由 lldb 定義的一種特殊的文件格式,Coredump 文件可以還原 App 在運行到某一時刻的完整運行狀態(這裏的運行狀態主要指的是內存狀態)。大家可以簡單的理解爲:Coredump文件相當於在崩潰的現場打了一個斷點,並且獲取到當時所有線程的寄存器信息,棧內存以及完整的堆內存。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coredump 方案它的歸因優勢是什麼呢?首先因爲它是 lldb 定義的文件格式,所以它天然支持 lldb 的指令調試,也就是說開發者無需復現問題,就可以實現線上疑難問題的事後調試。另外因爲它有崩潰時現場的所有內存信息,這就爲開發者提供了海量的問題分析素材。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個方案的適用範圍比較廣,可以適用於任意 Mach 異常或者 Signal 異常問題的分析。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/64/646f7287765a987d2c9e01f0f488c487.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面也帶來一個線上真實案例的分析:當時這個問題出現在字節的所有產品中,而且在很多產品中的量級非常大,排名Top 1 或者 Top 2,這個問題在之前兩年的時間內都沒有被解決。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以看到這個崩潰調用棧也全是系統庫方法,最終崩潰在 libdispatch 庫中的一個方法,異常類型是命中系統庫斷言。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e6a9e5944a95b61cf6b323669ddeb739.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將這次崩潰的 Coredump 文件上報之後,用前面提到的 lldb 調試指令去分析,因爲擁有崩潰時的完整內存狀態,所以我們可以分析所有線程的寄存器和棧內存等信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏最終我們分析出:崩潰線程的 0 號棧幀(第一行調用棧),它的 x0 寄程器實際上就是 libdispatch 中定義的隊列結構體信息。在它起始地址偏移 0x48 字節的地方,也就是這個隊列的 label 屬性(可以簡單理解爲隊列的名字)。這個隊列的名字對我們來說是至關重要的,因爲要修復這個問題,首先應該知道究竟是哪個隊列出現了問題。通過 memory read 指令我們直接讀取這塊內存的信息,最終發現它是一個 C 的字符串,名字叫 com.appple.CFFileDescriptor,這個信息非常關鍵。我們在源碼中全局搜索這個關鍵字,最終發現這個隊列是在字節底層的網絡庫中創建的,這也就能解釋爲什麼字節所有產品都有這個崩潰了。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/26/266cbce2f2fdfb6f534d89eafc2b82eb.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終我們和網絡庫的同學一起排查,同時結合 libdispatch 的源碼,定位到這個問題的原因是 GCD 隊列的外部引用計數小於0,存在過度釋放的問題,最終命中系統庫斷言導致崩潰。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a6/a61c307c584a517794e772531402c1fc.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排查到問題之後,解決方案就比較簡單了:我們只需要在這個隊列創建的時候,使用 dispatch_source_create 的方式去增加隊列的外部引用計數,就能解決這個問題。和維護網絡庫的同學溝通後,確認這個隊列在整個 App 的生命週期內不應該被釋放。這個問題最終解決的收益是直接讓字節所有產品的 Crash 率降低了8%。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.2 第二類疑難問題 —— Watchdog","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們進入疑難問題中的第二類問題 —— Watchdog 也就是卡死。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5b/5b5d14155ceb5e1b2a28f8c6b2751fd8.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中左側是我在微博上截的兩張圖,是用戶在遇到卡死問題之後的抱怨。可見卡死問題對用戶體驗的傷害還是比較大的。那麼卡死問題它的危害有哪些呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先卡死問題通常發生於用戶打開 App 的冷啓動階段,用戶可能等待了10 秒什麼都沒有做,這個 App 就崩潰了,這對用戶體驗的傷害是非常大的。另外我們線上監控發現,如果沒有對卡死問題做任何治理的話,它的量級可能是普通 Crash 的 2-3 倍。另外現在業界普遍監控 OOM 崩潰的做法是排除法,如果沒有排除卡死崩潰的話,相應的就會增加 OOM 崩潰誤判的概率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"卡死類問題的歸因難點有哪些呢?首先基於傳統的方案——卡頓監控:認爲主線程無響應時間超過3秒~5秒之後就是一次卡死,這種傳統的方案非常容易誤報,至於爲什麼誤報,我們下一頁中會講到。另外卡死的成因可能非常複雜,它不一定是單一的問題:主線程的死鎖、鎖等待、主線程 IO 等原因都有可能造成卡死。第三點是死鎖問題是一類常見的導致卡死問題的原因。傳統方案對於死鎖問題的分析門檻是比較高的,因爲它強依賴開發者的經驗,開發者必須依靠人工的經驗去分析主線程到底跟哪個或者哪些線程互相等待造成死鎖,以及爲什麼發生死鎖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d3525225bc7dab1d94bb119192e49bd.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以看到這是基於傳統的卡頓方案來監控卡死,容易發生誤報。爲什麼呢?圖中綠色和紅色的部分是主線程的不同耗時階段。假如主線程現在卡頓的時間已經超過了卡死閾值,剛好發生在圖中的第5個耗時階段,我們在此時去抓取主線程調用棧,顯然它並不是這次耗時的最主要的原因,問題其實主要發生在第4個耗時階段,但是此時第4個耗時階段已經過去了,所以會發生一次誤報,這可能讓開發者錯過真正的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上提到的痛點,我們給出了兩個解決方案:首先在卡死監控的時候可以多次抓取主線程調用棧,並且記錄每次不同時刻主線程的線程狀態,關於線程狀態包括哪些信息,下一頁中會提到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外我們可以自動識別出死鎖導致的卡死問題,將這類問題標識出來,並且可以幫助開發者自動還原出各個線程之間的鎖等待關係。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a9ba27b3b0d40d47415ca51eab4b24a.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是第一個歸因工具——線程狀態,這張圖是主線程在不同時刻調用棧的信息,在每個線程名字後面都有三個 tag ,分別指的是三種線程的狀態,包括當時的線程 CPU 佔用、線程運行狀態和線程標誌。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中右側是線程的運行狀態和線程標誌的解釋。當看到線程狀態的時候,我們主要的分析思路有兩種:第一種,如果看到主線程的 CPU 佔用爲 0,當前處於等待的狀態,已經被換出,那我們就有理由懷疑當前這次卡死可能是因爲死鎖導致的;另外一種,特徵有所區別,主線程的 CPU 佔用一直很高 ,處於運行的狀態,那麼就應該懷疑主線程是否存在一些死循環等 CPU 密集型的任務。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/33/33062378ef844c6fd9c5a4c7dc587337.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個歸因工具是死鎖線程分析,這個功能比較新穎,所以首先帶領大家瞭解一下它的原理。基於上一頁提到的線程狀態,我們可以在卡死時獲取到所有線程的狀態並且篩選出所有處於等待狀態的線程,再獲取每個線程當前的 PC 地址,也就是正在執行的方法,並通過符號化判斷它是否是一個鎖等待的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中列舉了目前我們覆蓋到的一些鎖等待方法,包括互斥鎖、讀寫鎖、自旋鎖、 GCD 鎖等等。每個鎖等待的方法都會定義一個參數,傳入當前鎖等待的信息。我們可以從寄存器中讀取到這些鎖等待信息,強轉爲對應的結構體,每一個結構體中都會定義一個線程id的屬性,表示當前這個線程正在等待哪個線程釋放鎖。對每一個處於等待狀態的線程完成這樣一系列操作之後,我們就能夠完整獲得所有線程的鎖等待關係,並構建出鎖等待關係圖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/16/16dd536249cf70865c6fe027d9477545.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上述方案,我們可以自動識別出死鎖線程。假如我們能判斷 0 號線程在等待 3 號線程釋放鎖, 同時3 號線程在等待0號線程釋放鎖,那麼顯然就是兩個互相等待最終造成死鎖的線程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以看到這裏主線程我們標記爲死鎖,它的 CPU 佔用爲 0,狀態是等待狀態,而且已經被換出了,和我們之前分析線程狀態的方法論是吻合的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd58200513486abd688485a0a834277a.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這樣的分析之後,我們就能夠構建出一個完整的鎖等待關係圖,而且無論是兩個線程還是更多線程互相等待造成的死鎖問題,都可以自動識別和分析。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/97/9799628af54f2ab4abe3450d8681bf2f.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是上圖中死鎖問題的一段示意的源碼。它的問題就是主線程持有互斥鎖,子線程持有 GCD 鎖,兩個線程之間互相等待造成了死鎖。這裏給出的解決方案是:如果子線程中可能存在耗時操作,儘量不要和主線程有鎖競爭關係;另外如果在串行隊列中同步執行 block 的話,一定要慎重。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cd/cd4220e453750f3d55fe94db8338c123.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是通過字節內部線上的監控和歸因工具,總結出最常見觸發卡死問題的原因,分別是死鎖、鎖競爭、主線程IO、跨進程通信。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.3 第三類疑難問題 —— OOM","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OOM 就是 Out Of Memory,指的是應用佔用的內存過高,最終被系統強殺導致的崩潰。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ac/acc810ed372e93e481ce5dcd6ace290c.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OOM 崩潰的危害有哪些呢?首先我們認爲用戶使用 App 的時間越長,就越容易發生 OOM 崩潰,所以說 OOM 崩潰對重度用戶的體驗傷害是比較大的;統計數據顯示,如果 OOM 問題沒有經過系統性的治理,它的量級一般是普通 Crash 的 3-5 倍。最後是內存問題不同於 Crash 和卡死,相對隱蔽,在快速迭代的過程中非常容易劣化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼 OOM 問題的歸因難點有哪些呢?首先是內存的構成是非常複雜的事情,並沒有非常明確的異常調用棧信息。另外我們在線下有一些排查內存問題的工具,比如 Xcode MemoryGraph 和 Instruments Allocations,但是這些線下工具並不適用於線上場景。同樣是因爲這個原因,如果開發者想在線下模擬和復現線上 OOM 問題是非常困難的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f9/f9039288180add974b4d72222ae43b83.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們給出解決線上 OOM 疑難問題的歸因工具是MemoryGraph。這裏的 MemoryGraph 主要指的是在線上環境中可以使用的 MemoryGraph。跟 Xcode MemoryGraph 有一些類似,但是也有不小的區別。最大的區別當然是它能在線上環境中使用,其次它可以對分散的內存節點進行統計和聚合,方便開發者定位頭部的內存佔用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏帶領大家再回顧一下線上 MemoryGraph 的基本原理:首先我們會定時的去檢測 App 的物理內存佔用,當它超過危險閾值的時候,就會觸發內存 dump,此時 SDK 會記錄每個內存節點符號化之後的信息,以及他們彼此之間的引用關係,如果能判定出是強引用還是弱引用,也會把這個強弱引用關係同時上報上來,最終這些信息整體上報到後臺之後,就可以輔助開發者去分析當時的大內存佔用和內存泄露等異常問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們還是用一個實戰案例帶領大家看一下 MemoryGraph 到底是如何解決 OOM 問題的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/9089e5045b842c01a2881fcff0c25cb6.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析 MemoryGraph 文件的思路一般是抽絲剝繭,逐步找到根本原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是 MemoryGraph 文件分析的一個例子,這裏的紅框標註了不同的區域:左上角是類列表,會把同一類型對象的數量以及它們佔用的內存大小做一個彙總;右側是這個類所有實例的地址列表,右下角區域開發者可以手動回溯對象的引用關係(當前對象被哪些其他對象引用、它引用了哪些其他對象),中間比較寬的區域是引用關係圖。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲不方便播放視頻,所以這邊就跟大家分享一些比較關鍵的結論:首先看到類列表,我們不難發現 ImageIO 類型的對象有 47 個,但是這 47 個對象居然佔了 500 多 MB 內存,顯然這並不是一個合理的內存佔用。我們點開 ImageIO 的類列表,以第一個對象爲例,回溯它的引用關係。當時我們發現這個對象只有一個引用,就是 VM Stack: Rust Client Callback ,它實際上是飛書底層的 Rust 網絡庫線程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排查到這裏,大家肯定會好奇:這 47 個對象是不是都存在相同的引用關係呢?這裏我們就可以用到右下角路徑回溯當中的 add tag 功能,自動篩選這 47 個對象是否都存在相同的引用關係。大家可以看到上圖中右上角區域,通過篩選之後,我們確認這 47 個對象 100% 都有相同的引用關係。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再去分析 VM Stack: Rust Client Callback這個對象。發現它引用的對象中有兩個名字非常敏感,一個是 ImageRequest,另外一個是 ImageDecoder ,從這兩個名字我們可以很容易地推斷出:應該是圖片請求和圖片解碼的對象。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/96/960d4c210018522030510d0c96299a52.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再用這兩個關鍵字到類列表中搜索,可以發現 ImageRequest 對象有 48 個,ImageDecoder 對象有 47 個。如果大家還有印象的話,上一頁中佔用內存最大的對象 ImageIO 也是 47 個。這顯然並不是一個巧合,我們再去排查這兩類對象的引用關係,發現這兩類對象也同樣是 100% 被 VM Stack: Rust Client Callback 對象所引用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終我們和飛書圖片庫的同學一起定位到這個問題的原因:在同一時刻併發請求 47 張圖片並解碼,這不是一個合理的設計。問題的根本原因是飛書圖片庫的下載器依賴了 NSOperationQueue 做任務管理和調度,但是卻沒有配置最大併發數,在極端場景下就有可能造成內存佔用過高的問題。與之相對應的解決方案就是對圖片下載器設置最大併發數,並且根據待加載圖片是否在可視區域內調整優先級。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4f96bbe64a4af078001b330d08b2dba.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是通過字節內部的線上監控和歸因工具,總結出來最常見的幾類觸發 OOM 問題的原因,分別是:內存泄露,這個較爲常見;第二個是內存堆積,主要指的是 AutoreleasePool 沒有及時清理;第三是資源異常,比如加載一張超大圖或者一個超大的 PDF 文件;最後一個是內存使用不當,比如內存緩存沒有設計淘汰清理的機制。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3.4 第四類疑難問題 —— CPU 異常和磁盤 I/O 異常","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏之所以把這兩類問題合併在一起,是因爲這兩類問題是高度相似的:首先它們都屬於資源的異常佔用;另外它們也都不同於閃退,導致崩潰的原因並不是發生在一瞬間,而都是持續一段時間的資源異常佔用。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e899f893fe29fe040996b1a3e545726d.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"異常 CPU 佔用和磁盤 I/O 佔用危害有哪些呢?首先我們認爲,這兩類問題即使最終沒有導致 App 崩潰,也特別容易引發卡頓或者設備發燙等性能問題。其次這兩類問題的量級也是不可以被忽視的。另外相比之前幾類穩定性問題而言,開發者對這類問題比較陌生,重視程度不夠,非常容易劣化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這類問題的歸因難點有哪些呢?首先是剛剛提到它的持續時間非常長,所以原因也可能並不是單一的;同樣因爲用戶的使用環境和操作路徑都比較複雜,開發者也很難在線下復現這類問題;另外如果 App 想在用戶態去監控和歸因這類問題的話,可能需要在一段時間內高頻的採樣調用棧信息,然而這種監控手段顯然性能損耗是非常高的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/15ce64d91c1c5daa2f59f7b16d0e79cf.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中左側是我們從 iOS 設備中導出的一段 CPU 異常佔用的崩潰日誌,截取了關鍵部分。這部分信息的意思是:當前 App 在 3 分鐘之內的 CPU 時間佔用已經超過80%,也就是超過了 144 秒,最終觸發了這次崩潰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中右側是我截取蘋果 WWDC2020 一個 session 中的截圖,蘋果官方對於這類問題,給出了一些歸因方案的建議:首先是 Xcode Organizer,它是蘋果官方提供的問題監控後臺。然後是建議開發者也可以接入 MetricKit ,新版本有關於 CPU 異常的診斷信息。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a44857c1fc689a23f1a9f4e89f6c7bb8.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中左側是磁盤異常寫入的崩潰日誌,也是從 iOS 設備中導出,依然只截取了關鍵部分:在 24 小時之內,App 的磁盤寫入量已經超過了 1073 MB,最終觸發了這次崩潰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中右側是蘋果官方的文檔,也給出了對於這類問題的歸因建議。同樣是兩個建議:一個是依賴 Xcode Organizer,另一個是依賴  MetricKit。我們選型的時候最終確定採用 MetricKit 方案,主要考慮還是想把數據源掌握在自己手中。因爲 Xcode Organizer 畢竟是一個蘋果的黑盒後臺,我們無法與集團內部的後臺打通,更不方便建設報警、問題自動分配、issue狀態管理等後續流程。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e4/e42be711c9bfb5b6c3f477ee51b6df32.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MetricKit ","attrs":{}},{"type":"text","text":"是蘋果提供的官方性能分析以及穩定性問題診斷的框架,因爲是系統庫,所以它的性能損耗很小。在 iOS 14 系統以上,基於Metrickit,我們可以很方便地獲取 CPU 和磁盤 I/O 異常的診斷信息。它的集成也非常方便。我們只需要導入系統庫的頭文件,設置一個監聽者,在對應的回調中把 CPU 和磁盤寫入異常的診斷信息上報到後臺分析就好了。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/ab44cb261d75fad5904d4445c0450f19.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實這兩類異常的診斷信息格式也是高度類似的,都是記錄一段時間內所有方法的調用以及每個方法的耗時。上報到後臺之後,我們可以把這些數據可視化爲非常直觀的火焰圖。通過這樣直觀的形式,可以輔助開發者輕鬆地定位到問題。對於上圖中右側的火焰圖,我們可以簡單的理解爲:矩形塊越長,佔用的 CPU 時間就越長。那麼我們只需要找到矩形塊最長的 App 調用棧,就能定位到問題。圖中高亮的紅框,其中有一個方法的關鍵字是 animateForNext,看這個名字大概能猜到這是動畫在做調度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終我們和飛書的同學一起定位到這個問題的原因:飛書的小程序業務有一個動畫在隱藏的時候並沒有暫停播放,造成了 CPU 佔用持續比較高。解決方案也非常簡單,只要在動畫隱藏的時候把它暫停掉就可以了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、總結回顧","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0e9b8747ee0c09c83c2301182eda2d78.webp","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第二章節穩定性問題治理方法論中,我提到“如果想把穩定性問題治理好,就需要將這件事情貫穿到軟件研發週期中的每一個環節,包括問題的發現、歸因、治理以及防劣化。”同時我們認爲線上問題——特別是線上疑難問題的歸因,是整個鏈路中的重中之重。針對每一類疑難問題,本次分享均給出了一些好用的歸因工具:Crash 有 Zombie 監控和 Coredump;Watchdog 有線程狀態和死鎖線程分析;OOM 有 MemoryGraph;CPU 和磁盤 I/O 異常有 MetricKit。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec982b569a0bf594983385b80c52e507.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次分享提到的所有疑難問題的歸因方案,除了MetricKit 之外,其餘均爲字節跳動自行研發,開源社區尚未有完整解決方案。這些工具和平臺後續都將通過字節","attrs":{}},{"type":"link","attrs":{"href":"https://www.volcengine.com/products/apmplus","title":"","type":null},"content":[{"type":"text","text":"火山引擎應用開發套件 MARS 下的 APM Plus ","attrs":{}}]},{"type":"text","text":"平臺提供一站式的企業解決方案。本次分享提到的所有能力均已在字節內部各大產品中驗證和打磨多年,其自身的穩定性以及接入後所帶來的業務效果都是有目共睹的,歡迎大家持續保持關注。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章