走完線上 Bug 定位最後一公里

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一個小故事"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"週末 12 點的鬧鐘在回龍觀均價 3000 的出租屋急促的響起,程序員小A慵懶的拿過手機,滑開手機通知欄,沒有未接電話,點開手機的攔截信箱,沒有報警短信,昨晚的發佈一定很順利。小A幸福的伸了個懶腰。戴上 3000 塊的 BeatsSolo Pro,音樂逐漸響起來,小A緩緩的閉上了眼睛,正午的陽光從窗戶漫進來,撒在小A稀疏的頭髮上。此時的小A正在腦海中勾勒着自己美好的未來。房東說:十年前住在這間屋的小B,現在已經是某度的 T10 大佬,五年前住在這兒的小T,現在已經在某條帶領 200 人的團隊,想到這兒,小A的嘴角微微上揚,那我也一定不會太差吧~"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"嘀嘀..耳機裏傳來兩聲消息提示音,小A心裏咯噔一聲,刺骨的寒意瀰漫開來,北京三月的陽光突然就不暖了。小A用力的微微睜開雙眼,通知欄測試同學小C的頭像一閃而過。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"xx線上 BUG 緊急修復羣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小C: “@小A 昨晚上線的代碼好像有點有問題,來公司看下?我在公司等你。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點開羣設置,老闆的頭像赫然在列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"懷着愧疚、徘徊、悔恨、無奈、憤怒的心情,小A翻身穿上他在路邊買的價值 20 元的人字拖,坐上了前往西二旗的地鐵十號線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不久,西二旗某辦公室傳來了亙古不變的對話,“這段代碼測試過,在我電腦上沒問題啊”、\"你重啓下試試\"、“是不是代碼沒上線”、“是不是誰把我代碼沖掉了”、“你們測試數據是不是有問題呀”……於是一個下午過去了、一個晚上過去了、一個週末過去了、一個程序員的青春過去了、一個程序員本就不長的職業生涯過去了。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一個小總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面這個虛構的小故事只是想說明一個簡單的現象,程序員的很多時間被線上 bug fix 佔據。因爲線上線下環境不一致、輸入輸出不一等等原因,很多 bug 定位起來效率低下,耗時巨長,導致很多時候程序員遇到線上 bug 總是頭疼不已,不由自主的想要甩鍋給外在因素,在確定是自己的問題的時候再排查問題。那麼線上問題排查到底難在哪兒?首先來看看我們排查線上問題的一個基本步驟,這個步驟一般是排查大多數線上問題的步驟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟1:找到能復現問題的輸入;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟2:判斷該輸入能否在日常環境構造, 如果能,調到步驟 5。如果不能,繼續步驟 3;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟3:查看線上環境日誌,看能否找到異常輸入相關的異常日誌,輔助排查問題;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟4:初步推斷問題原因,嘗試修復並加上更多日誌輸出。然後打包、發佈。重複步驟 3 直到定位根因;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟5:日常構造相同輸入,單點調試,定位問題;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際的場景中,因爲線上線下環境隔離的問題,線上的輸入很多時候難以在日常環境中構造,大多數時候我們都在步驟 2、3、4 中循環,於是時間就在循環中慢慢的流逝了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面做這麼多步驟其實對於查問題而言就是希望可以知道當某段代碼執行不符合預期的時候,這段代碼的輸入是什麼,輸出是什麼,拋出了什麼異常,以及代碼中每一行的具體執行情況。那麼是否有一款產品可以讓用戶方便快捷的實現這個目標呢?答案是有的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"聊一聊 ARMS"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里雲的應用實時監控服務 ARMS 是一款應用性能管理(APM)產品,包含應用監控、Prometheus 監控和前端監控三大子產品,涵蓋分佈式應用、容器環境、瀏覽器、小程序、APP 等領域的性能管理,能幫助用戶實現全棧式性能監控和端到端全鏈路追蹤診斷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ARMS 最新推出了 Arthas 診斷功能,其第一個版本主要包含四個能力,分別是 JVM 概覽、線程耗時分析、方法執行分析以及性能分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JVM 概覽:查看實時的 JVM 內存、GC 信息以及操作系統信息、環境變量、系統變量等信息。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線程耗時分析:查看實時的線程耗時情況,並可查看每個線程實時的方法堆棧。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方法執行分析:實時的抓取滿足指定條件的方法執行明細、出入參數以及異常。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"性能分析:快捷的通過火焰圖的的形式,展示系統性能瓶頸。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ARMS 的 Arthas 功能使用起來也比較簡單,詳情可參照文檔(https:\/\/help.aliyun.com\/document_detail\/204809.html)。下面來簡單聊一聊如何利用 ARMS 的 Arthas 診斷能力來進行線上問題的定位。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"聊一聊 ARMS Arthas 診斷"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上一節簡單介紹了 ARMS 的 Arthas 診斷具備的能力,那麼用這些能力能解決哪些線上問題呢?在這裏,我們對線上問題進行了一個歸納總結,將其分爲下面四類問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"方法執行不符合預期:"},{"type":"text","text":"包括方法執行耗時、方法返回值、方法拋出了異常等情況,表現在應用上可能是一些接口或者服務的 RT 增高,錯誤率增高,返回值異常等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"進程 CPU 耗時突增:"},{"type":"text","text":"一般有代碼死循環問題、FullGC 導致 GC 線程耗時高、併發使用 HashMap 等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"性能優化問題:"},{"type":"text","text":"主要用於分析性能瓶頸,輔助性能優化,包括 CPU 耗時、內存分配、鎖競爭、itimer 等情況的性能分析。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"其他問題:"},{"type":"text","text":"比如初始化環境變量讀取錯誤、內核版本不符合要求、類衝突等問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面就以一個實際的 demo 來演示如何利用 ARMS 的 Arthas 執行不符合預期這種問題的診斷,後續的文章會繼續介紹如何利用 Arthas 進行其他類型問題的診斷。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"利用 ARMS Arthas 診斷方法執行不符合預期類問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"問題背景:product 應用的 com.alibabacloud.hipstershop.productserviceapi.service.ProductService@confirmInventory   接口某次發佈後平均 RT 到達 400,發佈以前的平均 RT 在 1ms 以下,如下圖所示。現在想定位耗時具體耗在哪兒。 "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/78\/78e1991eb1295a7480529bf77e2a3338.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,進入 ARMS Arthas 診斷的頁面。當我們進行 Bug 定位的時候,首先需要知道出問題的類名和方法名,按照圖示截圖中的紅色註釋輸入相應的類名和方法名。如果你是 EDAS(https:\/\/help.aliyun.com\/document_detail\/42934.html)用戶,可直接選擇一個服務或者接口,後臺會自動推斷相應的實現類和方法。對應到本案例,對應的類是 com.alibabacloud.xxx.xxx.xxx.ProductService,方法是 confirmInventory。填寫完畢後點擊確定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cc\/cc602625915ce93436044c3e9481312e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 2"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示,點擊確定後可以得到 confirmInventory 方法執行的紀錄,包含執行的入參,返回值異常以及方法執行明細。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a2\/a244ab655d948c1cab87b96aea549a58.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是這次執行的耗時 2.89 ms,不是我們預期中的一次耗時高調用。此時,可點擊右上角修改診斷參數,設定抓取耗時大於 300ms 的方法調用(除此以外還可以設置更多的過濾條件,包括方法參數滿足的條件等等,具體可查看文檔https:\/\/help.aliyun.com\/document_detail\/204809.html)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4a1fa8a3816caa0218b7369a2204e891.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 4"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點擊確定後,點擊右上角刷新圖標再次診斷,這次抓取到一次耗時 1501ms 的方法調用,發現原來是在該方法的執行過程中,執行了 Thread.sleep() 方法。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/61\/61247e1472506fcdfebd69b4b5beb2ad.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到這裏,你可能還會好奇,爲什麼會執行 sleep 方法呢?這塊代碼的邏輯是怎樣的呢?點擊右上角查看方法源碼,一目瞭然的將方法源碼與方法執行明細相結合。如下圖所示,confirmInventory 方法中執行的每一次方法調用最後會以“\/\/-”爲前綴展示該方法執行的耗時情況。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fc\/fcc080894c9b7dbf0f3e11eb5a7c1eb3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 6"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,你還可以點擊圖5 ,列表最右側的操作列的下鑽,快捷的進一步分析 confirmInventory 調用的子方法的執行情況。這在根因比較深的場景下十分方便好用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此,完成了我們這個問題的一個定位演示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相信 ARMS 的 Arthas 診斷功能一定給你留下了深刻的印象,也一定會成爲您線上問題診斷的利器,幫助您更快更方便的診斷線上故障。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"寫在最後"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快速免費體驗 ARMS 功能:https:\/\/arms.console.aliyun.com\/。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,企業級分佈式應用服務EDAS K8s(https:\/\/help.aliyun.com\/document_detail\/199295.html) 作爲一款一體化的產品,既具備了應用的託管能力,也集成了 ARMS 的監控診斷能力,同樣可以體驗 ARMS 的 Arthas 診斷功能,可根據您目前的實際情況選擇一款產品來體驗 ARMS 的 Arthas 診斷能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備註:上述功能目前僅對部署在 K8s 爲集羣中的 Java 應用有效,後續會支持部署的 ECS 上的 Java 應用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:阿里巴巴中間件(ID:Aliware_2018)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/UBSizddouMxc27A5_PfkmQ","title":"xxx","type":null},"content":[{"type":"text","text":"走完線上 Bug 定位最後一公里"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章