揭祕百度微服務監控:百度遊戲服務監控的演進

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:臣聞防患於未然者易,除患於已然者難 —— 明·馬文升《添風憲以撫流民疏》","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲一個程序員,是否有留意到每逢節假日在各大景區時有程序員打開電腦緊急處理線上問題?如果線上出現大量的報警,這時我們該如何判斷是自身服務問題還是依賴服務問題?午夜線上重大問題出現,如何能夠喚醒相關人員快速響應?相信這些問題對於很多同學都不陌生,監控的重要性不言而喻,那如何打造一個完善的監控系統,協助程序員發現並高效定位問題?本文將介紹百度遊戲微服務監控實踐,基於百度完善的監控基礎實施,我們打造了較爲完善的監控系統,下面我們向大家介紹我們的實踐歷程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文4583字,預計閱讀時間 9分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務的快速發展,遊戲服務端研發同學平均每人要維護2~3個微服務,後續業務場景增多可能會引入更多微服務,如何高效的獲悉整個微服務系統的運行狀態,業務異常時如何快速發現問題並解除故障,遊戲服務端研發同學在監控實踐上做了很多工作嘗試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初期的監控基於公司的Argus監控(日誌服務器相關監控)、Monitor監控平臺(業務監控)、Sia監控(可視化監控)等覆蓋了一些基礎的監控,但是由於缺乏體系、缺少和業務的結合,整體的效果並不理想,不少問題依然是客服和產品同學反饋,同時在跟進問題過程中研發最爲頭疼的一個點是在問題定位上往往要花很長的時間,這個對業務造成了一定的負面影響。在這種情況下我們系統化的梳理了面臨的問題並體系化的設計和優化完善了監控系統,並着重針對問題定位做了和業務的深度結合,大大提升了問題的定位效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面將就我們監控系統的建設過程整體介紹,希望對讀者有所幫助。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、微服務監控初探","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控建設初期我們主要是基於百度的監控基礎設施添加各種監控,但是由於缺乏體系效果並不理想。儘管初探階段我們監控能力不夠完善且能力較弱,但這些分散的監控措施也幫助研發同學發現了不少系統問題,爲後續的體系化和多維度組合監控打下了基礎。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1、日誌和服務器監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用百度Argus監控平臺,實現對機器狀態和業務日誌的監控,遊戲微服務藉助機器及日誌監控能力對線上服務進行了監控的覆蓋。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們初期對Argus監控的應用偏單維化,結合業務場景的深度不夠,諸如某個問題某些實例的監測閾值及多維度報警能力初期並沒有考慮設計,下面是對於百度Argus監控的能力和流程介紹:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/61/6189711348a74cef6ce17b3b347a6d8b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"argus整體數據流如下,可以支持電話、SMS、短信及百度如流報警","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"            ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5c/5c7efd86c87373a304f07cd0b309838a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌相關監控業界有大家熟悉的ELK Stack 方案(Elasticseach + Logstash + Kibana),使用Beats(可選)在每臺服務器上安裝後,作爲日誌客戶端收集器,然後通過Logstash進行統一的日誌收集、解析、過濾等處理,再將數據發送給Elasticsearch中進行存儲分析,最後使用Kibana來進行數據的展示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2、服務輪詢監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用百度monitor監控平臺,對於核心的接口採用定時輪詢檢測的機制來輔助監控線上服務質量,monitor平臺支持可視化配置,但是需要針對每個場景做定製化配置,隨着業務快速的迭代,這種監控添加的效率和易用性已不能滿足業務的需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/453170ba12506daf88fc9910187623b4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3、服務可視化監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用公司SIA智能監控系統,實現了服務流量、可用性、性能等指標的監控可視化,可以輔助業務研發可視化的觀察服務線上狀態並基於線上異常狀態報警。但是業務對於SIA智能監控能力並沒有充分使用,導致可視化的輔助作用有限,智能能力沒有體現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/372212894eab24bb92ec13842df5435a.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3 監控可視化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於業界的可視化監控工具,有諸如Kibana、Grafana等,相關的能力都已很完善,基本可以滿足業務的各種展現需求,大家可以參考瞭解。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、微服務監控演進","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上面所闡述的,監控初探階段的監控措施雖然可以輔助研發發現和定位一些問題,但是還是存在諸多問題,主要是如下四個方面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"風險暴露滯後,大多報警發生時已造成影響;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控缺乏統一規劃,相關監控項混亂且覆蓋極不完整;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控能力弱,無法提供有效異常信息;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報警混亂,研發被報警信息轟炸;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從整體監繫統建設成本和收益來看,我們不會將過去的監控全部推翻,而是基於在現有基礎監控的能力上加以完善。首先我們以系統化的視角對於監控系統做全面設計,然後基於設計強化監控系統各個部分的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.1、監控系統化設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目標:有效預防、及時發現、快速止損;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"落地:基於系統化的設計目標,做了如下的落地思路拆解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                   ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/576218bac056acbf5fc650eadb65c8cc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現上從風險控制、智能監控、智能報警、高效定位四個方面來設計微服務系統的監控系統化工作,整體流程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d4f6f33c3d4943731c5b697362e37133.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面從風險控制、智能監控、智能報警和高效定位四個方面逐一介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.1風險控制設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線上問題發現的時機越早越好,由於研發同學水平客觀上存在差異,且通過cooder review無法有效規避上線問題的發生,所以遊戲業務研發在自動化case和發佈環節做了較多的工作,以減少問題的發生。下面是研發做的主要風險控制項,通過這些風險控制項的落地,目前已經可以減少95%以上的上線中問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"            ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/63/6374935326cc1e9984134eb5cae2a674.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.2智能監控設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遊戲業務初期的監控,是分散的監控添加:日誌監控使用argus,可視化的監控實驗SIA智能監控平臺,監控的覆蓋和監控系統之間的協同效果並沒有做全局考慮,這樣就暴露出一些問題,如:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題1:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照監控對象劃分的監控,是在單一維度上做到有效覆蓋,但是系統全局波動異常如何探測?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題2:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某個實例因爲網絡或機器磁盤偶發故障導致pvlost突增,如何高效的獲得信息?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題3:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統可用性波動,是某個機房的問題,還是特定接口的問題,或是訪問下游的異常?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)智能異常檢測","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用SIA系統的智能異常檢測算法,將耗時、流量、SLA指標、收入等指標納入到監控體系,可以高效探測到系統的週期/非週期波動異常,下面簡單介紹下主要的算法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                 ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cb/cbf4c08cba3c031a153ee23bcb032407.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過將上述指標同遊戲業務的流量、耗時、收入等指標的結合,在系統週期性或非週期性的波動時,即使是較爲緩慢的下降也可以通過這些週期性檢測工具有效檢測,大大提高異常檢測覆蓋度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)全場景監控覆蓋","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們從4個象限覆蓋監控,做到問題暴露無死角,同時針對諸如服務維度的監控,還細化了多維度的篩選能力,力求從宏觀視角便於發現問題的同時也做到在微觀世界能夠輔助高效定位問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                               ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/07edc74f90ad24f0684a4ca5743cabc5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們着重提下數據監控,我們針對遊戲業務的特殊化場景,細化了需要監控的數據以及場景,以確保監控的完整覆蓋,下面是數據相關的一些監控項。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ac/ac3cabb4bbe1a549a890136ba9f6ed12.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)多維度監控可視化輔助","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多維度篩選能力:服務、接口、錯誤碼、機房、機器實例;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"異常多維度可視化 :如pvlost基於接口、機器、機房的分佈;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"錯誤分佈可視化:分接口、分錯誤碼;      ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                    ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/42/420fb5d9ba1c2d64a62f509181cb4c02.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6 多維度監控可視化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.3智能報警設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報警整體做了分級報警設計,基於不同的場景設置不同的報警範圍和報警方式,減少了非重要報警的信息氾濫,同時在報警應用上有如下整體設計:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)智能合併過濾與自動升級","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/75/750fc1130cd66be63f31eada1189e64e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"智能過濾:","attrs":{}},{"type":"text","text":"減少報警信息的過渡氾濫,做一定的信息篩選;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"智能報警合併:","attrs":{}},{"type":"text","text":"通過信息的合併,提升報警的信息簡介度,進一步減少報警信息氾濫;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"報警自動升級:","attrs":{}},{"type":"text","text":"解決了困擾報警觸達不了值班人的問題,通過設置不同閾值擴大到不同的範圍,並升級報警的形式從郵件->如流->短信->電話,且報警電話可以設置不斷的撥打直至有人響應爲止,解決了觸達的問題;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)樣式內容自定義","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於普通的實例報警或服務報警,相應報警信息按照固定格式進行輸出;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心邏輯部分添加基於富文本的報警內容定義,完整的展示報警信息和報警問題,並提供問題的上下文語義,大大提高了信息量,爲定位問題提供了充足有效的信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/41/418fc0db3d6029696ce2a1036601d0e6.gif","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8 報警內容樣式自定義","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1.4高效定位能力支持","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9d/9d16ade95f560f62ad64f92a85970487.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"報警暴露信息高效","attrs":{}},{"type":"text","text":":對於關鍵核心邏輯採用Trace鏈路+機器人方式來實現報警的高效觸達和自定義化輸出,實現信息的高效傳遞;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"報警信息確認高效","attrs":{}},{"type":"text","text":":部分注意考慮在異常信息報警後,爲了確認線上的相關完整日誌數據和請求當時數據情況的快速數據檢索,實時trace系統高效的解決了這個問題;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)核心邏輯機器人Trace鏈路信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報警暴露信息在覈心邏輯已基本達到了分鐘級的問題報警 + 問題的自動定位,研發基於報警信息即可以看到對應的問題代碼行數及出錯原因,大大提高了問題的定位效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然這個方式的報警目前還存在實現成本較高的問題,諸如在遊戲業務的充值完成後給用戶發道具過程中如果存在一次,我們會暴露出請求參數、出錯函數及出錯的具體原因,研究基於這個數據可以直觀的明確具體的問題,但是這個需要較爲定製化的實現,有一定接入成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                                 ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/71/719d9bde65e317b8a9c2999b56584925.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)實時trace系統接入","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用百度trace中臺的能力,可以做到業務在非侵入式的情況下進行採集,接入成本極低。對於時效性方面採用了百度DataHub消息隊列,並利用Dstream實時建索引,使得從數據源到故障定位平臺可以基於關鍵信息的檢索時效性在5分鐘以內,大大的提高研發定位效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/26/2688e2502c9d14d1db1ce365784b33ef.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、微服務監控全景圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"            ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/8630c72fe4b2459d34eef8934365d6d8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.1用戶觸達","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過多維度可視化監控,輔助研發基於可視化界面即可快速分析出問題大致原因;基於智能報警和業務報表,可以滿足在時效性和業務詳細健康度的全面檢測,讓研發同學全面感知系統的狀態;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.2監控工具","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於公司提供的Argus監控、Sia智能監控和機器人監控輔助工具,可以完整的對系統進行全面覆蓋;對於一些長週期的業務數據,諸如應用日活、下載成功率、白屏率等指標數據,則提供定製化的監控以覆蓋此類場景的監控;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.3監控指標","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於監控指標大體分位如上一些分類,基於這些分類做到監控的有效覆蓋;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.4監控對象","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控對象從服務器、業務日誌、服務狀態到業務數據、業務核心邏輯和核心場景,通過全面的監控對象梳理已做到對於監控的全面掌控。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、總結展望","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過系統化的監控能力建設,無論是在時效性、定位效率還是覆蓋度等均達到了較爲理想的狀態,研發對於重大的線上問題可以第一時間感知,並有完善的輔助定位信息來協助高效定位問題,總結整體監控的實踐過程,主要是有以下幾個方面的心得。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(1)系統化設計落地","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控系統首先要明確解決的是什麼問題,達到的是什麼目標,將問題和目標理解清楚後,實現上就以如何充分解決問題並達成目標來思考,基於這樣一個系統化的分析拆解過程,我們從風險控制、智能監控、智能報警、高效定位幾個部分入手來實現我們的監控系統,以達到預期的目標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(2)分級的思考方式在監控和報警中應用,核心邏輯集中火力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論是監控還是報警,我們都以目標集中於重要的功能和核心邏輯,如果現有工具無法達到目標,那就考慮多個工具組合來滿足監控的目標。對於通用的邏輯功能則強調覆蓋程度,以現有工具完整覆蓋。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(3)易於實施和落地","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"公司提供的SIA智能監控、argus監控都有提供聚合的能力,對於同質的內容監控做到一步到位。而對於異構或差異化的服務,則可以以業務方現有的形式以非侵入能力支持接入,大大提高了監控的添加效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(4)充分結合公司現有能力,創新組合應用,提高效率","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用監控基礎實施的時候,不同的監控工具各有優劣,充分利用不同的監控工具的優勢達到整體監控效果的最優,同時對於諸如一些核心邏輯的監控,創新的使用機器人報警+trace的內容定製化能力,實現對於核心邏輯問題的高效反饋和定位。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然在監控系統方面的實踐已經達到了較爲理想的效果,但是在系統故障處理、容災等能力的自動化機制上有待進一步完善建設,且對於系統資源的使用並沒有做到智能化的利用,目前資源的增減仍然有賴於人工的干預。後續的優化目標是在故障自動化處理、資源智能擴縮容上達到全面的自動化,以提供系統整體的可維護性和可用性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495573&idx=1&sn=ed2ab72bdea9ac56cb3c63c2afc5fbb0&chksm=c03edfe9f74956ff87711de4647dbf0e9fc3067bacd9a1363bb08827d8257af4b59dff7a1012&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247496007&idx=1&sn=ea4e0dc518177e456ff01a2961af2842&chksm=c03ec13bf749482dd2a5d241d68d087454fd79f4f204fcbc98821dbbaecf6b2d51945c111a41&token=1402156335&lang=zh_CN#rd","title":"","type":null},"content":[{"type":"text","text":"如何像百度直播一樣優化用戶體驗(起播篇)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495266&idx=1&sn=a50ed4cf4828bb6bdc6caa58c2cdae5a&chksm=c03ede1ef74957080a72f358781b0ee656be889bf4a91705a249ab1c7e36ce78294de2974226&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度搜索穩定性問題分析的故事(下)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495266&idx=1&sn=a50ed4cf4828bb6bdc6caa58c2cdae5a&chksm=c03ede1ef74957080a72f358781b0ee656be889bf4a91705a249ab1c7e36ce78294de2974226&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度搜索穩定性問題分析的故事(上)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"----------  END  ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章