百度搜索穩定性問題分析的故事(下)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"百度搜索系統是百度歷史最悠久、規模最大並且對其的使用已經植根在大家日常生活中的系統。坊間有一種有趣的做法:很多人通過打開百度搜索來驗證自己的網絡是不是通暢的。這種做法說明百度搜索系統在大家心目中是“穩定”的代表,且事實確是如此。百度搜索系統爲什麼具有如此高的可用性?背後使用了哪些技術?以往的技術文章鮮有介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文立足於大家所熟悉的百度搜索系統本身,爲大家介紹其可用性治理中關於“穩定性問題分析”方面使用的精細技術,以歷史爲線索,介紹穩定性問題分析過程中的困厄之境、破局之道、創新之法。希望給讀者帶來一些啓發,更希望能引起志同道合者的共鳴和探討。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文5110字,預計閱讀時間11分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上週,在","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/78adc33fb16caa49bdadecd1a","title":"","type":null},"content":[{"type":"text","text":"百度搜索穩定性問題分析的故事(上)","attrs":{}}]},{"type":"text","text":"中,已經介紹了我們是如何通過全面的數據系統建設解決問題追查的死角,沒看過的朋友可以重新看下這篇文章。接下來,將分享我們如何進行故障的自動化、智能化分析,提高問題追查的效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"第4章 再創新:應用價值的再釋放","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1 巨浪——故障分析的“終點”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拒絕的分析是一個定性的過程,根據拒絕query激發的日誌信息,就可以定位業務層面的原因,或者定位到引起異常的模塊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個過程可以抽象爲下面幾步:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1) 故障(拒絕)信號的感知","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2) 故障單位(query)全量信息(日誌)的收集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3) 根據收集到的信息進行故障單位(query)的歸因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4) 對批量故障單位(query)的原因進行再歸類,以及特徵挖掘","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個過程需要在秒級完成,時效性要求很高。過程的順利執行面臨下面8個挑戰:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰1:","attrs":{}},{"type":"text","text":"如何實現快速的日誌檢索。在採集到拒絕信號之後,拒絕的分析需要快速拿到日誌原文,這些信息如果直接從線上掃描,速度和穩定性上顯然達不到要求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰2:","attrs":{}},{"type":"text","text":"拒絕定位的實時性和準確性之間的矛盾如何解決。日誌越完整,拒絕原因的分析結果越準確。但是因爲網絡延遲等原因,分析模塊無法保證馬上拿到所有的日誌。接收到拒絕信號後就開始分析,可以確保分析的實時性,但是準確性難以保證。而延遲一段時間再分析,可能會拿到更完整的日誌,但是會影響拒絕分析的實時性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰3:","attrs":{}},{"type":"text","text":"如何準確全面地描述故障。生產環境的故障“五花八門”,如果逐個進行表達和管理,維護成本會非常高。需要尋找一種方案,把所有的故障(規則)系統、準確、全面地管理起來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰4:","attrs":{}},{"type":"text","text":"特徵工程如何進行。在拿到日誌原文之後,我們需要確定從日誌中應該拿哪些信息,如何採集這些信息,並且以程序可以理解的方式將這些特徵表達出來,最終和拒絕原因關聯起來,即特徵的選擇、提取、表達和應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰5:","attrs":{}},{"type":"text","text":"如何還原query現場。在線系統爲了保證可用性,關鍵模塊上都會有重查。在定位拒絕時,需要還原出完整的調度樹,這樣才能看到由根節點出發到葉子節點各條路徑失敗的原因,不然可能會得到矛盾的結果。如下圖所示,A-1、B-1和B-2節點都發生了重查,當拼接錯誤時(C模塊的實例掛到了錯誤的B模塊節點下),B-1(或B-2)的錯誤狀態和掛在它之下的C模塊的日誌狀態可能是矛盾的,無法得出正確的定位結論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b0a08d70a6d5212f259e0e1eaa07b666.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰6:","attrs":{}},{"type":"text","text":"如何對拒絕特徵進行深度挖掘。自動定位難以定位到根因,更精確的定位依賴人工參與的繼續分析。分析工具需要能從各種拒絕中找到聚集特徵並以一定的優先順序展示給用戶,爲根因定位或者止損提供更多線索。query中可以提取的信息包括query的查詢詞(word),發送query的client端ip,query的語種或者處理query的機器所在的物理機房等。比如,當發現系統拒絕都和某個ip的攻擊流量有關時,可以對該ip進行封禁止損。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/67/67bff283479274b206165f9f2bc256a3.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰7:","attrs":{}},{"type":"text","text":"級聯故障如何感知。當某個模塊故障引起拒絕時,可能會產生級聯的次生故障,表現爲拒絕直接原因多樣化。下圖展示了 一種典型的級聯故障:E異常後B對C發起了大量重查,首查疊加重查流量徹底把D壓垮,最後A對B也開始發起大量重查。拒絕的流量在個個模塊都有可能命中限流策略,表現爲不同的拒絕原因。因此,在產生故障時,依賴某一個時間點的拒絕統計信息可能會掩蓋引起拒絕的根因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/62b7a9f748ae08479fd6e0f0b572827b.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"挑戰8:","attrs":{}},{"type":"text","text":"如何定位未知故障。故障是偶發的,我們進行拒絕原因劃分的時候所使用的劃分集合,無法完整體現系統可能出現的拒絕原因。對於未知故障或者未被納入到拒絕定位規則中的拒絕,我們需要有手段“製造”故障,發現未知或者未採集到的故障。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c12990a0ba9a0da6c6b446887bc5d96b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面,將依次介紹我們是如何解決這8個問題的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.1 索引鏡像技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了實現日誌的快速檢索,日誌索引由在線採集模塊提取後,除了推送本機建立索引之外,還將定位需要的子集主動推送至旁路索引模塊,該模塊會以日誌對應的queryID爲key寫入內存介質的全量索引存儲中。這裏的索引支持多列稀疏存儲,相同queryID的多條日誌location可以追加寫入。這樣,單條拒絕query的location信息可以已O(1)的時間複雜度拿到,接下來並行地到目標機器上撈取日誌,並將其寫入持久化的故障日誌存儲中。最後對這些日誌進行特徵提取並分析拒絕原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/fec27dfebd5549ee25363ee5325e3715.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.2 流式分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決問題2,我們借鑑了流式分析的理念。無論是分析模塊收到了拒絕信號還是增量的拒絕日誌信號,都觸發一次拒絕原因的分析,並更新結論。這裏有2個關鍵點:一是分析自動觸發,線上只要發生拒絕,拒絕分析就開始工作。二是增量更新,只要某個拒絕query的日誌有更新,就重新觸發故障原因分析。對於入口模塊,在線採集端會根據其日誌中的指定字段判斷是否是拒絕,並將這個信號連同索引一起推送到旁路索引模塊。旁路索引模塊在收到該信號後會立即通知分析中心對這個queryID進行分析,因此分析流程可以在拒絕報警發出之前觸發,最大化故障定位止損效率。當旁路索引模塊向分析模塊觸發完一次分析請求後,會將這個queryID記錄到全量索引存儲的pvlost表中,當後續有非入口模塊日誌的索引到達時,旁路索引模塊拿該索引中的queryID到這個表中查找,即可判斷是否是需要觸發增量分析。增量分析會合並所有已知日誌,並更新分析結論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.3 完備labelset","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在入口模塊接收到用戶query後,該query會經多個模塊的處理。每個模塊都有讀取、解析請求包,請求後端,處理後端返回結果,以及最後的打包發送流程。在這個抽象層級上,請求處理各個步驟的劃分是足夠明確的,並且都可能出現失敗而引起query在該模塊的拒絕。所以,我們對這個處理過程中可能失敗的原因進行了枚舉,構建了單模塊故障原因完備模板,將該模版應用到所有的必查模塊就構成了故障原因的完備集合。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/dfb53de3c1bb02c3802bfa213d682612.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.4 特徵工程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在確定了完備labelset(拒絕原因)之後,我們需要在程序中實現自動的特徵提取、表達,並和拒絕原因建立映射。不同模塊的業務日誌差異很大,爲了解決特徵的提取問題,我們實現規則提取引擎,輸入爲日誌原文和提取規則,輸出爲採集到的特徵。特徵的類型主要有2種:指定內容是否存在、值是多少。在提取出特徵之後,我們使用一個向量表示各個特徵的取值,當向量中某些特徵的取值滿足指定的條件(等於、在指定範圍等)時,就給出對應的拒絕原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b82e5753d0fa2e4534811b3fd66343bb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.5 單query現場還原","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌分析模塊拿到的日誌只是相互獨立的節點,進行query現場還原後才能開始分析。從入口模塊開始,搜索系統的各個模塊會把自己的span_id,以及所調度的多個後端的span_id打印出來,依據這些信息即可還原調度現場。需要注意的是,模塊發起的首查和重查是有先後順序的,通過對一個節點的孩子節點的span_id進行排序,即可還原這種調度上的先後次序。在還原調度樹之後,將調度樹由根節點到葉子節點路徑上的所有異常日誌彙總,從中拿到所有的特徵並和規則列表進行比對,即可得到該路徑(調用鏈)的拒絕原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/5354b8323e1bb076910cf5d502c057bf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.6 智能rank算法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"問題6的難點在於:在一批query所有維度的特徵中,找到有明顯聚集性的一個(一組)維度。這可以進一步表示爲:在不同維度之間進行排序,找到排名最考前的維度,而排序的依據就是該維度內部取值是否有高度的聚集性。爲了解決這個問題,我們借用了熵的概念——當拒絕的query在某個維度上取值聚集越強時,它的熵就會越低。在構建排序模型時,我們對不同維度的取值進行了變換,確保不同維度可比,並加入了人工經驗確定維度權重。這樣就可以在出現拒絕時,按照順序給出拒絕query在不同維度的聚集性,幫助定位根因或制定止損策略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"4.1.7 時間線分析機制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了準確感知到拒絕的演化過程,我們實現了timeline機制。收到拒絕報警後,該工具會自動從巨浪獲取拒絕信息,按照秒級粒度進行拒絕原因數量統計,並進行二維展示,如下圖所示。在該展示結果上,可以看到不同秒級時間各種拒絕的數量,以及不同拒絕原因隨時間的變化趨勢,幫助我們定位根因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f8/f8c255064c0f5c07901d55e1b63d8957.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1.8 混沌工程技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"問了解決問題8,我們引入了混沌工程的技術。混沌工程提供了向在線服務精確注入各種故障的能力,這樣就可以拿到豐富且帶標記的樣本補充到定位知識庫中。這樣不僅解決了日誌樣本問題,還可提升對未知故障的預測能力,從“亡羊補牢”進化到“未雨綢繆”,防患於未然。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這8大技術,很好的解決了前面的8個問題。在定位效果上,準確率可達99%,出現拒絕後,產出模塊粒度的拒絕原因可以在秒級完成,分析能力可覆蓋大規模拒絕。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 長尾批量分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索系統中存在着一些響應時間長尾,爲了解決這個問題,我們基於全量tracing和logging數據,實現了一套例行長尾原因分析機制。該機制定時從入口模塊拿到響應時間長尾的query,再對每個query調用全量調用鏈的接口拿到完整的調度樹。在分析長尾原因時,從入口模塊開始,通過廣度優先遍歷的方式,逐步向後端模塊推進,直到找到最後一個響應時間異常模塊,即認爲長尾是由該模塊引起的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模塊響應時間異常的定義爲:該模塊的響應時間超過了正常請求的極限響應時間,並且它所調用的模塊的響應時間是正常的。在確定異常模塊之後,可以進一步從全量調用鏈中有針對性的拿到該模塊的日誌,從日誌中根據規則找到該模塊處理耗時異常的階段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e3/e359a1fb10b532a9ddff82e6dc792b04.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3 異常狀態全流程追蹤","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了確保用戶體驗的穩定性,搜索會定期分析未召回預期結果的query。query沒有返回預期結果,可能是因爲它命中了“搗亂者”寫入的cache,也有可能它穿透了cache,召回了有問題的結果,這裏問題的原因可能是偶發的或者是穩定的。我們需要能篩選出可以穩定復現的問題進行追查。爲了實現這個需求,我們先拿到各個query的tracing以及logging信息,根據這些信息可以:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)找到哪些query命中了“搗亂者”寫入的髒cache;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)哪些query穿透到了後端並重新進行了檢索。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將命中cache的query和寫入cache的query關聯起來,即可得到下圖所示的結果——異常狀態的全流程追蹤。只要異常效果持續時間內的cache命中是連續的,並且觸發了多次cache的更新,那麼就可以認爲在這一段時間內,故障是穩定復現的,可以投入人力追查。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/305b103a24eba86b22710fbfd807ad3e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"第5章 總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文首先介紹了百度搜索可用性保障的困境,超大的服務規模、極高頻的變更和參與人數、海量數據和請求量、多樣且多變的故障種類等構成的複雜系統,對年只能停服5分鐘的極端嚴格可用性目標構成了極大挑戰。然後,以時間順序介紹了我們對百度搜索可用性保障的解決經歷和經驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,爲了解決問題追查死角的問題,我們建設了可觀測基礎——logging、tracing、metrics,這些沒有精細加工的基礎數據解決了可用性保障中的一部分問題,但是我們發現基礎數據的自動化程度較低、智能性較差,複雜問題需要大量人力投入,分析效果強依賴人工經驗,甚至根本無法分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更進一步,爲了解決可用性保障的效率問題,我們對體系中的各個組件進行升級,使可觀測性的產出變得可觀測,複雜的效果故障由不可追查變得可追查,拒絕分析從人工變得自動、準確、高效。在整個體系建設過程中,我們從數據的消費者,變成數據的生產者和加工者,通過數據的生產、加工、分析全流程閉環,使得百度搜索中各種故障無處遁形、無懈可擊,使得百度搜索可用性保障擺脫困境,持續維持較好的用戶口碑,同時本文也希望給讀者帶來一些啓發,更希望能引起志同道合者的共鳴和探討。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本期作者 | ZhenZhen;LiDuo;XuZhiMing","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"招聘信息","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關注同名公衆號百度Geek說,點擊菜單欄“內推”即可加入搜索架構部,我們期待你的加入!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495266&idx=1&sn=a50ed4cf4828bb6bdc6caa58c2cdae5a&chksm=c03ede1ef74957080a72f358781b0ee656be889bf4a91705a249ab1c7e36ce78294de2974226&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度搜索穩定性問題分析的故事(上)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495229&idx=1&sn=b3cfdbcf0a5ebcc44d673dfbc8f83196&chksm=c03ede41f74957571ed3ef7e2a1f2f5f128bb3e8b9466039e0321a3340b49e8d823ae4f21d10&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度關於微前端架構EMP的探索:落地生產可用的微前端架構","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247495152&idx=1&sn=11b4052ed004b010394851423b6c98c5&chksm=c03edd8cf749549a5810740daa40b1496be3167e4785f19dbd515c3dcda92c3f22233398c83f&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"社羣編碼識別黑灰產攻擊實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494914&idx=1&sn=6aeb11be56935107a7eee7618f619cf2&chksm=c03edd7ef74954680cf3f79464bca4e7755598fca29c6d6d58a700508f646a8c795d54727685&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"PornNet:色情視頻內容識別網絡","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章