好大夫在線在解構服務風險治理方面的實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019年初,好大夫經歷了幾次嚴重的線上故障,面臨了中間件和服務治理危機。很多故障都是因爲業務系統中不規範的SQL以及慢接口造成的,嚴重的幾次甚至雪崩到全站短暫不可用的程度,這種局面必須立即、徹底改變。於是,系統架構部痛定思痛,發起了“DOA\"(Dead or Alive)工程,首先治理基礎設施,提升中間件的穩定性和高可用。之後緊接着又發起了服務風險治理項目,識別慢接口,不規範的SQL,依賴不合理等服務風險。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大家砥礪前行的完成這兩個大項目之後,全站的穩定性得到了大幅度提升。經過了這兩年多的沉澱,現在我來彙報一下在做服務風險治理過程的相關經驗心得,希望能帶給大家一起啓發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說到風險,我先想到了認知意識,每個人對風險認知其實是不一樣的,大概可以分爲一下四類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"我意識到我已經知道了;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"我意識到我不知道;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"我意識不到我知道;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"我意識不到我不知道。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在日常工作中,我們收集了不少開發工程師的反饋,給我印象最深的就是“我意識不到我不知道”。SRE小組探索服務風險治理已經快兩年了,迎來了新版本的迭代。藉此機會,想和大家深入聊一下服務風險治理,拓寬彼此認知的邊界。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次分享主要分三個部分:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"探險:首先梳理下開發工程師遇到的已知和未知的風險,介紹一下服務風險治理的相關概念名詞;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"冒險:介紹下我們如何識別、量化、追蹤服務風險,如何整合到平臺裏的;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歷險:現場工作坊,實戰分析服務風險任務。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"探險"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不知道大家有沒這樣的疑問:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"奪命線p99到底是個啥,p50,p75,p95這一家子暗藏什麼玄機?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我的服務接口平均響應耗時30ms,是不是很健康,爲何在蜘蛛抓取的時候,受傷的總是我呢?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常說的高層服務、低層服務、上下游服務、循環依賴、雙向依賴、慢接口、慢SQL等等基本概念說的是什麼?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到底有哪些因素影響服務的健康度?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衡量服務健康度的指標有哪些,是如何篩選的呢?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務健康度是大吞吐量服務應該感興趣的事吧,我的服務QPM才幾十需要關心啥?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"定時任務,異步消費者裏面的慢接口不影響用戶,不算風險吧?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DB抖動造成的波動會不會生成風險任務?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我只關心自己的服務健康度,將聚合接口邏輯扔給前端可以嗎?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"..."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"且看我們是如何處理這些疑問的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務風險治理最終目標是爲了服務健康,服務的健康體系是個複雜系統,影響因素很多,但我們需要抓住現階段最大的風險。經過反覆的對比,選擇從延遲風險入手,也是爲了達到公司預期“全站秒開”的大目標。藉助MDD(Metrics-Driven Development)指導思想,確定SLI,設定SLO。並圍繞SLO去識別風險,解決風險。故此選擇SLI:接口延遲-p99。並設定了SLO:後端服務p99<100ms,前端服務p99<600ms。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"延遲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曾經有人問爲啥不用平均耗時呢,選擇p99是爲啥,這裏再解釋一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/08\/0803e8f66647c4faef84b63a3ebed94c.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/824c78d07737c2eed2572fd9c7644684.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現實生活中普遍存在兩種分佈,正太分佈 和 冪律分佈。典型的兩個例子:中國成年男性的身高符合正太分佈,程序員的收入卻符合冪律分佈。那什麼樣的數據具備這樣的特性呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般如果有極值界限的,大多會符合正太分佈,比如人的身高,體重,不可能無限大和無限小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有人會說人的財富也有上限呀,爲啥不是正太分佈?由於財富聚集頭部和尾部差距拉的過大,財富會在一連串聚集後,越來越分化,從而演化成了冪律分佈。這麼說可能有些人還是不太理解,有研究表明冪律分佈一般由於連鎖效應產生的,詳細可以參考《失效的科學》。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩種分佈都具備長尾效應,取平均值就不能很好的反映模型特徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務延遲就是符合正太分佈滿足長尾效應,故此我們取p99作爲SLI。在服務延遲中,如果p50,p75,p95,p99無限接近,服務越穩定,p99值越小服務具備了更高的抗壓性,也就是彈性更強。p99是個神奇的指標,我們以後會經常遇到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好,關於爲何選擇p99,應該大家都清楚了吧。接下來我們就圍繞降低p99去挖掘服務存在的風險點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"尋找風險"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"哪些因素會影響接口延遲呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"探尋很久,我們終於抓住了尾巴:"},{"type":"text","marks":[{"type":"italic"}],"text":"依賴"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控服務依賴的延遲,就能順藤摸瓜,從而解決了一大部分的高延遲服務風險。爲什麼這麼說呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於現在是微服務的架構,服務與服務之間,服務與中間件之間,服務與第三方接口之間,都可能隱藏風險點。監控好這些依賴的延遲,好把脈,服務風險治理就算成功一半了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"且看如何把脈。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第一大忌:服務之間依賴不合理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們得了解幾個基本概念:服務層級,高層、低層、上下游"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9ef21aaa6e4a7421566b0edfd367ad9e.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務分層這部分分級模型我們參照《架構整潔之道》,大家可以看到越靠裏同心圓,層級越高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/19\/195589b63a195c0d0cd7a27cbbe1d836.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個組件依賴模塊示意圖,其中Translate組件層級最高,同樣我們服務也符合這種模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/83\/83cf2c73ace82702b98a601a49e16091.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所謂高層、低層,我們這麼界定,離用戶側越近層級越低,離用戶側越遠層級越高。換個說法,離輸入輸出端越近層級越低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上下游服務,符合數據流返回方向,從上游到下游,從高層到低層。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"低層依賴高層,下游依賴上游,避免不合理的依賴成爲風險點,如雙向依賴,環形依賴等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那依賴不合理爲啥會影響延遲呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f2\/f24dc5f9240f2bb3206c7bf1355480f4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果存在環形依賴B->E->C->B,E抖動會造成C負載高,從而可能造成B負載高,又會反過來作用E。這時候排查定位問題會非常困難,三個服務都在告警,整個鏈路都在超時,恢復起來會非常麻煩。我們處理好合理的依賴,避免這樣的情況產生,不要讓已知風險成爲定時炸彈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於網絡開銷成本較高,另外一個風險就是循環依賴。由於我們走的http協議,網絡成本比較高,如果一次請求50ms,循環10次就是500ms。從而變成了大殺器。服務拆分並不是越細越好,做好服務邊界的界定,減少不必要的服務間依賴。做好服務間依賴監控,就得依賴鏈路分析了。有機會我們再細聊這部分的實現。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第二大忌:中間件100%可用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多開發工程師對中間件的認知停留黑盒層面,要麼盲目地認爲中間件100%高可用,要麼認爲中間件異常和我無關。然而中間件使用是否合理,是否存在風險點,一直是被大家忽視的一個問題。再加上中間件細節被框架屏蔽了,很多時候更是很難覺察到風險。這裏先拋開中間件選型是否合理的問題,假設依賴的中間件都是合理的,我們來分析一下中間件延遲問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間件一般分建連和執行兩個階段,由於框架的異構性,有的實現了連接池的長連接,有的是短連接。網絡連接也是一種資源類型,也屬於消耗品。延遲高會造成排隊,更有可能造成雪崩事件。當然中間件應該要考慮如何防止雪崩有過載防護機制。那作爲服務方是不是什麼事都幹不了呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們要警惕這種思想,至少我們應該關注高延遲的事件,現在我們數據庫和redis都是按服務維度隔離。延遲會直接反饋到用戶請求的鏈路上。建連超過1s可能是資源不夠用,喫不住這麼大的流量。如果是執行慢,大部分需要考慮是不是姿勢不太對,這裏面主要是可能存在慢SQL。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們針對不同的事件也提煉出了一些風險任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是建連耗時及重試次數,這部分對短連接的場景下尤爲重要,頻繁建連會帶來巨大的開銷。我們選取connection耗時作爲指標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後就是慢SQL風險任務,我們選取了執行時長作爲指標,執行時長超過1s的需要重點關注。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次我們大部分業務場景是基於Mysql,如果有在大分頁,或者查詢結果集過大,或者有like語句,或者沒有where條件等,極有可能造成服務內存泄漏和執行過慢的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有緩存依賴,鎖依賴問題。大部分業務使用Redis做片段緩存和共享鎖,獲取鎖超時異常,緩存被穿透等,可能會造成數據庫被拖死,我們需要關注命中率和Redis交互的延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外RabbitMQ消費者,Prefetch count預取數,如果消息過大,一次取的過多,都可能造成OOM。php框架與RabbitMQ心跳時間60s,這塊就需要php消費者耗時不能超過兩個心跳週期,也就是120s。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這塊涉及的細節比較多,今天就不展開了,總之,服務不能太依賴中間件的100%高可用,需要考慮失敗的可能及一些技巧。拓寬自己的認知邊界,以正確的姿勢更好地使用中間件。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第三大忌:第三方服務的鍋,我背不動"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再講一下第三方依賴吧,這塊也是我在日常處理問題中比較常見的類型。一說第三方問題就各種抱怨,不是我的鍋。業務需求不可避免要與第三方交互,基於SDK或Http等。常見的有用戶請求同步等待短網址生成,調第三方語音轉文字服務,調用騰訊api,調用短信、電話運營商服務,調用ios\/友盟sdk推送等等。遇到最多的就是業務反饋mq消費者夯住了不再工作了,腳本執行超過2天了,用戶請求大量499了等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分是過度信賴第三方,或者沒有意識到第三方的問題,或者寫代碼的時候只考慮功能,未兼容異常情況。最常見的做法是需要做超時配置,如果是線程池或者長連接的模式,就需要做心跳保活機制了。提升第三方依賴高可用另外一個手段就是冗餘備份,支持災備切換。這部分只要意識到,做好幾個關鍵指標的監控如延遲和成功率,基本上都能避免。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"處理好這些依賴風險點,服務的整體穩定性就提高了,關於服務高可用其他的點可以參考下面這張圖,這塊有時間我們再細聊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/ed1124542bf8b5c3663bea830c686174.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"冒險"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講了我們是如何分析依賴提煉出風險任務,大家也有了服務風險意識。接下來分享一下我們是如何錘鍊服務風險治理平臺的。平臺整體是基於鏈路日誌分析,整合風險通知,整合DBA慢SQL優化建議,整合數據可視化畫像。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/66\/6632f0b119b4090122b98fdaa8158eab.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面簡單聊一下平臺設計中遇到的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何保障收益最大化?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b5d083c9cf8808135be84e8011bdb713.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們先來看一個模型,由於每個人對風險的好惡容忍度是不一樣的。有的開發工程師會說p99才200ms左右,很健康呀。因此我們需要給定衡量的標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有些開發工程師會考慮優化後的收益,有些任務優化成本低但收益不高,有些優化成本高但收益也大。因此需要評定風險任務的等級,讓開發工程師關注質量而不是數量,以便抓住收益最大化。結合我們的現狀,現階段服務接口延遲風險是我們最大的痛點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何準確地識別服務延遲風險?爲了識別延遲風險,制定收益最大化SLO,我們做了很多實驗,並參考業內其他公司的經驗,我們結合p99,慢接口qpm設定了標準。最終達到後端服務p99小於100ms,前端服務p99小於600ms。"}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"sum(appslow_count>y) by(appname,method) and sum(appslow_p99>x) by (appname,method)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何讓開發工程師抓住重點?我們有了SLO,識別出延遲風險後,我們會根據耗時,訪問量給任務打上不同的優先級。並且平臺支持開發工程師分階段制定優化計劃,方便任務追蹤。針對依賴的上游服務慢導致自己慢的,可以給對方送臭雞蛋,催促對方優化。每個Q會統計臭雞蛋最多的服務,邀請開發共賞。所以優先優化高風險和收到臭雞蛋多的接口。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何優化延遲風險任務?延遲風險具備相似的特性,有的存在循環調用,有的存在慢SQL,有的存在依賴不合理等。基於不同的特徵我們給風險任務打上不同的標籤,針對每種標籤給出相應的優化建議。如存在循環調用,就會給出具體的幾組詳情,配合APM鏈路分析,直達案發現場。如存在慢SQL,打通DBA SQL優化引擎,給出優化建議。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何實現數據可視化?風險任務優化週期一般比較長,服務的健康度需也要拉大時間維度去查看。不同的角色關注的維度也不太一樣,不同的場景關注的維度也不一樣。也就是平臺需要具備OLAP數據庫查詢的能力,支持上卷,下鑽按不同維度聚合數據並可視化展示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"按事業部維度:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/86\/86b6978d6c4ffa7b4268217f0bd28081.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"按服務維度:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7cebf35d33b7c6e9baac6a610f888ca1.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"債務佔比趨勢:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2e\/2e51af064b32a2987aed39fe36d4ffdc.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"歷險"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後我們來實戰體驗一下服務風險治理平臺是如何工作的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"任務列表:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/05\/05c8df6040446169ca30bb5037a2cd40.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們能直觀的看到服務的延遲線,p50,p75,p95,p99,四條線越聚攏服務越穩定。這塊有個設計技巧,需要按時間稀疏,支持查看全年趨勢。30分鐘內支持按秒實時聚合查詢,這塊我們採用直接查詢Clickhouse中存儲的原始日誌。然後每分鐘打點轉換成metrics,然後存儲到GraphiteMergeTree引擎數據庫中。GraphiteMergeTree支持稀疏策略,7d內按1h的平均值進行稀疏,7d以上按1d平均值進行稀疏。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任務列表支持不同維度,不同標籤的聚合檢索。默認按優先級排序,方便開發工程師抓住收益高的風險任務,同時高亮計劃快過期任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"任務詳情:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7a8f61a3e3c9ce43706a28c7d163f163.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們給出接口的詳細畫像,如果慢SQL會高亮提醒。給出相關的優化建議,結合APM鏈路入口,定位到案發現場。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b1\/b12a757a849a1938c49a825b4db9edce.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對慢SQL,我們提取SQL指紋,與DBA SQL優化分析引擎對接,給出優化建議。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bcf111149a1e005b4e135e54a7a1ddf4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體實戰工作涉及的細節比較多,需要從系統、中間件、代碼甚至需求層面綜合考慮,本次就先不展開了,後續還會單獨講,感興趣的同學可以關注一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總之,優化風險任務是個長期工程,我們需要制定計劃,給予提醒,方便開發工程師提前將這些工作納入自己的OKR來落地。平臺也會提供優化工作總結報表,推出週報、季度報告等,直接發送給相關事業部的業務和技術負責人。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次分享主要基於MDD指導思想,以指標爲導向,深入分析服務風險模型,講解了服務風險治理的一般模式,降低服務延遲,規避風險。一步步帶領大家探索如何將未知的未知風險,轉換成已知的未知風險,最終轉換成已知的已知風險。希望對大家的日常工作有所幫助,也歡迎大家一起交流學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方勇:好大夫基礎架構部高級工程師,專注於 SRE,微服務、中間件的穩定性和可用性建設,整體負責好大夫服務治理雲平臺的設計和搭建。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章