國內酒店穩定性治理實踐之系統間依賴治理

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之前介紹了緩存治理的實踐,具體參考:"},{"type":"link","attrs":{"href":"http:\/\/mp.weixin.qq.com\/s?__biz=MzA3NDcyMTQyNQ==&mid=2649263379&idx=1&sn=d08b1f92781236afb7429bf1d073cf85&chksm=87675cedb010d5fb69a18f2f0219887a53c97006db41f431004a35a52feeaec233bc835e8990&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"國內酒店穩定性治理實踐之緩存治理"}]},{"type":"text","text":" 。在做完緩存治理後,我們並沒有止步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的應用還依賴了很多外部組件、接口,也同時對外提供了一些接口,所有這些依賴都有出現故障的可能,而且個別場景在故障時影響可能很大。因此在緩存治理之後,我們開始了覆蓋度更廣的穩定性治理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇文章重點講述對系統間依賴的治理,主要包括系統依賴的外部組件、依賴的外部接口、對外提供的接口,比如 Dubbo、Http、DB、MQ 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e6\/e683f827b4591b9de8fb53835344c952.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"治理方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"服務定級及依賴治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)根據業務核心程度及影響,標記每個應用的等級(P1、P2、P3,按核心程度 P1>P2>P3),並梳理依賴等級。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)P1 應用保證多機房部署,同時保證:正常情況下,任何一個 P1 應用在任何一個機房的在線機器數不超過該應用整個在線機器數的一半。這樣調整部署之後,單機房的網絡和個別組件故障時,核心業務的影響明顯減小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)對於強依賴,進行弱化,做到可降級;對於弱依賴,進行異步化,做到可熔斷;同時去除一些非必要的依賴。對於 P1 應用接口調用的 P3 應用接口,我們提前對調用做好異常處理,並支持熔斷,之後在線上演練摘掉所有 P3 應用在線機器,P1 應用接口可以不受影響;對於 P1 應用接口調用 P1 應用接口,我們提前準備好降級方案,通過多通道、多副本等手段來保證調用端可以快速進行故障恢復。提前評估這些強、弱依賴出故障的影響,並準備好處理手段,這樣才能減少故障次數、降低故障時影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/52\/52d2edbf0d727a7fdf20b647969f0644.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"限流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個主要應對的是流量突然變大導致應用可能扛不住的場景。我們選擇統一接入封裝後的 sentinel 組件,主要使用的功能包含:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)應用級別單機 Dubbo 動態限流:允許動態配置對所有 Dubbo 接口的限流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)應用級別單機 Http 動態限流:允許動態配置對所有 Http 接口的限流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)業務埋點限流:根據個別參數特殊的值進行單機(或者集羣)限流。比如對於核心的接口,我們不能只具備接口維度的整體限流,還要能對一些特殊的參數進行限流,典型的例子就是酒店報價接口可以通過參數區分來源是 app 或 pc,目前來自 app 的請求量遠遠大於 pc 端,當 pc 端流量異常上漲時,可以單獨增加對 pc 端請求的限流,這樣就不會影響到 app 端的請求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)應用級別接口的集羣動態限流:從系統保護層面來說,這個不是必須的,只對某些特殊場景有用,目前我們主要使用上面三種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sentinel 這個組件允許直接從接口層面限制住流量上限,如果真有異常量級流量進來時,我們可以配置相關規則拒絕一定量級的請求,根據當前集羣的能力盡可能對外提供服務。這樣異常量級流量就不至於把應用所在集羣所有機器打死,導致應用無法提供服務或者提供服務能力銳減等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cfc7683afca9c80a68f46b9917f0cbcd.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,限流本質上會影響請求的處理,可能會對用戶體驗造成影響,因此在系統能扛得住的時候我們是不希望使用限流的。應用流量可能在某些節假日或者活動時異常上漲,這些都是正常的用戶流量,這個時候我們更期望的是應用能正常處理這些流量。這時我們不光要提前準備好限流手段,更需要提前預估流量,做好壓測,評估應用是否需要擴容及擴容的機器數量。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Dubbo 治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個主要應對一些 Dubbo 線程池被打滿及下游 Dubbo 服務超時的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)Dubbo 線程池監控:這是很容易被忽略的點,儘可能避免線上 Dubbo 線程池不夠時纔想起臨時增加機器或者增大線程數,也方便做一些需求評估。provider 端業務線程池默認共用同 1 個線程池,每增加一個接口都會消耗這個默認線程池的資源。qunar 內部的 Dubbo 線程池做過一些調整,consumer 的業務線程池也是共用的同一個。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)Dubbo 線程池隔離:可以隔離出核心接口,也可以把個別非核心接口隔離。主要防止非核心接口出問題時打滿 Dubbo 線程池,影響核心接口的服務能力。舉個實際的場景,某核心應用增加了一個非核心的 Dubbo 接口,量級大而且響應時間較長,這個接口不斷搶佔 provider 的線程池資源,導致這個應用的核心 Dubbo 接口出現拿不到連接的情況,這時候就可以將這個非核心的 Dubbo 接口隔離使用單獨的線程池,就可以不影響核心接口了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)Dubbo 接口超時時間合理化配置:要根據接口正常響應時間配置,consumer 端的超時時間一定要配置,且不能大於 provider 端的超時時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6d\/6d2e61d741ad8e186dcae4b644f05d4f.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Http 治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個主要針對調用外部 Http 接口出現大量超時及異常的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)超時配置檢查:根據接口實際情況,配置合理的超時時間(比 P99 大一些),同時支持動態配置超時相關的參數值。超時是目前組內應用異常最多的地方,超時時間過大容易出現一直佔着主線程最終拖垮服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)異步化:推薦使用異步方式調用,並且支持同步、異步切換。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)支持重試:業務檢查是否可以及是否需要配置重試,不建議同步執行的接口做重試(容易超時)。實際中很多接口重試 1~2 次就可能成功,但需要和下游溝通確認是否可以重試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4)做好隔離:對於異步調用,做好線程池隔離及 client 隔離,防止不同的接口調用相互影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/84\/845350abe47f6333eafd98b37a1d3a60.jpeg","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"DB 治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DB 和之前的緩存治理還是很相似的,主要關注的是組件出現問題後的快速恢復,以及核心數據存儲做好多副本、多分片。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)高可用保證:數據多副本存儲(主從、DB 存儲 redis 緩存)、快速恢復及降級等處理,降低可能的慢查詢、DB 宕機等對服務能力的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)優化存儲的無用數據,比如國內酒店相關服務去掉緩存的國際酒店數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)監控平均請求數據集(超出一定數據集增加報警),實現 Mybatis 的 Inteceptor 在切面裏完成。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"MQ 治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個主要應對單個 MQ 無法正常發送消息、或者消費嚴重堆積的場景。比如之前出現過單個機房 zk 故障導致 MQ 部分 topic 發送失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先確認 MQ 是否可以快速切換到正常的通道,比如將消息屏蔽故障機房、或者將消息漂移到正常的機房。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特別核心的場景準備好多通道:發不同 topic 的消息或者使用兩個不同的 MQ 組件,這時需要注意消費時做好冪等,這樣掛掉一個集羣不會對核心造成影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a64746c91c99b39c75e603108d9b55e0.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)檢查並完善監控:Dubbo 接口、Http 接口、DB 相關的 Mybatis method 接口等操作成功及異常的 QPS、時間等監控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)添加 appcode 維度的各組件及調用的監控面板:方便巡檢和故障時快速查看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)超時等配置檢查及合理性修正:通常 connect timeout 要短,socket timeout 可以根據業務實際情況大一點。實際開發中,很多人都是 copy 的 demo 或者某處使用的代碼,而忽略了這些數值是否合理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"治理過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整體過程和之前緩存治理的很像,主要都是“梳理場景 → 確定方案 → 開發及自測 → 測試 → 上線 → 線上演練及改進”。當然也有不同的地方,本次在對相關應用進行治理時,每個應用都是分兩次或更多次進行線上發佈:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)先增加限流組件及相關的組件監控:先保證能對異常的流量進行預防,同時爲後續治理準備好監控數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)基於完善的監控指標,有針對性的進行參數及流程等的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這些都做完後,我們繼續在線上進行演練。對於超出預期的,我們修正方案繼續優化重新演練;對於符合預期的,我們整理成通用的治理方案,作爲以後其他應用及新依賴治理的標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/dd\/ddc7d97616a6c52a55cc9ad3556d5dcd.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每次故障後,我們都會做故障 review,然後提出一些改進點。回過頭仔細 review 這些改進點,有很多我們可以在故障發生前做好預防,並通過一些治理將影響降低,甚至避免故障的發生。我們做穩定性治理的出發點也是希望在常見的故障前對一些問題和場景準備好處理方案和方法,儘可能降低故障數量、故障持續時間及影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做系統間依賴治理並不是一次就 OK 的,目前我們的治理也只是針對目前已有的依賴進行治理。未來我們計劃能對這些依賴自動打上標記,並上報到專門做服務治理或相關的系統來管理,在系統裏準備好故障時應對手段,這樣既能動態識別出新的依賴,又能在故障時快速採取有效的手段進行應對。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次做完系統間依賴治理後,我們繼續對系統內部資源進行了治理,主要採用降級、熔斷、隔離等手段,我們下一篇再具體講解。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:鄭吉敏"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/sWxbyD2srai35PARKk6ugw"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:國內酒店穩定性治理實踐之系統間依賴治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:Qunar技術沙龍 - 微信公衆號 [ID:QunarTL]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章