國內酒店穩定性治理實踐之緩存治理

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們連續遇到幾次與緩存相關的故障:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、DBA 運維失誤,導致我們存儲在 redis 裏的核心基礎數據被清空。由於無法正常提供報價,出現 ATP(訂單量驟降)故障,之後通過定時任務花費半個小時將數據寫回 redis,故障恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、PC 端爬蟲流量進入後端,應用的 redis 連接池被打滿,大量同步的 redis 請求都等待 500ms 獲取連接,導致應用的 tomcat 線程池被打滿,服務被拖死,無法正常提供 PC 端業務,而 redis server 端當時還完全沒壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"類似與緩存相關的故障還有不少,這裏就不一一列舉了。在對故障進行 review 時,我們意識到有不少核心場景都使用了 redis 緩存作爲核心依賴和存儲,同時這些場景我們並沒有對 redis 可能出現的問題進行預防和處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於我們的核心業務重度依賴 redis,爲了不讓類似的故障重複上演,也希望在故障前做好準備和預防,我們對緩存進行了專項的治理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"治理方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、高可用治理:這項是最重要的,但是和 redis 本身的高可用部署是無關的,核心出發點是業務的高可用指標不應該完全依賴於使用的組件,組件出故障不代表業務也跟着故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,建議快速恢復。通常這種方案更適用於有基礎數據集的,通過定時任務或者手動觸發接口,在短時間內完成對數據的清洗,清洗時考慮優先恢復熱點數據。這裏的短時間,我們期望是能影響 ATP 的場景在 2min 內完成,對用戶體驗有影響的允許在 10min 內完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/4014442e63b1cad8e5fe3c7ddc96f507.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,考慮對核心數據做多副本。核心的業務場景可以考慮將數據緩存到 redis 不同 namespace 集羣中或多個緩存組件中(redis+tair、redis+memory),這樣掛掉一個時可以通過切換到備用的緩存來快速恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f3\/f3144b7ea6ef750cc987d58062ba950a.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後,考慮手動降級。不走緩存,或替換爲其他通道。這裏優先考慮無損降級的方式,必要的時候可以考慮有損降級。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e881e134510099b8bf148b2b8da866e2.webp","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,我們還發現系統裏存在”A 應用寫,B、C 應用讀“的情況,這種需要上下游一起溝通最終的方案,並且推薦使用快速恢復,使用方可根據需要做些額外的準備。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、參數調優:這裏主要指對 redis 使用配置的時間及線程數的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據監控我們發現:絕大部分場景下,通過 redis 讀取和更新緩存的時間都是幾毫秒左右(含連接時間)。實際中很多場景 redis 的使用都忽略了這些參數的合理配置,並且發現很多都是複製的某年的某個例子的幾百毫秒。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這種情況,我們要求對應用裏每個 redis 的配置都做好檢查及合理化配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、補充一些治理的細節"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)memcache 替換爲 redis:redis 組件相比 memcache 好處就不多說了,主要是將緩存的運維統一都交給公司 DBA。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)統一配置文件格式:目前很多系統線上有很多配置文件,找起來很麻煩,故障時要能快速找到對應的配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)完善監控:保證每個業務場景對 redis 的調用量級和時間(含異常的量級和時間)都能在監控系統找到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"治理過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、梳理核心場景使用的緩存。主要整理涉及緩存的核心場景、故障時影響、數據量級等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、確定整體治理方案。先有了大致方案後,然後組內各個系統負責人一起 review 方案細節,將忽略的細節補充到整體方案中。這個和梳理可以並行,在梳理完成後確定最終的治理方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、review 各個場景治理細節。開發、應用負責人、qa 負責人一起 review 每個場景的治理細節,並且明確標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、按照每個場景討論確定的治理方案進行開發、自測。過程中如果發現方案有問題,可討論修改,按新方案執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5、開發過程中整理故障場景方案及應用維度的應急手冊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6、提測及測試。開發要再次跟 qa 說明治理的場景及方案,qa 根據整理的手冊進行驗證,同時在 beta 環境演練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7、上線及製作監控面板。代碼上線後,將相關監控按照應用維度製作監控面板,方便日常演練和故障時快速查看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8、線上演練。在業務低峯時間段,對應急手冊裏的調整在線上進行驗證,對有問題的點進行改進,並找時間繼續演練,直至達到預期。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"成果與總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,組裏大部分的 P1 系統都完成了緩存治理及演練,共花費 60 多個人日,過程中參與的開發人員對 redis 的很多細節做了深入學習,加深了對 redis 的理解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存治理開始時的梳理加深了組內人員對系統的理解,產出的 wiki 對其他同學及新同學很有幫助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存治理產出的監控面板,對日常巡檢和故障時快速定位很有幫助。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存治理產出的應急手冊,在面對實際故障時,能極大的減少故障持續的時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得特別說明的是,近期基礎數據組 DB 裏酒店圖片數據意外被髒寫,間接導致我們 redis 裏的數據被髒寫(這部分數據是用戶觸發的)。在 DB 裏數據恢復正常後,我們藉助預留的降級手段,直接調整開關爲調用 dubbo 接口獲取基礎服務 DB 裏的圖片數據,故障在操作後 1min 內恢復。"}]},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:鄭吉敏"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/eMHsutVlfX7CVGx4hjdCQg"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:國內酒店穩定性治理實踐之緩存治理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:Qunar技術沙龍 - 微信公衆號 [ID:QunarTL]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章