百度商業託管頁系統高可用建設方法和實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"在互聯網公司中,業務迭代快,系統變更頻繁,初期都是刀耕火種。但隨着系統複雜度不斷增加,系統穩定性問題會凸顯出來,當穩定性問題成爲業務發展的掣肘的時候,重新推倒重來所需要的代價可想而知,因此我們的系統架構需要持續優化和演進不斷提升穩定性,既要解決當務之急,又要防患未然。本文結合具體實踐,對系統高可用建設的方法進行思考和總結。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文7986字,預計閱讀時間19分鐘。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"百度商業託管:","attrs":{}},{"type":"text","text":"是百度爲了實現營銷新生態的建設,以高效連接和投放優化爲目標,爲商業客戶提供一站式的的運營陣地,連接服務和消費者,是百度從流量運營到用戶運營的重要轉變。代表的產品有基木魚、度小店等建站和電商平臺。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着託管頁的業務不斷髮展,系統的規模和業務複雜度不斷增加,系統的可用性面臨巨大挑戰,本文從可用性建設的方法到實踐,深入分析穩定性建設的思路,從規範、監控、冗餘、降級、預案等多方面實現系統的高可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「可用性指標定義:」","attrs":{}},{"type":"text","text":" 對於系統而言,最理想的情況是系統能提供24小時不間斷的提供服務、但由於軟件系統的複雜度高,尤其在分佈式系統環境中經常會由於系統BUG、軟硬件異常、容量不足等導致系統無法提供100%的可用性,因此通常採用N個9來評估系統可用性,此指標也作爲一些基礎服務的SLA的標準。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0d/0d9e06406bb8e5c447bd6396a7aadfb9.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、可用性整體建設思路","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0bd788b19251449b7f001e664ed2e320.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統的高可用建設是一件龐大的工程、需要從不同維度去綜合考慮,整體建設思路可以圍繞系統故障發生的時間、範圍、頻率,處理速度等方面來綜合考慮。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1 故障發現早","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從故障的發生時間來看,在用戶或客戶反饋問題之前,研發人員能夠第一時間發現問題是非常重要的,每一次故障發生之後我們都會深入思考一個問題,能不能更早的發現問題,我們有哪些常用的手段和方法,下面就逐一介紹下。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.1 故障發現早-規範化:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「日誌規範化」","attrs":{}},{"type":"text","text":"規範化的核心思想是通過一定約束來保證整體系統能夠協調統一,託管內部的服務是基於統一的微服務框架構建,但由於各個系統和模塊的日誌千差萬別,在開發、測試和運維階段帶來較高的成本。日誌規範主要是針對開發過程中關鍵業務信息的記錄,高效的定位問題;在QA測試階段進行問題排查;在數據統計分析提供有效指導手冊。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「全局通用規範」:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全局上下文采用統一的MDC實現,用中括號和空格分割。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有的logger均需設置addtivity=false,禁止重複打印。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"msg信息需要簡明、易懂。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相關日誌禁止重複打印到console.log中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打印日誌使用slf4j門面。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「日誌分級」:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「TRACE」","attrs":{}},{"type":"text","text":" 調試詳細信息,線上禁止開啓。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「DEBUG」","attrs":{}},{"type":"text","text":" 開發調試日誌,線上禁止開啓。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「WARN」","attrs":{}},{"type":"text","text":" 警告日誌 日誌常用來表示系統模塊發生問題,但並不影響系統運行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「INFO」","attrs":{}},{"type":"text","text":" 信息記錄 日誌級別主要用於記錄系統運行狀態等關聯信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「ERROR」","attrs":{}},{"type":"text","text":" 錯誤信息輸出 此信息輸出後,主體系統核心模塊正常工作,需要修復才能正常工作。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「日誌文件」","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ce/ce075ce77054b6700316f6e5af26b53f.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「logPattern」","attrs":{}}]}]}],"attrs":{}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「關鍵日誌的格式要求」","attrs":{}},{"type":"text","text":"此處涉及的細則規範較多,不一一列舉,主要涉及到貫穿日誌的核心上下文,需要包含來源ip,請求路徑,狀態碼,耗時等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「報警規範化」","attrs":{}},{"type":"text","text":"報警規範化主要針對錯誤日誌的報警監控,做到報警的分級監控、定義了分級監控的監控項目名稱的定義。針對不同級別的報警,採用不同的採集任務和監控策略,並定義配套的跟進流程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通用服務的性能監控報警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"強依賴的性能監控報警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務異常狀態碼監控報警。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三方服務耗時監控報警。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b2ddcdb24df194f74a10181b092d7fae.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「值班規範化」","attrs":{}},{"type":"text","text":"針對值班同學的通報、止損、定位、解決等核心規範和流程,保證線上的問題能夠第一時間處理和解決。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.2 故障發現早-系統監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統監控主要是從問題感知到問題定位的全面能力的建設。前面提到的日誌規範化整改是實現自動化問題感知的前提,當系統日誌規範完成之後,我們就可以通過一些自動化的方式來建設統一的監控。在問題感知層面主要包含業務指標、業務功能、系統穩定性、數據的正確性、時效性等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3b5003afbc96ee31a2a8a870f5f19349.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「問題感知:」","attrs":{}},{"type":"text","text":"業務指標是指系統關注的核心業務指標,主要是通過實時數據採集的方式能夠發現業務指標的變化,能夠實時監控到系統問題對業務的影響範圍。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務功能是指針對核心的業務功能分場景的自動化測試和監控能力。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統穩定性會從多個維度去衡量。會從網關入口來衡量可用性、會從模塊自身來看可用性、以及從依賴的第三方的穩定性來衡量系統的穩定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據一致性校驗本質上是一種離線或近線的對賬場景,對於分佈式的微服務來說,絕大部分都是採用補償來實現最終一致性,因此數據的對賬就顯得尤爲重要。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「問題定位:」","attrs":{}},{"type":"text","text":"問題定位主要是結合一些核心業務場景,建設一些異常指標的報警和監控,其中包括流量異常、平響異常、pvlost等。在數據正確性和時效性上面來看,包括數據延時、數據不一致等。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.3 故障發現早-容量評估","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容量評估是提前發現系統容量問題的有效手段,尤其是當有一些特定的業務場景的時候,需要工程師或者架構師進行系統的容量評估來判斷系統是否需要擴容等操作,在這裏需要我們提前做很多準備,常見的容量評估的方式有靜態分析和動態評估兩種手段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"靜態分析是指通過分析現有系統的依賴拓撲,結合在當前流量的情況下,通過理論計算出系統能否承受的最大流量的負載能力和系統瓶頸。靜態分析只能提供一種預估的結果,不一定客觀和準確。動態評估是指針對線上的服務進行模擬壓測,通過系統的實際情況來評估容量情況,此種方式相對客觀準確,但線上的全鏈路壓測會有一定風險,而且容易對業務數據帶來污染。因此實際在做容量評估的時候可以採用靜態分析+動態評估兩者結合的方式來進行。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5d/5db264c216e061344516fe9a58296a6e.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「動態評估相關注意事項:」","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘量模擬線上真實的流量(流量回放)來進行線上壓測,因爲不同的分支邏輯可能帶來的系統負載不同,例如如果針對某一個相同請求和參數進行壓測,極可能命中cache,則會導致壓測結果不置信。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態評估之前需要通過靜態分析排查可能對業務帶來的影響,需要增加相關的開關避免對用戶帶來干擾,例如:針對下單行爲給用戶或者商家發短信等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態評估可能會對線上系統帶來影響,因此要在流量低峯期進行,並且能夠做到快速啓停。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態評估需要業務系統配合做數據打標和清理,避免髒數據對線上業務的影響。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2 故障範圍小","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從故障的範圍來看,縮小故障範圍的常用方法和核心手段主要就是隔離,隔離強調的是將微服務架構體系中非核心服務導致的故障隔離出去,減少非核心因素對業務核心的穩定性影響,隔離工作做好之後只需要考慮核心服務的穩定性。通俗點講雞蛋不能放在同一個籃子裏。具體有存儲的隔離、服務的隔離、以及權限的隔離。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.1 故障範圍小-存儲隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統建設初期,爲了提升研發效率和節省資源,很多業務都是共用存儲的。隨着業務的發展,經常會出現以下問題:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"A業務的慢sql導致整個集羣變慢。影響了B、C、D業務。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"B業務的大表的添加字段,導致主從延時,影響了A、C、D業務。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"C業務線下離線統計分析導致從庫CPU100%,影響了A、B、D業務。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決如下問題的主要方法就是物理集羣拆分,避免業務共用底層存儲相互影響,提升系統整體穩定性。託管頁系統有建站和電商兩大業務,由於共享MYSQL集羣導致互相影響的線上case出現的頻率較高,一般按照業務域去遷移物理集羣,主要的拆分方法和步驟見下圖:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b9/b9f5bafdd8ad3f40a18dd0fa649dac27.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中新集羣的容量評估、資源申請、以及切換過程中的雙寫同步都是非常重要的流程和步驟,雙寫後業務要及時校驗數據的準確性。關於其他redis等其他的存儲隔離的思路和方法與上述一致。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.2 故障範圍小-服務隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「服務隔離」","attrs":{}},{"type":"text","text":"服務隔離一種方式是從業務視角去看的,此處涉及到微服務的拆分的原則,一般方法和原則如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將容易變化的,頻繁變更的部分隔離出來服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將高併發等級高的應用與低等級的應用隔離出來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照組織架構劃分將服務進行拆分和隔離(康威定律|垂直拆分)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"沉澱底層通用的基礎信息和服務,保證通用性(水平拆分)。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一種隔離的方式是從冗餘的視角去看的,從高可用的角度需要保證我們的服務保證多機房多地域的冗餘,保證在某個機房或者某個地域出現故障時候,能否及時切換和止損。冗餘解決的是核心服務面對各種環境變化時的穩定性應對,比如服務故障、交換機故障、網絡故障、機房故障等,通過各種層次的冗餘和流量調度機制,保證業務面對各種硬件和環境變化時仍然可以通過冗餘切換提供穩定的服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此處的冗餘更多的是指接入層和服務的冗餘,對於無狀態的服務冗餘是很容易實現的,但是對於有狀態的基礎組件和存儲服務做多地域冗餘成本是很高的,可以分場景去實現,例如對數據一致性要求不高的查詢場景,可以採用存儲的多地域部署,但是對一致性要求很高的,需要考慮set化來實現,具體可參考阿里的三地五中心架構。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「老舊服務清理」","attrs":{}},{"type":"text","text":"由於系統不斷變更和迭代,不斷會有一些技術項目對現有的系統進行重構或者重寫,對於多個版本的接口或者系統並存的情況在互聯網公司並不罕見。尤其是對一些底層的基礎服務,此處需要程序員或者架構師有高度的敏感度和責任心,對於一些技術的尾巴要及時清理,來保證系統的高可用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於基礎服務的提供方,涉及到老舊版本的升級,需要及時推動上游系統進行升級。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於依賴一些無人維護的老舊服務,需要重新梳理服務依賴拓撲,進行優化替代。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於依賴的基礎組件、需要及時進行評估和更新上線,尤其涉及到一些安全問題,性能問題等。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2.3 故障範圍小-權限隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統故障大多數都是由於變更導致的,在變更管控上重要的一點就是要做到權限隔離,服務發佈和上線的權限隔離,此處需要依託於容器化管理平臺的能力,但是團隊內部需要及時清理相關權限。避免不相關人員誤操作導致線上風險。","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"線上數據庫的讀寫權限隔離,IP授權的管控。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"線上服務的發佈和部署權限隔離,分級發佈的審覈人員名單管控。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"代碼庫的權限隔離,保證CR的質量。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"對於服務的入口層以及管理權限的隔離。","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3 故障頻率低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提到故障頻率不得不提及另外一個概念,叫做MTBF(平均故障間隔)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/55ef9a16d2f6c190b3bb1ee0848e8ed7.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"失效時間是指上一次設備恢復正常狀態(圖中的up time)起,到設備此次失效那一刻(圖中的down time)之間間隔的時間。可以將MTBF用如下的數學式表達:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b017f1abb94d127133dd635f6f28129.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們面臨的是各種複雜的網絡環境,故障頻率是衡量我們系統自我保護能力的重要指標,接下來介紹下常見的方法和實踐。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3.1 故障頻率低-服務限流","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個系統都有自己的最大承受能力,即在達到某個臨界點之前,系統都可以正常提供服務。爲了保證系統在面臨大量瞬時流量的同時仍然可以對外提供服務,我們就需要採取流控。尤其是針對一些底層基礎服務或者被較多應用依賴的業務服務。限流算法常見的有記數法(固定窗口和滑動窗口)令牌桶和漏桶算法。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「常見限流算法」","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"令牌桶算法:","attrs":{}},{"type":"text","text":"在令牌桶算法中,存在一個桶,用來存放固定數量的令牌。算法中存在一種機制,以一定的速率往桶中放令牌。每次請求調用需要先獲取令牌,只有拿到令牌,纔有機會繼續執行,否則選擇選擇等待可用的令牌、或者直接拒絕。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a6/a6f5c908de598af410d5fb85a4892292.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"漏桶算法:","attrs":{}},{"type":"text","text":"漏桶算法這個名字就很形象,算法內部有一個容器,類似生活用到的漏斗,當請求進來時,相當於水倒入漏斗,然後從下端小口慢慢勻速的流出。不管上面流量多大,下面流出的速度始終保持不變。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3f/3f7893ee2eff83de004e42c9c885e80c.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「限流方式」:","attrs":{}},{"type":"text","text":"從託管頁系統來看,主要是兩大類服務的限流,一種是直接面向用戶的web服務或者api,這種通常情況下都會有網關層,例如百度有自己的BFE平臺,可以很方便實現限流規則的配置。另一種是RPC服務,這種需要自己來實現限流,目前比較流行的限流方式是RateLimiter - resilience4j(基於令牌桶實現),能夠跟Springboot很好的集成,具體實現和使用方法可參考https://resilience4j.readme.io/docs/ratelimiter","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"在配置和實現限流時需要注意以下幾點:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"限流的難點是如何評估合理的閾值,通常要結合線上的實際情況,和動態壓測結果來準確評估。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"由於我們的服務會提供多個API,需要針對服務進行全局的限流配置以及核心重要API的限流配置,優先級是全局>局部。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"爲保證限流操作的及時性,系統需要支持動態修改配置。","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.3.2 故障頻率低-降級熔斷","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式系統的熔斷就像家用電路的保險絲一樣,當系統超過承載的閾值時,會自動熔斷,起到系統保護的作用。尤其在微服務發展迅猛的今天,服務依賴的拓撲越來越複雜,架構師都很難畫出來所有的服務依賴拓撲,當出現某一個服務不可用但是沒有相應的熔斷措施的話,極可能出現雪崩,這種災難性的故障需要我們通過合理的熔斷和降級來保證的。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「強弱依賴梳理」","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要做好熔斷降級的前提是要梳理好強弱依賴,此處的強弱依賴梳理主要從對業務的影響來評估,例如下單操作,對於商品的庫存服務就是強依賴,因爲要保證數據的一致性。此處不可降級。但是在商品詳情頁展示的價格依賴營銷算價服務,此處可以定義成弱依賴,因爲就算價格服務不可用,商品可以按照原價展示。降級一般來說對業務都有影響,我們核心要做的是降級後預期是什麼,會對哪些業務產生影響。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「框架選擇」","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"熔斷降級框架目前比較常用的是 https://resilience4j.readme.io/docs/circuitbreaker。這是一款輕量級的斷路器框架(6w行代碼),使用簡單。(Hystrix停止更新,轉入維護模式),同樣比較常見的是 Sentinel,網上對比文章較多,此處不再贅述。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b9/b90f99c898836c1338ef8dbb61a3e1f9.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與Hystrix相同,Resilience4j熔斷器也存在三種狀態,即關閉狀態(CLOSED)、半開啓狀態(HALF_OPEN)和開啓狀態(OPEN),但除此之外,Resilience4j還有兩個特殊的狀態,不可用狀態(DISABLED)和強制開啓狀態(FORCED_OPEN)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Resilience4j使用ring bit buffer 這種數據結構來存儲被保護方法的調用結果。一次成功調用,存儲1,一次失敗調用存儲0。ring bit buffer是一個類似bitset的數據結構,其底層是一個long型的數組,僅需16個元素就可以存儲1024次調用的結果。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.3 故障頻率低-超時設置","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"超時和重試設置的不合理同樣會帶來系統故障,託管系統針對超時、容錯、池化、等進行了全盤的梳理和整改。主要集中在如下方面","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「合理的超時設置」","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RPC依賴和HTTP依賴均應設置合理的超時時間,可根據依賴服務線上99分位值,增加30%-50%的buffer。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"許多框架都有默認的超時時間,需要酌情調整。例如redis連接池默認讀寫和連接超時爲2000ms,okhttp的連接池默認爲10s,hikari連接超時默認爲30s. 很多默認的超時連接對於併發高的服務和應用都不太合理,需要結合業務場景綜合考慮。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「容錯機制」","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對於讀操作可以選擇failover的容錯策略,重試次數<=2次。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對於寫操作的重試需要酌情考慮,要充分考慮下游服務是否能保證冪等性,爲風險起見,對於下游無法保證冪等性的情況可以選擇 failfast。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「池化設置」","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線程池、連接池都是我們在程序開發中經常會使用的方式,核心目的就是爲了減少頻繁創建和銷燬帶來的系統開銷,提升系統的性能,但是不合理的池化配置同樣會給系統帶來一定的風險。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線程池不允許使用Executors去創建,而是通過ThreadPoolExecutor的方式。顯示定義線程池核心參數。(阿里編碼規範)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"redis和數據庫連接池初始值需要考慮集羣規模 以及存儲服務允許的最大連接數,不可配置過大,配置的不合理會出現服務啓動時就把存儲服務打滿的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線程池和連接池需要設置有區分度的名稱,以便於monitor和日誌記錄。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"池化大小設置多少合適?要結合吞吐量和平響要求,建議公式:併發量(連接數、線程數)= 每秒請求數 (QPS)* 處理時間。也需要考慮CPU核數,磁盤,內存等綜合考慮。建議根據線上壓測來實際評估。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.4 故障處理快","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"故障處理關鍵是要快速止損,很多程序員比較愛鑽牛角尖,非要定位出根本原因纔去解決問題,但隨着故障時間的增加,對業務的影響會變得越來越大。因此每個程序員都需要有快速止損的意識,第一時間恢復業務,故障深層次的原因待保留現場事後分析和覆盤解決。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「快速擴縮容」","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"快速擴縮容體現了服務和系統的伸縮能力,這裏要依賴於容器化的集羣伸縮能力。由於歷史問題,託管頁系統有一些是運行在物理機或者無人維護的老平臺上面,對於突發流量的應對簡直束手無策,因此paas化是亟待解決的事情,依託於強大的paas平臺的快速擴縮容的能力,能夠做到快速止損。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「常規處理預案」","attrs":{}},{"type":"text","text":"在平時多積累常規的預案是應對突發故障快速處理的有效手段,故障快速定位和止損相對理想的方式是打通故障定位和預案,當出現故障時,相關開發或者運維同學能夠快速判斷出故障類型並及時執行預案,主要有攻擊限流、機房切換、快速擴容、常規緊急case的處理流程等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預案設計的一些經驗TIPS","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將歷史出現過的case進行復盤總結,分類歸檔到預案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"建立預案時儘量方便觸發和執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上線或者變更引起的故障比較常見,每次上線需要有相應的變更回滾的完整方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於流量和容量變化引起的故障,需要週期例行化的進行線上容量評估。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於機房、網絡、硬件等故障,要通過適當冗餘和快速的流量切換保證服務穩定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"「數據備份」","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當服務出問題我們可以及時通過流量切換、重啓、擴容解決,但是當數據出問題,例如刪庫,數據丟失等問題,恢復起來成本極高,因此平時我們需要對核心數據進行備份,例如MYSQL集羣的核心數據要做到天級備份,並且可以通過binlog實時回溯數據,一般需要業務方和DBA共同確認數據的備份以及快速恢復機制。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、總結和思考","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統穩定性是一個非常大的課題,本文結合商業託管頁的穩定性建設的實踐,從宏觀層面分類闡述了保障系統穩定性的常見方法。從故障發現、到故障影響,從故障頻率到故障處理多個方面進行了總結。穩定性建設需要綜合考慮業務、研發、測試、運維多方面的因素,需要各個方面協同配合。由於筆者能力有限,編寫倉促,文中難免會有不準確或未能詳盡的地方,請讀者多多指正。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度 Geek 說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章