混沌工程及故障演練組件的應用與實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,隨着金融業的快速發展以及日益新增的需求變化,傳統的單體架構已經不能滿足需要,分佈式系統、微服務架構正在越來越多地應用到業界中。雖然分佈式系統有衆多優點,但分佈式的引入導致各個系統間的交互、以及底層的基礎設施變得更加的複雜,因此,對系統的可用性、可靠性提出了更高的要求。爲了解決分佈式系統架構健壯性的驗證、相關風險及應對預案的研究,我們在2021年年初啓動了《分佈式微服務平臺軟件風險工程SRE研究》課題的研究及應用實踐工作,從故障驅動的角度出發,提前挖掘當前架構、鏈路、系統存在的隱患,驗證基礎設施的完備性,限制故障影響的範圍,建立相關風險預案。由此,我們引入了混沌工程及故障演練,並進行了系統的研發應用及實踐。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"混沌工程是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程是什麼呢?混沌工程是在分佈式系統上進行實驗的學科,目的是建立對系統抵禦生產環境中失控條件的能力及信心。它起源於2010年Netflix創建的Chaos Monkey,可以隨機終止在生產環境中運行的實例,工程師可以快速瞭解正在構建的服務是否健壯,有足夠的彈性,容忍計劃外的故障。至此,混沌工程開始興起。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程通過在系統中隨機注入不同類型的故障,儘可能多地識別出易出故障的環節,從而可以有針對性地對系統進行加固和防範。歷經多年,混沌工程被越來越多的公司實踐用以提高系統架構的可靠性。以Netflix爲例,2010年內部開發了混沌實驗工具Chaos Monkey之後,仍一直致力於該方面的研究,並在2014年提出了故障注入測試(FIT),2015年正式提出了混沌工程的指導思想,2017年開源了Chaos Monkey的V2版本。此外,2016年Gremlin公司正式將混沌實驗工具商用化。2017年ChaosIQ公司開源了chaostoolkit混沌實驗框架。今天,許多公司包括Google、Amazon、Microsoft、FANG、阿里巴巴、美團、奇虎360、網易等都在研究自己的混沌工程。由此可見,混沌工程正在引起許多公司的關注,通過各種形式的混沌實驗提高系統的健壯性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼過往實踐中混沌工程都發揮了哪些作用呢?比如:工行通過對支付類交易實施混沌實驗,模擬雙活應用的單園區網絡斷連然後再恢復,發現支付鏈路存在一些底層故障場景下交易失敗的架構設計缺陷,在投產前對支付系統架構進行了重新設計和升級,有效避免了在生產環境觸發這些問題。Netflix在過去一年中利用混沌工程提前發現了2次大故障和8次小故障,避免了整個組織大約70萬美金的損失。京東、阿里會在雙十一大促之前會進行兩個月密集的混沌工程故障演練,考察故障發生的時候系統和團隊對故障的檢測、響應、處理還有恢復能力,提高團隊對大規模故障的容錯能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"縱觀上述內容,混沌工程對於提升複雜分佈式系統的健壯性和可靠性發揮了重要作用,在我們的系統中引入混沌工程的應用必不可少。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"混沌工程VS傳統高可用測試的差異"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

差異性

傳統高可用測試

混沌工程

實驗範圍

實驗的範圍是比較侷限的

實驗的可能性是無限的

實驗目的

對系統已知的可能取值進行測試

獲得更多關於系統新認知進行實驗

測試方法

特定的斷言,測試會產生二元的結果,非真即假

系統架構不同,實驗千變萬化,結果不同

思維方式

按照一定的預先計劃注入注入故障

探索性地主動去尋找故障"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由上表可知,混沌工程和傳統測試方法的主要區別在於:混沌工程是發現新信息的實踐過程,而故障注入則是對一個特定的條件、變量的驗證方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當探究複雜系統如何應對異常時,對系統中的服務注入通信故障(如超時、錯誤等)不失爲一種很好的方法。但有時我們希望探究更多其他的非故障類的場景,如流量激增、資源競爭條件、非計劃中的或非正常組合的消息處理等等。我們需要探究清楚系統在這種情況下的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統高可用測試方法通過對預先設想到的可以破壞系統的點進行測試,但是並沒能去探究上述這類更廣闊領域裏的、不可預知的、但很可能發生的事情。例如,傳統測試中可以寫一個斷言(assertion),即我們給定一個特定的條件,產生一個特定的輸出。測試一般來說只會產生二元的結果,驗證一個結果是真還是假,從而判定測試是否通過。這個過程並不能讓我們發掘出對於系統未知的、尚不明確的認知,它僅僅是對我們已知的系統屬性可能的取值進行測驗。而混沌工程實驗的可能性是無限的,根據不同的分佈式系統架構和不同的核心業務價值,實驗可以千變萬化。它和已有的測試已知屬性的方法有本質上的區別。可以幫助我們獲得更多的關於系統的新認知,通常能開闢出一個更廣袤的對複雜系統的認知空間。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"建信金科對混沌工程做的技術預研及工具選型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於行內金融場景業務和分佈式系統架構、運行環境的分析,我們的混沌工程需要具備兩大特點:(1) 忽略底層基礎設施,同時覆蓋虛擬機和容器環境;(2)統一的流程編排模型,既能編排虛擬機又能編排容器故障。然而,業界目前還沒有統一的工具能夠滿足上述要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前期對目前現有的混沌測試工具進行了調研,分析總結了各個工具故障注入的能力,如下表所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

工具總結分類

工具名稱

覆蓋功能點不足

ChaosMonkeyKubeMonkeyBlockadeByteMonkey

側重虛擬機故障

Chaosd

側重容器故障

PumbaChaosMeshLitmusChaosBlade"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上表中各個工具的運行環境及故障注入能力的側重點不同,我們選取了分別偏重於虛擬機和容器環境、故障注入類型更加全面的ChaosMesh和Chaosd開源框架。雖然這兩個框架的功能點比較全面,覆蓋了k8s、網絡、磁盤、CPU、內存、IO等多個類型故障,但是還未覆蓋JVM、數據庫、緩存、消息隊列等異常。然而對於分佈式系統來說,消息隊列、緩存、數據庫是非常重要的組件,這些故障注入是必不可少的。另外,分佈式系統屬於Java應用,JVM的注入也至關重要。所以,我們對上述兩個框架進行了二次開發,擴展了故障注入的能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於ChaosMesh和Chaosd是兩個不同的框架,不能同時模擬虛擬機和容器故障,我們需要抽象並設計統一編排模型,統一調度ChaosMMesh和Chaosd,達到利用一個混沌組件同時注入容器和虛擬機兩種基礎設施故障的需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基於混沌工程研發的故障演練組件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於ChaosMesh和Chaosd故障注入工具,我們致力於建設一套完備的混沌故障演練組件。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、混沌工程故障演練組件架構"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/02573283b651184502776343d549c73b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程故障演練組件的整體架構如上圖所示,主要包含前端門戶、任務調度、故障注入介質、監控告警、自定義故障、發壓模塊、結果分析幾大模塊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 前端門戶: "},{"type":"text","text":"用戶可快速地進行實驗的編排,環境管理,預案制定,實時監控等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 任務調度:"},{"type":"text","text":"完成所有實驗任務的批量下發和調度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 故障注入介質:"},{"type":"text","text":"接收任務調度框架下發的任務,並實施相應的故障注入事件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4. 監控告警:"},{"type":"text","text":"採集監控的信息並整合,監控指標的存儲以及定期的歸檔刪除。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"5. 自定義故障:"},{"type":"text","text":"通過二次開發進行實現開源框架未實現的故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"6. 發壓模塊:"},{"type":"text","text":"模擬注入故障中的發壓場景,與現有發壓平臺ICDP結合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"7. 結果分析:"},{"type":"text","text":"實驗結束後,對結果的對比、分析、存儲,展示。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、支持的功能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b9\/b9c76784e9d6221f8ef995d9f6241bc5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,故障演練組件包含了磁盤類、進程類、網絡類、壓力類、JVM類、文件類、host類、容器類等幾大功能。具體到每個功能模塊,又包含了常見的磁盤讀寫、磁盤填充、殺死進程、CPU打滿、網絡延遲等典型故障。JVM自定義類型異常注入、擋板延遲等功能還在進一步擴展開發。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、故障演練過程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/eda8e027aed2dbc623b82dade1fc856b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"測試人員進行故障實驗注入時,需要進行環境準備、故障任務編排、開始故障注入、終止注入等操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"環境準備:"},{"type":"text","text":"開始故障演練之前,需要在門戶添加被實施故障的目標服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障任務編排:"},{"type":"text","text":"故障開始之前選擇了上述添加的故障環境之後,可以新建一個故障的演練任務,每個任務可以包含多個故障階段,每個階段可以包含多個故障步驟,每個步驟又可以包含多個故障的動作,通過對這些階段、步驟、動作進行時序、並行、循環、定時等編排來制定一次完整的故障任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障注入:"},{"type":"text","text":"故障任務編排完成之後,即可點擊開始來進行故障注入,演練組件將通過故障注入介質來操作指定的目標服務器開始實驗。實驗進行的過程中,監控模塊以及分析模塊則會收集各個服務器的資源指標等信息實時展示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"終止注入:"},{"type":"text","text":"故障任務除了自動結束以外,也可以通過手動終止,故障演練組件則會下發命令到目標服務器終止實驗。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"當前分佈式系統的應用實踐"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a2\/a27117c8f3cc1f89a4fbaf83b829b074.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 分佈式系統架構如上圖所示,由應用路由、配置中心、Zookeeper集羣、Kafka消息隊列、Redis緩存、Cassandra索引、Mysql數據庫多個組件構成。各個組件之間通過SDK、API、同步、異步調用相結合,集羣之間也涉及到數據一致性、災備、多活等各種場景的處理,整體結構相當複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於分佈式系統各個組件調用關係複雜,容易出現各種各樣的線上故障。以Redis緩存和Kafka爲例,Redis可能出現大量緩存過期、緩存穿透、節點宕機、Kafka可能出現刪除部分kafka主題、消息隊列滿等故障。另外,網絡異常、磁盤IO等系統層面的故障也呈現增長的趨勢。金融行業業務場景複雜、故障場景衆多、測試環境資源不足、傳統測試方法單一等使我們無法及時地發現各種線上問題並修復,導致系統的健壯性不足無法應對生產環境的各種突發故障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程則可以幫助驗證分佈式系統的健壯性,爲此我們設計了路由層、中間件層、數據庫層、緩存層、系統層等主流的應用高可用測試場景。下邊以網絡故障場景爲例來介紹幫助更好地理解實驗過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"故障場景:由於網絡發生故障,導致包損壞,且包損壞逐漸增大。一段時間後,網絡被維修,故障恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗步驟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 使用發壓平臺對應用路由發壓"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 製造一個網絡包損壞的故障,將包損壞率從20%增加到50%,如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a0\/a0f29f6309e8e6b27d0a154330a5134a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a69a10145f8591370a172f0a4b63db8a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.  保持一段時間,恢復兩次故障"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2daa503b9909f73bc50cafecf598aaac.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/485d777cb448ad51b90a39ead31d6184.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗結果如圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f4\/f427673c01d6971782521359bfa847eb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/6722d0c99629ddd4560161f43ea02220.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

預期結果

實驗結果

網絡發生包損壞時,TPS下降,響應時長上升

網絡發生包損壞時,TPS下降,響應時長上升,對應圖中

包損壞率增加時,TPS繼續下降

包損壞率增加時,TPS繼續下降,對應圖中

網絡恢復後,TPS和響應時長恢復到故障之前

網絡恢復,TPS未恢復到故障發生之前

,延時部分未恢復且高達600s"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過本次實驗反映了應用路由存在一些架構設計的缺陷並有針對性地修復,提升系統的健壯性和穩定性。截止目前對分佈式系統進行的故障演練測試,已發現了10餘個問題,正在逐步優化並制定相應的預案,有效避免在生產環境發生這些問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,我們已進行了混沌工程及故障演練組件的研發應用及實踐,包括虛擬機\/物理機上的混沌工具集、容器的混沌工具集,並將混沌工程應用到了分佈式系統的具體測試中,有效地發現並針對性地修復了相關係統故障問題。從應用場景上,混沌工程也適用於各類基礎平臺、通用技術組件、基礎軟件及工具的測試,後續我們將繼續拓展混沌工程的應用場景,對上述應用場景進行混沌測試,以提升系統的健壯性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,隨着業務規模的不斷擴大以及分佈式系統的不斷升級,我們將在混沌工程的基礎上,繼續完善故障演練組件的研發來滿足日漸增長的需求。故障演練組件原型如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/577a34c5410deaf5882965fa0f0e6e15.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,正在原型的基礎上快速擴展以下3個功能:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 環境管理:"},{"type":"text","text":"支持物理節點的簡單註冊、支持單個k8s集羣和多個物理節點的實驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 預案編排:"},{"type":"text","text":"支持分別制定k8s和物理節點的故障預案制定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 實驗管理:"},{"type":"text","text":"支持創建、開始、停止、展示實驗的操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個故障演練組件的操作調用流程結構如下圖所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bf\/bff450d2f059d8db318f8a511f01371c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了對分佈式系統進行故障演練外,接下來會加快推進對行內其他業務編排各種故障類型的演練,逐漸穩步提升全行各個系統處理異常事故以及極端場景的能力,爲行內各個系統的穩健發展提供有效的保障。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,我們也會加速拓展將混沌工程在分佈式系統的實踐應用到整個金融業場景,我們將提供一種服務化的能力,通過沉澱出的高可用場景案例集、故障編排方案,用戶只需要簡單使用故障演練組件即可自動化地對自身的整個集羣環境做混沌故障演練,有效避免生產環境的事故。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考文獻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1]Casey Rosenthal,Lorin Hochstein,Aaron Blohowiak,Nora Jones,Ali Basiri.混沌工程Netflix系統穩定性之道[M].北京:電子工業出版社,2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2]吳冕冠.ChaosBlade在工商銀行混沌工程體系中的應用實踐[EB\/OL].2021-01-01."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:金科優源匯(ID:jkyyh2020)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/x4f2FbzO0IPW_pVJIFQPpA","title":"xxx","type":null},"content":[{"type":"text","text":"混沌工程及故障演練組件的應用與實踐"}]}]}]}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章