都在聊混沌工程,它的落地實踐你瞭解多少?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因果關係是生活的某種基本原則。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"體現在開發者的世界大抵就是:如果你不提早發現和解決問題,最後問題就會在週末\/半夜來解決你。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"無數個被叫醒的深夜、被工作“召回”的週末、以及因系統故障而付出的慘痛代價已讓越來越來開發者和管理者意識到實施混沌工程的重要性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"說到"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/f95435b7e834638fd6e97a2f7","title":"xxx","type":null},"content":[{"type":"text","text":"混沌工程"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",並非這兩年的新概念。早在十多年前Netflix 在亞馬遜雲科技上發佈的一款名叫 chaos monkey(混沌猴子) 的服務,混沌工程便已經誕生了。那麼,爲什麼直到近幾年,混沌工程纔開始受到廣泛關注呢?在"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s?__biz=MjM5MDE0Mjc4MA==&mid=2651088953&idx=4&sn=a96069cc00324ae2469a3b69a35b3c83&chksm=bdb9906a8ace197c56abbe32dad84e5965883b613638a099d1afd253277002fbf9bb18614053&scene=27#wechat_redirect","title":"xxx","type":null},"content":[{"type":"text","text":"《亞馬遜雲科技在混沌工程的探索與實踐》"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Tech Talk中,資深開發者佈道師黃帥就這個問題進行論述,並詳細介紹了實踐混沌工程的難點和思路,以及如何利用合適的開發工具對混沌工程進行落地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"是時候聊聊混沌工程了"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲什麼是現在,而不是十年前?黃帥從我們當前面臨的痛點開始聊起,並將企業和開發者當前面臨的痛點歸結爲以下五點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"痛點1: 系統規模增長帶來的複雜性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"隨着系統規模的增長,其複雜性類似於心臟拓撲結構。每個節點都是一個服務,錯綜相連,整個軟件的生命週期極度依賴治理手段和能力。企業對於系統進行觀測及持續維護的難度也大大增加了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/22\/22bbb81328dd8f71a6ab98bb3b00bfe2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"痛點2: 快與穩的煎熬"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"DevOps 、微服務、敏捷開發的廣泛使用,使軟件的迭代和新功能的發佈越來越快速。如何快步前進過程中,保持穩健,成了企業面臨的新難題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"痛點3: 小步快跑的疑問"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"每次變更很小,提高發布頻次,一定能夠降低風險的結論值得商榷。軟件版本的快速迭代通常發生在軟件質量還沒有收斂到比較好狀態的前期,而且存在不同服務迭代節奏的差異。在這種情況下,心臟拓撲結構中哪怕很微小的增量變動都可能產生級聯故障,從而對整個系統造成重大影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"痛點4: 穩定性測試的新難度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"心臟拓撲結構的複雜依賴性,給集成測試、迴歸測試和浸泡測試都帶來新的挑戰。此外,整個新版本的發佈過程也是保證系統穩定性的重要環節,因此流水線上的每一個環節都需要進行驗證,給穩定性測試帶來新的難度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"痛點5: 排障追蹤的困境"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當系統的複雜性達到一定程度,版本更迭的速度又非常快的狀態下,很多時候我們碰到生產問題,要找出背後的根因非常困難。這中間可能是因爲,健康儀表盤吐出了不準確的服務狀態,水平擴展明明要藉助冗餘性獲得更好的可用性卻因毒化效應完全失效,告警系統因自身的穩定性故障產生大量的誤報等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以上五大痛點的本質都可以用故障宿命論來解釋:人類設計的系統變得愈發複雜,逐漸超出了人類的認知範圍。近年來,隨着分佈式系統、敏捷開發、雲原生微服務架構等雲上現代化技術的廣泛應用,開發的效率和便捷性大幅提升,但是這些創新的背後隱藏的問題是傳統穩定性治理手段的落後。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"軟件可以用新技術實現彎道超車,系統複雜性引起的故障該如何解決?複雜性科學研究者、Cynefin認知框架的提出者戴夫·斯諾登認爲——理解複雜系統的唯一方法,就是與之互動。快速迭代中,誰都無法做到所有變化背後的考量全部記錄下來,系統是我們設計和構建的,但系統行爲我們卻無法準確預測,所以我們必須要在實際的運行環境中,通過實驗探索的方式(行爲預期、事件注入、系統觀測和更新假設),擴展我們對系統行爲的認識,這便是最好的“互動”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"亞馬遜可用性保障團隊災難大師傑西·羅賓斯,因其消防員的經驗於2004年發明了GameDay,邀請志願者藉助“實驗”與待測系統進行“互動”,探索未知的系統風險,同時也訓練了工程師團隊的應急響應能力。隨着業界更多團隊開展"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/95c04fc0365f9739fb5cb3683","title":"xxx","type":null},"content":[{"type":"text","text":"GameDay"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",亟待一種新的實踐模式實現GameDay高效實驗、自動化和流程標準化。Netflix發明的混沌猴子工具,以及後續提出的混沌工程正是源於這個思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"混沌工程承認人們對於系統的行爲認知是有侷限的,通過受控的故障注入實驗,觀測、記錄、分析系統,找到背後的原因,改進架構或相關的代碼和設計,從而真正提升整個系統架構的韌性,避免級聯故障發生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9d\/9ddd6d85ebab8da7fdce1770a2507453.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"混沌工程具體能產生怎樣的價值?是否能量化呢?黃帥在分享中列舉了Netflix 報告中的一個例子:“過去一年中,混沌工程提前發現了2次大故障和8次小故障,避免了整個組織大約70萬美金的損失。混沌工程團隊,總共3個成員,薪水支出15萬美金\/人。開展混沌工程實驗,本身需要1萬美金的成本。投資回報率是多少?高達52.17%。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"落地混沌工程的難點和思路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"混沌工程帶來的價值,無疑是令人心動的。近年來,許多公司都嘗試採用某種形式的混沌工程來提高現代架構的可靠性。然而,混沌工程的探索實踐之路卻並非一帆風順。分享中提到,企業實踐混沌工程主要面臨三個方面的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先,面臨穩態分析和服務透視方面的挑戰。在故障注入後,需要判定系統的穩態是否被改變。如何定義穩態?又該採用何種手段判定呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"混沌工程的核心特徵是對照觀測實驗。精細化流量控制後,分別設置實驗組和對照組。在實驗組中注入故障,通過可觀測的手段比較差異,通過鏈路追蹤的方式去判斷其強弱依賴的狀況。人工手動分析,穩態對照分析的效率和準確性需要依賴於自動檢驗算法,簡單地可以採用雙樣本的T檢驗,複雜地就需要藉助異常檢測等手段。在鏈路追蹤的過程中,需要分析它的強弱依賴,從而計算出爆炸半徑。其中,很重要的可觀測手段是服務透視。在亞馬遜雲科技上既支持已有的可觀測工具,同時也支持開源架構,整套工具開發者可直接進行使用。此外,可觀測性在促進我們對於系統行爲認知的同時,混沌工程本身也是系統可觀測性的強有力的驗證手段。通過平時的實驗可以去驗證當有故障產生的時候,哪些告警纔是真正對排障追蹤真正有效,哪些告警並不能帶來價值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二,面臨爆炸半徑安全管控方面的難題。混沌工程採用的方式是用很小比例可能受到影響的用戶進行實驗,藉以提升整個系統的可靠性。這個過程需要進行安全管控,一方面,要有隨時停止的能力,俗稱一鍵關停。另一方面要合理管控實驗流量的大小,流量太小,可能會產生樣本誤差,流量太大會影響用戶。對於偏差,我們可以藉助實驗手段中存在已久的統計學方法,如多重檢驗、費舍爾方法等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ff\/fff743d0fc370d6587dba2b0399d6e87.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"安全管控可以採用灰度對照實驗,一鍵關停的方式,對流量進行精細控制。通過亞馬遜雲科技的服務可以很細粒度的去隔離相應爆破的對象,以及進行強弱依賴識別。因爲只有知道相關服務之間的依賴關係強弱才能知道爆破半徑有多大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三,是效率方面的難題。假設有10個組件,採用排列組合的方式去實驗會產生360多萬個場景。這麼多場景不可能通過有限的人力、時間和資源完成測試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們需要以史爲鏡,通過歷史的故障進行一些類比引申和重現。首先,故障注入點不在於多,而是要使故障注入點的組合場景更接近現實的狀況。其次,要去實現混沌工程標準化、模板化和互操作性,不同團隊之間可以互相分享,共享一些模板,這些模板可以高效地使用起來。另外,可以使用FMECA 服務失效模式的分析手段,從故障發生的可能性、嚴重性和可觀測性三個角度,確定故障組合的優先級。通過這個優先級就可以很輕鬆地判定出哪些故障是核心故障,哪些故障是最重要的,然後把時間精力投入到重要的故障裏面就可以了。在故障組合的探索方面,可以用到STPA 分析模型。該模型認爲系統的可靠性依賴數據面、控制面、人工面三個部分,三個部分交互的點,也就是最容易產生故障的點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"總結來說,企業落地混沌工程的難點和思路基本圍繞穩態、安全、效率三個方面。那麼,在落地過程中是否可以藉助一些工具和服務呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用好一切可用的服務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"今年3月,亞馬遜雲科技發佈了一款Amazon FIS服務,幫助大家更簡單、高效地去實踐混沌工程。該服務基於亞馬遜雲科技內部工具Amazon Gremlin技術和能力的輸出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9d\/9d3ec80b41ebde7593e3ce996ac2188a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"早在2004年,亞馬遜就創立了基於工程師團隊的交互式和開放式的學習與訓練的“GameDay”。在2004年到2021年整整17年間亞馬遜雲科技一直實踐GameDay的玩法。2010年亞馬遜雲科技推出自研混沌工程產品Amazon Gremlin,結合GameDay成爲可用性保障實踐內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Amazon Gremlin內部實踐併爲其公司帶來價值後,亞馬遜雲科技考慮將擁有的能力賦能給自身的客戶和用戶,推出了Amazon FIS服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1d\/1d13188c9272db9dc81531c8765c1f99.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Amazon FIS的最大的優勢體現在它的易用性。使用Amazon FIS不用集成和安裝其他工具就可以控制管理臺。同時Amazon FIS提供一些預設實驗的模板,用這些模板可直接做故障注入,如果產生新的想法,也可以根據自己的需要進行定製和修改。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其次,Amazon FIS非常貼近真實的場景。很多故障場景並不代表真實世界的事件,在實際場景中很多問題是多個故障的組合引發的,Amazon FIS支持串行、並行等多種實驗靈活編排,最大程度還原真實場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三,也是非常重要的安全保障方面,Amazon FIS提供了自動停止的能力。一旦混沌工程實驗影響到實際用戶或產生負面影響,它可以自動停止。用戶也可以自行設置相應的告警,一旦觸發相應的停止條件,實驗將立即停止。在故障注入的過程中,可觀測性非常重要,Amazon FIS可集成 Amazon CloudWatch 的可觀測能力,方便用戶觀察故障注入時的系統變化。此外,Amazon FIS內置安全回滾策略,並且能夠通過細粒度 IAM 實現精細權限控制,避免混沌工程這個平臺成爲新的攻擊點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Tech Talk的最後,黃帥演示了四個藉助Amazon FIS服務輕鬆打造雲上混沌的實驗。具體操作步驟可點擊視頻瞭解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"video","data":{"id":"423185","name":"輕鬆打造雲上混沌實驗.mp4","poster":"https:\/\/static001.infoq.cn\/resource\/image\/98\/68\/98b5a82a2bd96c6e2bdec9c2d60cee68.png","url":"https:\/\/media001.geekbang.org\/316a61e8d77a47f69107bc3049615a71\/4e285733140f4baab057c7bd165f9c9c-ebf7d638118ae95030c1dae138fdbea4-sd.m3u8"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章