用盡一切手段降低MTTR,混沌工程在華泰證券的落地實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InfoQ在做混沌工程系列訪談時發現,企業對混沌工程的認知普遍存在兩種情況:一種是企業不瞭解混沌工程,武斷地認爲用不上;一種是企業對混沌工程抱有太多期待,對投入產出比的容忍度較低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢索混沌工程的實踐新聞,關聯詞都是“大公司”、“生產環境”、“失控”,看上去似乎是大公司的熱鬧,也並不那麼安全。InfoQ網站上4月發佈的《"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/f95435b7e834638fd6e97a2f7","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline"}],"text":"對混沌工程的五個常見誤解"}]},{"type":"text","text":"》每週都掛在周熱榜上。混沌工程從誕生至今11年,爲什麼一篇認知相關的文章,依然有這麼高的熱度?"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"改變認知是第一步"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“其實用不用混沌工程,思考一個核心問題:不管是否使用混沌工程,故障該來總會來的,不會因爲技術棧和業務敏感的差異而有所不同。爲了讓故障真正來臨時能應對得更從容一些,應該提前嘗試一下類似的破壞實驗。從諱疾忌醫逐步走向小範圍的可控實驗、開放性的大規模生產環境實驗,信心是逐漸樹立的,關鍵要走出第一步,並且持續走下去,當然這個過程中,混沌平臺本身的可靠性便利性也是非常重要的。”華泰證券信息技術部運行保障中心運維平臺開發團隊負責人邱朋談到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"認知,是混沌工程進入企業,需要跨過的第一個檻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關注混沌工程最應該關注什麼?邱朋認爲,最需要關注的是效能,即在關注和計劃使用混沌工程時,首先要考慮對混沌工程的定位是什麼?對它的投入計劃是什麼樣子?計劃收穫什麼?混沌工程是主動增強系統穩定性的優秀實踐,但不是萬能的,另外對它思想上的認知、投入力度、"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/J5ojibDFjTKi6RIoeND4","title":"xxx","type":null},"content":[{"type":"text","text":"SRE"}]},{"type":"text","text":"的配合參與度都會很大程度影響它的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“認知上的改變,要先認可穩定性不是通過前期的設計開發或者後期運維、分析,就能徹底發現隱患、消除風險的,必須秉持’從生產中來,到生產中去’的思路,反覆對生產環境進行可控的實驗,驗證系統在可能發生的場景下的表現以及運維人員應對的有效性,才能通過實戰檢驗系統、檢驗應急能力。特別是金融行業偏穩態,相關負責人這塊的認知和思維上的轉變是很關鍵的。”"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"難,必須做的規模化演練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據華泰證券的經驗,混沌工程實踐過程中另一大難點是規模化的演練。企業前期在試點範圍內開展時,平臺本身的便利性、穩定性相對是不足的,此時可以通過人員的針對性輔導和支持解決,一旦取得一些成效並計劃規模化推廣覆蓋時,平臺的問題就會批量爆發,且此時沒有足夠的人力支撐,容易陷入批量的負向反饋聲音中,無論從平臺使用的人還是平臺開發的人,很容易陷入負面或對立的情緒。在真正規模化推廣時,需要預先做好孵化和預熱。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據邱朋介紹,華泰證券建設了故障演練可觀測能力的一體化集成、一鍵式演練、演練場景庫、演練知識庫、自動化報表等能力提升便捷性。2021年上半年,華泰證券開展了保衛波特姆行動,從行情、賬戶的貼身式輔導,通過試點樹立信心,逐漸擴充到理財、"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/GVfeBvKH*7VM4u2ZTKRD","title":"xxx","type":null},"content":[{"type":"text","text":"交易"}]},{"type":"text","text":"以及其他300+核心業務系統的負責人自助化演練,最多的時候一天自助化演練272次,期間未因開展混沌工程導致業務受損,反而發現了近百個優化點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並且由於行業特殊性,華泰證券在進行混沌工程實踐時,尤其需要注意一些問題。由於證券行業的高穩態要求,首先是不可能直接在生產環境進行實驗的,特別是交易類的業務場景。比較可行的方案是在測試或仿真環境,控制爆炸半徑及迅速停止實驗;在逐漸通過測試環境的少量業務系統的試點,混沌工程平臺也基本穩定之後,在一些非核心業務且可靠性比較良好的系統進行生產環境的開展;逐步積累信心之後,通過專項的行動進行規模化推廣,同時配套自動化的集成的"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/7e582a007ff0556385fdb15a1","title":"xxx","type":null},"content":[{"type":"text","text":"可觀測性手段"}]},{"type":"text","text":",以及演練過程的可視化、演練報告的自動化生成和評價能力,能大幅度降低SRE的精力投入,也能最大化降低推行的阻力。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"混沌工程是整個運營保障工具體系中的一環"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"換句話說,混沌工程是整個運營保障工具體系中的一環,而不是一個割裂的平臺。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“一旦業務卡頓,我們可以第一時間發現,從而進行及時處置。後續業務能處理,響應時間比較長,基本上沒有超時,就是卡。系統進程端口是正常的,系統響應部分超時或者全部超時,長時間沒有應答,或者會有超時的重試,所以是業務卡頓、業務無響應、業務完全故障和全鏈路的故障,我們總結下來,所有故障無外乎是這幾種。”華泰證券資深穩定性工程專家王帥介紹到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"華泰證券穩定性功能架構包括演練管理、故障演練、演練自動化和演練評價四個功能。上文提到華泰證券在2021年上半年做的保衛波特姆行動,所謂波特姆就是 Bottom。華泰證券不斷探測系統運行底線,發現技術風險。通過建立故障演練模型、故障矩陣和運維聯動,對歷史故障進行回放,做系統化的地毯式的演練覆蓋。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果以混沌工程能力熟練度的4個階段(入門、簡單、高級、熟練)評價,在邱朋看來,目前小部分大型互聯網企業已經在此領域的世界範圍內走得比較靠前,並且開源共享了部分混沌工程能力,大部分企業處於簡單階段:使用工具化的手段可以自助式進行故障的注入,通過手工觀察和結果整理獲得反饋,部分具備了分組試驗對比的能力。如果從應用度看(暗中進行、適當投入、正式採用、成爲文化),絕大多數企業還處於“暗中進行”階段:對重要項目不採用、只覆蓋少量系統、組織內部感知不強、早期使用者偶爾進行混沌實驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前行業是否有通用解決方案?邱朋告訴InfoQ:“從故障構造能力(特別是計算資源層面)已經比較通用,業內有開源了部分混沌工程技術以及提供了商用化的高可用方案(如AHAS、"},{"type":"link","attrs":{"href":"https:\/\/xie.infoq.cn\/article\/4228b61343890cee954541ca6","title":"xxx","type":null},"content":[{"type":"text","text":"ChaosBlade"}]},{"type":"text","text":"等),可以考慮集成或使用;對於業務層面的故障分析和構造,以及一體化、智能化的混沌工程建設上,可能各家的方案會有不同,建設的進度也不盡相同,像數據丟失損壞的故障構造方案、帶載流量下的故障演練、一體化監控處置能力集成的故障演練等能力,華泰是根據自己的特點進行規劃和建設的。” "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採訪嘉賓簡介:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"邱朋,華泰證券信息技術部運行保障中心運維平臺開發團隊負責人,從事運營商、互聯網、證券行業軟件開發及運維12年,具備豐富的運維體系建設和平臺落地經驗,目前專注於證券金融行業下智能化、一體化的運行保障平臺體系建設和SRE技術運營的數字化轉型。"}]}]},{"type":"horizontalrule"},{"type":"heading","attrs":{"align":"center","level":4},"content":[{"type":"text","text":"掃描下方二維碼,進入有獎問答"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"參與國內首個混沌工程調研報告"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/78\/b4\/78ca90e0bd194c1745abb38a57f3c4b4.jpeg","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解我國混沌工程發展全貌,中國信通院聯合混沌工程實驗室啓動《中國混沌工程調查報告》問卷徵集活動,深入探索我國系統穩定性現狀及混沌工程使用情況、行業採納度、技術成熟度及未來發展趨勢,以期推動混沌工我國的概念普及,提升國內系統穩定性,促進軟件質量發展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次調查問卷由中國信通院聯合混沌工程實驗室、infoQ、VCEC、中國雲原生社區共同發起,參與問卷的用戶有機會獲得電腦包、文化衫等精美禮品,掃描上方二維碼進入問卷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程實驗室成員包括:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/05\/38\/059b75d168a41a99d5aeef0a116ee838.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章