Chaos Mesh 助力 Apache APISIX 提升穩定性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache APISIX 是 Apache 基金會下的頂級項目,目前在生產環境中已經通過每日幾百億次請求量的考驗。隨着社區的發展,Apache APISIX 的功能越來越多,需要與外部組件產生的交互也越來越多,隨之而來的不確定性呈指數級增長。在社區中,我們也收到了用戶反饋的一些問題,這裏舉兩個例子。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"場景一","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Apache APISIX 的配置中心, etcd 與 Apache APISIX 之間出現意外的高網絡延遲時,Apache APISIX 能否仍然正常運行進行流量過濾轉發?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"場景二","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶在 issue 反饋,當 etcd 集羣中的一個節點失效而集羣仍然可以正常運行時,會出現與 Apache APISIX admin API 交互報錯的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管 Apache APISIX 在 CI 中通過單元 / e2e / fuzz 測試覆蓋了大部分情景,然而尚未覆蓋到與外部組件的交互。當發生網絡波動、硬盤故障、或是進程被殺掉等難以預料的異常行爲時,Apache APISIX 能否給出合適的錯誤信息、是否可以保持或自行恢復到正常的運行狀態呢?爲了測試覆蓋到用戶提出的場景,以及在投入生產環境前主動發現類似的問題,經過社區討論決定使用 PingCAP 開源的混沌工程平臺 Chaos Mesh 進行測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程是一種在系統基礎設施上進行試驗,主動找出系統中的脆弱環節的方法,從而確保系統具有抵禦生產環境中失控環境的能力。混沌工程最早由 Netflix 提出,用以模擬從而抵禦早期雲服務的不穩定性。隨着技術的演進,現在的混沌工程平臺提供了更多種類的故障可供注入,依靠 Kubernetes 也可以更方便地控制故障半徑。這些都是 Apache APISIX 選擇 Chaos Mesh 的重要原因,但作爲開源社區,Apache APISIX 深知只有活躍的社區才能確保軟件穩定使用和快速迭代,而這也是 Chaos Mesh 更加吸引人的特點。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"如何在 APISIX 上應用混沌工程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混沌工程在單純的注入故障以外,逐漸形成了一套完整的方法論。根據 Principle of Chaos Engineering 的推薦,部署混沌工程實驗需要五個步驟:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"定義穩態,即找到一個證明正常運行的可量化指標。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"做出假設,假設指標在實驗組和對照組都始終保持穩定狀態。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"設計實驗,引入運行中可能出現的故障。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"驗證假設,即通過比較實驗組和對照組的結果證僞假設。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"修復問題。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來以上述兩個用戶反饋場景爲例,依照這五個步驟爲大家介紹 Apache APISIX 應用混沌工程的流程。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"場景一","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13cac25727ef5ba63f71c39026641fba.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用一幅圖來描述這個場景。對照上面的五個步驟,首先需要找到衡量 Apache APISIX 正常運行的可量化指標。在測試時最主要的方法是利用 Grafana 對 Apache APISIX 運行指標進行監測,找到可衡量的指標後,在 CI 中就可以從 Prometheus 中單獨提取數據進行比較判斷,這裏使用了路由轉發的 Request per Second(RPS)和 etcd 的可連接性 作爲評價指標。另一點就是需要對日誌進行分析,對於 Apache APISIX 就是查看 Nginx 的 error.log 判斷是否有報錯以及報錯是否符合預期。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在對照組也就是引入 Chaos 前進行實驗,檢測 set/get route 均能成功,etcd 可連接,並記錄此時的 RPS。之後,使用 network chaos 添加 5s 的網絡延遲 ,再次進行實驗,此時 set route 失敗,get route 成功,etcd 無法連接,RPS與之前相比無明顯變化。實驗符合預期。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"場景二","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a3/a32ac45309f5d0fab0a115184c83ad45.png","alt":"Image","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進行同樣的對照組實驗之後引入 pod-kill chaos,復現了預期的錯誤。在隨機刪除集羣中少數 etcd 節點的情況下,etcd 可連接性表現出時有時無,日誌則打印出了大量連接拒絕的報錯。更加有趣的是,在刪除 etcd 端點列表的第一個或第三個節點時,設置路由正常返回,而只有在刪除 etcd 端點列表中的第二個節點時,設置路由會報錯 “connection refused”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排查發現原因在於 Apache APISIX 使用的 etcd lua API 選擇端點時並不是隨機而是順序選擇,因此新建 etcd client 進行的操作就相當於只綁定在一個 etcd 端點上導致持續性的失敗。修復這個問題之後,還爲 etcd lua API 添加了健康檢查,確保不會在斷開連接的 etcd 上進行大量的重複;以及增加了 etcd 集羣完全斷開連接時的回退檢查,避免大量報錯衝爆日誌。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來計劃","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"藉助 e2e 模擬場景進行混沌測試","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"目前在 Apache APISIX 中,仍然主要依靠人來識別系統中可能的脆弱點進行測試修復。對於開源社區來說,與之前提到的 Netflix 在企業中應用混沌工程不同,儘管在 CI 中測試,無需擔心混沌工程的故障半徑對生產環境的影響,但同時也無法覆蓋生產環境中的複雜而全面的場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"爲了覆蓋更多的場景,未來社區計劃利用現有的 e2e 測試模擬更加完整的場景,進行更大範圍、更強隨機性的混沌測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null}}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"爲更多 Apache APISIX 項目添加混沌測試","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"除了爲 Apache APISIX 找到更多可能的脆弱點之外,社區還計劃爲 Apache APISIX Dashboard 和 Apache APISIX Ingress Controller 等更多項目添加混沌測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null}}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"爲 Chaos Mesh 添加功能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":10,"align":null,"origin":null},"content":[{"type":"text","text":"在部署 Chaos Mesh 時遇見一些暫不支持的功能,包括網絡延遲的目標不支持選擇 service,網絡混沌無法指定容器端口注入等,Apache APISIX 社區未來也會協助 Chaos Mesh 添加相關功能。希望開源社區都會越來越好。","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"關於 Apache APISIX","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache APISIX 是一個動態、實時、高性能的開源 API 網關,提供負載均衡、動態上游、灰度發佈、服務熔斷、身份認證、可觀測性等豐富的流量管理功能。Apache APISIX 可以幫忙企業快速、安全的處理 API 和微服務流量,包括網關、Kubernetes Ingress 和服務網格等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全球已有數百家企業使用 Apache APISIX 處理關鍵業務流量,涵蓋金融、互聯網、製造、零售、運營商等等,比如美國航空航天局(NASA)、歐盟的數字工廠、中國航信、中國移動、騰訊、華爲、微博、網易、貝殼找房、360、泰康、奈雪的茶等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"200 餘位貢獻者,一同締造了 Apache APISIX 這個世界上最活躍的開源網關項目。聰明的開發者們!快來加入這個活躍而多樣化的社區,一起來給這個世界帶來更多美好的東西吧!","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache APISIX 項目地址:","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/apache/apisix","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/apache/apisix","attrs":{}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache APISIX 官網:","attrs":{}},{"type":"link","attrs":{"href":"http://apisix.apache.org/zh/","title":null,"type":null},"content":[{"type":"text","text":"http://apisix.apache.org/zh/","attrs":{}}]}]}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章