網易數帆雲原生故障診斷系統實踐與思考

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 是一個生產級的容器編排引擎,但是 Kubernetes 仍然存在系統複雜、故障診斷成本高等問題。網易數帆旗下輕舟雲原生團隊在近幾年的穩定性保障工作中累計了不少生產實踐的經驗,我們沉澱並落地了輕舟雲原生故障診斷系統來幫助產品評估集羣的穩定性併爲用戶提供優化建議。本文分享了我們在業務落地不同時期穩定性保障的實踐,以及我們在集羣穩定性保障層面產品化的思考,希望能夠給讀者朋友帶來一些啓發。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"容器化落地早期(2018 下半年至 2019 上半年)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 2018 年下半年,網易內部的部分業務開始逐步對應用進行容器化改造並在生產中落地。這個時期業務的使用還遠談不上雲原生,很多用戶是把容器當作虛擬機在用。我們團隊在這個時期的主要職責包括以下幾個方面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"保障內部 Kubernetes 集羣的穩定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決業務用戶在容器化落地過程中遇到的一系列問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決輕舟產品化初期的一系列問題。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"早期面臨的問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個時期我們面臨的問題主要集中在兩個方面:Kubernetes、Docker、操作系統層面的問題以及用戶使用方式不合理導致的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes、Docker、操作系統層面的問題是容器化落地早期難以避免的。我們當時內部主要使用的 Kubernetes 版本是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"1.11","attrs":{}}],"attrs":{}},{"type":"text","text":",Docker 版本是 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"18.06","attrs":{}}],"attrs":{}},{"type":"text","text":",現在社區中仍然能找到很多那個時期 Kubernetes 和 Docker 相關問題的 Issue。我們內部當時維護的操作系統是 Debian 9 和 Debian 10,而一些較強勢的業務對操作系統有硬性要求,集羣中的節點使用 CentOS 7 的操作系統。CentOS 7 使用的 3.10 版本內核的 Cgroups 和 Systemd 實現在容器場景下埋了非常多的坑。針對這方面的問題,我們通過內核調參以及爲 Kubernetes 和 Docker 打補丁以維護內部版本的方式來避免問題。以 Kernel Memory Accounting 泄漏這個經典問題爲例,我們關閉了 Kubelet 和 Docker 中相關的邏輯,並且規定 CentOS 7.7 爲最低支持的版本且在啓動參數中固化 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"cgroup.memory=nokmem","attrs":{}}],"attrs":{}},{"type":"text","text":" 選項來規避改問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶使用方式的不合理也導致了非常多問題。早期用戶對 Kubernetes 管理工作負載的設計思想不是很瞭解,加上業務部門的成本壓力,很多用戶爲了應用快速容器化將虛擬機的用法直接搬到了容器上來。有些不規範的實踐通過 Kubernetes 提供的機制可以較好的糾正,有些嚴重的情況則觸發了 Docker 和內核在某些特殊場景下的 Bug 影響了集羣的穩定性。針對這方面的問題,我們通過爲用戶分析故障並給出解決方案的方式來幫助用戶容器化平滑落地。例如使用探針而不是傳統方法對應用進行健康檢查,避免大量執行 Exec 進入容器內執行命令而引發容器終止時進程回收的問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"早期的思考","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 2018 下半年至 2019 上半年這一時期,網易不少業務完成了生產環境大規模容器化落地並且迅速享受到了雲原生技術在資源管理和成本控制層面的紅利。我們團隊在這一過程中積累了許多寶貴的經驗,同時欠下了一些技術債:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"忽視雲原生技術佈道的重要性:我們花費了大量時間幫助用戶解決不當實踐引起的各種問題,這些問題很多是用戶可以通過看文檔獨立解決。但是用戶對這些新技術缺少學習動力,並且新接觸 Kubernetes 的同事在增長,問題似乎永遠解決不完。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺少團隊職責的細分:落地初期許多的問題都是容器化後產生的,而大部分人對雲原生技術是比較陌生的,所以很多問題最終都需要我們團隊來解決。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不雲原生的雲原生落地:迫於業務成本壓力,業務需要快速完成容器化改造,但是業務用戶對該技術缺乏經驗。在這樣的背景下,我們開發了一些中間層組件幫助用戶快速落地,也做了一些不是很雲原生的妥協。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"沒有對集羣標準化交付的規範:部分用戶使用的是 CentOS 7 操作系統,我們在認識到使用該操作系統運行 Docker 的風險後仍然沒有去引導用戶在操作系統上的選擇,並且在安裝操作系統以及內核參數設置這些問題上也缺少把控。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些技術債無法用對或錯去評判,在當時的客觀背景下看來是難以避免的,但是這也爲我們之後的工作提供了很好的思路。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"業務雲原生化時期(2019 上半年至 2020 下半年)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於早期業務在大規模容器化上嚐到了甜頭,許多業務開始逐步使用 Kubernetes 來編排應用,這一段時間我們管理的集羣數量從原來的十多個變成了近百個。而隨着 Kubernetes 的進一步推廣,公司內部越來越多的人開始學習雲原生技術,也有越來越多的 Operator 被部署到集羣中。我們團隊的職責也變得愈加多樣和複雜:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩解較大規模集羣中的穩定性風險。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規範輕舟不同產品團隊在 Kubernetes 中各個擴展點的使用。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣版本管理和維護的問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決業務用戶在雲原生化過程中遇到的一系列問題。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"工作中常見的問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部分集羣連接的 APIServer 客戶端數量超過了 4000 個,其中不乏一些用戶用腳本對 Pod 資源進行全量 LIST 來獲取數據。這些集羣的 APIServer 消耗接近 100G 的內存以及 50 核的 CPU 算力,並且 APIServer 所在節點的網卡流量達到了 15G。針對這方面的問題,我們通過分析客戶端的的業務類型找出了使用不合理的客戶端並進行優化。例如某個 DaemonSet 運行的組件一開始使用了 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"kube-builder","attrs":{}}],"attrs":{}},{"type":"text","text":" 進行開發且監聽了全量的 Node 資源,但是實際上只需要監聽本節點 Node 的資源變化,我們使用 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"client-go","attrs":{}}],"attrs":{}},{"type":"text","text":" 庫重寫了客戶端並且只關注本節點 Node 的資源變化來規避容量問題,並且向輕舟各團隊說明了 APIServer 客戶端實現上需要注意的事項,藉此爲契機來推進整體產品的穩定性提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輕舟產品中的一些功能實現了 Admission Webhook 進行擴展,但是輕舟當時很多已交付集羣的商業化版本中並不包含 Webhook Server 的超時機制,某些 Webhook Server 會在特定場景下卡住無法返回,嚴重影響的集羣穩定性。我們將上游版本中 Webhook 的特性 Cherry Pick 到商業化版本中,並且推動了 Admission Control 這個擴展點使用的規範化,去除了不少產品中不合理的設計和濫用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最開始我們管理的 Kubernetes 集羣並沒有很多,所以管理成本是可控的。隨着用戶的增長,我們需要維護的 Kubernetes 集羣越來越多,版本範圍也越來越大,包括 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"1.11","attrs":{}}],"attrs":{}},{"type":"text","text":" 到 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"1.17","attrs":{}}],"attrs":{}},{"type":"text","text":" 之間的多個版本。早期我們雖然建議用戶版本升級時需要進行節點下線再上線的流程,但是剛剛容器化的用戶當時難以容忍應用的重建,我們承諾了節點熱升級的方案,這些方案也大大增加了我們的管理成本。針對這方面的問題,我們學習了 Red Hat 維護商業化操作系統的策略,通過確定內部維護的 ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"1.17","attrs":{}}],"attrs":{}},{"type":"text","text":" 版本爲商業化 Kubernetes 版本,我們將內部版本維護的工作控制在這幾個方面:合入上游版本的 Bug 修復以及用戶需要的特性對某個商業版本進行維護;標準化某個商業版本到下一個商業版本的升級方案;明確輕舟各組件與 Kubernetes 版本的兼容性矩陣來降低軟件管理成本;開發並上線元集羣 Operator 方案來將集羣管理的工作自動化。這樣我們將集羣版本管理以及運維的責任都明確到相關團隊以及個人,降低了 Kubernetes 集羣的管理成本和潛在風險。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着使用 Kubernetes 編排應用的用戶越來越多,我們需要幫助用戶解決的問題類型也越來越多,其中包含幫助用戶在雲原生場景下更好診斷業務應用的問題。例如用戶在應用容器化之前常用的 Java 應用診斷方式難以在雲原生場景下進行使用,我們開發並在產品上集成了 JVM 內存診斷管理的功能,幫助用戶方便的對生產環境的 Java 應用進行診斷。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"發展期的思考","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 2019 上半年至 2020 下半年這一時期,我們管理的集羣數量增長到了近百個。這個時期我們的工作主要集中在解決集羣運維管理的問題以及幫助用戶業務更好的實現雲原生化。在幫助用戶的過程中我們發現了越來越多需要從機制上解決的問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲 Kubernetes 與業務應用之間的關係較緊密,我們需要明確集羣穩定性保障以及應用穩定性保障的邊界以及有效的評估模型,這種責任邊界的不明確帶來了交付成本上的增長以及不確定性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着雲原生技術的發展,用戶雖然能夠漸漸感受到其在標準化層面帶來的優勢,但是我們在幫助商業用戶解決實際問題上仍然有不少工作需要進行。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"穩定性保障工作在網易內部很早就已經積累了一定的基礎,並且這些工作是由多個團隊完成的。藉助雲原生的契機,讓多個團隊形成合作並將以往的經驗在商業化產品中進行集成是一個新的挑戰。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"雲原生商業化時期(2020 下半年至今)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2020 下半年開始,網易大部分互聯網業務都開始在生產中使用 Kubernetes 來編排應用,我們工作的焦點開始轉變爲將內部沉澱的能力通過輕舟混合雲產品對外進行商業化輸出。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"商業化中需要解決的問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商業化場景下需要考慮的問題比在公司內管理多個集羣更加複雜多樣。在集羣穩定性保障層面,有些問題通過報警消息就可以準確的識別,但是商業化場景下很多問題的診斷對基礎設施的自動化水平提出了比較大的挑戰:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在雲原生場景下,用戶需要行之有效的手段對業務應用進行排障。例如用戶需要在某個報警觸發時分析堆的使用狀況而不是等到 OOM 發生後才能進行排查。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某些問題的診斷需要採集一些信息,而這些信息具有一定時效性並且採集成本較高,將這些診斷分析流程自動化可以大大提高產品的售後能力。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然 APM 和監控能夠解決一部分可觀測性問題,但是經驗告訴我們很多用戶業務問題的根本原因是在系統這一層面發現的,可能是基礎設施層抖動或者系統設置不合理導致的。我們需要結合用戶或輕舟 APM 中的數據與系統層面的數據來打造可落地的故障診斷體系。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何提高 Kubernetes 集羣穩定性保障體系的自動化能力,並藉助雲原生標準化將多個技術領域內已有的保障能力進行集成是我們團隊重點思考的問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"網易輕舟雲原生故障診斷系統的設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決上述問題,我們團隊設計並實現了輕舟雲原生故障診斷系統來解決故障現場保留不易、售後技術支持成本高、產品穩定性評估難等問題。發生故障時,一次售後技術支持流程大致如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"故障發生後用戶進行簡單的排查,發現難以定位問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"用戶聯繫售後技術支持同事上報故障。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"技術支持同事通過研發同事提供的 FAQ 文檔指導用戶獲取信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"技術支持同事根據 FAQ 文檔對問題進行解答。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"問題較複雜,技術支持同事拉入研發同事進行介入。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"研發同事向技術支持同事和用戶溝通獲取有效的上下文信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"研發同事獲取用戶環境權限並進行故障診斷。(某些用戶環境有比較嚴格的安全限制,研發同事需要去用戶現場進行故障診斷。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"text","text":"研發同事發現導致故障的原因,輸出排查結論以及解決方案。(有時問題複雜度較高或者故障現場沒有保留,研發同事難以定位問題。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"text","text":"技術支持同事負責與用戶溝通並給出解決方案。(故障難以定位的情況下需要說服用戶。)","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這樣的一次流程中涉及到用戶、技術支持、研發三個角色,每個角色完成的工作以及工作成本大致如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"發現環境中的故障。(低成本,通常是報警或者業務出現問題。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯繫售後技術支持。(低成本,通常是微信或電話聯繫。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在技術支持的指導下進行簡單的信息獲取和診斷。(中成本,用戶對技術相對陌生導致操作效率低且可能出錯。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術支持","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 FAQ 文檔指導用戶獲取信息。(中成本,需要指導用戶執行相對陌生的操作。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據 FAQ 文檔對問題進行解答。(低成本,技術支持通常具備一定技術背景以及經驗。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與用戶溝通並給出解決方案。(中成本,尤其是故障難以定位的情況下需要說服用戶。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研發","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"溝通獲取有效的上下文信息。(中成本,研發缺少接觸用戶的經驗和用戶缺少技術經驗導致溝通不暢。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獲取用戶環境權限並進行故障診斷。(高成本,診斷需要花費很多精力並且有時候需要多個團隊的研發介入。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在特殊情況下去用戶現場進行故障診斷。(高成本,研發出差是額外的開銷。)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸出排查結論以及解決方案。(低成本,研發通常只需要給出結論和解決方案的文檔。)","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的一次流程中主要的成本都是由技術支持和研發承擔的。作爲服務提供方,如果我們能夠在這個服務形態下實現一個系統來提高整個流程的自動化程度並且降低上述中高成本工作所帶來的開銷,那麼對輕舟商業化輸出的能力無疑是一個巨大的提升。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"設計與實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過定義 Operation、OperationSet、Trigger 和 Diagnosis 對象,我們對整個穩定性保障流程中不同角色需要處理的問題進行了抽象。整個系統由 Master 和 Agent 組成,並且從 APIServer、Prometheus、Elasticsearch 等組件獲取可觀測性指標數據以觸發一次故障診斷,部署架構如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/507c27499f2bf9bbe71f21fe99af1a30.png","alt":null,"title":"網易輕舟雲原生故障診斷系統部署架構","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Operation 對象","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Operation 描述了一個診斷操作以及將其註冊到故障診斷系統的方式。一個負責獲取 Golang 性能剖析數據的診斷操作可以通過下述 Operation 進行註冊:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"yaml"},"content":[{"type":"text","text":"apiVersion: diagnosis.netease.com/v1\nkind: Operation\nmetadata:\n annotations:\n description: This operation manages actions to profile go programs.\n maitainer: APM Team\n name: go-profiler\nspec:\n processor: # 註冊處理故障診斷請求的服務器,如果未定義服務器的 IP 和 Port 則爲故障診斷系統 Agent 內置的處理器。\n path: /processor/goprofiler # 故障診斷系統 Agent 會請求該路徑來觸發故障診斷。\n scheme: http # 故障診斷系統 Agent 向該服務器發送 HTTP 請求。\n timeoutSeconds: 60 # 故障診斷系統 Agent 等待該服務器返回診斷結果的超時時間爲 60 秒。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Operation 的後端是一個實現診斷操作邏輯 HTTP 服務器,不同診斷操作由不同團隊各自維護。Operation 對象主要解決了以下問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準化了診斷操作集成到產品的接口,Operation 只需要處理標準格式的 JSON 數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲不同團隊故障診斷的工作劃分了責任邊界,各團隊可以根據需要負責處理的問題場景實現專業的故障診斷邏輯。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SRE 或者技術支持在管理診斷操作時不需要理解其內部實現細節。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原先積累的基礎設施層技術保障能力能夠以較低的改造成本接入到 Kubernetes 環境中。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"OperationSet 對象","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OperationSet 定義了故障診斷的工作流,其中包含表示診斷過程狀態機的有向無環圖。一次收集 Dockerd 和 Containerd 信息的工作流可以通過下述 OperationSet 表示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"yaml"},"content":[{"type":"text","text":"apiVersion: diagnosis.netease.com/v1\nkind: OperationSet\nmetadata:\n annotations:\n description: This operation set collects debugging information for dockerd and containerd.\n maitainer: Kubernetes Team\n name: docker-debugger\nspec:\n adjacencyList: # 表示診斷工作流的有向無環圖。\n - id: 0 # 第一個頂點表示診斷的開始,不包含任何操作。\n to:\n - 1\n - id: 1 # 第二個頂點執行獲取 Docker 元信息的操作。\n operation: docker-info-collector\n to:\n - 2\n - id: 2 # 第三個頂點執行獲取 dockerd goroutine 的操作。\n operation: dockerd-goroutine-collector\n to:\n - 3\n - id: 3 # 第四個頂點執行獲取 containerd goroutine 的操作。\n operation: containerd-goroutine-collector\n to:\n - 4\n - id: 4 # 第五個頂點執行將節點置爲不可調度的操作。\n operation: node-cordon\nstatus:\n paths: # 記錄有向無環圖中所有的診斷路徑,故障診斷系統 Agent 會按順序執行診斷路徑。\n - - id: 1\n operation: docker-info-collector\n - id: 2\n operation: dockerd-goroutine-collector\n - id: 3\n operation: containerd-goroutine-collector\n - id: 4\n operation: node-cordon\n ready: true # 控制器是否已根據 .spec.adjacencyList 字段生成最新的 .status.paths 字段。\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Trigger 對象","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Trigger 描述如何通過外部消息來源觸發一次診斷。一次複雜的故障診斷通常是由報警觸發的,而報警的來源可能是監控系統、APM 系統或者日誌。利用 KubeletPlegDurationHigh 報警觸發收集 Dockerd 和 Containerd 信息的工作流的 Trigger 如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"yaml"},"content":[{"type":"text","text":"apiVersion: diagnosis.netease.com/v1\nkind: Trigger\nmetadata:\n annotations:\n description: This trigger collects debugging information for dockerd and containerd on alert KubeletPlegDurationHigh firing.\n maitainer: Kubernetes Team\n name: kubelet-pleg-duration-high\nspec:\n operationSet: docker-debugger # 觸發後運行 docker-debugger 中定義的工作流。\n sourceTemplate: # 用於創建診斷的來源模板。\n prometheusAlertTemplate: # 利用 Prometheus 報警來創建診斷。\n regexp: # 觸發診斷的 Prometheus 報警正則表達式。\n alertName: KubeletPlegDurationHigh # 觸發診斷的 Prometheus 報警爲 KubeletPlegDurationHigh。\n nodeNameReferenceLabel: node # Prometheus 報警中 node 標籤的值是運行診斷的節點名。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研發通過定義診斷流程的 OperationSet 和在問題出現時觸發診斷的 Trigger 實現了多箇中高成本工作的自動化,輕舟產品的整體售後能力得到了增強:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶不需要在技術支持的指導下進行簡單的信息獲取和診斷。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研發不需要溝通獲取有效的上下文信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多數場景下研發可以避免獲取用戶環境權限並進行故障診斷等步驟。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Diagnosis 對象","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Diagnosis 是用於管理某個診斷的 API 對象,其中包含了診斷工作流運行的狀態。一個表示收集 Dockerd 和 Containerd 信息的 Diagnosis 如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"yaml"},"content":[{"type":"text","text":"apiVersion: diagnosis.netease.com/v1\nkind: Diagnosis\nmetadata:\n annotations:\n trigger: kubelet-pleg-duration-high\n operationSet: docker-debugger\n name: kubelet-pleg-duration-high.fc76dbd98\n namespace: default\nspec:\n nodeName: pri3-k8s1210.jd.163.org # 執行故障診斷的節點,該字段根據 Trigger 的 .spec.sourceTemplate.prometheusAlertTemplate.nodeNameReferenceLabel 設置。\n operationSet: docker-debugger # 運行的工作流,該字段根據 Trigger 的 .spec.operationSet 設置。\nstatus:\n checkpoint: # 記錄當前診斷執行操作的檢查點,與 OperationSet 的 .status.paths 一致。\n nodeIndex: 4\n pathIndex: 0\n conditions: # 記錄當前診斷的狀況。\n - lastTransitionTime: \"2021-04-27T07:52:24Z\"\n message: Diagnosis is accepted by agent on node pri3-k8s1210.jd.163.org\n reason: DiagnosisAccepted\n status: \"True\"\n type: Accepted\n - lastTransitionTime: \"2021-04-27T07:52:27Z\"\n message: Diagnosis is completed\n reason: DiagnosisComplete\n status: \"True\"\n type: Complete\n operationResults: # 記錄診斷運行的結果。\n \"1\": # 記錄 Docker 元信息請求的結果。\n operation: docker-info-collector\n result: '{\"ID\":\"LJM3:UWWT:L6L3:J6RJ:QRB2:NPMT:FXNC:WA6A:S2AN:JNKV:XE6V:HL7C\",\"Containers\":167,\"ContainersRunning\":88,\"ContainersPaused\":0,\"ContainersStopped\":79,\"Images\":80,\"Driver\":\"overlay2\",\"DriverStatus\":[[\"Backing Filesystem\",\"\\u003cunknown\\u003e\"],[\"Supports d_type\",\"true\"],[\"Native Overlay Diff\",\"true\"]],\"SystemStatus\":null,\"Plugins\":{\"Volume\":[\"local\"],\"Network\":[\"bridge\",\"host\",\"ipvlan\",\"macvlan\",\"null\",\"overlay\"],\"Authorization\":null,\"Log\":[\"awslogs\",\"fluentd\",\"gcplogs\",\"gelf\",\"journald\",\"json-file\",\"local\",\"logentries\",\"splunk\",\"syslog\"]},\"MemoryLimit\":true,\"SwapLimit\":false,\"KernelMemory\":true,\"CpuCfsPeriod\":true,\"CpuCfsQuota\":true,\"CPUShares\":true,\"CPUSet\":true,\"IPv4Forwarding\":true,\"BridgeNfIptables\":true,\"BridgeNfIp6tables\":true,\"Debug\":false,\"NFd\":497,\"OomKillDisable\":true,\"NGoroutines\":392,\"SystemTime\":\"2021-04-27T16:29:29.283405124+08:00\",\"LoggingDriver\":\"json-file\",\"CgroupDriver\":\"cgroupfs\",\"NEventsListener\":0,\"KernelVersion\":\"4.15.0-142-generic\",\"OperatingSystem\":\"Ubuntu 18.04.3 LTS\",\"OSType\":\"linux\",\"Architecture\":\"x86_64\",\"IndexServerAddress\":\"https://index.docker.io/v1/\",\"RegistryConfig\":{\"AllowNondistributableArtifactsCIDRs\":[],\"AllowNondistributableArtifactsHostnames\":[],\"InsecureRegistryCIDRs\":[\"127.0.0.0/8\"],\"IndexConfigs\":{\"docker.io\":{\"Name\":\"docker.io\",\"Mirrors\":[\"https://docker.mirrors.ustc.edu.cn/\"],\"Secure\":true,\"Official\":true}},\"Mirrors\":[\"https://docker.mirrors.ustc.edu.cn/\"]},\"NCPU\":4,\"MemTotal\":11645624320,\"GenericResources\":null,\"DockerRootDir\":\"/data\",\"HttpProxy\":\"\",\"HttpsProxy\":\"\",\"NoProxy\":\"\",\"Name\":\"pri3-k8s1210.jd.163.org\",\"Labels\":[],\"ExperimentalBuild\":false,\"ServerVersion\":\"19.03.8\",\"ClusterStore\":\"\",\"ClusterAdvertise\":\"\",\"Runtimes\":{\"runc\":{\"path\":\"runc\"}},\"DefaultRuntime\":\"runc\",\"Swarm\":{\"NodeID\":\"\",\"NodeAddr\":\"\",\"LocalNodeState\":\"inactive\",\"ControlAvailable\":false,\"Error\":\"\",\"RemoteManagers\":null},\"LiveRestoreEnabled\":false,\"Isolation\":\"\",\"InitBinary\":\"docker-init\",\"ContainerdCommit\":{\"ID\":\"7ad184331fa3e55e52b890ea95e65ba581ae3429\",\"Expected\":\"7ad184331fa3e55e52b890ea95e65ba581ae3429\"},\"RuncCommit\":{\"ID\":\"dc9208a3303feef5b3839f4323d9beb36df0a9dd\",\"Expected\":\"dc9208a3303feef5b3839f4323d9beb36df0a9dd\"},\"InitCommit\":{\"ID\":\"fec3683\",\"Expected\":\"fec3683\"},\"SecurityOptions\":[\"name=apparmor\",\"name=seccomp,profile=default\"],\"Warnings\":[\"WARNING: No swap limit support\"]}'\n \"2\": # 記錄 dockerd goroutine 的文件服務器訪問地址。\n operation: dockerd-goroutine-collector\n result: '10.180.156.129:30100/dockerd-goroutine/pri3-k8s1210.jd.163.org/goroutine-stacks-2021-04-27T155225+0800.log'\n \"3\": # 記錄 containerd goroutine 的文件服務器訪問地址。\n operation: containerd-goroutine-collector\n result: '10.180.156.129:30100/containerd-goroutine/pri3-k8s1210.jd.163.org/containerd-goroutine-2021-04-27T155225+0800.log'\n \"4\": # 記錄將節點置爲不可調度的處理結果。\n operation: node-cordon\n result: 'node/pri3-k8s1210.jd.163.org cordoned'\n phase: Succeeded # 記錄當前診斷的狀態。\n startTime: \"2021-04-27T07:52:24Z\"\n succeededPath: # 執行成功的診斷路徑。\n - id: 1\n operation: docker-info-collector\n - id: 2\n operation: dockerd-goroutine-collector\n - id: 3\n operation: containerd-goroutine-collector\n - id: 4\n operation: node-cordon\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/898e6bc8ba677ec2690c66ba798eaab5.png","alt":null,"title":"一次售後技術支持中需要完成的工作","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Master","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master 負責管理 Operation、OperationSet、Trigger 和 Diagnosis 對象。當 OperationSet 創建後,Master 會進行合法性檢查並基於用戶定義生成有向無環圖,所有的診斷路徑被更新至 OperationSet 的狀態中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master 會校驗 Diagnosis 的 PodReference 或 NodeName 是否存在,如果 Diagnosis 中只定義了 PodReference,則根據 PodReference 計算並更新 NodeName。Master 會查詢被 Diagnosis 引用的 OperationSet 狀態,如果被引用的 OperationSet 異常,則標記 Diagnosis 失敗。Diagnosis 可以由用戶直接手動創建,也可以通過配置 Prometheus、Event 或 Elasticsearch 消息模板自動創建。Master 由下列部分組成:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GraphBuilder 根據 OperationSet 中定義的頂點生成有向無環圖並計算出所有的診斷路徑。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Alertmanager 接收 Prometheus 報警並根據 Trigger 中定義的模板創建 Diagnosis 對象。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Eventer 接收 Kubernetes Event 並根據 Trigger 中定義的模板創建 Diagnosis 對象。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ElasticAlerting 接收 Elasticsearch 報警並根據 Trigger 中定義的模板創建 Diagnosis 對象。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Agent","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Agent 負責實際診斷的執行並內置多個常用診斷操作。當 Diagnosis 創建後,Agent 會根據 Diagnosis 引用的 OperationSet 執行診斷工作流,診斷工作流是包括多個診斷操作的集合。Agent 組件由下列部分組成:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Executor 負責執行診斷工作流。Diagnosis 引用的 OperationSet 狀態中包含表示診斷工作流的有向無環圖和所有的診斷路徑。診斷路徑表示診斷過程中的排查路徑,通過執行某個診斷路徑中每個頂點 Operation 的診斷操作可以對問題進行排查。如果某個診斷路徑的所有診斷操作均執行成功,則該次診斷被標記爲成功。如果所有診斷路徑均執行失敗,則該次診斷被標記爲失敗。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b4/b46945ef757771327bfc719fd43d9450.png","alt":null,"title":"KubeletPlegDurationHigh 報警觸發診斷時序圖","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從早期的容器化落地到現在的雲原生商業化,我們思考問題的核心一直是幫助用戶切實解決雲原生落地過程中的痛點。網易輕舟雲原生故障診斷系統提供了一套框架來幫助產品打造可靠的穩定性保障體系,通過團隊的努力我們提升了幫助用戶的效率並降低了管理成本,讓用戶真正享受到了雲原生技術紅利的同時也讓我們未來可以走的更遠。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"作者介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"黃久遠,網易數帆資深開發工程師,專注於雲原生以及分佈式系統等領域,參與了網易雲音樂、網易新聞、網易嚴選、考拉海購等多個用戶的大規模容器化落地以及網易輕舟容器平臺產品化工作,主要方向包括集羣監控、智能運維體系建設、Kubernetes 以及 Docker 核心組件維護等。當前主要負責網易輕舟雲原生故障診斷系統的設計、開發以及產品商業化工作。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章