如何專業化監測一個 Kubernetes 集羣？

原創

2021-06-13 07:03

監測接口標準	APIService 地址	接口使用場景描述
Resource Metrics	metrics.k8s.io http:\/\/metrics.k8s.io\/	主要用於 Kubernetes 內置的消費鏈路，通常由 Metrcis-Server 提供。
Custom Metrics	custom.metrics.k8s.io http:\/\/custom.metrics.k8s.io\/	主要的實現爲 Prometheus，提供資源監測和自定義監測。
External Metrics	external.metrics.k8s.io http:\/\/external.metrics.k8s.io\/	主要的實現爲雲廠商的 Provider，提供雲資源的監測指標。"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Resource Metrics "},{"type":"text","text":"類對應接口 metrics.k8s.io，主要的實現就是 metrics-server，它提供資源的監測，比較常見的是節點級別、pod 級別、namespace 級別。這些指標可以通過 kubectl top 直接訪問獲取，或者通過 K8s controller 獲取，例如 HPA(Horizontal Pod Autoscaler)。系統架構以及訪問鏈路如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/29\/29efb85e86ec8150187ab8a9696df5cf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Custom Metrics "},{"type":"text","text":"對應的 API 是 custom.metrics.k8s.io，主要的實現是 Prometheus。它提供的是資源監測和自定義監測，資源監測和上面的資源監測其實是有覆蓋關係的，而這個自定義監測指的是：比如應用上面想暴露一個類似像在線人數，或者說調用後面的這個數據庫的 MySQL 的慢查詢。這些其實都是可以在應用層做自己的定義的，然後並通過標準的 Prometheus 的 client，暴露出相應的 metrics，然後再被 Prometheus 進行採集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而這類的接口一旦採集上來也是可以通過類似像 custom.metrics.k8s.io 這樣一個接口的標準來進行數據消費的，也就是說現在如果以這種方式接入的 Prometheus，那你就可以通過 custom.metrics.k8s.io 這個接口來進行 HPA，進行數據消費。系統架構以及訪問鏈路如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7aade4e0aaa5113650eeb8ca1d19c974.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"External Metrics 。因爲我們知道 K8s 現在已經成爲了雲原生接口的一個實現標準。很多時候在雲上打交道的是雲服務，比如說在一個應用裏面用到了前面的是消息隊列，後面的是 RBS 數據庫。那有時在進行數據消費的時候，同時需要去消費一些雲產品的監測指標，類似像消息隊列中消息的數目，或者是接入層 SLB 的 connection 數目，SLB 上層的 200 個請求數目等等，這些監測指標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那怎麼去消費呢？也是在 K8s 裏面實現了一個標準，就是 external.metrics.k8s.io。主要的實現廠商就是各個雲廠商的 provider，通過這個 provider 可以通過雲資源的監測指標。在阿里雲上面也實現了阿里巴巴 cloud metrics adapter 用來提供這個標準的 external.metrics.k8s.io 的一個實現。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"日誌（Logging）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"概要來說包括："}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主機內核的日誌。主機內核日誌可以協助開發者診斷例如：網絡棧異常，驅動異常，文件系統異常，影響節點（內核）穩定的異常。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Runtime 日誌。最常見的運行時是 Docker，可以通過 Docker 的日誌排查例如刪除 Pod Hang 等問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8s 組件日誌。APIServer 日誌可以用來審計，Scheduler 日誌可以診斷調度，etcd 日誌可以查看存儲狀態，Ingress 日誌可以分析接入層流量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用日誌。可以通過應用日誌分析查看業務層的狀態，診斷異常。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日誌的採集方式分爲被動採集和主動推送兩種，在 K8s 中，被動採集一般分爲 Sidecar 和 DaemonSet 兩種方式，主動推送有 DockerEngine 推送和業務直寫兩種方式。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DockerEngine 本身具有 LogDriver 功能，可通過配置不同的 LogDriver 將容器的 stdout 通過 DockerEngine 寫入到遠端存儲，以此達到日誌採集的目的。這種方式的可定製化、靈活性、資源隔離性都很低，一般不建議在生產環境中使用；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務直寫是在應用中集成日誌採集的 SDK，通過 SDK 直接將日誌發送到服務端。這種方式省去了落盤採集的邏輯，也不需要額外部署 Agent，對於系統的資源消耗最低，但由於業務和日誌 SDK 強綁定，整體靈活性很低，一般只有日誌量極大的場景中使用；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DaemonSet 方式在每個 node 節點上只運行一個日誌 agent，採集這個節點上所有的日誌。DaemonSet 相對資源佔用要小很多，但擴展性、租戶隔離性受限，比較適用於功能單一或業務不是很多的集羣；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Sidecar 方式爲每個 POD 單獨部署日誌 agent，這個 agent 只負責一個業務應用的日誌採集。Sidecar 相對資源佔用較多，但靈活性以及多租戶隔離性較強，建議大型的 K8s 集羣或作爲 PaaS 平臺爲多個業務方服務的集羣使用該方式。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"掛載宿主機採集、標準輸入輸出採集、Sidecar 採集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c039f04868e7c3e2d1b5c8ac480f27d0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結下來："}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DockerEngine 直寫一般不推薦；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務直寫推薦在日誌量極大的場景中使用；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DaemonSet 一般在中小型集羣中使用；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Sidecar 推薦在超大型的集羣中使用。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"事件（Event）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事件監測是適用於 Kubernetes 場景的一種監測方式。事件包含了發生的時間、組件、等級（Normal、Warning）、類型、詳細信息，通過事件我們能夠知道應用的部署、調度、運行、停止等整個生命週期，也能通過事件去了解系統中正在發生的一些異常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8s 中的一個設計理念，就是基於狀態機的一個狀態轉換。從正常的狀態轉換成另一個正常的狀態的時候，會發生一個 Normal 的事件，而從一個正常狀態轉換成一個異常狀態的時候，會發生一個 Warning 的事件。通常情況下，Warning 的事件是我們比較關心的。事件監測就是把 Normal 的事件或者是 Warning 事件匯聚到數據中心，然後通過數據中心的分析以及報警，把相應的一些異常通過像釘釘、短信、郵件等方式進行暴露，實現與其他監測的補充與完善。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes中的事件是存儲在 etcd 中，默認情況下只保存 1 個小時，無法實現較長週期範圍的分析。將事件進行長期存儲以及定製化開發後，可以實現更加豐富多樣的分析與告警："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對系統中的異常事件做實時告警，例如 Failed、Evicted、FailedMount、FailedScheduling 等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常問題排查可能要去查找歷史數據，因此需要去查詢更長時間範圍的事件（幾天甚至幾個月）。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事件支持歸類統計，例如能夠計算事件發生的趨勢以及與上一時間段（昨天\/上週\/發佈前）對比，以便基於統計指標進行判斷和決策。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持不同的人員按照各種維度去做過濾、篩選。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持自定義的訂閱這些事件去做自定義的監測，以便和公司內部的部署運維平臺集成。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"NPD（Node Problem Detector）框架"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 集羣及其運行容器的穩定性，強依賴於節點的穩定性。Kubernetes 中的相關組件只關注容器管理相關的問題，對於硬件、操作系統、容器運行時、依賴系統（網絡、存儲等）並不會提供更多的檢測能力。NPD（Node Problem Detector）針對節點的穩定性提供了診斷檢查框架，在默認檢查策略的基礎上，可以靈活擴展檢查策略，可以將節點的異常轉換爲 Node 的事件，推送到 APIServer 中，由同一的 APIServer 進行事件管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NPD 支持多種異常檢查，例如："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎服務問題：NTP 服務未啓動"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"硬件問題：CPU、內存、磁盤、網卡損壞"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kernel 問題：Kernel hang，文件系統損壞"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器運行時問題：Docker hang，Docker 無法啓動"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源問題：OOM 等"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/af\/af7fbc0f41a559da588a41b9ab74cc29.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上，本章節總結了常見的 Kubernetes 可觀測性方案。在生產環境，我們通常需要綜合使用各種方案，形成立體多維度、相互補充的可觀測性體系；可觀測性方案部署後，需要基於上述方案的輸出結果快速診斷異常和錯誤，有效降低誤報率，並有能力保存、回查以及分析歷史數據；進一步延伸，數據可以提供給機器學習以及 AI 框架，實現彈性預測、異常診斷分析、智能運維 AIOps 等高級應用場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這需要可觀測性最佳實踐作爲基礎，包括如何設計、插件化部署、配置、升級上述各種可觀測性方案架構，如何基於輸出結果快速準確診斷分析跟因等等。阿里雲容器服務 ACK 以及相關雲產品（監測服務 ARMS、日誌服務 SLS 等），將雲廠商的最佳實踐通過產品化能力實現、賦能用戶，提供了完善全面的解決方案，可以讓用戶快速部署、配置、升級、掌握阿里雲的可觀測性方案，顯著提升了企業上雲和雲原生化的效率和穩定性、降低技術門檻和綜合成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面將以 ACK 最新的產品形態 ACK Pro 爲例，結合相關雲產品，介紹 ACK 的可觀測性解決方案和最佳實踐。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ACK可觀測性能力"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"指標（Metrics）可觀測性方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於指標類可觀測性，ACK 可以支持開源 Prometheus 監測和阿里雲 Prometheus 監測（阿里雲 Prometheus 監測是 ARMS 產品子產品）兩種可觀測性方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開源 Prometheus 監測，以 helm 包形式提供、適配阿里雲環境、集成了釘釘告警、存儲等功能；部署入口在控制檯的應用目錄中 ack-prometheus-operator，用戶配置後可以在 ACK 控制檯一鍵部署。用戶只需要在阿里雲 ACK 控制檯配置 helm 包參數，就可以定製化部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/4059e2dc655c5de9e312243b7c41b164.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"阿里雲 Prometheus監測"},{"type":"text","text":"，是 ARMS 產品子產品。應用實時監測服務 (Application Real-Time Monitoring Service, 簡稱 ARMS) 是一款應用性能管理產品，包含前端監測，應用監測和 Prometheus 監測三大子產品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 2021 年的 Gartner 的 APM 魔力象限評測中，阿里雲應用實時監測服務（ARMS）作爲阿里雲 APM 的核心產品，聯合雲監測以及日誌服務共同參與。Gartner 評價阿里雲 APM："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"中國影響力最強"},{"type":"text","text":"：阿里雲是中國最大的雲服務提供商，阿里雲用戶可以使用雲上監測工具來滿足其可觀測性需求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"開源集成"},{"type":"text","text":"：阿里雲非常重視將開源標準和產品（例如 Prometheus）集成到其平臺中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"成本優勢"},{"type":"text","text":"：與在阿里雲上使用第三方 APM 產品相比，阿里雲 APM 產品具有更高的成本效益。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖概要對比了開源 Prometheus 和阿里雲 Prometheus 的模塊劃分和數據鏈路。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/45\/45af902184308d96dfce33e582b0b739.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ACK 支持 CoreDNS、集羣節點、集羣概況等 K8s 可觀測性能力；除此之外，ACK Pro 還支持託管的管控組件 Kube API Server、Kube Scheduler 和 Etcd 的可觀測性能力，並持續迭代。用戶可以通過在阿里雲 Prometheus 中豐富的監測大盤，結合告警能力，快速發現 K8s 集羣的系統問題以及潛在風險，及時採取相應措施以保障集羣穩定性。監測大盤集成了 ACK 最佳實踐的經驗，可以幫助用戶從多維度分析分析、定位問題。下面介紹如何基於最佳實踐設計可觀測性大盤，並列舉使用監測大盤定位問題的具體案例，幫助理解如何使用可觀測性能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先來看 ACK Pro 的可觀測性能力。監測大盤入口如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e8a34ae635f97aa145e50b7caa3628ae.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"APIServer 是 K8s 核心組件之一，是 K8s 組件進行交互的樞紐，ACK Pro APIServer 的監測大盤設計考慮到用戶可以選擇需要監測的 APIServer Pod 來分析單一指標、聚合指標以及請求來源等，同時可以下鑽到某一種或者多種 API 資源聯動觀測 APIServer 的指標，這樣的優勢是既可以全局觀測全部 APIServer Pod 的全局視圖，又可以下鑽觀測到具體 APIServer Pod 以及具體 API 資源的監測，監測全部和局部觀測能力，對於定位問題非常有效。所以根據 ACK 的最佳實踐，實現上包含了如下 5 個模塊："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供 APIServer Pod、API 資源（Pods，Nodes，ConfigMaps 等）、分位數（0.99，0.9，0.5）、統計時間間隔的篩選框，用戶通過控制篩選框，可以聯動控制監測大盤實現聯動"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"凸顯關鍵指標以便識別系統關鍵狀態"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展示 APIServer RT、QPS 等單項指標的監測大盤，實現單一維度指標的觀測"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展示 APIServer RT、QPS 等聚合指標的監測大盤，實現多維度指標的觀測"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展示對 APIServer 訪問的客戶端來源分析，實現訪問源的分析"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面概要介紹模塊的實現。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"關鍵指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ec\/ec61dc49156eafd7b4cab8251f4332b8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯示了核心的指標，包括 APIServer 總 QPS、讀請求成功率、寫請求成功率、Read Inflight Request、Mutating Inflight Request 以及單位時間丟棄請求數量 Dropped Requests Rate。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些指標可以概要展示系統狀態是否正常，例如如果 Dropped Requests Rate 不爲 NA，說明 APIServer 因爲處理請求的能力不能滿足請求出現丟棄請求，需要立即定位處理。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Cluster-Level Summary"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0b\/0b8a866d3ffc2ab3f004927225ecdd2d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包括讀非 LIST 讀請求 RT、LIST 讀請求 RT、寫請求 RT、讀請求 Inflight Request、修改請求 Inflight Request 以及單位時間丟棄請求數量，該部分大盤的實現結合了 ACK 最佳實踐經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於響應時間的可觀測性，可以直觀的觀察到不同時間點以及區間內，針對不同資源、不同操作、不同範圍的響應時間。可以選擇不同的分位數，來篩選。有兩個比較重要的考察點："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"曲線是否連續"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"RT 時間"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先來解釋曲線的連續性。通過曲線的連續性，可以很直觀的看出請求是持續的請求，還是單一的請求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示在採樣週期內，APIServer 收到 PUT leases 的請求，每個採樣期內 P90 RT 是 45ms。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲圖中曲線是連續，說明該請求在全部採樣週期都存在，所以是持續的請求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ad\/ad6a20becd6545292ea86355c94abf48.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示在採樣週期內，APIServer 收到 LIST daemonsets 的請求，有樣值的採樣週期內 P90 RT 是 45ms。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲圖中只有一次，說明該請求只是在一次採樣週期存在。該場景來自於用戶執行 kubectl get ds --all-namespaces 產生的請求記錄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0b\/0bebc650695c836890dc2339e2f83c89.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再來解釋曲線體現的 RT。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶執行命令創建 1MB 的 configmap，請求連接到公網 SLB"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kubectl create configmap cm1MB --from-file=cm1MB=.\/configmap.file"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"APIServer 記錄的日誌中，該次請求 POST configmaps RT 爲 9.740961791s，該值可以落入 apiserver_request_duration_seconds_bucket 的(8, 9]區間，所以會在 apiserver_request_duration_seconds_bucket 的 le=9 對應的 bucket 中增加一個樣點，可觀測性展示中按照 90 分位數，計算得到 9.9s 並圖形化展示。這就是日誌中記錄的請求真實RT與可觀測性展示中的展示 RT 的關聯關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以監測大盤既可以與日誌可觀測功能聯合使用，又可以直觀概要的以全局視圖展示日誌中的信息，最佳實踐建議結合監測大盤和日誌可觀測性做綜合分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nI0215 23:32:19.226433 1 trace.go:116] Trace[1528486772]: \"Create\" url:\/api\/v1\/namespaces\/default\/configmaps,user-agent:kubectl\/v1.18.8 (linux\/amd64) kubernetes\/d2f5a0f,client:39.X.X.1,request_id:a1724f0b-39f1-40da-b36c-e447933ef37e (started: 2021-02-15 23:32:09.485986411 +0800 CST m=+114176.845042584) (total time: 9.740403082s):\nTrace[1528486772]: [9.647465583s] [9.647465583s] About to convert to expected version\nTrace[1528486772]: [9.660554709s] [13.089126ms] Conversion done\nTrace[1528486772]: [9.660561026s] [6.317µs] About to store object in database\nTrace[1528486772]: [9.687076754s] [26.515728ms] Object stored in database\nTrace[1528486772]: [9.740403082s] [53.326328ms] END\nI0215 23:32:19.226568 1 httplog.go:102] requestID=a1724f0b-39f1-40da-b36c-e447933ef37e verb=POST URI=\/api\/v1\/namespaces\/default\/configmaps latency=9.740961791s resp=201 UserAgent=kubectl\/v1.18.8 (linux\/amd64) kubernetes\/d2f5a0f srcIP=\"10.x.x.10:59256\" ContentType=application\/json:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d6\/d631faaa2682b25454a96f5cc582ad03.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面解釋一下RT與請求的具體內容以及集羣規模有直接的關聯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上述創建 configmap 的例子中，同樣是創建 1MB 的 configmap，公網鏈路受網路帶寬和時延影響，達到了 9s；而在內網鏈路的測試中，只需要 145ms，網絡因素的影響是顯著的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以 RT 與請求操作的資源對象、字節尺寸、網絡等有關聯關係，網絡越慢，字節尺寸越大，RT 越大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於大規模 K8s 集羣，全量 LIST（例如 pods，nodes 等資源）的數據量有時候會很大，導致傳輸數據量增加，也會導致 RT 增加。所以對於 RT 指標，沒有絕對的健康閾值，一定需要結合具體的請求操作、集羣規模、網絡帶寬來綜合評定，如果不影響業務就可以接受。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於小規模 K8s 集羣，平均 RT 45ms 到 100ms 是可以接受的；對於節點規模上 100 的集羣，平均 RT 100ms 到 200ms 是可以接受的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是如果 RT 持續達到秒級，甚至 RT 達到 60s 導致請求超時，多數情況下出現了異常，需要進一步定位處理是否符合預期。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩個指標通過 APIServer \/metrics 對外透出，可以執行如下命令查看 inflight requests，是衡量 APIServer 處理併發請求能力的指標。如果請求併發請求過多達到 APIServer 參數 max-requests-inflight和 max-mutating-requests-inflight 指定的閾值，就會觸發 APIServer 限流。通常這是異常情況，需要快速定位並處理。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"QPS & Latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c6\/c661ca784d665fd8ea3486b7d4800223.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該部分可以直觀顯示請求 QPS 以及 RT 按照 Verb、API 資源進行分類的情況，以便進行聚合分析。還可以展示讀、寫請求的錯誤碼分類，可以直觀發現不同時間點下請求返回的錯誤碼類型。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Client Summary"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3d\/3dee7c247a059f070e729e96ac28761b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該部分可以直觀顯示請求的客戶端以及操作和資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"QPS By Client "},{"type":"text","text":"可以按客戶端維度，統計不同客戶端的QPS值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"QPS By Verb + Resource + Client "},{"type":"text","text":"可以按客戶端、Verb、Resource 維度，統計單位時間（1s）內的請求分佈情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於 ARMS Prometheus，除了 APIServer 大盤，ACK Pro 還提供了 Etcd 和 Kube Scheduler 的監測大盤；ACK 和 ACK Pro 還提供了 CoreDNS、K8s 集羣、K8s 節點、Ingress 等大盤，這裏不再一一介紹，用戶可以查看 ARMS 的大盤。這些大盤結合了 ACK 和 ARMS 的在生產環境的最佳實踐，可以幫助用戶以最短路徑觀測系統、發現問題根源、提高運維效率。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"日誌（Logging）可觀測性方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SLS 阿里雲日誌服務是阿里雲標準的日誌方案，對接各種類型的日誌存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於託管側組件的日誌，ACK 支持託管集羣控制平面組件（kube-apiserver\/kube-controller-manager\/kube-scheduler）日誌透出，將日誌從 ACK 控制層採集到到用戶 SLS 日誌服務的 Log Project 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於用戶側日誌，用戶可以使用阿里雲的 logtail、log-pilot 技術方案將需要的容器、系統、節點日誌收集到 SLS 的 logstore，隨後就可以在 SLS 中方便的查看日誌。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6f\/6f2c09099149abd22062f032d17d37ee.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/32\/32ecc08b3a2f5638a2b597784365afe5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"事件（Event）可觀測性方案 + NPD 可觀測性方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 的架構設計基於狀態機，不同的狀態之間進行轉換則會生成相應的事件，正常的狀態之間轉換會生成 Normal 等級的事件，正常狀態與異常狀態之間的轉換會生成 Warning 等級的事件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ACK 提供開箱即用的容器場景事件監測方案，通過 ACK 維護的 NPD（node-problem-detector）以及包含在 NPD 中的 kube-eventer 提供容器事件監測能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NPD（node-problem-detector）是 Kubernetes 節點診斷的工具，可以將節點的異常，例如 Docker Engine Hang、Linux Kernel Hang、網絡出網異常、文件描述符異常轉換爲 Node 的事件，結合 kube-eventer 可以實現節點事件告警的閉環。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"kube-eventer 是 ACK 維護的開源 Kubernetes 事件離線工具，可以將集羣的事件離線到釘釘、SLS、EventBridge 等系統，並提供不同等級的過濾條件，實現事件的實時採集、定向告警、異步歸檔。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NPD 根據配置與第三方插件檢測節點的問題或故障，生成相應的集羣事件。而Kubernetes集羣自身也會因爲集羣狀態的切換產生各種事件。例如 Pod 驅逐，鏡像拉取失敗等異常情況。日誌服務 SLS（Log Service）的 Kubernetes 事件中心實時匯聚 Kubernetes 中的所有事件並提供存儲、查詢、分析、可視化、告警等能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/26\/95\/26decc517a476ac4918fea32a0367a95.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ACK可觀測性展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ACK 以及相關雲產品對 Kubernetes 集羣已經實現了全面的觀測能力，包括指標、日誌、鏈路追蹤、事件等。後面發展的方向包括："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"挖掘更多應用場景，將應用場景與可觀測性關聯，幫助用戶更好的使用K8s。例如監測一段時間內 Pod 中容器的內存\/CPU 等資源水位，利用歷史數據分析用戶的Kubernets 容器資源 requests\/limits 是否合理，如果不合理給出推薦的容器資源 requests\/limits；監測集羣 APIServer RT 過大的請求，自動分析異常請求的原因以及處理建議；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯動多種可觀測性技術方案，例如K8s事件和指標監測，提供更加豐富和更多維度的可觀測性能力。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們相信 ACK 可觀測性未來的發展方向會越來越廣闊，給客戶帶來越來越出色的技術價值和社會價值！"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"佳旭，阿里雲容器服務技術專家"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自：阿里巴巴中間件（ID：Aliware_2018）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接："},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/uGZCd-rTyP0y_7pJhUVwfA","title":"xxx","type":null},"content":[{"type":"text","text":"如何專業化監測一個 Kubernetes 集羣？"}]}]}]}