如何专业化监测一个 Kubernetes 集群?
{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 在生产环境应用的普及度越来越广、复杂度越来越高,随之而来的稳定性保障挑战也越来越大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何构建全面深入的可观测性架构和体系,是提升系统稳定性的关键之因素一。ACK将可观测性最佳实践进行沉淀,以阿里云产品功能的能力对用户透出,可观测性工具和服务成为基础设施,赋能并帮助用户使用产品功能,提升用户 Kubernetes 集群的稳定性保障和使用体验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文会介绍 Kubernetes 可观测性系统的构建,以及基于阿里云云产品实现 Kubernetes 可观测系统构建的最佳实践。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Kubernetes 系统的可观测性架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 系统对于可观测性方面的挑战包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"K8s 系统架构的复杂性。"},{"type":"text","text":"系统包括控制面和数据面,各自包含多个相互通信的组件,控制面和数据间之间通过 kube-apiserver 进行桥接聚合。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"动态性。"},{"type":"text","text":"Pod、Service 等资源动态创建以及分配 IP,Pod 重建后也会分配新的资源和 IP,这就需要基于动态服务发现来获取监测对象。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"微服务架构。"},{"type":"text","text":"应用按照微服务架构分解成多个组件,每个组件副本数可以根据弹性进行自动或者人工控制。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"针对 Kubernetes 系统可观测性的挑战,尤其在集群规模快速增长的情况下,高效可靠的 Kubernetes 系统可观测性能力,是系统稳定性保障的基石。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那么,如何提升建设生产环境下的 Kubernetes 系统可观测性能力呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 系统的可观测性方案包括指标、日志、链路追踪、K8s Event 事件、NPD 框架等方式。每种方式可以从不同维度透视 Kubernetes 系统的状态和数据。在生产环境,我们通常需要综合使用各种方式,有时候还要运用多种方式联动观测,形成完善立体的可观测性体系,提高对各种场景的覆盖度,进而提升 Kubernetes 系统的整体稳定性。下面会概述生产环境下对 K8s 系统的可观测性解决方案。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"指标(Metrics)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 是业界指标类数据采集方案的事实标准,是开源的系统监测和报警框架,灵感源自 Google 的 Borgmon 监测系统。2012 年,SoundCloud 的 Google 前员工创造了 Prometheus,并作为社区开源项目进行开发。2015 年,该项目正式发布。2016 年,Prometheus 加入 CNCF 云原生计算基金会。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 具有以下特性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多维的数据模型(基于时间序列的 Key、Value 键值对)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"灵活的查询和聚合语言 PromQL"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供本地存储和分布式存储"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过基于 HTTP 的 Pull 模型采集时间序列数据"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可利用 Pushgateway(Prometheus 的可选中间件)实现 Push 模式"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可通过动态服务发现或静态配置发现目标机器"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持多种图表和数据大盘"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 可以周期性采集组件暴露在 HTTP(s) 端点的\/metrics 下面的指标数据,并存储到 TSDB,实现基于 PromQL 的查询和聚合功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于 Kubernetes 场景下的指标,可以从如下角度分类:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"容器基础资源指标"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采集源为 kubelet 内置的 cAdvisor,提供容器内存、CPU、网络、文件系统等相关的指标,指标样例包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器当前内存使用字节数 container_memory_usage_bytes;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器网络接收字节数 container_network_receive_bytes_total;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器网络发送字节数 container_network_transmit_bytes_total,等等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kubernetes 节点资源指标"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采集源为 node_exporter,提供节点系统和硬件相关的指标,指标样例包括:节点总内存 node_memory_MemTotal_bytes,节点文件系统空间 node_filesystem_size_bytes,节点网络接口 ID node_network_iface_id,等等。基于该类指标,可以统计节点的 CPU\/内存\/磁盘使用率等节点级别指标。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kubernetes 资源指标"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采集源为 kube-state-metrics,基于 Kubernetes API 对象生成指标,提供 K8s 集群资源指标,例如 Node、ConfigMap、Deployment、DaemonSet 等类型。以 Node 类型指标为例,包括节点 Ready 状态指标 kube_node_status_condition、节点信息kube_node_info 等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Kubernetes 组件指标"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 系统组件指标。例如 kube-controller-manager, kube-apiserver,kube-scheduler, kubelet,kube-proxy、coredns 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 运维组件指标。可观测类包括 blackbox_operator, 实现对用户自定义的探活规则定义;gpu_exporter,实现对 GPU 资源的透出能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes 业务应用指标。包括具体的业务 Pod在\/metrics 路径透出的指标,以便外部进行查询和聚合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上述指标,K8s 提供了通过 API 方式对外透出指标的监测接口标准,具体包括 Resource Metrics,Custom Metrics 和 External Metrics 三类。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.