vivo AI 计算平台云原生自动化实践

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2018 年底,vivo AI 研究院为了解决统一高性能训练环境、大规模分布式训练、计算资源的高效利用调度等痛点,着手建设 AI 计算平台。经过两年的持续迭代,平台建设和落地取得了很大进展,成为 vivo AI 领域的核心基础平台。平台从当初服务深度学习训练为主,到现在演进成包含 VTraining、VServing、VContainer 三大模块,对外提供模型训练、模型推理和容器化能力。VContainer是计算平台的底座,基于Kubernetes构建的容器平台,具备资源调度、弹性伸缩、零一混部等核心能力。VContainer的容器集群有上千个节点,拥有超过100PFLOPS的GPU算力。集群里同时运行着上千个VTraining的训练任务和上百个VServing的推理服务以及上百个在线服务项目。本文主要分享了VContainer云原生相关基础组件的自动化实践,从半工具化人工维护,到白屏化流程的实践和落地。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2、早期的风险与踩坑"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们在2018年底开始使用rke来建设k8s集群,也算是rke项目早期的用户。根据实践经验,我们将k8s集群建设和维护划分为:机器管理、集群管理、容器网络管理3大步骤。在实施过程中,我们面对着一些风险,也踩过了一些坑。早期集群建设阶段,风险难以避免,会出现在变更的各个环节当中。"},{"type":"text","marks":[{"type":"strong"}],"text":"但我们不应该害怕风险,也不能因为风险的存在而不做变更,我们应该保持平常心,敬畏风险,把稳定性放在首要位置"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"风险一,多集群场景:机器数据缺乏统一管控能力,集群A节点出现被添加到集群B的情况。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"风险二,集群节点被初始化:集群维护有标准流程,但是流程中不同操作,使用不同的工具完成,初始化过程出现集群节点遗漏在初始化列表的情况。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"风险三,变更配置错误:在集群建设和维护三个步骤中,配置项重复繁杂,变更工具缺乏校验功能,出现配置错误情况,导致底层组件故障,影响业务系统。 "}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1、机器管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"机器管理有两个部分:数据信息管理和机器变更。风险和踩坑出现在机器变更过程,早期我们选择"},{"type":"link","attrs":{"href":"https:\/\/docs.ansible.com\/","title":"ansible","type":null},"content":[{"type":"text","text":"ansible"}]},{"type":"text","text":" 批量操作工具,有3个变更操作频率很高:机器初始化、机器清理和其他非固定批量操作。初始化是机器添加到k8s集群前,安装docker、gpu软件、配置环境等等;同理,机器清理是卸载和清空docker软件和相关环境。 "},{"type":"link","attrs":{"href":"https:\/\/docs.ansible.com\/ansible\/latest\/cli\/ansible-playbook.html","title":"ansible-playbook","type":null},"content":[{"type":"text","text":"ansible-play"}]},{"type":"text","text":" 模块允许我们定义tasks任务,管理相同类型操作的脚本。我们创建了初始化和清理脚本的tasks任务,一系列操作的脚本添加到对应的tasks任务下面,使用时运行相同命令完成变更。需要特别注意是:执行tasks任务需要配置机器列表,如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"192.168.219.10 ansible_ssh_pass=\"123456\"\n\n192.168.219.11 ansible_ssh_pass=\"123456\"\n\n192.168.219.12 ansible_ssh_pass=\"123456\"\n\n192.168.219.13 ansible_ssh_pass=\"123456\"\n\n[rke-prepare]\n\n192.168.219.10\n\n192.168.219.11\n\n192.168.219.12\n\n192.168.219.13"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这里存在操作风险,步骤重复繁多或者多人操作情况,机器列表有可能出现重复、错漏的情况,我们踩过这样的坑:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑1:ansible初始化操作,错误把集群中工作节点或者核心节点执行了初始化。因为配置初始化机器列表,多人操作或者遗忘修改机器列表,导致集群节点被初始化。这样的后果非常严重,初始化了一般worker节点,影响业务容器;初始化了核心节点影响范围更加大,整个集群可用性都会被影响。本人之前把跑着在线业务的worker节点初始化掉,造成业务节点Not Ready,直接影响了线上业务可用性。后来我们在ansible脚本中加上检查k8s集群节点的步骤,判断机器如果已经存在k8s相关组件即可跳过初始化操作。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2、集群管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集群管理核心操作:集群创建、扩缩容、更新、容灾4个。前文提到的 "},{"type":"link","attrs":{"href":"https:\/\/rancher.com\/","title":"rancher","type":null},"content":[{"type":"text","text":"rancher"}]},{"type":"text","text":" 开源的k8s集群管理项目 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/rancher\/rke","title":"rke","type":null},"content":[{"type":"text","text":"rke"}]},{"type":"text","text":" 满足我们基本需求。在其 "},{"type":"link","attrs":{"href":"https:\/\/www.rancher.cn\/products\/rke\/","title":"官方介绍","type":null},"content":[{"type":"text","text":"官方介绍"}]},{"type":"text","text":" 中说到:RKE是一款经过CNCF认证的开源Kubernetes发行工具,可以在Docker容器内运行。它通过删除大部分主机依赖项,并为部署、升级和回滚提供一个稳定的路径,从而解决了Kubernetes最常见的安装复杂性问题。借助RKE,Kubernetes可以完全独立于您正在运行的操作系统和平台,轻松实现Kubernetes的自动化运维。 和其他云原生项目一样,rke也使用 golang 开发,是一个命令行工具。使用配置文件 "},{"type":"link","attrs":{"href":"https:\/\/rancher.com\/docs\/rke\/latest\/en\/example-yamls\/","title":"cluster.yaml","type":null},"content":[{"type":"text","text":"cluster.yml"}]},{"type":"text","text":" 管理k8s集群,并且通过cluster.rkestate维护k8s集群状态。rkestate文件是rke命令行自行管理的k8s状态文件,用户不必过多关心。cluster.yml才是用户管理k8s集群的配置文件,rke up操作按照该yml配置文件更变k8s集群:节点增删、版本更新等,如下是cluster.yaml实例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"nodes:\n\n- address: 1.2.3.4\n\nuser: ubuntu\n\nrole:\n\n- controlplane\n\n- etcd\n\n- worker\n\nservices:\n\netcd:\n\nimage: rancher\/coreos-etcd:v3.3.10-rancher1\n\nkube-api:\n\nimage: rancher\/hyperkube:v1.14.3-rancher1\n\nextra_args: {}\n\n... ...\n\nnetwork:\n\nplugin: calico\n\noptions:\n\ncalico_cloud_provider: none\n\naddons: \"\"\n\naddons_include: []\n\n... ... "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"展示里是简化后的配置,详细的配置介绍可以参考"},{"type":"link","attrs":{"href":"https:\/\/docs.rancher.cn\/docs\/rke\/example-yamls\/_index","title":"yaml文件示例","type":null},"content":[{"type":"text","text":"yaml文件示例"}]},{"type":"text","text":"。尽管rke提供了单个yml配置文件管理k8s集群的功能,但是该文件配置繁杂重复,而且我们一开始就使用了较早版本的rke,也碰到了一些坑:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑1:rke添加工作(worker)节点时,节点角色错误配置为核心节点(controlplane\\etcd)角色,对于etcd的情况会导致api-server滚动重启,正在请求api-server的服务连接会被断开,对于重试服务影响不大。类似kubectl logs、exec等操作会被断开,重新执行解决。对于controlplane的情况,集群内部的worker节点会重启kubelet和kube-proxy。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑2:使用较早版本rke变更时,每次都会打印证书变更,需要强制更新的日志,是较早版本日志输出的bug不必惊慌,较新版本已经修复:"},{"type":"link","attrs":{"href":"https:\/\/github.com\/rancher\/rke\/issues\/1405","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/github.com\/rancher\/rke\/issues\/1405"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑3:同样是较早版本rke up,使用update-only仍然会操作所有worker节点,操作过程偶尔会出现某个节点长时间没有响应的情况,导致整个变更流程被堵塞,无法完成。我们增加了一个ignore-hosts字段,支持rke up跳过执行机器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑4:同样是较早版本rke up,我们对etcd进行灾难恢复演练过程,发现rke etcd restore的操作将整个k8s集群所有节点进行清理后再重建,其实我们的目是etcd集群挂掉后,可以快速重建etcd集群,而不需要变更woker和controlplane的系统组件和calico、ingress-controller等组件,造成业务层面的影响。对此我们进行了改造,rke etcd restore命令恢复etcd集群时,默认只进行:清理etcd节点、etcd重建、rke up几个基本操作。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3、容器网络管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器网络我们使用的是calico插件,扁平化需求是与网络组配置系统交互完成。在日常维护工作中,我们踩过这样的坑:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑1:ippool配置错误,本人在新集群刚搭建时,在创建ippool步骤中,把容器网络的字段,填入了主机网段的值。导致的后果是,在物理主机节上创建了奇怪的路由规则,k8s集群主机网络和容器网络都受到了不同程度影响,后来我们使用ansible批量删除异常路由。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"踩坑2:扁平化节点集群配置错误,把集群A的节点配置到集群B的RR节点上面,当时验证只影响配置错误节点的扁平网络功能,其他节点不受影响。但是,错误配置的恢复过程比较麻烦。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"k8s集群作为基础设施,上面运行大量在线业务和训练任务。在集群变更过程中,小小失误都有可能导致业务层面直接不可用,我们必须想尽办法规避风险,力所能及填平所有坑坑洼洼。借鉴传统运维管理经验,k8s集群运维管理也需要自动化。"},{"type":"text","marks":[{"type":"strong"}],"text":"很多人看到自动化第一印象是代码程序,其实自动化的精髓是标准。如何将复杂、重复、分散的操作标准化、流程化,是自动化的关键。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3、自动化设计过程"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1、设计思路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自动化前半阶段目标非常明确:减少人工手动操作,建立标准化流程和提高运维效率;降低操作风险,提高集群稳定性。根据我们目前探索和实践的经验,后半阶段的目标也逐渐清晰:高度自动化、半智能化方向设计,检测和分析定位集群问题,并提供快速恢复的办法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/15\/69\/15873b066d4989430d037e81dae4f669.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"自动化建设目标"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"结合我们自动化设计的目标,和我们基础组件的使用情况,以下设计要点我们重点关注:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"多集群"},{"type":"text","text":":管理多个k8s集群和所有物理机信息,多个集群在工具化阶段信息分散,自动化首要任务是把数据同步到一起,用来帮助我们梳理自动化的流程、校验以及审核这3个方面的标准,并且设计3个方面要实现的功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"标准流程"},{"type":"text","text":":将我们日常重复和复杂的集群变更操作规范化,制定标准的变更流程,并且将其软件化、产品化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自动校验"},{"type":"text","text":":梳理集群变更过程需要人工校验的case,并且设计自动化校验的步骤,把这些校验步骤作为标准变更流程中的必要前置执行条件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"流程可控"},{"type":"text","text":":既然制定了标准流程,那么整个变更流程就可以全部自动化,一次性完成所有操作。但是,变更总是存在着风险和未知因素,因此流程中每个步骤的执行前后,对应设计人工审核和控制的环节。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最小变更原则"},{"type":"text","text":":因为变更总是带着风险,所以我们希望变更越少越好,最好是只做需要的变更,无关联的变更尽量避免。例如,早期的rke up进行worker节点扩缩容的情况,还是会牵连到对核心节点的一些操作。因为rke考虑到保证集群整个状态是健康可用的,所以rke up会尝试校验并且操作集群中所有节点。站在稳定性角度考虑,我们只想变更worker节点,不想牵连到核心节点,或者其他不用变更的节点也不想牵连。后来rke up可以指定角色变更:worker、controlplane、etcd,而我们也做了定制化,节点扩缩容只会操作需要变更的节点,其他一切节点保持不动。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前期自动化聚焦于日常80%集群运维工作,实现白屏化的建设:机器管理、集群管理、容器网络管理。机器管理包括:从CMDB同步机器信息、机器初始化、机器环境清理。集群管理包括:节点信息可视化、增删节点、更新节点、rke配置和状态配置文件管理。而容器网络管理暂时是ippool的增删查改4个操作和k8s节点calico网络扁平化流程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2、架构设计"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照设计思路,如下是我们自动化设计的简单架构图,AutoRke自动化平台是我们建设的目标,底层操作k8s、calico和docker等云原生基础组件的变更,上层对接vivo基础平台完成同步数据和流程控制等功能。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/41\/e0\/41036b1d3b57d2bdf817927f2c8201e0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"自动化实践简单架构 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"AutoRke"},{"type":"text","text":":建设一个提供标准流程的白屏化平台,集成rke 、ansible 等命令行的功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"云原生基础"},{"type":"text","text":":Autorke自动化平台管理的目标对象:k8s、calico、docker。在物理机上安装配置docker环境,使用docker api接口部署和管理k8s组件。管理calico容器网络扁平化配置和容器网络地址池ippool。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"vivo基础平台: "},{"type":"text","text":"Autorke建设过程实现自动化流程依赖的关键系统。我们从CMDB同步机器信息,使用单点登录来验证用户权限。VCalico通过工单流程的方式,完成calico扁平化配置的自动流程。HIC是vivo机器硬件管理相关的系统,正在接入到k8s节点故障处理的流程,帮助我们提高稳定性。作业平台是vivo机器批量操作的系统,与CMDB信息打通,我们将用来做机器初始化和快速作业的操作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4、自动化实践与落地"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自动化实践最终产出工具化、系统化产品,我们的目标是白屏化的平台。能够管理多个k8s集群和所有的物理机,收拢日常分散的k8s集群变更操作,提供标准、可控、可审查的白屏化流程,完成日常k8s集群变更,提高变更效率,降低操作风险,提高集群稳定性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1、核心技术"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"k8s集群建设和维护是自动化的核心工作,前文介绍我们使用rke来开展相关工作。rke使用docker的方式部署k8s集群,在容器中启动k8s组件。区别于我们平时使用docker命令管理容器生命周期,rke使用docker服务的API接口管理容器。为了远程批量管理大量主机的docker服务,rke构建ssh的tcp连接对象,在创建操作远程主机docker服务的"},{"type":"link","attrs":{"href":"#NewClient","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline"}],"text":"docker client"}]},{"type":"text","text":"时,使用该tcp连接对象为docker client创建http client,并且绑定到docker.sock。如下图,rke通过ssh连接的方式构建远端docker client,使用docker.sock实现docker服务的访问,其中堡垒机环节是rke支持安全要求的设计,一切物理机只能通过堡垒机ssh登陆。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/b2\/74\/b29c0588bddab50c41a69556c7382774.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"rke工作流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"证书管理"},{"type":"text","text":",可以分为2个关键点:证书发布和轮换证书。证书发布操作发生在集群初始化、master集群变更以及etcd集群变更,通过容器的方式发布证书。证书轮换是在证书即将过期或者证书泄露后需要重新颁发证书,可以在cluster.yml配置,也可以使用rke cert rotate完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"etcd集群"},{"type":"text","text":",rke实现了3个重要的etcd操作:集群创建、扩缩容、数据备份与恢复。集群创建和扩缩容,在cluster.yml中配置etcd节点,执行rke up变更etcd集群。数据备份与恢复是日常etcd集群的数据备份,在出现故障时快速恢复数据与k8s集群的功能。上图etcd节点启动容器:etcd、snapshots,其他kubelet相关容器是worker角色所需组件,也就是etcd可以作为worker节点部署其他服务,但是我们不推荐这么做。snapshots是etcd集群数据定期备份的容器。etcd容器使用静态集群方式部署,启动时配置好etcd的集群规模和节点列表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"核心节点"},{"type":"text","text":",需要部署3个核心服务apiserver、scheduler、controller的容器,服务参数配置在cluster.yml文件,启动过程读取并且设置在容器运行配置。可见,核心节点没有nginx-proxy组件,这个组件是用来反向代理连接apiserver,核心节点内组件连接本机的apiserver,所以不需要。rke对etcd集群变更时,访问etcd集群证书和IP列表发生变更,需要按照顺序重启核心节点的apiserver服务,重新加载访问etcd集群配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"工作节点"},{"type":"text","text":",rke对集群工作节点扩缩过程,管理节点上3个k8s组件kubelet、nginx-proxy和kube-proxy容器生命周期:创建、启动、重启、删除、查询。rke变更核心节点时,会变更访问核心节点的配置,同理,要需要重启工作节点的服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"插件部署"},{"type":"text","text":",addons是可选部署,用户可以通过其他k8s部署服务的方式。rke对addons的部署划分了3部分:cni、k8sAddons和userAddons,部署过程使用k8s的client,不再使用直连docker api接口方式部署。addons基本以:daemonset、deployment方式部署在k8s集群。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"集群更新"},{"type":"text","text":",rke在1.0.0版本开始支持k8s集群更新,cluster.yml配置支持了各个角色节点最大不可用数量、批量更新等参数,但是更新的要求比较苛刻:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 只支持相邻的次要版本或者补丁版本更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 运行在k8s集群中的业务需要支持健康检查"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. k8s集群角色节点、addon组件、业务系统需要支持高可用"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2、实践过程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们明确实践计划分两步走:rke 工具定制化和autorke白屏化平台。定制化解决最初我们使用rke过程,出现的不符合我们预期的场景,同时,深入调研rke原理为白屏化提供技术基础。白屏化阶段实现变更云原生组件平台化,制定标准流程,降低变更门槛和风险。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"RKE CLI定制化"},{"type":"text","text":":在原生rke 命令基础上,扩展了calico和worker2个子命令,分别负责calico容器网络管理和k8s worker节点扩缩容,这两个子命令支撑我们完成了大部分k8s集群运维工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. rke calico支持容器网络扁平化配置:新增、删除和扩容。同时,也支持了ippool的增、删、查、改、启用、禁用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.rke worker,在worker节点扩缩容时,我们只希望做最小的变更,不执行不必要的操作。所以,rke worker命令只变更需要扩缩容的节点,其他不需要变更节点保持不动。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Autorke白屏化"},{"type":"text","text":":CLI命令行的方式存在缺陷,只能胜任一次性的操作,不能满足交互的场景,而且变更的流程规范也没有完全统一,我们决定把CLI的工作做成白屏化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前平台功能围绕k8s集群管理开展:机器管理、集群节点管理、网段管理、配置管理。机器管理对物理机信息同步,从CMDB拉取全部机器信息,方便日常机器信息查询,同时为集群变更和日后集群稳定性建设提供数据基础。管理k8s集群worker节点日常变更:扩容、缩容、更新、容器网络配置。网段管理是容器网段的管理,calico ippool的增删查改。配置管理实现rke使用的集群配置文件和状态文件的版本管理。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4e\/3d\/4ef7a47ba27040b4560367b5bcaf1e3d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"自动化功能简单展示—机器管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上图是机器管理页面,统一管理机器关于:硬件、软件、k8s、网络各个方面的信息。最右边机器操作下拉框,目前支持k8s集群节点的添加、移除和更新3个功能。我们简单介绍集群添加节点的步骤,因为步骤较多,使用文字描述具体过程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 机器管理页面选中的准备添加到集群的机器,并且创建添加机器的配置,可以选择默认配置、自定义配置的方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 运行添加机器校验,确认机器是否可以添加到k8s集群"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 初始化目标机器,安装所需软件、驱动,配置docker运行环境"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 添加机器到k8s集群,同步机器标签和污点,生成calico网络配置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 创建calico扁平网络工单,给VCalico系统发起容器网络配置工单"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6. 创建ippool容器网段,调用calico sdk配置所需要的网段信息"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7. 监听VCalico回调信息,更新节点容器网络标签"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3、落地情况"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"应用于vivo AI计算平台中4个k8s集群,上千台物理机。后续接入新建集群,物理机数量将达到数千台。按照实践过程两个迭代阶段:rke定制化和autorke白屏化。定制化是对原生的rke命令行工具改造,实现符合我们场景的功能。autorke白屏化是把前期定制化的功能和变更流程实现白屏化,从去年12月上线至今白屏化完成k8s集群变更工单100+。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.4、改进优化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"针对使用过程出现的痛点,我们也做一些优化:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"失败重试"},{"type":"text","text":",在同一个节点变更流程中,存在部分节点执行结果返回失败,在变更流程实现重试失败节点,优化用户体验,提高异常情况的处理效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"流程拆分"},{"type":"text","text":",在calico扁平化配置中,我们需要与VCalico交互完成工单和回调,开始我们考虑自动化流程在提交工单后面的流程,不再需要人工干预。其实,VCalico上报回调报文时,更加需要k8s管理员确认创建ippool的信息。提交容器网络申请的工单信息也需要人工校验,而不应该是自动生成配置后,立刻发起容器网络配置工单。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5、后续计划"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自动化初期实现了云原生基础组件日常运维管理工作的白屏化功能,提高了工作效率,降低操作风险,一定程度上提高了基础组件的稳定性。在今后自动化建设过程中,我们希望丰富自动化的功能,探索半智能化方向,重点关注云原生基础组件稳定性和可用性方面的自动化建设。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"巡检"},{"type":"text","text":",自动检测k8s集群存在的问题以及风险点"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自愈"},{"type":"text","text":",告警与故障自动分析定位以及快速恢复"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"更新"},{"type":"text","text":",基础组件版本更新和机器升级流程等"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"梁大钊,曾就职于 百度, 启明星辰 等公司,目前是 vivo AI 计算平台组的资深工程师,参与平台中调度、容器网络、自动化等方向建设。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章