作业帮Kubernetes Serverless在大规模任务场景下的落地和优化

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在作业帮的云原生容器化改造进程中,各业务线原本部署在虚拟机上的定时任务逐渐迁移到Kubernetes集群cronjob上。起初,cronjob规模较小,数量在1000以下,运行正常,随着cronjob的规模扩大到上万个后,问题就逐渐显现出来。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当时主要面临两个问题:一是集群内节点稳定性问题;二是集群资源利用率不高。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第一个问题:集群内节点稳定性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于业务上存在很多分钟级执行的定时任务,导致pod的创建和销毁非常频繁,单个节点平均每分钟有上百个容器创建和销毁,机器的稳定性问题频繁出现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一个典型的问题是频繁创建pod导致节点上cgroup过多,特别是memory cgroup不能及时回收,读取\/sys\/fs\/cgroup\/memory\/memory.stat变慢,由于kubelet会定期读取该文件来统计各个cgroup namespace的内存消耗,CPU内核态逐渐上升,上升到一定程度时,部分CPU核心会长时间陷入内核态,导致明显的网络收发包延迟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在节点 perf record cat \/sys\/fs\/cgroup\/memory\/memory.stat 和 perf report 会发现,CPU主要消耗在memcg_stat_show上:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/37\/d9\/371827cd5e59dafd1ece5f8ce5fcfdd9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而cgroup-v1的memcg_stat_show函数会对每个CPU核心遍历多次memcg tree,而在一个memcg tress的节点数量达到几十万级别时,其带来的耗时是灾难性的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为什么memory cgroup没有随着容器的销毁而立即释放呢?主要是因为memory cgroup释放时会遍历所有缓存页,这可能很慢,内核会在这些内存需要用到时才回收,当所有内存页被清理后,相应的memory cgroup才会释放。整体来看,这个策略是通过延迟回收来分摊直接整体回收的耗时,一般情况下,一台机器上创建容器不会太多,通常几百到几千基本都没什么问题,但是在大规模定时任务场景下,一台机器每分钟都有上百个容器被创建和销毁,而节点并不存在内存压力,memory cgroup没有被回收,一段时间后机器上的memory cgroup数量达到了几十万,读取一次memory.stat耗时达到了十几秒,CPU内核态大幅上升,导致了明显的网络延迟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bf\/f5\/bf3e18ab49865596243ff073accc99f5.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,dockerd负载过高、响应变慢、kubelet PLEG超时导致节点unready等问题。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第二个问题:集群的节点资源利用率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于我们使用的智能卡CNI网络模式,单个节点上的pod数量存在上限,节点有几乎一半的pod数量是为定时任务的pod保留的,而定时任务的pod运行时间普遍很短,资源使用率很低,这就导致了集群为定时任务预留的资源产生了较多闲置,不利于整体的机器资源使用率提升。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他问题:调度速度、服务间隔离性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某些时段,比如每天0点,会同时产生几千个Job需要运行。而原生调度器是K8s调度pod本身对集群资源分配,反应在调度流程上则是预选和打分阶段是顺序进行的,也就是串行。几千个Job调度完成需要几分钟,而大部分业务是要求00:00:00准时运行或者业务接受误差在3s内。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些服务pod是计算或者IO密集型,这种服务会大量抢占节点CPU或者IO,而cgroup的隔离并不彻底,所以会干扰其他正常在线服务运行。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、在K8s集群中使用serverless"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以,对CRONJOB型任务我们需要一个更彻底的隔离方式,更细粒度的节点,更快的调度模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决上述问题,我们考虑将定时任务pod和普通在线服务的pod隔离开,但是由于很多定时任务需要和集群内服务互通,最终确定了一种将定时任务pod在集群内隔离开来的解决办法 —— K8s serverless。我们引入了虚拟节点,来实现在现有K8s体系下使用K8s serverless。部署在虚拟节点上的 pod具备与部署在集群既有节点 pod 一致的安全隔离性、网络连通性,又具有无需预留资源,按量计费的特性。如图所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/89\/74\/89f0dc693157abdf5ac8fa7e69c22474.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"任务调度器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有cronjob型workload都使用任务调度器,任务调度器批量并行调度任务pod到Serverless的节点,调度上非串行,实现完整并行,调度速度ms级,也支持Serverless节点故障时或者资源不足时调度回正常节点。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解决和正常节点上pod差异"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用K8s Serverless前首先要解决Serverless pod和运行在正常节点上的pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"差异,做到对业务研发无感。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1.日志采集统一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在日志采集方面,由于虚拟节点是云厂商维护的,无法运行DaemonSet,而我们的日志采集组件是以DaemonSet形式运行的,这就需要对虚拟节点上的日志做单独的采集方案。云厂商将容器的标准输出收集到各自的日志服务里,各个云厂商日志服务的接口各不一样,所以我们自研了日志消费服务,通过插件的形式集成云厂商日志client,消费各云厂商的日志和集群统一的日志组件采集的日志打平后放到统一的Kafka集群里以供后续消费。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.监控报警统一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在监控方面,我们对Serverless上的pod 做了实时CPU\/内存\/磁盘\/网络流量等监控,做到了和普通节点上的pod一致,暴露pod sanbox 的export接口,promethus负责统一采集,迁移到Serverless时做到了业务完全无感。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"提升启动性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Serverless JOB 需要具备秒级的启动速度才能满足定时任务对启动速度的要求,比如业务要求00:00:00准时运行或者业务接受误差在3s内。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主要耗时在以下两个步骤:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 底层sanbox创建或者运行环境初始化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 业务镜像拉取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是做到同一个workload的sanbox能够被复用,这样主要耗时就在服务启动时长,除了首次耗时较长,后续基本在秒级启动。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过自定义JOB调度器、解决和正常节点上pod的差异、提升Serverless pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"启动性能措施,做到了业务无感切换到Serverless,有效利用Serverless免运维、强隔离、按量计费的特性,既实现了和普通业务pod隔离,使得集群不用再为定时任务预留机器资源,释放了集群内自有节点的上万个pod,约占总量的10%;同时避免节点上pod创建过于频繁引发的问题,业务对定时任务的稳定性也有了更好的体验。定时任务迁移到Serverless,释放了整个集群约10%的机器,定时任务的资源成本降低了70%左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"吕亚霖,作业帮基础架构 - 架构研发团队负责人。负责技术中台和基础架构工作。在作业帮期间主导了云原生架构演进、推动实施容器化改造、服务治理、GO 微服务框架、DevOps 的落地实践。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"别路,作业帮基础架构-高级研发工程师,在作业帮期间,负责多云k8s集群建设、k8s组件研发、linux内核优化调优相关工作。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章