作业帮Kubernetes Serverless在大规模任务场景下的落地和优化

原創

2021-10-27 16:24

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在作业帮的云原生容器化改造进程中，各业务线原本部署在虚拟机上的定时任务逐渐迁移到Kubernetes集群cronjob上。起初，cronjob规模较小，数量在1000以下，运行正常，随着cronjob的规模扩大到上万个后，问题就逐渐显现出来。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、问题"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当时主要面临两个问题：一是集群内节点稳定性问题；二是集群资源利用率不高。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第一个问题：集群内节点稳定性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于业务上存在很多分钟级执行的定时任务，导致pod的创建和销毁非常频繁，单个节点平均每分钟有上百个容器创建和销毁，机器的稳定性问题频繁出现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一个典型的问题是频繁创建pod导致节点上cgroup过多，特别是memory cgroup不能及时回收，读取\/sys\/fs\/cgroup\/memory\/memory.stat变慢，由于kubelet会定期读取该文件来统计各个cgroup namespace的内存消耗，CPU内核态逐渐上升，上升到一定程度时，部分CPU核心会长时间陷入内核态，导致明显的网络收发包延迟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在节点 perf record cat \/sys\/fs\/cgroup\/memory\/memory.stat 和 perf report 会发现，CPU主要消耗在memcg_stat_show上："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/37\/d9\/371827cd5e59dafd1ece5f8ce5fcfdd9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而cgroup-v1的memcg_stat_show函数会对每个CPU核心遍历多次memcg tree，而在一个memcg tress的节点数量达到几十万级别时，其带来的耗时是灾难性的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为什么memory cgroup没有随着容器的销毁而立即释放呢？主要是因为memory cgroup释放时会遍历所有缓存页，这可能很慢，内核会在这些内存需要用到时才回收，当所有内存页被清理后，相应的memory cgroup才会释放。整体来看，这个策略是通过延迟回收来分摊直接整体回收的耗时，一般情况下，一台机器上创建容器不会太多，通常几百到几千基本都没什么问题，但是在大规模定时任务场景下，一台机器每分钟都有上百个容器被创建和销毁，而节点并不存在内存压力，memory cgroup没有被回收，一段时间后机器上的memory cgroup数量达到了几十万，读取一次memory.stat耗时达到了十几秒，CPU内核态大幅上升，导致了明显的网络延迟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bf\/f5\/bf3e18ab49865596243ff073accc99f5.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外，dockerd负载过高、响应变慢、kubelet PLEG超时导致节点unready等问题。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第二个问题：集群的节点资源利用率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于我们使用的智能卡CNI网络模式，单个节点上的pod数量存在上限，节点有几乎一半的pod数量是为定时任务的pod保留的，而定时任务的pod运行时间普遍很短，资源使用率很低，这就导致了集群为定时任务预留的资源产生了较多闲置，不利于整体的机器资源使用率提升。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他问题：调度速度、服务间隔离性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某些时段，比如每天0点，会同时产生几千个Job需要运行。而原生调度器是K8s调度pod本身对集群资源分配，反应在调度流程上则是预选和打分阶段是顺序进行的，也就是串行。几千个Job调度完成需要几分钟，而大部分业务是要求00：00：00准时运行或者业务接受误差在3s内。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些服务pod是计算或者IO密集型，这种服务会大量抢占节点CPU或者IO，而cgroup的隔离并不彻底，所以会干扰其他正常在线服务运行。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、在K8s集群中使用serverless"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以，对CRONJOB型任务我们需要一个更彻底的隔离方式，更细粒度的节点，更快的调度模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决上述问题，我们考虑将定时任务pod和普通在线服务的pod隔离开，但是由于很多定时任务需要和集群内服务互通，最终确定了一种将定时任务pod在集群内隔离开来的解决办法 —— K8s serverless。我们引入了虚拟节点，来实现在现有K8s体系下使用K8s serverless。部署在虚拟节点上的 pod具备与部署在集群既有节点 pod 一致的安全隔离性、网络连通性，又具有无需预留资源，按量计费的特性。如图所示："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/89\/74\/89f0dc693157abdf5ac8fa7e69c22474.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"任务调度器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有cronjob型workload都使用任务调度器，任务调度器批量并行调度任务pod到Serverless的节点，调度上非串行，实现完整并行，调度速度ms级，也支持Serverless节点故障时或者资源不足时调度回正常节点。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解决和正常节点上pod差异"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用K8s Serverless前首先要解决Serverless pod和运行在正常节点上的pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"差异，做到对业务研发无感。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1.日志采集统一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在日志采集方面，由于虚拟节点是云厂商维护的，无法运行DaemonSet，而我们的日志采集组件是以DaemonSet形式运行的，这就需要对虚拟节点上的日志做单独的采集方案。云厂商将容器的标准输出收集到各自的日志服务里，各个云厂商日志服务的接口各不一样，所以我们自研了日志消费服务，通过插件的形式集成云厂商日志client，消费各云厂商的日志和集群统一的日志组件采集的日志打平后放到统一的Kafka集群里以供后续消费。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.监控报警统一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在监控方面，我们对Serverless上的pod 做了实时CPU\/内存\/磁盘\/网络流量等监控，做到了和普通节点上的pod一致，暴露pod sanbox 的export接口，promethus负责统一采集，迁移到Serverless时做到了业务完全无感。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"提升启动性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Serverless JOB 需要具备秒级的启动速度才能满足定时任务对启动速度的要求，比如业务要求00:00:00准时运行或者业务接受误差在3s内。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主要耗时在以下两个步骤："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 底层sanbox创建或者运行环境初始化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 业务镜像拉取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是做到同一个workload的sanbox能够被复用，这样主要耗时就在服务启动时长，除了首次耗时较长，后续基本在秒级启动。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过自定义JOB调度器、解决和正常节点上pod的差异、提升Serverless pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"启动性能措施，做到了业务无感切换到Serverless，有效利用Serverless免运维、强隔离、按量计费的特性，既实现了和普通业务pod隔离，使得集群不用再为定时任务预留机器资源，释放了集群内自有节点的上万个pod，约占总量的10%；同时避免节点上pod创建过于频繁引发的问题，业务对定时任务的稳定性也有了更好的体验。定时任务迁移到Serverless，释放了整个集群约10%的机器，定时任务的资源成本降低了70%左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介绍："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"吕亚霖，作业帮基础架构 - 架构研发团队负责人。负责技术中台和基础架构工作。在作业帮期间主导了云原生架构演进、推动实施容器化改造、服务治理、GO 微服务框架、DevOps 的落地实践。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"别路，作业帮基础架构-高级研发工程师，在作业帮期间，负责多云k8s集群建设、k8s组件研发、linux内核优化调优相关工作。"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

2024年DataOps趋势预测：AI不会取代数据工程师

APM digest收集了多位行業專家對DataOps在2024的發展形勢及對IT和業務的影響的預測，這些技術最高管理者，包括Confluent技術戰略負責人Andrew Sellers的深刻洞見可能與你的感覺一致嗎？快來探討一下。數據可

2024-04-30 11:49:29

云原生周刊：K8s 中的服务和网络｜ 2024.4.29

開源項目推薦 k8s-image-swapper k8s-image-swapper 是 Kubernetes 的一個變更 Webhook，它將鏡像下載到自己的鏡像倉庫，並將鏡像指向該新位置。它是 docker pull-through p

2024-04-30 10:48:10

跨平台美学！使用DevExpress Reports & Office File API时如何管理字体？

DevExpress Office File API是一個專爲C#, VB.NET 和 ASP.NET等開發人員提供的非可視化.NET庫。有了這個庫，不用安裝Microsoft Office，就可以完全自動處理Excel、Word等文檔。開

2024-05-06 23:35:34

MySQL 社区经理：MySQL 8.4 InnoDB 参数默认值为什么要这么改？

MySQL 8.4 LTS 版本，我們一共修改了 20 個 InnoDB 變量的默認值。作者：Frederic Descamps，EMEA 和亞太地區的 MySQL 社區經理。於 2016 年 5 月加入 MySQL 社區團隊。擔任開源

2024-05-06 23:20:21

Redis开源社区持续壮大，华为云为Valkey项目注入新的活力

摘要：作爲Valkey社區的Technical Steering Committee member，華爲雲將持續參與社區建設。一、背景今年3月21日，Redis Labs宣佈從Redis 7.4版本開始，將原先比較寬鬆的BSD

2024-05-06 22:32:57

通义灵码实战系列：一个新项目如何快速启动，如何维护遗留系统代码库？

作者：別象進入 2024 年，AI 熱度持續上升，翻閱科技區的文章，AI 可謂是軍書十二卷，卷卷有爺名。而麥肯錫最近的研究報告顯示，軟件工程是 AI 影響最大的領域之一，AI 已經成爲了軟件工程的必選項，也有研究稱開發者每天的事務性工作可

2024-04-30 21:12:20

30 秒出服装设计稿，森马用函数计算+AIGC 整“新活”!

創新項目如何去賦能我們的業務，這件事情在森馬很重要。阿里雲函數計算幫我們屏蔽掉了想把AI落地到實際業務場景中 GPU 算力資源儲備、採購成本、技術門檻等很多難題，從而迅速做出決策，快人一步站在正確的起點，體驗新技術對整個服裝爆款設計、營銷

2024-04-30 21:12:14

当「软件研发」遇上 AI 大模型

作者：陳鑫（神秀）大家好，我是通義靈碼的產品技術負責人陳鑫。過去有八年時間，我都是在阿里集團做研發效能，即研發工具相關的工作。我們從 2015 年開始做一站式 DevOps 平臺，然後打造了雲效，也就是將 DevOps 平臺實現雲化。到

2024-04-30 21:12:13

Apache DolphinScheduler支持Flink吗？

隨着大數據技術的快速發展，很多企業開始將Flink引入到生產環境中，以滿足日益複雜的數據處理需求。而作爲一款企業級的數據調度平臺，Apache DolphinScheduler也跟上了時代步伐，推出了對Flink任務類型的支持。 Flink

2024-04-30 11:49:27

全面提升 RAG 质量！Zilliz 携手智源集成 Sparse Embedding、Reranke

Zilliz 持續爲 AI 應用開發者賦能！近期，Zilliz 與智源研究院達成合作，將多種 BGE（BAAI General Embedding）開源模型與開源向量數據庫 Milvus 集成。得益於 Milvus 2.4 最新推出的

2024-04-29 21:20:24

一分钟部署 Llama3 中文大模型，没别的，就是快

前段時間百度創始人李彥宏信誓旦旦地說開源大模型會越來越落後，閉源模型會持續領先。隨後小扎同學就給了他當頭一棒，向他展示了什麼叫做頂級開源大模型。美國當地時間4月18日，Meta 在官網上發佈了兩款開源大模型，參數分別達到 80 億 (8

2024-04-29 21:14:30

云原生周刊：Terraform 1.8 发布｜ 2024.5.6

開源項目推薦 xlskubectl 用於控制 Kubernetes 集羣的電子表格。xlskubectl 將 Google Spreadsheet 與 Kubernetes 集成。你可以通過用於跟蹤費用的同一電子表格來管理集羣。 git-

2024-05-06 22:46:37

ACK One x OpenKruiseGame 全球游戏服多地域一致性交付最佳实践

作者：劉秋陽、蔡靖前言在當今全球一體化的經濟環境下，數字娛樂產業正日益成爲文化和商業交流的有力代表。在此背景下大量遊戲廠商嘗試遊戲出海並取得了令人矚目的成績，許多遊戲以全球同服架構吸引着世界各地廣泛的玩家羣體。遊戲全球化部署不僅擴大了單

2024-04-30 21:12:18

华为云云原生FinOps解决方案，释放云原生最大价值

華爲云云原生FinOps通過可視化的成本洞察和成本優化，幫助用戶精細用雲以提升單位成本的資源利用率，實現降本增效目標企業上雲現狀：上雲趨勢持續加深，但云上開支存在顯著浪費根據Flexer 2024年最新的一項調查顯示，當前有超過7

2024-04-29 22:33:46

1 名工程师轻松管理 20 个工作流，创业企业用 Serverless 让数据处理流程提效

作者：嶽洋、陳德全、劉靜娜北京語勢科技有限公司成立於 2023 年 6 月，語勢科技定位爲“智能投資時代的主題入口”，在資管行業從以機構爲核心轉向以用戶爲核心的變革時代，通過打造主題投資引擎，賦能普惠投資一體化，打造以投資者和資管機構爲主

2024-04-28 21:12:22

24小時熱門文章

最新文章

最新評論文章