作業幫Kubernetes Serverless在大規模任務場景下的落地和優化

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在作業幫的雲原生容器化改造進程中,各業務線原本部署在虛擬機上的定時任務逐漸遷移到Kubernetes集羣cronjob上。起初,cronjob規模較小,數量在1000以下,運行正常,隨着cronjob的規模擴大到上萬個後,問題就逐漸顯現出來。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當時主要面臨兩個問題:一是集羣內節點穩定性問題;二是集羣資源利用率不高。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第一個問題:集羣內節點穩定性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於業務上存在很多分鐘級執行的定時任務,導致pod的創建和銷燬非常頻繁,單個節點平均每分鐘有上百個容器創建和銷燬,機器的穩定性問題頻繁出現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個典型的問題是頻繁創建pod導致節點上cgroup過多,特別是memory cgroup不能及時回收,讀取\/sys\/fs\/cgroup\/memory\/memory.stat變慢,由於kubelet會定期讀取該文件來統計各個cgroup namespace的內存消耗,CPU內核態逐漸上升,上升到一定程度時,部分CPU核心會長時間陷入內核態,導致明顯的網絡收發包延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在節點 perf record cat \/sys\/fs\/cgroup\/memory\/memory.stat 和 perf report 會發現,CPU主要消耗在memcg_stat_show上:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/37\/d9\/371827cd5e59dafd1ece5f8ce5fcfdd9.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而cgroup-v1的memcg_stat_show函數會對每個CPU核心遍歷多次memcg tree,而在一個memcg tress的節點數量達到幾十萬級別時,其帶來的耗時是災難性的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼memory cgroup沒有隨着容器的銷燬而立即釋放呢?主要是因爲memory cgroup釋放時會遍歷所有緩存頁,這可能很慢,內核會在這些內存需要用到時纔回收,當所有內存頁被清理後,相應的memory cgroup纔會釋放。整體來看,這個策略是通過延遲迴收來分攤直接整體回收的耗時,一般情況下,一臺機器上創建容器不會太多,通常幾百到幾千基本都沒什麼問題,但是在大規模定時任務場景下,一臺機器每分鐘都有上百個容器被創建和銷燬,而節點並不存在內存壓力,memory cgroup沒有被回收,一段時間後機器上的memory cgroup數量達到了幾十萬,讀取一次memory.stat耗時達到了十幾秒,CPU內核態大幅上升,導致了明顯的網絡延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/bf\/f5\/bf3e18ab49865596243ff073accc99f5.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,dockerd負載過高、響應變慢、kubelet PLEG超時導致節點unready等問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"第二個問題:集羣的節點資源利用率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於我們使用的智能卡CNI網絡模式,單個節點上的pod數量存在上限,節點有幾乎一半的pod數量是爲定時任務的pod保留的,而定時任務的pod運行時間普遍很短,資源使用率很低,這就導致了集羣爲定時任務預留的資源產生了較多閒置,不利於整體的機器資源使用率提升。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"其他問題:調度速度、服務間隔離性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某些時段,比如每天0點,會同時產生幾千個Job需要運行。而原生調度器是K8s調度pod本身對集羣資源分配,反應在調度流程上則是預選和打分階段是順序進行的,也就是串行。幾千個Job調度完成需要幾分鐘,而大部分業務是要求00:00:00準時運行或者業務接受誤差在3s內。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些服務pod是計算或者IO密集型,這種服務會大量搶佔節點CPU或者IO,而cgroup的隔離並不徹底,所以會干擾其他正常在線服務運行。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、在K8s集羣中使用serverless"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以,對CRONJOB型任務我們需要一個更徹底的隔離方式,更細粒度的節點,更快的調度模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決上述問題,我們考慮將定時任務pod和普通在線服務的pod隔離開,但是由於很多定時任務需要和集羣內服務互通,最終確定了一種將定時任務pod在集羣內隔離開來的解決辦法 —— K8s serverless。我們引入了虛擬節點,來實現在現有K8s體系下使用K8s serverless。部署在虛擬節點上的 pod具備與部署在集羣既有節點 pod 一致的安全隔離性、網絡連通性,又具有無需預留資源,按量計費的特性。如圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/89\/74\/89f0dc693157abdf5ac8fa7e69c22474.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"任務調度器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有cronjob型workload都使用任務調度器,任務調度器批量並行調度任務pod到Serverless的節點,調度上非串行,實現完整並行,調度速度ms級,也支持Serverless節點故障時或者資源不足時調度回正常節點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決和正常節點上pod差異"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用K8s Serverless前首先要解決Serverless pod和運行在正常節點上的pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"差異,做到對業務研發無感。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1.日誌採集統一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在日誌採集方面,由於虛擬節點是雲廠商維護的,無法運行DaemonSet,而我們的日誌採集組件是以DaemonSet形式運行的,這就需要對虛擬節點上的日誌做單獨的採集方案。雲廠商將容器的標準輸出收集到各自的日誌服務裏,各個雲廠商日誌服務的接口各不一樣,所以我們自研了日誌消費服務,通過插件的形式集成雲廠商日誌client,消費各雲廠商的日誌和集羣統一的日誌組件採集的日誌打平後放到統一的Kafka集羣裏以供後續消費。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2.監控報警統一"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在監控方面,我們對Serverless上的pod 做了實時CPU\/內存\/磁盤\/網絡流量等監控,做到了和普通節點上的pod一致,暴露pod sanbox 的export接口,promethus負責統一採集,遷移到Serverless時做到了業務完全無感。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"提升啓動性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Serverless JOB 需要具備秒級的啓動速度才能滿足定時任務對啓動速度的要求,比如業務要求00:00:00準時運行或者業務接受誤差在3s內。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主要耗時在以下兩個步驟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 底層sanbox創建或者運行環境初始化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 業務鏡像拉取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要是做到同一個workload的sanbox能夠被複用,這樣主要耗時就在服務啓動時長,除了首次耗時較長,後續基本在秒級啓動。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過自定義JOB調度器、解決和正常節點上pod的差異、提升Serverless pod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"啓動性能措施,做到了業務無感切換到Serverless,有效利用Serverless免運維、強隔離、按量計費的特性,既實現了和普通業務pod隔離,使得集羣不用再爲定時任務預留機器資源,釋放了集羣內自有節點的上萬個pod,約佔總量的10%;同時避免節點上pod創建過於頻繁引發的問題,業務對定時任務的穩定性也有了更好的體驗。定時任務遷移到Serverless,釋放了整個集羣約10%的機器,定時任務的資源成本降低了70%左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"呂亞霖,作業幫基礎架構 - 架構研發團隊負責人。負責技術中臺和基礎架構工作。在作業幫期間主導了雲原生架構演進、推動實施容器化改造、服務治理、GO 微服務框架、DevOps 的落地實踐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"別路,作業幫基礎架構-高級研發工程師,在作業幫期間,負責多雲k8s集羣建設、k8s組件研發、linux內核優化調優相關工作。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章