作業幫 Kubernetes 原生調度器優化實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度系統的本質是爲計算服務或任務匹配合適的資源,使其能夠穩定高效地運行,以及在此基礎上進一步提高資源使用密度,而影響應用運行的因素非常多,比如 CPU、內存、IO、差異化的資源設備等一系列因素都會影響應用運行的表現。同時,單獨和整體的資源請求、硬件 \/ 軟件 \/ 策略限制、 親和性要求、數據區域、負載間的干擾等因素以及週期性流量場景、計算密集場景、在離線混合等不同應用場景的交織也帶來了決策上的很多變化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度器的目標則是快速準確地實現這一能力,但快速和準確這兩個目標在資源有限的場景下往往會產生矛盾,這需要在二者間權衡,本文主要分享了作業幫在實際應用 K8s 過程中遇到的問題以及最終探討出的解決方案,希望對廣大開發者有所幫助。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"調度器原理和設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8s 默認調度器的整體工作框架可以簡單用下圖概括:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5bd8c4755399fdc7ab049a99e60bb15d.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"兩個控制循環"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、第一個控制循環稱爲 Informer Path,主要工作是啓動一系列 Informer,用來監聽(Watch)集羣中 Pod、Node、Service 等與調度相關的 API 對象的變化。比如,當一個待調度 Pod 被創建出來之後,調度器就會通過 Pod Informer 的 Handler,將這個待調度 Pod 添加進調度隊列;同時,調度器還要負責對調度器緩存 Scheduler Cache 進行更新,並以這個 cache 爲參考信息,來提高整個調度流程的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、第二個控制循環即爲對 pod 進行調度的主循環,稱爲 Scheduling Path。這一循環的工作流程是不斷地從調度隊列中取出待調度的 pod,運行兩個步驟的算法,來選出最優 node"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在集羣的所有節點中選出所有“可以”運行該 pod 的節點,這一步被稱爲 Predicates;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一步選出的節點中,根據一系列優選算法對節點打分,選出“最優”即得分最高的節點,這一步被稱爲 Priorities。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度完成之後,調度器就會爲 pod 的 spec.NodeName 賦值這個節點,這一步稱爲 Bind。而爲了不在主流程路徑中訪問 Api Server 影響性能,調度器只會更新 Scheduler Cache 中的相關 pod 和 node 信息:這種基於樂觀假設的 API 對象更新方式,在 K8s 中稱爲 Assume。之後纔會創建一個 goroutine 來異步地向 API Server 發起更新 Bind 操作,這一步就算失敗了也沒有關係,Scheduler Cache 更新後就會一切正常。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"大規模集羣調度帶來的問題和挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8s 默認調度器策略在小規模集羣下有着優異表現,但是隨着業務量級的增加以及業務種類的多樣性變化,默認調度策略則逐漸顯露出侷限性:調度維度較少,無併發,存在性能瓶頸,以及調度器越來越複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"迄今爲止,我們當前單個集羣規模節點量千級,pod 量級則在 10w 以上,整體資源分配率超過 60%,其中更是包含了 GPU、在離線混合部署等複雜場景。在這個過程中,我們遇到了不少調度方面的問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題 1:高峯期的節點負載不均勻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認調度器,參考的是 workload 的 request 值,如果我們針對 request 設置的過高,會帶來資源浪費;過低則有可能帶來高峯期 CPU 不均衡差異嚴重的情況;使用親和策略雖然可以一定程度避免這種,但是需要頻繁填充大量的策略,維護成本就會非常大。而且服務的 request 往往不能體現服務真實的負載,帶來差異誤差。而這種差異誤差,會在高峯時體現到節點負載不均上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時調度器,在調度的時候獲取各節點實時數據來參與節點打分,但是實際上實時調度在很多場景並不適用,尤其是對於具備明顯規律性的業務來說,比如我們大部分服務晚高峯流量是平時流量的幾十倍,高低峯資源使用差距巨大,而業務發版一般選擇低峯發版,採用實時調度器,往往發版的時候比較均衡,到晚高峯就出現節點間巨大差異,很多實時調度器往往在出現巨大差異的時候會使用再平衡策略來重新調度,高峯時段對服務 POD 進行遷移,服務高可用角度來考慮是不現實的。顯然,實時調度是遠遠無法滿足業務場景的。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"我們的方案:高峯預測時調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這種情況,需要預測性調度方案,根據以往高峯時候 CPU、IO、網絡、日誌等資源的使用量,通過對服務在節點上進行最優排列組合迴歸測算,得到各個服務和資源的權重係數,基於資源的權重打分擴展,也就是使用過去高峯數據來預測未來高峯節點服務使用量,從而干預調度節點打分結果。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題 2:調度維度多樣化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務越來越多樣,需要加入更多的調度維度,比如日誌。由於採集器不可能無限速率採集日誌且日誌採集是基於節點維度。需要平衡日誌採集速率,各個節點差異不可過大。部分服務 CPU 使用量一般但是日誌輸出量很大,而日誌並不屬於默認調度器決策的一環,所以當這些日誌量很大的多個服務 pod 在同一個節點上時,該機器上的日誌上報就有可能出現部分延遲。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"我們的方案:補全調度決策因子"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該問題顯然需要對調度決策補全,我們擴展了預測調度打分策略,添加了日誌的決策因子,將日誌也作爲節點的一種資源,並根據歷史監控獲取到服務對應的日誌使用量來計算分數。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題 3:大批量服務擴縮帶來的調度時延"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務複雜度進一步上升,在高峯時段出現,會有大量定時任務和集中大量彈性擴縮,大批量(上千 POD)同時調度導致調度時延上漲,這兩者對調度時間比較敏感,尤其對於定時任務來說,調度延時的上漲會被明顯感知到,原因是 K8s 調度 pod 本身是對集羣資源的分配,反應在調度流程上則是預選和打分階段是順序進行的。如此一來,當集羣規模大到一定程度時,大批量更新就會出現可感知的 pod 調度延遲。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"我們的方案:拆分任務調度器,加大併發調度域、批量調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決吞吐能力低下最直接的方法就是串行改並行,對於資源搶佔場景,儘量細化資源域,資源域之間並行。基於以上策略,我們拆分出了獨立的 Job 調度器,同時使用 Serverless 作爲 Job 運行的底層資源。K8s Serverless 爲每一個 Job POD 單獨申請了獨立的 POD 運行 sanbox,也就是任務調度器,完整並行。以下爲對比圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2dbfa4a1562ec5d1e10cdc88b6ba6a0b.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"原生調度器在晚高峯下節點 CPU 使用率"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cfbdaa6b34ebba6deca8097cb083493b.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"優化後調度器在晚高峯下節點 CPU 使用率"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Work 節點資源、GPU 資源、Serverless 資源是我們集羣異構資源的三類資源域,這三種資源上運行的服務存在天然差異,我們使用 forecast-scheduler、gpu-scheduler、job-schedule 三個調度器來管理這三種資源域上的 Pod 調度情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預測調度器管理大部分在線業務,其中擴展了資源維度,添加了預測打分策略。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GPU 調度器管理 GPU 資源機器的分配,運行在線推理和離線訓練,兩者的比例處於長期波動中,高峯期間離線訓練縮容、在線推理擴容;非高峯期間離線訓練擴容、在線推理縮容;同時處理一些離線圖片任務來複用 GPU 機器上比較空閒的 CPU 等資源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Job 調度器負責管理定時任務調度,定時任務量大且創建銷燬頻繁,資源使用非常碎片化,而且對時效性要求更高;所以我們將任務儘量調度到 Serverless 服務上,壓縮集羣中爲了能容納大量任務而冗餘的機器資源,提升資源利用率。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2b\/2b3836e05650199627daaccec925c480.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來演進探討"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更細粒度的資源域劃分,將資源域劃分至節點級別,節點級別加鎖來進行。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源搶佔和重調度。正常場景下,當一個 Pod 調度失敗,這個 Pod 會保持在 pending 的狀態,等待 Pod 更新或者集羣資源發生變化進行重新調度,但是 K8s 調度器依然存在一個搶佔功能,可以使得高優先級 Pod 在調度失敗時,擠走某個節點上的部分低優先級 Pod 以保證高優先級 Pod 的正常運行,迄今爲止我們並沒有使用調度器的搶佔能力,即使我們通過以上多種策略來加強調度的準確性,但依然無法避免部分場景下由於業務帶來的不均衡情況,這種非正常場景中,重調度的能力就有了用武之地,也許重調度將會成爲日後針對異常場景的一種自動修復方式。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"呂亞霖,作業幫基礎架構 - 架構研發團隊負責人。2019 年加入作業幫,負責技術中臺和基礎架構工作。在作業幫期間主導了雲原生架構演進、推動實施容器化改造、服務治理、GO 微服務框架、DevOps 的落地實踐。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章