伴魚分佈式調度系統 Jarvis 的設計與實現

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着伴魚課程業務需求和用戶量的快速增長,涉及到實時和延時任務的場景也越來越多。例如課程錄製、課程視頻轉碼、課程視頻上傳以及相關的課程視頻分析、老師學生行爲分析、語音識別、情緒識別等算法離線預測任務等。這些任務都需要大量的計算、存儲、網絡等資源,而且不同的場景對任務的執行時間,調度策略又有不同的要求。如果由業務方來各自管理機器資源並且監控每個任務的狀態,累積的維護成本會非常高,而且不方便統一管理。從整體來看,在資源有限的情況下,簡單的調度邏輯已無法同時滿足全部的任務需求。我們需要一個能進行任務調度、任務編排、異構資源管理、任務監控的分佈式任務調度解決方案。但是如何在合理高效管理任務的前提下去做到節約資源成本,就成了我們需要面對的一個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以錄製任務爲例,爲了滿足高峯期的需求,有二十多臺 64Core 128G Memory 的物理機全天運轉提供服務。然而實際上大部分時間機器資源都是閒置的,只有在用戶上課時纔會有任務執行,但爲了用戶體驗和實時類課程錄製,又不能臨時減少機器數量。在類似業務的背景下,我們對系統功能要求進行了整理,並對業界開源項目和第三方產品進行了調研。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"功能要求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持實時任務,進行秒級調度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持併發場景下批量創建任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持管理異構資源。物理機、K8S 集羣、ECI、EKS 等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持與運行中任務進行通信。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持任務狀態監控。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持任務結果回調。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持任務的失敗重試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持工作流,編排 DAG。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持優先級調度,並且保證資源的優先級親緣性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持搶佔式調度。緊急任務可直接搶佔資源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"保證任務只會被精確調度和執行一次。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持彈性資源調度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持任務資源池。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持多租戶。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持定時任務。Crontab、Fix delay、Fix rate 等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持日誌查詢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持 Shell\/Python\/Golang 等執行腳本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持容器創建,刪除,管理等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"…"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"產品調研"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業界內關於任務調度的開源項目和第三方產品有很多,我們主要調研了其中幾個產品,並進行了幾項指標的對比。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/b5\/a6\/b50a85f460e471978a229b8d473520a6.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於我們自身的需求背景及產品調研結果,主要考慮到如下原因:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開源項目和第三方產品都不能完全滿足我們的需求, 比如接入ECI、EKS等彈性容器,異構資源管理,支持容器,服務治理,Kubernetes 集羣等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自研更容易適配伴魚的基礎框架和技術工具。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自研系統的架構可靈活調整,並適配業務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對開源項目做二次開發或者封裝第三方 SDK 的開發和維護成本也不低。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"…"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們選擇自研一套能夠完全滿足內部需求的分佈式任務調度系統,並取名爲 Jarvis (鋼鐵俠中的智能管家)。在資源上,Jarvis 系統可以接管業務方指定的物理機,雲主機,K8S 集羣, ECI, EKS 等資源,不同類型的任務可以做到資源隔離和動態管理。除此之外,Jarvis 系統還藉助了 ECI 和 EKS 這些彈性容器服務的能力,在物理資源不足時,可以將容器任務調度到上面進行執行。(ECI, EKS:可以理解爲一個按使用量計費的,無限容量的 K8S 集羣)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/84\/cc\/84b48a86dbe2b78a009f9d7baa0a32cc.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 系統主要有四大模塊:JobManager、Scheduler、ResourceManager、Worker,每個模塊都以集羣方式部署。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模塊介紹"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"JobManager"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負責管理任務的生命週期,維護任務的依賴關係(DAG 編排),支持定時任務,實時任務的創建和管理,監控任務狀態,管理任務的生命週期,維護任務狀態機。Job Manager 負責監控任務的運行狀態、管理任務的生命週期,處理實時\/定時\/延時任務,另外 Job Manager 還負責監控超時任務,對任務查殺和強行釋放資源。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Scheduler"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負責對任務進行調度,通過給 ResourceManager 發送任務進行資源綁定,並將分配到資源的任務 dispatch 到指定位置。Jarvis 調度系統的大腦,它從 Job Manager 中獲取需要執行的任務,根據任務的類型、等待時間、優先級等信息,按照多種調度算法,對任務進行調度並將任務分發給合理的 Worker 來執行任務。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"ResourceManager"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"負責管理業務方所有可用資源,包括但不限於物理機,雲主機,K8S 集羣等,並將 Scheduler 推送過來的任務綁定最合適的資源。作爲 Jarvis 調度系統的資源管理中心,它還負責將物理機、K8S 集羣等資源註冊到緩存和數據庫,將這些資源統一管理,並監控資源的負載情況和資源使用信息。除此之外,ResourceManager 集成了資源的打分,分配,調度方案,可作爲插拔式插件進行更新。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Worker"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該模塊部署在宿主機上,可以使用容器或二進制形式部署,負責向 ResourceManager 上報機器資源使用情況、健康狀態、心跳檢查等,並向 Job Manager 上報任務執行狀態。最終都會通過 Worker 執行作業與任務。Jarvis 調度系統中的任務執行和分發者,接收並執行由 Scheduler 分發的任務、接收並彙報任務的運行結果。實時向 ResourceManager 回報資源的使用情況、健康狀態、心跳等信息,確保物理機資源能夠被 Jarvis 管理。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模塊細節"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"JobManager模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Job Manager 並不是嚴格的去中心化設計,而是通過 Etcd 分佈式鎖選舉出 Master 節點,Master 節點相比其它 Slave 節點多了一些全局的監控工作,但不會直接與其它節點存在關聯。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 JobManager 中實現了 3 種任務模型:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時任務:一次創建,一次執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"定時任務:一次創建,定時執行(週期性執行或指定時間一次性執行)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DAG:根據依賴關係執行。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"支持的任務類型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前我們支持以下任務類型:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Golang 腳本"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Python 腳本"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Shell 腳本"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HTTP"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自定義任務"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於腳本類的任務,需要提供具體的腳本內容。對於容器類型的任務,需要提供任務鏡像,啓動參數,環境變量等。因爲容器可以方便地限制 CPU、Memory 等資源的使用,而且在 ECI 的助力下,很少會出現資源不足的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 系統接入的第一個任務是直播中臺的課程錄製業務,業務方將原有服務中的錄製邏輯抽離出來進行了容器化,在 Jarvis 中以容器形式運行。在接入錄製任務的過程中,業務方提出了一些新的需求,比如客戶端需要切換錄製 SDK,上游服務可調用 Jarvis 的指令發送接口給任務發送切換 SDK 指令,但是 Docker 本身是不支持的,我們最終通過 Docker Exec API + IPC 打通了物理機以及 ECI 上容器任務的通信。另外 Jarvis 還支持自定義 Processor(可以理解爲任務插件),可以直接在機器上執行特定的任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 Job Manager 創建任務的時候,可以設置限制資源(CPU、Memory、GPU等)的參數,對於容器任務,容器底層自身可以做到嚴格的資源限制,對於腳本類和自定義 Processor 任務,我們會使用 Linux Cgroup、Namespace 技術來實現資源隔離和限制。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"核心接口"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CreateJob: 創建實時任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CreateDag: 創建 DAG 任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CreateCronJob: 創建定時任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UpdateCronJobExpression: 更新運行中 CornJob 的 cron 表達式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StopCronJob: 暫時停止定時任務,可恢復。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KillJob: 強制終止任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SendCommandToJob: 向運行中的任務發送指令。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"…"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務狀態"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/ff\/35\/ffc49c7d840c6b50ccbfedf23321f535.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任一時刻,Job 只會處於以下一種狀態"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Init (初始態):初始化 Job 狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Schedulable:可被調度狀態,Job Manager 收到的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scheduling:正在調度中狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Schedulable 的 Job 被 Scheduler 拉取後,JobManager 修改爲狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pending:待運行的狀態。任務已經被 dispatch 到 worker ,但還未開始執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Running:任務正在運行的狀態。Worker 上傳給 Job Manager 的 Job 狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Killed:任務被終止的狀態。Worker 上報的被驅逐或搶佔的 Job 狀態,如重試次數大於 0 改爲 Schedulable 狀態,否則改爲 Failed 狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Retryable(臨時態):任務正在重試的狀態。任務執行失敗或被 kill 後,如重試次數大於 0 改爲 Schedulable 狀態,否則改爲 Failed 狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Failed(最終態):任務最終執行失敗的 Job 狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Succeeded(最終態):任務最終執行完成的 Job 狀態。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Job 的狀態變化主要來自 Scheduler 和 Worker 的上報以及 Job Manager 的監控,同一個 Job 多個狀態的上報存在併發問題,可能會造成緩存與數據庫的不一致。爲此我們實現了支持自動續期的 Redis 分佈式鎖來保證狀態變化的原子性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務執行流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JobManager 收到客戶端提交的請求後,通過分佈式 ID 生成器生成 JobId ,將其放入 Redis Set 中,並將 Job 信息持久化到 DB。創建 Job 的參數如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"type CreateJobReq struct {\n\tTeamId string \/\/業務方標識\n\tAppName string \/\/業務名稱\n\tName string \/\/任務名稱\n\tDescription string \/\/任務描述\n\tCreator string \/\/創建者\n\tTimezone string \/\/時區\n\tRetries int32 \/\/重試次數\n\tRetryInterval int32 \/\/重試間隔\n\tPriority int32 \/\/優先級\n\tConcurrency bool \/\/是否併發執行\n\tExecutor string \/\/http,docker、eci...\n\tExecutorMode string \/\/執行模式\n\tExecutorConfig map[string]string \/\/任務參數及配置\n\tCpu float64 \/\/cpu需求\n\tMemory int32 \/\/memory需求\n\tGpu int32 \/\/gpu需求\n\tTimeout int32 \/\/超時時間\n\tCallbackUrl string \/\/回調URL\n\t...\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此時 Job 爲 Schedulable 狀態,等待被 Scheduler 調度。Scheduler 會定時批量從 Job Manager 維護的 Redis Set中拉取任務。Job 被拉取後,狀態變更爲 Scheduling ,然後由 Scheduler 根據 Job 指定的調度策略向 Resource Manager 申請資源,成功申請到資源後,Scheduler 將 Job 指派到對應資源上的 Worker,此時 Job 狀態變更爲 Pending, 在 Worker 啓動任務成功後,任務狀態變更爲 Running。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7c\/91\/7c77df94eb4bbb7b7d6f132005734e91.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"定時任務的實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常見用來實現定時任務的 DelayQueue 和 Cron,底層都是最小堆,單次插入刪除的平均時間複雜度是 O(log n), 如果堆的大小已經達到 100w,那麼每次插入都需要將近 20 次操作(2^20 = 1048576)。Jarvis 的設計目標是能同時維護百萬級的定時任務,在這種情況下,用常見方式去執行創建任務、停止任務等操作將會非常耗時,爲此我們需要一種更高級的數據結構:時間輪,而其可以達到近乎 O(1) 的時間複雜度。在海量任務場景下(百萬級別),每次插入新的任務,時間輪要比最小堆少 19 次操作。參考 Kafka 中時間輪算法的實現,我們基於 Golang 實現了高性能的層級時間輪並應用到 Job Manager 中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最簡單的時間輪就是一個固定大小的循環列表,其中每格代表一個時間間隔,包含一個雙向鏈表用來維護某一時刻下的任務列表。很明顯,這種單層時間輪無法表示較大的時間跨度,且在初始化後無法管理超過跨度的定時任務。層級時間輪通過按需創建多個時間輪,並對每層時間輪設置不同的時間跨度,有效地解決了單層時間輪的缺點。當定時任務超過層級時間輪當前最大時間跨度後,會創建 N 倍與當前跨度的高層時間輪,其中的 N 是上述提到的循環列表格數。隨着時間的流逝,高層的時間輪中的任務會被逐步降級插入到下層時間輪中,直到達到最底層時間輪的當前時刻指針,任務被取出,移出時間輪。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於單個時間輪來說,目前可以達到 ms 級的精度,在 tick = 1ms,時間輪的時間格個數 timewheelsize = 60 時,第一層時間輪的跨度爲60ms,第二層時間輪的跨度爲 6060ms = 3.6s,第三層時間輪的跨度爲 603.6s = 216s … 第七層時間輪的時間跨度爲 88.7 year,僅需七層就足以滿足業務上的需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實踐過程中,我們也發現了 Jarvis 直接使用時間輪算法的一些問題,例如沒有做備份,當服務器宕機時會丟失所有任務。而每個 JobManager 節點中都運行着一個時間輪,我們需要保證這個“分佈式時間輪”在服務重啓或宕機時,任務能被及時地分發到其它節點。爲了解決這個問題,我們在以下的任務監控場景引入了 Job Bucket 的新概念。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Job Manager 集羣管理着 Jarvis 系統中的所有 Job,爲了便於每個節點都均衡地參與到 Job 的監控和管理工作中,上游負責負載均衡,但 Job Manager 還需要先對 Job 進行分配。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對此,我們引入了 Bucket 的概念,類似 Redis Cluster 中的哈希槽。創建 Job 時,會根據 CRC32(JobId) mod BucketCount ,計算出 Job 所映射的 Bucket。Job Manager 節點在啓動時會去搶佔 Bucket,搶佔成功後纔會提供監控等服務,否則就會一直嘗試搶佔。例如,現在有 10 個 bucket,部署了 12 個 Job Manager 節點,這樣會有 10 個節點搶到了 Bucket,另外 2 個節點只會提供接口服務,並不實際維護任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搶到 Bucket 的節點會監控 Bucket 下綁定的所有 Job,當發現有 Job 在某個狀態超時後,會主動執行 kill、釋放資源、重新調度等操作。節點需要通過心跳監控對 bucket 續期,當某個節點出現故障時,該 Bucket 會被立即釋放,之前未搶到的節點會及時接管該 Bucket。Bucket 的數量是可動態配置的,一般我們會設置節點數爲 N。N 代表備用的監控節點,可以自己把控節點數量。除了監控 Job 的狀態,Job Manager 中的 Master 節點還會監控 Bucket 的數量,當宕機的節點數超過 N 時,意味着出現了 Bucket 無節點接管的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果節點宕機,新的節點接管其 Bucket 後,會將該 Bucket 下的定時任務重新加入到自己的時間輪中,這樣就保證了定時任務在節點宕機重啓時也不會丟失。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"DAG 任務編排"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DAG 的每個子任務本質都是一個實時任務。在實現上,我們用有向無環 (DAG) 維護了任務間的依賴關係,當子任務執行結束時,通知 DAG 執行一次檢查,如果已無可執行的子任務,則 DAG 執行結束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/48\/d1\/48cd7b04a64629222713560a461af2d1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務失敗策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般任務執行失敗有以下幾種情況:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶主動 Kill"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任務超時後被系統 Kill"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任務運行失敗"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接收到任務失敗上報後,Job Manager 會根據創建時指定的 “失敗重試次數” 參數,嘗試重新調度任務,當重試次數用完後,Job 會被標記爲最終態 Failed。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"任務結果回調"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 系統並不關心任務具體業務邏輯的對錯,我們只保證任務成功在資源上運行。如果業務方需要拿到任務執行完的結果,可以在業務邏輯中任務結束前調自己的接口。我們也提供了回調機制:在創建 Job 時可以指定 CallBack URL,任務執行結束後,Jarvis 會將任務的輸出結果進行回調。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"保證某些場景下的任務冪等性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在系統的基本邏輯基礎上,任務 (Job) 本身是具有冪等性的,因爲任務 (Job) 可以抽象成一個請求,但是因爲存在重試機制和補償機制的緣故,爲了避免在這些機制下產生任務 (Job) 被重複執行(即需要保證一個任務只會被一臺機器執行一次),Jarvis 在基於 Redis Check + 分佈式 ID + CallBack 機制下去保證這些場景下的「任務冪等性」。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Scheduler 模塊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scheduler 的核心功能是對任務進行調度,負責任務在創建後的“綁定資源 -> 指派任務”的過程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"調度策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 使用的調度算法主要有:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"First Come First Serve"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最簡單的一個調度算法:先來先服務。維護非搶佔式的任務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Multi Priority Level Queue"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多優先級級隊列調度算法。「多級」表示有多個隊列,每個隊列優先級從高到低。維護搶佔式任務。當搶佔式任務因資源不足無法執行時,會對低優先級的任務進行搶佔,被搶佔的任務會走正常的重試邏輯,直到重試次數用完。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間容忍性調度"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了以上兩種調度算法外,我們還基於 ECI 實現了一個對業務場景很實用的 Feature:容忍等待時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮如下場景:有一些任務的實時性要求不高,在 24 小時內執行完就可以,而我們只有有限的物理機資源。當瞬時創建大量這種任務時,就算全部資源滿負荷運轉,同時能跑的任務數量也很有限。Scheduler 會將這些任務積壓在隊列中,在這 24 小時裏充分利用資源,如果達到任務要求的可容忍等待時間後,資源仍然不足,Scheduler 就會直接將其調度到 ECI 上執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與”分佈式時間輪”問題類似,Scheduler 集羣各個節點的會維護的隊列,也沒有備份。但是 Scheduler 中沒有 Bucket 的概念,它是完全無狀態去中心化的。爲了解決這個問題,我們在監聽到 Scheduler 退出信號後,會把隊列中未處理的 Jobs 信息進行回調,再通過 Job Manager 轉發到其它節點。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"調度流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Scheduler 從 Job Manager 拉取 Schedulable 狀態的 Job。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將 Job 推送給 ResourceManager 嘗試綁定資源。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果綁定資源成功,根據資源信息將 Job 推送到指定的 Worker 上執行。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果綁定資源失敗,走搶佔邏輯或者重新入隊。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"ResourceManager模塊"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"調度系統現狀"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Google 的研究工作1表明,調度系統經歷了從單層調度系統到雙層調度系統再到共享狀態調度系統的演變過程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單層調度系統架構簡單且單調度器容易保證資源的一致性,但是其併發性能差且存在單點瓶頸問題不適於大規模集羣的調度。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雙層調度系統將資源管理和任務調度解耦,每個上層調度器都有自己的資源視圖且可以自定義任務調度的邏輯。提高了調度系統的靈活性和可擴展性,典型代表如 Mesos 等。雙層調度系統雖然提高了擴展性,但是由於上層調度器只有局部資源視圖因此任務的資源分配不是全局最優,且任務進行資源搶佔時無法跨調度器搶佔。雖然可以有多個上層調度器較單層調度器提高了併發性,但實際採用了類似悲觀鎖的方式併發性仍有待提高。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"共享狀態調度系統中每個調度器都有全局資源視圖,能夠實現全局最優調度,且多個調度器可以同時進行調度提高了併發性能,典型代表如 Omega 等。但是併發調度實際是採用了類似樂觀鎖的併發機制,因此會導致資源分配衝突,若頻繁發生衝突而重新進行調度可能影響系統性能。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"ResourceManager 處理流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 屬於共享狀態調度系統其資源管理模塊 ResourceManager 負責系統中資源的統一管理和分配,它接受來自 Worker 的資源信息彙報,並把集羣中資源按照一定的策略分配給各個任務。ResourceManager 是一個資源管理模塊,並不參與任務的具體執行(啓動、殺死、重啓等),其主要工作包括:Allocate、Report、Release。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Allocate: Scheduler 申請給 Job 分配資源,收到請求 ResourceManager 先查詢數據庫中該 Job 的狀態避免重複處理,然後根據分配策略和資源數據進行計算給 Job 分配資源。分配資源後更新 Job 的狀態和資源數據,先更新緩存然後用 Pulsar 同步數據到數據庫,至此該 Job 已經預佔了分配給它的資源狀態爲 Assumed。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Report: ResourceManager 將資源分配結果返回,Scheduler 使用資源來執行 Job 並將 Job 使用資源的結果通知 ResourceManager。Jarvis 通過超時機制來避免 Job 申請資源後長時間不使用,導致資源浪費,資源使用結果上報後 Job 的狀態爲 Used。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Release: Job 執行結束或超時被 kill 後 JobManager 會請求釋放該 Job 佔用的資源,ResourceManager 收到請求後更新 Job 的狀態爲 Deleted,並回收 Job 佔用的資源更新對應機器的資源數據。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"ResourceManager 狀態機"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":" +--------------------------------------------+ \n | Report Exist | \n | | \n + Allocate Report Success v Expire \nInitial +--------> Assumed +------------+---> Used +--------> Expired\n ^ + + +\n | | | |\n | | | | Release\n | | | |\n | | | |\n +----------------+ +---------> Deleted = 機器已分配資源量 + 任務所需資源量,不滿足該條件則淘汰。資源種類包括 CPU、內存、GPU、磁盤等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排序是對上一步篩選出的機器按照多個資源分配策略打分,對每臺機器都會計算一個 0~100 之間的分數,表示當前任務放到該機器的合適程度,其中 100 表示非常合適,0 表示非常不合適。每個不同的策略都有一個權重值,最終的分數爲權重和策略計算結果的乘積,而一個機器的分數就是所有策略計算結果的加和。比如有兩種優先級函數 priorityStrategy1 和 priorityStrategy2,對應的權重分別爲 weight1 和 weight2,那麼節點 X 的最終得分是:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"finalScoreHostX = (weight1 * priorityStrategy1) + (weight2 * priorityStrategy2)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優選是按照排序的結果選擇合適的機器分配給當前任務,這裏有兩點要考慮:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"選取機器前先打散,確保任務被均勻調度到每臺機器上,避免多次選中同一臺機器。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"每次選擇多臺機器其餘留作備用,以應對資源分配衝突的情況。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"伴魚的許多業務都具有較爲明顯的業務高峯時段,例如教室上課、教學音視頻錄製、音視頻轉碼、實時性數據分析等。接入這些業務後勢必會導致調度系統在某個時間段內調度併發數飆升,爲保證調度系統較高的吞吐率 ResourceManager 會一次接收多個 Scheduler 投遞的任務,並靈活的將同類型 Job 進行資源合併後再進行資源分配。例如:JobA 和 JobB 都需要 1Core 1G ,則先合併爲一個 2 Core 2G 的 Job 再進行資源分配。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外 Jarvis 是一個支持異構資源的調度系統,目前 ResourceManager 管理的機器資源包括物理機集羣、K8S 集羣、ECI、EKS 等。資源分配時會優先分配公司機器資源,公司機器資源不足時分配 ECI 等資源以保證可彈性擴容。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"提交資源分配結果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上一步資源分配處理已經確定任務要被分配到哪臺機器,ResourceManager 依靠事務來提交資源分配結果以保證資源的一致性:在選定的機器上模擬扣減資源,並再次檢測任務類型的滿足性、親和性等。若所有條件滿足則進行數據更新,不滿足則發生資源分配衝突。爲了減少資源分配衝突後二次調度帶來的開銷,在資源分配衝突時直接從備選機器中選擇一臺機器進行分配處理。提交事務進行資源扣減,若成功則調度成功,若失敗則回滾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存與數據庫的數據同步方案:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e8\/4e\/e8dc29a0e547b2c32da79e10f7bc2d4e.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Redis + Tidb,讀數據先讀 Redis,如果沒有讀取到改爲從 Tidb 讀取並更新到 Redis。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫數據先寫 Redis(lua 腳本保證原子性),寫入失敗,則失敗重試,寫入成功,則把消息推入 mq。單獨啓動一個協程從 mq 取數據寫入數據庫,通過 mq 的重試機制保證消息最終會被寫入 Tidb,數據最終一致。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中修改機器資源的 lua 腳本片段如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"local hostData = redis.call(\"HGET\", hostDataKey, hostName)\nif hostData == nil or type(hostData) ~= \"string\" then\n return redisNil\nend\n\nlocal jsonHostData = cjson.decode(hostData)\n\nlocal hostCpu = jsonHostData[\"hostcpu\"]\nlocal hostMem = jsonHostData[\"hostmem\"]\nlocal allocatedCpu = jsonHostData[\"allocatedcpu\"]\nlocal allocatedMem = jsonHostData[\"allocatedmem\"]\n\nif allocatedCpu + jobCpu > hostCpu*maxPercent then\n return resourceInsufficient\nend\n\nif allocatedMem + jobMem > hostMem*maxPercent then\n return resourceInsufficient\nend\n\njsonHostData[\"allocatedcpu\"] = allocatedCpu + jobCpu\njsonHostData[\"allocatedmem\"] = allocatedMem + jobMem\njsonHostData[\"utime\"] = tonumber(uTime)\n\nlocal strHostData = cjson.encode(jsonHostData)\n\nlocal b = redis.call(\"HSET\", hostDataKey, hostName, strHostData)\nif b ~= 0 then\n local d = redis.call(\"HDEL\", hostDataKey, hostName)\n return redisFail\nend\n\nreturn success"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Worker"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Worker 是部署在資源節點上的代理,其核心功能是資源負載上報和管理宿主機執行具體任務。Worker 啓動後會將宿主機註冊到 Resource Manager 中,並定時上報 CPU、Memory 等信息。Scheduler 會將綁定資源成功的任務 dispatch 到指定的 Worker,Worker 會在宿主機上啓動執行對應的腳本或容器。任務結束後,Worker 需要將結果上報給 Job Manager。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 可以幫助業務方管理不同類型的資源,將任務與業務邏輯解耦,通過我們提供的接口就可以快速創建各種類型的任務。目前直播中臺已有多個算法離線分析預測任務正式接入 Jarvis 系統,文章開始提到的錄製任務也已經跑通了接入流程,正在逐步把線上流量慢慢遷移到 Jarvis 中。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jarvis 目前已基本實現了我們最初設計的全部核心功能,但仍有一些不足之處,需要後續持續優化改進。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Web 界面: 支持業務方可視化管理 Job,可視化編輯 DAG 工作流等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源池: 對於實時性要求極高的任務,需要開發資源池減少任務啓動時間。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考資料"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Omega: flexible, scalable schedulers for large compute clusters"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Wikipedia — Scheduling"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] Apache Kafka, Purgatory, and Hierarchical Timing Wheels"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] Apache Kafka Timer Implement Source Code"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:閆雲龍、宋園園"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/tech.ipalfish.com\/blog\/2021\/06\/07\/jarvis\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:伴魚分佈式調度系統 Jarvis 的設計與實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:伴魚技術博客"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章