用從關注到使用argo已經快一年時間了,對於小型的算法工作流,argo用起來還是非常簡單高效的。這篇文章主要介紹argo以及粗略瀏覽一下背後的代碼(因爲argo的manifest如果要寫得好,就得看背後的數據結構),如果你只是在尋找argo的使用教程,建議直接看github的文檔,有問題可以看issues,這是最佳方式。
閱讀要求
熟練使用k8s/golang
算法工作流
在討論算法工作流之前,先看一下工作流的介紹。工作流,簡單來說是完成一件事的工作流程
像下圖所描述的任務,的就是一個典型的工作流
一般來說,工作流是一個有向無環圖(DAG)
算法工作流也是一個DAG,裏面的一個點,就是一個step,目前我所接觸的工作流,有簡單到一個step就可以完成的,也有n多個step組成的。每個step,都會有對應的輸入,也會有對應的輸出,然後構成一個完整的算法pipeline。
一個算法工作流,由一個或者多個算法step構成,但是算法工作流要按照一種預期的方式執行,就需要有配套的調度服務,來保證不同的step有序執行。
算法工作流調度服務,目前社區有兩個比較出色的開源項目,airflow以及argo。
airflow
傳送門airflow
是apache下孵化的一個工作流引擎,可以裸機部署,也可以部署在k8s集羣中
優點
- 代碼即配置(py實現)
- 提供UI,過程跟蹤比較方便
缺點
- API開放的功能比較少,做集成比較難度較高
- 對k8s的支持比較少
argo
golang開發的一個基於k8s crd的工作流引擎,社區活躍程度相當高,使用的公司包括Google/IBM/Nvidia等。由於argo是基於k8s開發的,對於有k8s技術棧的團隊來說,argo無疑比airflow更加適合。
下面討論主要圍繞2.3版本展開。
從兩個功能看argo
我目前常用的兩個功能,
DAG or Steps based declaration of workflows
github.com/argoproj/argo/examples/dag-diamond-steps.yaml
# The following workflow executes a diamond workflow
#
# A
# / \
# B C
# \ /
# D
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
dependencies: [B, C]
template: echo
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
這是一個鑽石形狀的DAG工作流
第一階段:執行A
第二階段:B和C是同時執行的
第三階段:D需要等B和C同時成功執行纔會觸發(注意:argo目前支持的是條件與,暫時不支持條件或,假如D的觸發條件是B或C任意一個完成,這種情況需要定製開發)
Workflow 數據結構
Workflow的manifest對應的數據結構在 https://github.com/argoproj/argo/blob/release-2.3/pkg/apis/workflow/v1alpha1/types.go#58
注:下文都是基於2.3的相對路徑
type Workflow struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata"`
Spec WorkflowSpec `json:"spec"`
Status WorkflowStatus `json:"status"`
}
這就是argo的頂層數據結構設計,TypeMeta 與ObjectMeta 用默認的結構,Spec是用戶需要填充的數據,Status是workflow controller需要更新的數據,感謝k8s openapi,非常優美的設計。
接下來是WorkflowSpec ,這裏只挑一些比較重要的數據,不然就太佔篇幅了。
// WorkflowSpec is the specification of a Workflow.
type WorkflowSpec struct {
// Templates is a list of workflow templates used in a workflow
Templates []Template `json:"templates"`
// Entrypoint is a template reference to the starting point of the workflow
Entrypoint string `json:"entrypoint"`
// Arguments contain the parameters and artifacts sent to the workflow entrypoint
// Parameters are referencable globally using the 'workflow' variable prefix.
// e.g. {{workflow.parameters.myparam}}
Arguments Arguments `json:"arguments,omitempty"`
// ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as.
ServiceAccountName string `json:"serviceAccountName,omitempty"`
...
}
Templates: 定義各種各樣的模板,由於是聲明式的設計,生產環境中,可以爲定位問題提供直接有效的信息。工作流完成之後,建議落地,方便將來查找問題。
Entrypoint :代表dag的第一個入口
Arguments: 不同steps的參數傳遞,所有steps可見
ImagePullSecrets : 如果kubelet沒有配置默認的鏡像倉憑證,就需要配置ImagePullSecrets來拉docker鏡像
ServiceAccountName : argo submit的時候,可以指定serviceaccount,如果不指定,用的是namesapce內的default sa,一般不會用default,會給一個高權限的sa
接下來看一下Template
// Template is a reusable and composable unit of execution in a workflow
type Template struct {
// Name is the name of the template
Name string `json:"name"`
// Inputs describe what inputs parameters and artifacts are supplied to this template
Inputs Inputs `json:"inputs,omitempty"`
// Outputs describe the parameters and artifacts that this template produces
Outputs Outputs `json:"outputs,omitempty"`
// NodeSelector is a selector to schedule this step of the workflow to be
// run on the selected node(s). Overrides the selector set at the workflow level.
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
// Affinity sets the pod's scheduling constraints
// Overrides the affinity set at the workflow level (if any)
Affinity *apiv1.Affinity `json:"affinity,omitempty"`
// Metdata sets the pods's metadata, i.e. annotations and labels
Metadata Metadata `json:"metadata,omitempty"`
// Deamon will allow a workflow to proceed to the next step so long as the container reaches readiness
Daemon *bool `json:"daemon,omitempty"`
// Steps define a series of sequential/parallel workflow steps
Steps [][]WorkflowStep `json:"steps,omitempty"`
// Container is the main container image to run in the pod
Container *apiv1.Container `json:"container,omitempty"`
// Script runs a portion of code against an interpreter
Script *ScriptTemplate `json:"script,omitempty"`
// Resource template subtype which can run k8s resources
Resource *ResourceTemplate `json:"resource,omitempty"`
// DAG template subtype which runs a DAG
DAG *DAGTemplate `json:"dag,omitempty"`
...
}
值得一提的是Container 字段,用的是k8s api的結構,如果需要配置ImagePullPolicy,就在這個字段配置。
Artifact support -S3
這個例子用的是全局artifact配置,要運行這個例子,需要在controler configmap配置好artifactory的憑證信息
github.com/argoproj/argo/examples/artifact-passing.yaml
# This example demonstrates the ability to pass artifacts
# from one step to the next.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: artifact-passing-
spec:
entrypoint: artifact-example
templates:
- name: artifact-example
steps:
- - name: generate-artifact
template: whalesay
- - name: consume-artifact
template: print-message
arguments:
artifacts:
- name: message
from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"
- name: whalesay
container:
image: docker/whalesay:latest
command: [sh, -c]
args: ["sleep 1; cowsay hello world | tee /tmp/hello_world.txt"]
outputs:
artifacts:
- name: hello-art
path: /tmp/hello_world.txt
- name: print-message
inputs:
artifacts:
- name: message
path: /tmp/message
container:
image: alpine:latest
command: [sh, -c]
args: ["cat /tmp/message"]
第一階段,generate-artifact step運行,負責創建一個文件
cowsay hello world | tee /tmp/hello_world.txt"
第二階段,print-message step使用第一階段創建的文件,並cat
cat /tmp/message
這是artifactory簡單傳遞的例子,如果想了解實現,請看argo 調度單位pod
架構設計
設計還是比較清晰簡單的,兩個Informer,分別監聽不同的資源,pod與wf(workflow)的交互,主要依賴於argoexec拉起的兩個container,下面會展開講。
本地實驗環境用到的3個argo鏡像
argoproj/workflow-controller v2.3.0 d0f63f453544 8 months ago 37.5 MB
argoproj/argoexec v2.3.0 c5ceebf7886f 8 months ago 286 MB
argoproj/argoui v2.3.0 64f7e1f854a6 10 months ago 183 MB
ui鏡像用途不大,試用一下就好,但生產應該不會用到。因爲實際上一般都不會用這種三方ui,跟團隊的前端技術棧有關。
那麼主要用到的鏡像就是兩個
- argoexec :用於拉起pod的init和wait兩個container
- workflow-controller: 用於監聽crd
argo的部署還是比較簡單的,涉及的鏡像比較少,代碼比較容易trace。
優點
- 提供artifactory storage功能,artifact目前支持s3、hdfs等方式
- 提供argo client,二次開發難度比較低
缺點
調度都是以pod爲單位的,對於大型模型的訓練(數十個節點),會有一些麻煩
argo 調度單位pod
工作流的一個step,對應一個典型的pod。從用戶提供的Template數據結構來看,這是一個完整的container啓動信息。
實現不同step的參數傳遞是比較簡單的,在創建pod的時候,用gotpl注入參數即可。難度稍高的是文件注入。文件注入的方式,要麼在build鏡像的時候嵌入一些sdk性質的工具,要麼就是動態複製文件進去。argo用的是動態複製文件的方式。
算法一般會有輸入,也會有輸出。argo採用的方式,是在一個pod插入兩個container,也就是sidecar 容器,執行的鏡像都是argoexec。argoexec的功能實現代碼實現在 argoproj/argo/cmd/argoexec/main.go:16
由於有兩個sidecar container,處理輸入輸出的文件變得比較簡單,因爲Volume是pod內可見的,只需要mount的路徑對就行,詳情可以往下看。這樣用戶的container儘可能保持完整。用戶裏面有三個容器
main container:業務邏輯實際運行的容器
下面看一下與argo相關的兩個sidecar container功能實現
init container:
主要是準備artifactory
入口的代碼在:github.com/argoproj/argo/cmd/argoexec/commands/init.go:34
func loadArtifacts() error {
wfExecutor := initExecutor()
defer wfExecutor.HandleError()
defer stats.LogStats()
// Download input artifacts
err := wfExecutor.StageFiles()
if err != nil {
wfExecutor.AddError(err)
return err
}
err = wfExecutor.LoadArtifacts()
if err != nil {
wfExecutor.AddError(err)
return err
}
return nil
}
StageFiles的功能是拉取script/resource,這個屬於argo另外的核心功能,可以trace一下代碼看看
LoadArtifacts的功能是拉取artifact,這個功能
github.com/argoproj/argo/workflow/executor/executor.go:124
// LoadArtifacts loads artifacts from location to a container path
func (we *WorkflowExecutor) LoadArtifacts() error {
log.Infof("Start loading input artifacts...")
for _, art := range we.Template.Inputs.Artifacts {
log.Infof("Downloading artifact: %s", art.Name)
if !art.HasLocation() {
if art.Optional {
log.Warnf("Ignoring optional artifact '%s' which was not supplied", art.Name)
continue
} else {
return errors.New("required artifact %s not supplied", art.Name)
}
}
artDriver, err := we.InitDriver(art)
if err != nil {
return err
}
// Determine the file path of where to load the artifact
if art.Path == "" {
return errors.InternalErrorf("Artifact %s did not specify a path", art.Name)
}
var artPath string
mnt := common.FindOverlappingVolume(&we.Template, art.Path)
if mnt == nil {
artPath = path.Join(common.ExecutorArtifactBaseDir, art.Name)
} else {
// If we get here, it means the input artifact path overlaps with an user specified
// volumeMount in the container. Because we also implement input artifacts as volume
// mounts, we need to load the artifact into the user specified volume mount,
// as opposed to the `input-artifacts` volume that is an implementation detail
// unbeknownst to the user.
log.Infof("Specified artifact path %s overlaps with volume mount at %s. Extracting to volume mount", art.Path, mnt.MountPath)
artPath = path.Join(common.ExecutorMainFilesystemDir, art.Path)
}
// The artifact is downloaded to a temporary location, after which we determine if
// the file is a tarball or not. If it is, it is first extracted then renamed to
// the desired location. If not, it is simply renamed to the location.
tempArtPath := artPath + ".tmp"
err = artDriver.Load(&art, tempArtPath)
if err != nil {
return err
}
if isTarball(tempArtPath) {
err = untar(tempArtPath, artPath)
_ = os.Remove(tempArtPath)
} else {
err = os.Rename(tempArtPath, artPath)
}
if err != nil {
return err
}
log.Infof("Successfully download file: %s", artPath)
if art.Mode != nil {
err = os.Chmod(artPath, os.FileMode(*art.Mode))
if err != nil {
return errors.InternalWrapError(err)
}
}
}
return nil
}
InitDriver:初始化artifactory driver,目前支持s3/hdfs/git/http等
這裏download好文件之後,跟容器是怎麼交互的呢?同樣以artifact-passing這個例子爲例。由於這個例子有兩個step,也就是兩個pod,第二個step會使用第一個step的output artifact,也會有正常的wait,我們先看第2個step就行,第1個step也類似
這是argo submit的時候,加了–watch的輸出:
STEP PODNAME DURATION MESSAGE
✔ artifact-passing-bgdjp
├---✔ generate-artifact artifact-passing-bgdjp-2525271161 9s
└---✔ consume-artifact artifact-passing-bgdjp-2453013743 16s
describe一下artifact-passing-bgdjp-2453013743,輸出:
Name: artifact-passing-bgdjp-2453013743
Namespace: argo
Priority: 0
Node: c-pc/192.168.52.128
Start Time: Sat, 15 Feb 2020 23:42:57 +0800
Labels: workflows.argoproj.io/completed=true
workflows.argoproj.io/workflow=artifact-passing-bgdjp
Annotations: cni.projectcalico.org/podIP: 100.64.201.217/32
workflows.argoproj.io/node-name: artifact-passing-bgdjp[1].consume-artifact
workflows.argoproj.io/template:
{"name":"print-message","inputs":{"artifacts":[{"name":"message","path":"/tmp/message","s3":{"endpoint":"argo-artifacts-minio.default:9000...
Status: Succeeded
IP: 100.64.201.217
IPs:
IP: 100.64.201.217
Controlled By: Workflow/artifact-passing-bgdjp
Init Containers:
init:
Container ID: docker://b646cd069376014db11b36bc14ff3b5a3ad95be1f0c7b974e3892f002915ce21
Image: argoproj/argoexec:v2.3.0
Image ID: docker-pullable://argoproj/argoexec@sha256:85132fc2c8bc373fca885df17637d5d35682a23de8d1390668a5e1c149f2f187
Port: <none>
Host Port: <none>
Command:
argoexec
init
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 15 Feb 2020 23:42:58 +0800
Finished: Sat, 15 Feb 2020 23:42:58 +0800
Ready: True
Restart Count: 0
Environment:
ARGO_POD_NAME: artifact-passing-bgdjp-2453013743 (v1:metadata.name)
Mounts:
/argo/inputs/artifacts from input-artifacts (rw)
/argo/podmetadata from podmetadata (rw)
/argo/secret/argo-artifacts-minio from argo-artifacts-minio (ro)
/var/run/secrets/kubernetes.io/serviceaccount from argo-admin-account-token-xbr9t (ro)
Containers:
wait:
Container ID: docker://d3b40cf3b2ee407e5508d1d528ca853e59cc09f42bcb20562a121b3343f29665
Image: argoproj/argoexec:v2.3.0
Image ID: docker-pullable://argoproj/argoexec@sha256:85132fc2c8bc373fca885df17637d5d35682a23de8d1390668a5e1c149f2f187
Port: <none>
Host Port: <none>
Command:
argoexec
wait
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 15 Feb 2020 23:42:59 +0800
Finished: Sat, 15 Feb 2020 23:43:13 +0800
Ready: False
Restart Count: 0
Environment:
ARGO_POD_NAME: artifact-passing-bgdjp-2453013743 (v1:metadata.name)
Mounts:
/argo/podmetadata from podmetadata (rw)
/argo/secret/argo-artifacts-minio from argo-artifacts-minio (ro)
/mainctrfs/tmp/message from input-artifacts (rw,path="message")
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from argo-admin-account-token-xbr9t (ro)
main:
Container ID: docker://a2feca1462a0336c5501ce2a5e1e55486d8bf35e4da4a661442081788c848d4c
Image: alpine:latest
Image ID: docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
cat /tmp/message
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 15 Feb 2020 23:43:13 +0800
Finished: Sat, 15 Feb 2020 23:43:13 +0800
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/tmp/message from input-artifacts (rw,path="message")
/var/run/secrets/kubernetes.io/serviceaccount from argo-admin-account-token-xbr9t (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
argo-artifacts-minio:
Type: Secret (a volume populated by a Secret)
SecretName: argo-artifacts-minio
Optional: false
input-artifacts:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
argo-admin-account-token-xbr9t:
Type: Secret (a volume populated by a Secret)
SecretName: argo-admin-account-token-xbr9t
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned argo/artifact-passing-bgdjp-2453013743 to c-pc
Normal Pulled 18h kubelet, c-pc Container image "argoproj/argoexec:v2.3.0" already present on machine
Normal Created 18h kubelet, c-pc Created container init
Normal Started 18h kubelet, c-pc Started container init
Normal Pulled 18h kubelet, c-pc Container image "argoproj/argoexec:v2.3.0" already present on machine
Normal Created 18h kubelet, c-pc Created container wait
Normal Started 18h kubelet, c-pc Started container wait
Normal Pulling 18h kubelet, c-pc Pulling image "alpine:latest"
Normal Pulled 18h kubelet, c-pc Successfully pulled image "alpine:latest"
Normal Created 18h kubelet, c-pc Created container main
Normal Started 18h kubelet, c-pc Started container main
有實際的pod信息,看pod的設計就比較直觀了,一共三個container
Init Containers: 只有一個init container
Containers: 有main和wait兩個container
先看pod的volume
Volumes:
podmetadata: pod的元數據,用於關聯pod和step
docker-sock: 用於支持docker in docker,關於這種使用方式,可以去看docker的官網看一下用法。這裏稍微提一下,docker要跑起來,會有server和client,docker in docker裏面只是做了client的映射,server還是宿主機的docker server(dockerd),掛在宿主機的docker之後,意味着可以可以對容器進行一些精準操作
argo-artifacts-minio: 就是創建的minio 憑證secret,包含accesskey和secret key
input-artifacts: 第一個step的輸出,第二個step的輸入
argo-admin-account-token-xbr9t: k8s token,因爲wait需要跟api server通信,所以需要掛在token
重點看一下input-artifacts的信息,
input-artifacts:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
由於是EmptyDir,所以它在宿主機的路徑就不需要關心,改目錄的生命週期和pod是一樣的。
同樣一個路徑,在不同的容器可以掛在爲不同的目錄,查看describe的輸出,可以發現各自掛在的路徑是這樣的:
init: /argo/inputs/artifacts
main: /tmp/message
wait: /mainctrfs/tmp/message
看到這裏,LoadArtifacts函數的代碼基本就可以理解了
artPath = path.Join(common.ExecutorArtifactBaseDir, art.Name)
ExecutorArtifactBaseDir const變量的值就是 /argo/inputs/artifacts
wait container:
一方面需要監聽進程是否完成,以便讓工作流繼續跑,另一方面,也要保存輸出的artifactory
TODO
一些實踐經驗
- 爲了保證k8s的資源以及安全,argo部署的時候,選擇namespace install方式就行,而且k8s管理員也不會給你cluster scope的權限哈
- 默認可以在全局的configmap配置s3 artifact的憑證信息,用的是一個租戶的bucket,如果需要隔離不同工作流的artifact,那麼就需要在工作流裏面指定憑證信息。由於工作流模板帶了敏感信息,信息安全的問題就要注意一下
- 初期使用argo的時候,的ImagePullPolicy是可以指定的,不然用默認的IfNotExist策略,你會發現你的鏡像沒有生效。
- 如果算法腳本是需要大量並行的pod跑,那麼argo可能不太適合,這種情況,就只能用自研的工作流引擎了。迴應開頭所說,argo比較適合小型的工作流。