[源碼解析] 深度學習分佈式訓練框架 horovod (18) --- kubeflow tf-operator

[源碼解析] 深度學習分佈式訓練框架 horovod (18) --- kubeflow tf-operator

0x00 摘要

Horovod 是一款基於 AllReduce 的分佈式訓練框架。憑藉其對 TensorFlow、PyTorch 等主流深度學習框架的支持，以及通信優化等特點，Horovod 被廣泛應用於數據並行的訓練中。

前面通過十幾篇文章，我們一步一步分析了 Horovod 的方方面面。接下來就是面對 Horovod on K8S 這座大山。

本文以及後幾篇文章目的是：藉着分析學習 Horovod on K8S 功能，把相關概念梳理一遍，期望可以從中找出設計思路。所以成文方式是：整理學習了很多網上文章，然後自己分析代碼。特此對各位作者深表感謝。

本文是 horovod on k8s 的餐前甜點和必備前提，介紹相關概念以及kubeflow 社區的 tf-operator。

本系列其他文章鏈接如下：

[源碼解析] 深度學習分佈式訓練框架 Horovod (1) --- 基礎知識

[源碼解析] 深度學習分佈式訓練框架 horovod (2) --- 從使用者角度切入

[源碼解析] 深度學習分佈式訓練框架 horovod (3) --- Horovodrun背後做了什麼

[源碼解析] 深度學習分佈式訓練框架 horovod (4) --- 網絡基礎 & Driver

[源碼解析] 深度學習分佈式訓練框架 horovod (5) --- 融合框架

[源碼解析] 深度學習分佈式訓練框架 horovod (6) --- 後臺線程架構

[源碼解析] 深度學習分佈式訓練框架 horovod (7) --- DistributedOptimizer

[源碼解析] 深度學習分佈式訓練框架 horovod (8) --- on spark

[源碼解析] 深度學習分佈式訓練框架 horovod (9) --- 啓動 on spark

[源碼解析] 深度學習分佈式訓練框架 horovod (10) --- run on spark

[源碼解析] 深度學習分佈式訓練框架 horovod (11) --- on spark --- GLOO 方案

[源碼解析] 深度學習分佈式訓練框架 horovod (12) --- 彈性訓練總體架構

[源碼解析] 深度學習分佈式訓練框架 horovod (13) --- 彈性訓練之 Driver

[源碼解析] 深度學習分佈式訓練框架 horovod (14) --- 彈性訓練發現節點 & State

[源碼解析] 深度學習分佈式訓練框架 horovod (15) --- 廣播 & 通知

[源碼解析] 深度學習分佈式訓練框架 horovod (16) --- 彈性訓練之Worker生命週期

[源碼解析] 深度學習分佈式訓練框架 horovod (17) --- 彈性訓練之容錯

0x01 背景知識

1.1 Kubernetes

kubernetes，簡稱K8s，是用8代替8個字符“ubernete”而成的縮寫。是一個開源的，用於管理雲平臺中多個主機上的容器化的應用，Kubernetes的目標是讓部署容器化的應用簡單並且高效（powerful）,Kubernetes提供了應用部署，規劃，更新，維護的一種機制。

Kubernetes 是一種越來越受歡迎的深度神經網絡訓練選項，因爲它提供了通過容器使用不同機器學習框架的靈活性，以及按需擴展的敏捷性。

當面臨較複雜的模型訓練或者數據量大時，單機的計算能力往往無法滿足算力要求。通過使用阿里的 AiACC 或者社區的 horovod 等分佈式訓練框架，僅需修改幾行代碼，就能將一個單機的訓練任務擴展爲支持分佈式的訓練任務。

在 Kubernetes 上常見的是 kubeflow 社區的 tf-operator 支持 Tensorflow PS 模式，或者 mpi-operator 支持 horovod 的 mpi allreduce 模式。

1.2 容器作爲調度單元

爲什麼希望使用容器來作爲深度學習系統的調度單元？因爲容器拉取/啓動快速。隔離資源效果好。抽象來看，可以將容器的image作爲job的一部分分發調度執行。當然容器化後會引入gpu，網絡等性能的代價。

比如 nvidia gpu 對docker提供了支持，nvidia-docker可以代替docker執行create和run操作。下圖就是nvidia-docker架構。

1.3 Kubeflow

Kubeflow 是一個開源的 Kubernetes 原生平臺，用於開發、編排、部署和運行可擴展的便攜式機器學習工作負載。Kubeflow 可以在任何Kubernetes 集羣上運行。

Kubeflow可以很好的管理多機任務，Kubeflow的名字比較簡單，爲Kubernetes + TensorFlow，是一個機器學習工具包，是運行在K8s之上的一套技術棧，這套技術棧包含了很多組件，組件之間的關係比較鬆散，我們可以配合起來用，也可以單獨用其中的一部分。

Kubeflow 詢問 Kubernetes 計劃分配哪幾臺機器來運行一個分佈式作業中的各個進程，隨後告知每個進程，所有其他進程的 IP 地址和 port。從而保證一個作業裏各個進程之間互相知道對方。

爲什麼需要讓所有進程互相知道對方呢？這是 TensorFlow ps-based distribution 方式要求的。TensorFlow 1.x 原生的分佈式訓練功能讓一個作業中所有進程都執行 TensorFlow 1.x runtime 程序。這些進程互相通信，互相協調成爲一個“分佈式 runtime“，來解釋執行表示深度學習計算過程的計算圖（graph）。在開始分佈式訓練之初，graph 被 TensorFlow runtime 拆解成若干子圖；每個進程負責執行一個子圖 —— 任何一個進程失敗（可能是被更高優先級作業搶佔），則整個大圖的執行就失敗了。所以 TensorFlow 原生的分佈式訓練能力不是容錯的（fault-tolerant）。不過，它是可以從錯誤恢復（fault-recoverable）—— TensorFlow API 提供 checkpoint 的能力；如果一個作業失敗了，可以重啓作業，從最近的 checkpoint 開始繼續執行。

1.4 Tensorflow on Kubeflow

Kubeflow 支持兩種不同的 Tensorflow 框架分佈式訓練方法。

第一種是原生 Tensorflow 架構，它依賴於集中式參數服務器來實現工作線程之間的協調。
第二種是分散式方法，工作線程通過 MPI AllReduce 原語直接相互通信，不使用參數服務器。NVIDIA 的 NCCL 庫已經在GPU 上有效地執行了大部分 MPI 原語，而 Uber 的Horovod 讓使用 TensorFlow 執行多 GPU 和多節點訓練變得輕而易舉。與參數服務器相比，第二種方法可以更好地優化帶寬和更好地擴展。

1.5 Operator

Operator 是Kubernetes 之中的概念，主要是用來打包、部署及管理用戶的任務。

Operator可以簡單理解爲 CRD + Controller。

CRD（Custom Resource Definition）是 Kubernetes 的擴展類型，用來爲用戶自定義資源提。
Controller 用來讓用戶操作CRD。

如果用 Java 來比喻，operator 就是 Class，CRD 就是類的成員變量，Controller 就是類成員方法。

1.6 TF-Operator

雖然KubeFlow提供了一大堆組件，涵蓋了機器學習的方方面面，但模型訓練肯定是KubeFlow最重要的功能。 KubeFlow針對各種各樣的機器學習框架提供了訓練的能力。方式是定義了各種各樣的Operator，其主要是用來管理機器學習或者深度學習裏面的任務，比如如何管理維護一個任務的多個節點，如何管理Pod及任務的生命週期，如何進行容錯等等。

TF-Operator就是開源社區基於K8S提供的擴展API，提供了TensorFlow的訓練能力，從名字也能看出來，這個實現是類似Job的一種方式，其特點如下：

提供TensorFlow原生PS-worker架構的多機訓練
推薦將PS和worker一起啓動
通過service做服務發現
在社區中最早期的Operator

因爲 TF-Operator 是社區中最早期的Operator，所以我們有必要先看看。

0x02 TensorFlow 分佈式

因爲 TF-Operator 是爲了支持 Tensorflow PS 模式，所以我們首先介紹一下 TensorFlow 分佈式。

2.1 Parameter server架構

在Parameter server架構（PS架構）中，集羣中的節點被分爲兩類：參數服務器（parameter server）和工作服務器（worker）。其中參數服務器存放模型的參數，而工作服務器負責計算參數的梯度。在每個迭代過程，工作服務器從參數服務器中獲得參數，然後將計算的梯度返回給參數服務器，參數服務器聚合從工作服務器傳回的梯度，然後更新參數，並將新的參數廣播給工作服務器。

PS-Worker 架構的梯度更新有着 同步更新 和 異步更新 兩種方式：

在同步訓練中，所有的Worker設備採用同一個Batch的不同小批(mini-batch)數據來訓練，等待所有設備該批次的梯度計算完成後，模型纔會根據所有的梯度進行一次參數更新，然後PS將更新後的模型下發到各個設備。

異步訓練中，沒有設備需要去等待其他設備的梯度計算和參數更新，所有設備獨立算並與將梯度結果更新到中心節點（PS）。異步訓練總體會訓練速度會快很多，但是異步訓練的一個很嚴重的問題是梯度失效問題（stale gradients），剛開始所有設備採用相同的參數來訓練，但是異步情況下，某個設備完成一步訓練後，可能發現模型參數已經被其它設備更新過了，此時這個設備計算出的梯度就過期了。

2.2 Tensorflow PS-Worker

2.2.1 架構

這裏只是大致介紹一下，主要是爲了和 TF-Operator 對比。

TF 把Job主要劃分爲Parameter Server和Worker（因爲 TF 版本不同，所以有不同階段的特別定義，比如 master 或者 chief）。

Parameter Job：執行模型相關的作業，包括模型參數存儲，分發，彙總，更新；作爲分佈式訓練的服務端，等到各個終端(supervisors)來連接。
Worker Job：在TensorFlow的代碼註釋中被稱爲supervisors，執行訓練相關的作業，包括推理計算和梯度計算。如果參數的數量太大，一臺機器處理不了，這就要需要多個Tasks（動態上理解，主機上的一個進程，從靜態的角度理解，Task就是我們寫的代碼）。
Chief supervisors：在衆多運算終端中必須選中一個作爲主要的運算終端。該終端是在運算終端中最先啓動的，它的功能是合併各個終端運算後的學習參數，將其保存再寫入。
Cluster 是 Jobs 的集合: Cluster(集羣) 就是集羣系統。

每個具體角色網絡標識都是唯一的，即分佈在不同IP的機器上(或者同一主機但不同端口號)。

在實際運行中，各個角色的網絡構建部分代碼必須完全相同，Ps-worker 架構分佈式模型的流程大致如下:

pull : 各個worker根據數據流圖的拓撲結構，從PS拉取最新的模型參數
feed: 各worker填充不同的批數據
compute: 各worker按照相同的模型參數和不同的批數據計算梯度，得出不同的梯度值
push 各worker 將計算得到的梯度值上傳給PS
update: PS 收集所有worker的梯度值，求平均值，更新模型參數。

2.2.2 代碼

具體邏輯如下：

Task需要知道集羣上都有哪些主機，以及它們都監聽什麼端口。tf.train.ClusterSpec()就是用來描述這個。
這個Cluster(集羣)有兩個Job(worker.ps)，worker中有三個Task(即，有三個Task執行Tensorflow op操作)
將ClusterSpec當作參數傳入到 tf.train.Server()中，同時指定此Task的Job_name和task_index。
由於是相同的代碼運行在不同的主機上，所以要傳入job_name和task_index加以區分，而ps_hosts和worker_hosts對於所有主機來說，都是一樣的，用來描述集羣的。
一個tf.train.Server包含了本地設備（GPUs，CPUs）的集合，可以連接到到其它task的ip:port（存儲在cluster中），還有一個session target用來執行分佈操作。還有最重要的一點就是，它創建了一個服務器，監聽port端口，如果有數據傳過來，他就會在本地執行(啓動session target,調用本地設備執行運算)，然後結果返回給調用者。
爲了使ps_server能夠一直處於監聽狀態，我們需要使用server.join()。這時，進程就會block在這裏.至於爲什麼ps_server剛創建就join呢，原因是因爲下面的代碼會將參數指定給ps_server保管，所以ps_server靜靜的監聽就好了。

# To build a cluster with two ps jobs on hosts ps0 and ps1, and 3 worker
# jobs on hosts worker0, worker1 and worker2.
cluster_spec = {
    "ps": ["ps0:2222"， "ps1:2222"]，
    "worker": ["worker0:2222"， "worker1:2222"， "worker2:2222"]}

# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# Create and start a server for the local task.
server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
	server.join()

稍微完整點的代碼如下：

def main(_):
  ps_hosts = FLAGS.ps_hosts.split("，")
  worker_hosts = FLAGS.worker_hosts.split("，")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

 		# 找出worker的主節點，即task_index爲0的節點
		is_chief = (FLAGS.task_index == 0)   
    
    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):
    # Compute

運行如下，可以看出，我們只需要寫一個程序，在不同的主機上，傳入不同的參數使其運行：

# On ps0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1

0x03 TF-Operator

3.1 TF-Operator 設計思路

瞭解了 TF 分佈式的大致運作，我們來看看 TF-Operator 設計思路。

以下是從 "Design Doc TFJob K8s CRD" 中翻譯的。

目標是使在Kubernetes（K8s）上運行TensorFlow訓練（尤其是分佈式訓練）變得容易。我建議通過創建一個K8s自定義資源描述符（CRD）和關聯的控制器來實現這一點。CRD負責管理運行培訓作業所需的K8s資源。

Kubernetes通過提供一個流程（而不是以VM爲中心）的世界視圖，使得管理流程變得更加容易。Kubernetes還爲複雜的分佈式應用程序提供了基本的構建塊。例如，K8s提供對DNS、健康檢查、日誌收集、度量收集、存儲等的內置支持。

在K8s中，控制器負責確保一套Pods是運行狀態。Pod是K8s中的基本構建塊，它描述了一個或多個應該進行共定位的進程（相同的ip）。K8s配備了許多內置控制器。可以確保N個pod以特定的規範運行。作業控制器可以用來運行二進制文件。

內置控制器不足以運行分佈式TensorFlow作業。TensorFlow是一個有狀態的應用程序；每個參數服務器和工作者都需要具有唯一的可尋址性，以支持所有不同的分佈式培訓模式。K8s有一個statefulset。但是，有狀態集用於永久運行的有狀態服務（如Redis之類的內存分片緩存服務），而不是用於運行到完成的作業。

因此，今天在K8s上運行分佈式TF作業意味着從內置原語中拼湊出一個解決方案。通常，這意味着手動管理多個資源。例如，用戶可以爲參數服務器創建一個有狀態集，爲工作者創建一個有狀態集，爲主服務器創建一個作業。

爲了解決內置資源的限制，K8s支持自定義資源（CRD）和控制器。使用CRD，可以很容易地爲特定工作負載創建具有所需語義的控制器，同時將用戶隱藏在實現中。K8s社區很快就採用了這種模式，貢獻了大量的CRD用於各種工作負載。

開發crd和各種控制器的K8s團隊的意見是，大多數控制器使用非分佈式、多線程設計，可伸縮性不是問題。

TFJob CRD爲K8s定義了TFJob資源。

TFJob資源是 TfReplicas 的集合。每個TfReplica對應一個在工作中扮演角色的一組 TensorFlow processes；

我做出了一個明確的決定，不試圖隱藏或替換K8s抽象。例如，每個TfReplica都包含一個標準的K8s PodTemplate 以指定要在每個複製副本中運行的進程（包括TF）。我這樣做是因爲K8s已經提供了一個被廣泛採用和理解的API。因此，引入新的概念來代替K8s的概念是令人困惑的。此外，公開PodTemplate 使TFJob用戶可以輕鬆地利用K8s特性。例如，TFJob用戶可以使用K8s將卷附加到其TF進程。這使得TF與K8s支持的任何存儲系統（如PDs、NFS等）結合使用變得非常容易。

3.2 架構圖

具體架構圖如下：

3.2.1 什麼是Pod

我們從圖上來看，先看中間的 pod 概念。

pod 是 k8s調度的最小單元。pod 可以理解爲：容器組，同時pod相當於邏輯主機，進入pod後彷彿進入一個linux主機，命令都可用（linux系統下），該“主機”內又有很多容器，進入後又彷彿是又進了一個linux主機。默認情況下，每個容器的文件系統與其他容器完全隔離。每個pod都有自己的ip地址。pod內的容器共享相同的ip和端口空間。

3.2.2 爲什麼要有 service

首先，每個Pod都會被分配一個單獨的IP地址，而且每個Pod都提供了一個獨立的Endpoint（Pod IP + ContainerPort）以被客戶端訪問，但這種訪問僅限於集羣內部，外部沒法訪問集羣內部的IP地址，

其次，Pod的生命是有限的，如果Pod重啓IP很有可能會發生變化。當 controller 用新 Pod 替代發生故障的 Pod 時，新 Pod 會分配到新的 IP 地址。這樣就產生了一個問題：如果一組 Pod 對外提供服務（比如 HTTP），它們的 IP 很有可能發生變化，那麼客戶端如何找到並訪問這個服務呢？

Kubernetes 給出的解決方案是 Service。

Service只是一個抽象概念，Kubernetes Service 從邏輯上代表了一組 Pod，具體是哪些 Pod 則是由 label 來挑選。Service 在邏輯上將一組pod（功能相同）給抽象出來一個統一入口。可以將他簡單理解爲做了一個服務的負載均衡。

Service 有自己 IP，而且這個 IP 是不變的。客戶端只需要訪問 Service 的 IP，Kubernetes 則負責建立和維護 Service 與 Pod 的映射關係。無論後端 Pod 如何變化，對客戶端不會有任何影響，因爲 Service 沒有變。所以一般會通過service來訪問pod。core-dns會給service分配一個內部的虛擬ip，因此內部服務可以通過這個ip或者是serviceName來訪問到pod的服務。

我們給出一個源碼中的service 例子。

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"
    prometheus.io/port: "8443"
  labels:
    app: tf-job-operator
  name: tf-job-operator
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    name: tf-job-operator
  type: ClusterIP

現在我們看到已經創建了名爲tf-job-operator的Service，會分配一個Cluster IP，該Service還會持續的監聽selector下面的 Pod，會把這些Pod信息更新到一個名爲 tf-job-operator 的Endpoints對象上去，這個對象就類似於我們上面說的Pod集合了。

3.2.3 什麼是 controller

因爲 Kubernetes 現有的資源類型無法滿足我們的需求，因此需要通過 Custom Resource Definition 的機制進行擴展。

K8S中一切都是resource，比如Deployment，Service等等。

我們可以基於CRD（CustomResourceDefinitions）功能新增resource，比如我想自定義一種Deployment資源，提供不同的部署策略。

我們知道resource可以通過k8s的RESTFUL API進行CURD操作，對於CRD創建的resource也是一樣的。

CRD僅僅是定義一種resource，我們還需要實現controller，類似於deployment controller等等，監聽對應資源的CURD事件，做出對應的處理，比如部署POD。

其實，TF-Operator 主要就是一個 Controller 的實現，我們下面也主要就是講解這個 controller。

3.3 Spec

我們首先給出一個 Job Spec，這樣大家可以在後續和代碼中對應。樣例如下，擁有一個 master，2個 workers，一個 PS。

apiVersion: "kubeflow.org/v1alpha1" # 指定api版本，此值必須在kubectl api-versions中  
kind: "TFJob"  # 指定創建資源的角色/類型 
metadata:  # 資源的元數據/屬性 
  name: "example-job"
spec: # 資源規範字段
  replicaSpecs: # 聲明副本數目
    - replicas: 1
      tfReplicaType: MASTER
      template: # 模版
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff  # 容器使用的鏡像地址  
              name: tensorflow
              args:
                - --log_dir=gs://my-job/log-dir
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              args:
                - --log_dir=gs://my-job/log-dir
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS

下面我們開始進入代碼世界。

3.4 TFJob

首先我們看看 TFJob 的定義，大致可以和上面的 Spec 中找到對應關係，因爲本文目的是瞭解其大略，所以我們就只分析這些即可。

// TFJob represents a TFJob resource.
type TFJob struct {
	// Standard Kubernetes type metadata.
	metav1.TypeMeta `json:",inline"`

	// Standard Kubernetes object's metadata.
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired state of the TFJob.
	// +optional
	Spec TFJobSpec `json:"spec,omitempty"`

	// Most recently observed status of the TFJob.
	// Populated by the system.
	// Read-only.
	// +optional
	Status commonv1.JobStatus `json:"status,omitempty"`
}

// TFJobSpec is a desired state description of the TFJob.
type TFJobSpec struct {
	// RunPolicy encapsulates various runtime policies of the distributed training
	// job, for example how to clean up resources and how long the job can stay
	// active.
	RunPolicy commonv1.RunPolicy `json:"runPolicy,inline"`

	// SuccessPolicy defines the policy to mark the TFJob as succeeded.
	// Default to "", using the default rules.
	// +optional
	SuccessPolicy *SuccessPolicy `json:"successPolicy,omitempty"`

	// A map of TFReplicaType (type) to ReplicaSpec (value). Specifies the TF cluster configuration.
	// For example,
	//   {
	//     "PS": ReplicaSpec,
	//     "Worker": ReplicaSpec,
	//   }
	TFReplicaSpecs map[commonv1.ReplicaType]*commonv1.ReplicaSpec `json:"tfReplicaSpecs"`

	// // A switch to enable dynamic worker
	EnableDynamicWorker bool `json:"enableDynamicWorker,omitempty"`
}

3.5 角色

其次我們看看 TF-Operator 之中，對 TF 角色的對應實現。

3.5.1 定義

首先是角色定義。這裏的角色基本對應了 Tensorflow 的各個角色，包括很多爲了兼容而保留的角色。

// setTypeNamesToCamelCase sets the name of all replica types from any case to correct case.
func setTypeNamesToCamelCase(tfJob *TFJob) {
	setTypeNameToCamelCase(tfJob, TFReplicaTypePS)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeWorker)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeChief)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeMaster)
	setTypeNameToCamelCase(tfJob, TFReplicaTypeEval)
}


const (
	// TFReplicaTypePS is the type for parameter servers of distributed TensorFlow.
	TFReplicaTypePS commonv1.ReplicaType = "PS"

	// TFReplicaTypeWorker is the type for workers of distributed TensorFlow.
	// This is also used for non-distributed TensorFlow.
	TFReplicaTypeWorker commonv1.ReplicaType = "Worker"

	// TFReplicaTypeChief is the type for chief worker of distributed TensorFlow.
	// If there is "chief" replica type, it's the "chief worker".
	// Else, worker:0 is the chief worker.
	TFReplicaTypeChief commonv1.ReplicaType = "Chief"

	// TFReplicaTypeMaster is the type for master worker of distributed TensorFlow.
	// This is similar to chief, and kept just for backwards compatibility.
	TFReplicaTypeMaster commonv1.ReplicaType = "Master"

	// TFReplicaTypeEval is the type for evaluation replica in TensorFlow.
	TFReplicaTypeEval commonv1.ReplicaType = "Evaluator"
)

3.5.2 創建角色

NewTFJobV2 函數就是依據配置的不同，來創建不同的角色。

這裏可以看到，生成 job 時候，基本就是按照 spec 的對應字段來處理。

apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:

下面是函數定義。

func NewTFJobV2(worker, ps, master, cheif, evaluator int) *tfv1.TFJob {
	tfJob := &tfv1.TFJob{
		TypeMeta: metav1.TypeMeta{
			Kind: tfv1.Kind,
		},
		ObjectMeta: metav1.ObjectMeta{
			Name:      TestTFJobName,
			Namespace: metav1.NamespaceDefault,
		},
		Spec: tfv1.TFJobSpec{
			TFReplicaSpecs: make(map[commonv1.ReplicaType]*commonv1.ReplicaSpec),
		},
	}
	tfv1.SetObjectDefaults_TFJob(tfJob)

	if worker > 0 {
		worker := int32(worker)
		workerReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &worker,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeWorker] = workerReplicaSpec
	}

	if ps > 0 {
		ps := int32(ps)
		psReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &ps,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypePS] = psReplicaSpec
	}

	if master > 0 {
		master := int32(master)
		masterReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &master,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeMaster] = masterReplicaSpec
	}

	if cheif > 0 {
		cheif := int32(cheif)
		cheifReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &cheif,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeChief] = cheifReplicaSpec
	}

	if evaluator > 0 {
		evaluator := int32(evaluator)
		evaluatorReplicaSpec := &commonv1.ReplicaSpec{
			Replicas: &evaluator,
			Template: NewTFReplicaSpecTemplate(),
		}
		tfJob.Spec.TFReplicaSpecs[tfv1.TFReplicaTypeChief] = evaluatorReplicaSpec
	}
	return tfJob
}

3.5.3 如何區分 master

用如下方法區分 master。

func (tc *TFController) IsMasterRole(replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec, rtype commonv1.ReplicaType, index int) bool {
	if ContainChieforMasterSpec(replicas) {
		return rtype == tfv1.TFReplicaTypeChief || rtype == tfv1.TFReplicaTypeMaster
	}
	// else check if it is worker with index 0
	return rtype == tfv1.TFReplicaTypeWorker && index == 0
}

0x04 Contoller

下面就進入正題，看看 Controller 如何實現。

4.1 K8S CRD關鍵概念

首先我們需要看看 K8S CRD 的一些關鍵概念。

informer：監聽apiserver中特定資源變化，然後會存儲到一個線程安全的local cache中，最後回調我們自己實現的event handler。
local cache：informer實時同步apiserver（也就是etcd）中的數據到內存中存儲，可以有效降低apiserver的查詢壓力，但缺點就是實時性不好，本地會比遠程的數據落後一點點但會最終與etcd一致，所以需要根據情況具體分析是走Local cache還是apiserver實時獲取數據。
Lister：提供了CURD操作訪問local cache。
controller：一個邏輯概念，就是指調度某種資源的實現而已，需要我們自己開發。Controller做的事情主要包括：
1. 實現event handler處理資源的CURD操作
2. 在event handler，可以使用workqueue類庫實現相同資源對象的連續event的去重，以及event處理異常後的失敗重試，通常是建議使用的。
Workqueue：一個單獨的類庫，是可選使用的，但通常都會使用，原因上面說了。我們需要在實現event handler的時候把發生變化的資源標識放入workqueue，供下面的processor消費。
Clientset：默認clientset只能CRUD k8s提供的資源類型，比如deployments，daemonset等；生成的代碼爲我們自定義的資源（CRD）生成了單獨的clientset，從而讓我們使用結構化的代碼CURD自定義資源。也就是說，想操作內建資源就用k8s自帶的clientset，想操作CRD就用生成代碼裏的clientset。
Processor：我們實現的go協程，消費workqueue中的事件，workqueue提供了按資源標識的去重。

4.2 定義

TFController 的定義如下，可以看出來幾個成員變量各有所用，就分別用到了上述的部分組件。

// TFController is the type for TFJob Controller, which manages
// the lifecycle of TFJobs.
type TFController struct {
	common.JobController

	// tfJobClientSet is a clientset for CRD TFJob.
	tfJobClientSet tfjobclientset.Interface

	// To allow injection of sync functions for testing.
	syncHandler func(string) (bool, error)

	// tfJobInformer is a temporary field for unstructured informer support.
	tfJobInformer cache.SharedIndexInformer

	// Listers for TFJob, Pod and Service
	// tfJobLister can list/get tfjobs from the shared informer's store.
	tfJobLister tfjoblisters.TFJobLister

	// tfJobInformerSynced returns true if the tfjob store has been synced at least once.
	tfJobInformerSynced cache.InformerSynced
}

4.3 入口

TF-Operator 邏輯代碼的入口是 runWorker，其實就是循環調用 processNextWorkItem。

func (tc *TFController) runWorker() {
	for tc.processNextWorkItem() {
	}
}

processNextWorkItem將從WorkQueue中讀取單個工作項，並嘗試通過調用syncHandler來處理它。

// processNextWorkItem will read a single work item off the workqueue and
// attempt to process it, by calling the syncHandler.
func (tc *TFController) processNextWorkItem() bool {
	obj, quit := tc.WorkQueue.Get()
	if key, ok = obj.(string); !ok {
		tc.WorkQueue.Forget(obj)
		return true
	}
	tfJob, err := tc.getTFJobFromKey(key)

  // 同步TFJob以將實際狀態匹配到所需的狀態。
	// Sync TFJob to match the actual state to this desired state.
	forget, err := tc.syncHandler(key)
}

4.4 syncHandler

syncHandler 的作用是依據 key 來同步 Job，就是從 WorkQueue 之中弄出來一個 job，本地處理。

之前設置有 tc.syncHandler = tc.syncTFJob，所以我們實際來到了 syncTFJob。

如果tfjob的期望值已經實現，那麼syncTFJob就會用給定的key來同步tfjob，這意味着它不希望更多的
pod/service被創建或刪除：
EnableDynamicWorker 這裏會根據不同類型設置。
然後會調用 ReconcileJobs 對具體 job 進行處理。

// syncTFJob syncs the tfjob with the given key if it has had its expectations fulfilled, meaning
// it did not expect to see any more of its pods/services created or deleted.
// This function is not meant to be invoked concurrently with the same key.
// 這個函數不能與同一個key同時調用
func (tc *TFController) syncTFJob(key string) (bool, error) {

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	sharedTFJob, err := tc.getTFJobFromName(namespace, name)
	tfjob := sharedTFJob.DeepCopy()

	// Sync tfjob every time if EnableDynamicWorker is true
	tfjobNeedsSync := tfjob.Spec.EnableDynamicWorker || tc.satisfiedExpectations(tfjob)

  // 爲新tfjob設置默認值。
	// Set default for the new tfjob.
	scheme.Scheme.Default(tfjob)

	if tfjobNeedsSync && tfjob.DeletionTimestamp == nil {
     // 調用reconcileTFJobs來啓動TFJobs
		reconcileTFJobsErr = tc.ReconcileJobs(tfjob, tfjob.Spec.TFReplicaSpecs, tfjob.Status, &tfjob.Spec.RunPolicy)
	}

	return true, err
}

4.5 ReconcileJobs

reconcileTFJobs檢查並更新每個給定TFReplicaSpec的replicas，並且做相應處理，可以認爲這裏是主控邏輯。

如果 job 結束，則做相應處理，delete所有pod和service。
如果TFJob超過了backofflimit或超過了active deadline，刪除所有pod和service，然後將狀態設置爲failed。
否則遍歷配置文件的TFReplicaSpecs部分，
- 分別爲不同類型的節點啓動相應的Pod。
- 在啓動Pod之後，還要爲其啓動一個Service。

// 如果在創建/刪除 pods/services時發生錯誤，它將請求tfjob。 
// ReconcileJobs checks and updates replicas for each given ReplicaSpec.
// It will requeue the job in case of an error while creating/deleting pods/services.
func (jc *JobController) ReconcileJobs(
	job interface{},
	replicas map[apiv1.ReplicaType]*apiv1.ReplicaSpec,
	jobStatus apiv1.JobStatus,
	runPolicy *apiv1.RunPolicy) error {

	metaObject, ok := job.(metav1.Object)
	jobName := metaObject.GetName()
	runtimeObject, ok := job.(runtime.Object)
	jobKey, err := KeyFunc(job)
	pods, err := jc.Controller.GetPodsForJob(job)
	services, err := jc.Controller.GetServicesForJob(job)
	oldStatus := jobStatus.DeepCopy()
  
  // 如果TFJob terminated，則delete所有pod和service。
	if commonutil.IsSucceeded(jobStatus) || commonutil.IsFailed(jobStatus) {
		// If the Job is succeed or failed, delete all pods and services.
		jc.DeletePodsAndServices(runPolicy, job, pods)    
		jc.CleanupJob(runPolicy, jobStatus, job)
		return nil
	}

	// 檢索以前的重試次數
  // retrieve the previous number of retry
	previousRetry := jc.WorkQueue.NumRequeues(jobKey)
	activePods := k8sutil.FilterActivePods(pods)
	jc.recordAbnormalPods(activePods, runtimeObject)

	active := int32(len(activePods))
	failed := k8sutil.FilterPodCount(pods, v1.PodFailed)
	totalReplicas := k8sutil.GetTotalReplicas(replicas)
	prevReplicasFailedNum := k8sutil.GetTotalFailedReplicas(jobStatus.ReplicaStatuses)

	if jobExceedsLimit {
		// If the Job exceeds backoff limit or is past active deadline
		// delete all pods and services, then set the status to failed
		jc.DeletePodsAndServices(runPolicy, job, pods); err != nil {
		jc.CleanupJob(runPolicy, jobStatus, job); err != nil {
		jc.Recorder.Event(runtimeObject, v1.EventTypeNormal, commonutil.JobFailedReason, failureMessage)
		commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, commonutil.JobFailedReason, failureMessage)
		return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
	} else {
		// General cases which need to reconcile
		if jc.Config.EnableGangScheduling {
			minAvailableReplicas := totalReplicas
			_, err := jc.SyncPodGroup(metaObject, minAvailableReplicas)
		}

    // 遍歷配置文件的TFReplicaSpecs部分，分別爲不同類型的節點啓動相應的Pod。
    // 在啓動Pod之後，還要爲其啓動一個Service。
		// Diff current active pods/services with replicas.
		for rtype, spec := range replicas {
			err := jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas)
			err = jc.Controller.ReconcileServices(metaObject, services, rtype, spec)
		}
	}

	err = jc.Controller.UpdateJobStatus(job, replicas, &jobStatus)

  // No need to update the job status if the status hasn't changed since last time.
	if !reflect.DeepEqual(*oldStatus, jobStatus) {
		return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
	}
	return nil
}

目前邏輯如下：

             +------------+
             | runWorker  |
             +-----+------+
                   |
                   |
                   v
          +--------+------------+
          | processNextWorkItem |
          +--------+------------+
                   |
                   |
                   v
              +----+------+
              | syncTFJob |
              +----+------+
                   |
                   |
                   v
           +-------+--------+
           | ReconcileJobs  |
           +-------+--------+
                   |
                   |
                   v
          +--------+---------+
          |                  |
          |                  |
          v                  v
+---------+---------+  +-----+--------+
|                   |  |              |
| ReconcileServices |  |ReconcilePods |
|                   |  |              |
+-------------------+  +--------------+

下面我們分別介紹處理 Pod 和處理 Service。

4.6 處理 Pod

4.6.1 ReconcilePods

reconcilePods爲每個給定的TFReplicaSpec檢查和更新pod。

具體比如：

初始化 replica 的狀態；
如果master pod存在，選擇master pod，如果沒有master，第一個worker pod被選爲master；
createNewPod 來創建新的 pod；
或者刪除 pod；

// reconcilePods checks and updates pods for each given TFReplicaSpec.
// It will requeue the tfjob in case of an error while creating/deleting pods.
func (tc *TFController) ReconcilePods(
	job interface{},
	jobStatus *commonv1.JobStatus,
	pods []*v1.Pod,
	rtype commonv1.ReplicaType,
	spec *commonv1.ReplicaSpec,
	replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec,
) error {

	tfJob, ok := job.(*tfv1.TFJob)

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))
  
  // 獲取rtype類型的所有pod。
	pods, err := tc.FilterPodsForReplicaType(pods, rt)

	numReplicas := int(*spec.Replicas)
	masterRole := false

	initializeReplicaStatuses(jobStatus, rtype)

	// GetPodSlices will return enough information here to make decision to add/remove/update resources.
	// For example, let's assume we have pods with replica-index 0, 1, 2
	// If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a pod with replica-index 3 will be created.
	// If replica is 1, return a slice with size 3. [[0],[1],[2]], pod with replica-index 1 and 2 are out of range and will be deleted.
	podSlices := tc.GetPodSlices(pods, numReplicas, logger)
	for index, podSlice := range podSlices {
		if len(podSlice) > 1 {
			logger.Warningf("We have too many pods for %s %d", rt, index)
		} else if len(podSlice) == 0 {
      // 如果master pod存在，選擇master pod
      // 如果沒有master，第一個worker pod被選爲master。      
			// check if this replica is the master role
			masterRole = tc.IsMasterRole(replicas, rtype, index)
			// TODO: [should change to CreateNewPod]
			err = tc.createNewPod(tfJob, rt, strconv.Itoa(index), spec, masterRole, replicas)
		} else {
			// Check the status of the current pod.
			pod := podSlice[0]

      // 目前只允許縮減workers
			// check if the index is in the valid range, if not, we should kill the pod
			if index < 0 || index >= numReplicas {
				err = tc.PodControl.DeletePod(pod.Namespace, pod.Name, tfJob)
			}

			// Check if the pod is retryable.
			if spec.RestartPolicy == commonv1.RestartPolicyExitCode {
				if pod.Status.Phase == v1.PodFailed && train_util.IsRetryableExitCode(exitCode) {
					tc.Recorder.Event(tfJob, corev1.EventTypeWarning, tfJobRestartingReason, msg)
					err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRestarting, tfJobRestartingReason, msg)
					tfJobsRestartCount.Inc()
				}
			}

			updateJobReplicaStatuses(jobStatus, rtype, pod)
		}
	}
	return nil
}

4.6.2 createNewPod

createNewPod爲給定的index和type創建一個新的pod：

// createNewPod creates a new pod for the given index and type.
func (tc *TFController) createNewPod(tfjob *tfv1.TFJob, rt, index string, spec *commonv1.ReplicaSpec, masterRole bool,
	replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec) error {

	tfjobKey, err := KeyFunc(tfjob)
	expectationPodsKey := expectation.GenExpectationPodsKey(tfjobKey, rt)

	// Create OwnerReference.
	controllerRef := tc.GenOwnerReference(tfjob)

	// Set type and index for the worker.
	labels := tc.GenLabels(tfjob.Name)
	labels[tfReplicaTypeLabel] = rt
	labels[tfReplicaIndexLabel] = index

	podTemplate := spec.Template.DeepCopy()
	// Set name for the template.
	podTemplate.Name = common.GenGeneralName(tfjob.Name, rt, index)
	if podTemplate.Labels == nil {
		podTemplate.Labels = make(map[string]string)
	}
	for key, value := range labels {
		podTemplate.Labels[key] = value
	}

  // 生成集羣的配置信息，這裏最關鍵，看一下實現
	if err := tc.SetClusterSpec(tfjob, podTemplate, rt, index); err != nil {
		return err
	}

	// if gang-scheduling is enabled:
	// 1. if user has specified other scheduler, we report a warning without overriding any fields.
	// 2. if no SchedulerName is set for pods, then we set the SchedulerName to "kube-batch".
	if tc.Config.EnableGangScheduling {
		if isNonGangSchedulerSet(replicas) {
			tc.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		} else {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		}

		if podTemplate.Annotations == nil {
			podTemplate.Annotations = map[string]string{}
		}
		podTemplate.Annotations[gangSchedulingPodGroupAnnotation] = tfjob.GetName()
		podTemplate.Annotations[volcanoTaskSpecKey] = rt
	}

  // 使用上面的配置信息，真正啓動Pod的創建
	err = tc.PodControl.CreatePodsWithControllerRef(tfjob.Namespace, podTemplate, tfjob, controllerRef)
	return nil
}

4.6.3 生成配置信息

4.6.3.1 SetClusterSpec

上面函數中的生成配置信息比較重要，所以我們單獨摘出來說一下。

setClusterSpec爲給定的podTemplateSpec生成並設置TF_CONFIG：

// SetClusterSpec generates and sets TF_CONFIG for the given podTemplateSpec.
func (tc *TFController) SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error {
	tfjob, ok := job.(*tfv1.TFJob)

	// Generate TF_CONFIG JSON string.
	tfConfigStr, err := genTFConfigJSONStr(tfjob, rtype, index)

	// Add TF_CONFIG environment variable to tensorflow container in the pod.
	for i := range podTemplate.Spec.Containers {
		if podTemplate.Spec.Containers[i].Name == tfv1.DefaultContainerName {
			if len(podTemplate.Spec.Containers[i].Env) == 0 {
				podTemplate.Spec.Containers[i].Env = make([]v1.EnvVar, 0)
			}
			podTemplate.Spec.Containers[i].Env = append(podTemplate.Spec.Containers[i].Env, v1.EnvVar{
				Name:  tfConfig,
				Value: tfConfigStr,
			})
			break
		}
	}
	return nil
}

4.6.3.2 genTFConfigJSONStr

genTFConfigJSONStr 會生成 json 數據。

// genTFConfig will generate the environment variable TF_CONFIG
// {
//     "cluster": {
//         "ps": ["ps1:2222", "ps2:2222"],
//         "worker": ["worker1:2222", "worker2:2222", "worker3:2222"]
//     },
//     "task": {
//         "type": "ps",
//         "index": 1
//         },
//     }
// }
func genTFConfigJSONStr(tfjob *tfv1.TFJob, rtype, index string) (string, error) {
	// Configure the TFCONFIG environment variable.
	i, err := strconv.ParseInt(index, 0, 32)
	if err != nil {
		return "", err
	}

	cluster, err := genClusterSpec(tfjob)
	if err != nil {
		return "", err
	}

	var tfConfigJSONByteSlice []byte
	if tfjob.Spec.EnableDynamicWorker {
		sparseCluster := convertClusterSpecToSparseClusterSpec(cluster, strings.ToLower(rtype), int32(i))
		sparseTFConfig := SparseTFConfig{
			Cluster: sparseCluster,
			Task: TaskSpec{
				Type:  strings.ToLower(rtype),
				Index: int(i),
			},
		}
		tfConfigJSONByteSlice, err = json.Marshal(sparseTFConfig)
	} else {
		tfConfig := TFConfig{
			Cluster: cluster,
			Task: TaskSpec{
				Type:  strings.ToLower(rtype),
				Index: int(i),
			},
			// We need to set environment to cloud  otherwise it will default to local which isn't what we want.
			// Environment is used by tensorflow.contrib.learn.python.learn in versions <= 1.3
			// TODO(jlewi): I don't think it is used in versions TF >- 1.4. So we can eventually get rid of it.
      // 我們需要設置環境爲cloud，否則它會默認爲local，這不是我們想要的。
			Environment: "cloud",
		}
		tfConfigJSONByteSlice, err = json.Marshal(tfConfig)
	}
	if err != nil {
		return "", err
	}

	return string(tfConfigJSONByteSlice), nil
}

4.6.3.3 genClusterSpec

這裏就是從集羣信息中獲得 cluster 信息。

// genClusterSpec will generate ClusterSpec.
func genClusterSpec(tfjob *tfv1.TFJob) (ClusterSpec, error) {
	clusterSpec := make(ClusterSpec)

	for rtype, spec := range tfjob.Spec.TFReplicaSpecs {
		rt := strings.ToLower(string(rtype))
		replicaNames := make([]string, 0, *spec.Replicas)

		port, err := GetPortFromTFJob(tfjob, rtype)
    // 這裏循環生成了TF_CONFIG裏面的Cluster信息。注意看註釋，使用DNS配合Service，解決的還是各個節點IP不固定的問題
		for i := int32(0); i < *spec.Replicas; i++ {
			// As described here: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-records.
			// Headless service assigned a DNS A record for a name of the form "my-svc.my-namespace.svc.cluster.local".
			// And the last part "svc.cluster.local" is called cluster domain
			// which maybe different between kubernetes clusters.
            // 如下所述:https://kubernetes.io/docs/concepts/services-networking/dns-pos-service/#a-records。
            // Headless service爲"my-svc.my-namespace.svc.cluster.local"的名稱分配一個DNS記錄。
            // 最後一部分是"svc.cluster.local"被稱爲cluster domain，在不同的kubernetes集羣之間可能存在差異。
      
			hostName := common.GenGeneralName(tfjob.Name, rt, fmt.Sprintf("%d", i))
			svcName := hostName + "." + tfjob.Namespace + "." + "svc"
			clusterDomain := os.Getenv(EnvCustomClusterDomain)
			if len(clusterDomain) > 0 {
				svcName += "." + clusterDomain
			}

			endpoint := fmt.Sprintf("%s:%d", svcName, port)
			replicaNames = append(replicaNames, endpoint)
		}

		clusterSpec[rt] = replicaNames
	}

	return clusterSpec, nil
}

4.6.4 CreatePodsWithControllerRef

得到了集羣配置信息之後，就使用集羣的配置信息，進行真正啓動Pod的創建：

func (r RealPodControl) CreatePods(namespace string, template *v1.PodTemplateSpec, object runtime.Object) error {
	return r.createPods("", namespace, template, object, nil)
}

func (r RealPodControl) CreatePodsWithControllerRef(namespace string, template *v1.PodTemplateSpec, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
	if err := ValidateControllerRef(controllerRef); err != nil {
		return err
	}
	return r.createPods("", namespace, template, controllerObject, controllerRef)
}

4.6.5 createPods

這裏才真正調用K8S接口創建pod

func (r RealPodControl) createPods(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error {
	pod, err := GetPodFromTemplate(template, object, controllerRef)

	if len(nodeName) != 0 {
		pod.Spec.NodeName = nodeName
	}
	if labels.Set(pod.Labels).AsSelectorPreValidated().Empty() {
		return fmt.Errorf("unable to create pods, no labels")
	}
	if newPod, err := r.KubeClient.CoreV1().Pods(namespace).Create(pod); err != nil {
		return err
	} else {
		accessor, err := meta.Accessor(object)
	}
	return nil
}

此時邏輯如下：

                                        +------------------------------+
          +------------+                | SetClusterSpec               |
          | runWorker  |                |  +-------------------------+ |
          +-----+------+                |  | genTFConfigJSONStr      | |
                |                       |  |                         | |
                |                       |  |      genClusterSpec     | |
                v                       |  |                         | |
       +--------+------------+          |  +-------------------------+ |
       | processNextWorkItem |          +------------------------------+
       +--------+------------+                      |
                |                                   |
                |                                   v
                v                            +------+-------+      +-----------------------------+       +------------+
           +----+------+              +----> | createNewPod +----->+ CreatePodsWithControllerRef +------>+ createPods |
           | syncTFJob |              |      +--------------+      +-----------------------------+       +------------+
           +----+------+              |
                |                     |
                |                     |
                v                     |
        +-------+--------+            |
        | ReconcileJobs  |            |
        +-------+--------+            |
                |                     |
                |                     |
                v                     |
       +--------+---------+           |
       |                  |           |
       |                  |           |
       v                  v           |
+------+----------+  +----+--------+  |
|                 |  |             |  |
|ReconcileServices|  |ReconcilePods+--+
|                 |  |             |
+-----------------+  +-------------+

手機如下：

4.7 處理服務

4.7.1 ReconcileServices

ReconcileServices 爲每個給定的TFReplicaSpec檢查和更新service，大致如下：

將在創建/刪除服務時發生錯誤時請求tfjob。
獲取rt類型的所有service。
- 或者建立新服務；
- 或者刪除舊服務，目前只允許縮小worker的service範圍；

// reconcileServices checks and updates services for each given ReplicaSpec.
// It will requeue the job in case of an error while creating/deleting services.
func (jc *JobController) ReconcileServices(
	job metav1.Object,
	services []*v1.Service,
	rtype apiv1.ReplicaType,
	spec *apiv1.ReplicaSpec) error {

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))

	replicas := int(*spec.Replicas)
	// Get all services for the type rt.
	services, err := jc.FilterServicesForReplicaType(services, rt)

	// GetServiceSlices will return enough information here to make decision to add/remove/update resources.
	//
	// For example, let's assume we have services with replica-index 0, 1, 2
	// If replica is 4, return a slice with size 4. [[0],[1],[2],[]], a svc with replica-index 3 will be created.
	//
	// If replica is 1, return a slice with size 3. [[0],[1],[2]], svc with replica-index 1 and 2 are out of range and will be deleted.
	serviceSlices := jc.GetServiceSlices(services, replicas, commonutil.LoggerForReplica(job, rt))

	for index, serviceSlice := range serviceSlices {
		if len(serviceSlice) > 1 {
		} else if len(serviceSlice) == 0 {
			err = jc.CreateNewService(job, rtype, spec, strconv.Itoa(index))
		} else {
			// Check the status of the current svc.
			svc := serviceSlice[0]

			// check if the index is in the valid range, if not, we should kill the svc
			if index < 0 || index >= replicas {
				err = jc.ServiceControl.DeleteService(svc.Namespace, svc.Name, job.(runtime.Object))
			}
		}
	}
	return nil
}

4.7.2 CreateNewService

爲給定的index和type創建一個新service：

// createNewService creates a new service for the given index and type.
func (jc *JobController) CreateNewService(job metav1.Object, rtype apiv1.ReplicaType,
	spec *apiv1.ReplicaSpec, index string) error {
	jobKey, err := KeyFunc(job)

	// Convert ReplicaType to lower string.
	rt := strings.ToLower(string(rtype))
	expectationServicesKey := expectation.GenExpectationServicesKey(jobKey, rt)
	err = jc.Expectations.ExpectCreations(expectationServicesKey, 1)
	if err != nil {
		return err
	}

	// Append ReplicaTypeLabel and ReplicaIndexLabel labels.
	labels := jc.GenLabels(job.GetName())
	labels[apiv1.ReplicaTypeLabel] = rt
	labels[apiv1.ReplicaIndexLabel] = index

	port, err := jc.GetPortFromJob(spec)
	if err != nil {
		return err
	}

	service := &v1.Service{
		Spec: v1.ServiceSpec{
			ClusterIP: "None",
			Selector:  labels,
			Ports:     []v1.ServicePort{},
		},
	}

	// Add service port to headless service only if port is set from controller implementation
	if port != nil {
		svcPort := v1.ServicePort{Name: jc.Controller.GetDefaultContainerPortName(), Port: *port}
		service.Spec.Ports = append(service.Spec.Ports, svcPort)
	}

	service.Name = GenGeneralName(job.GetName(), rt, index)
	service.Labels = labels
	// Create OwnerReference.
	controllerRef := jc.GenOwnerReference(job)

	err = jc.ServiceControl.CreateServicesWithControllerRef(job.GetNamespace(), service, job.(runtime.Object), controllerRef)
	if err != nil && errors.IsTimeout(err) {
		succeededServiceCreationCount.Inc()
		return nil
	} else if err != nil {
		failedServiceCreationCount.Inc()
		return err
	}
	succeededServiceCreationCount.Inc()
	return nil
}

4.7.3 CreateServicesWithControllerRef

使用集羣的配置信息，真正啓動Service的創建：

func (r RealServiceControl) CreateServicesWithControllerRef(namespace string, service *v1.Service, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
	if err := ValidateControllerRef(controllerRef); err != nil {
		return err
	}
	return r.createServices(namespace, service, controllerObject, controllerRef)
}

4.7.4 createServices

此時才真正調用K8S接口創建service：

func (r RealServiceControl) createServices(namespace string, service *v1.Service, object runtime.Object, controllerRef *metav1.OwnerReference) error {
	if labels.Set(service.Labels).AsSelectorPreValidated().Empty() {
		return fmt.Errorf("unable to create Services, no labels")
	}
	serviceWithOwner, err := GetServiceFromTemplate(service, object, controllerRef)
	newService, err := r.KubeClient.CoreV1().Services(namespace).Create(serviceWithOwner)
	accessor, err := meta.Accessor(object)
}

此時邏輯拓展如下：

                                        +------------------------------+
          +------------+                | SetClusterSpec               |
          | runWorker  |                |  +-------------------------+ |
          +-----+------+                |  | genTFConfigJSONStr      | |
                |                       |  |                         | |
                |                       |  |      genClusterSpec     | |
                v                       |  |                         | |
       +--------+------------+          |  +-------------------------+ |
       | processNextWorkItem |          +------------------------------+
       +--------+------------+                      |
                |                                   |
                |                                   v
                v                            +------+-------+      +-----------------------------+       +------------+
           +----+------+              +----> | createNewPod +----->+ CreatePodsWithControllerRef +------>+ createPods |
           | syncTFJob |              |      +--------------+      +-----------------------------+       +------------+
           +----+------+              |
                |                     |
                |                     |
                v                     |           +------------------+     +---------------------------------+    +----------------+
        +-------+--------+            |    +----> | CreateNewService +---->+ CreateServicesWithControllerRef +--->+ createServices |
        | ReconcileJobs  |            |    |      +------------------+     +---------------------------------+    +----------------+
        +-------+--------+            |    |
                |                     |    |
                |                     |    |
                v                     |    |
       +--------+---------+           |    |
       |                  |           |    |
       |                  |           |    |
       v                  v           |    |
+------+----------+  +----+--------+  |    |
|                 |  |             |  |    |
|ReconcileServices|  |ReconcilePods+--+    |
|                 |  |             |       |
+------+----------+  +-------------+       |
       |                                   |
       +---------------------------------->+

手機如下：

所以我們大致可知，TF-Operator 本質上就是：

通過 TF-Operator 的這種自定義資源對象來描述分佈式機器學習的訓練任務；
同時實現了 TFJob 的 Controller 來控制容器的生老病死，給用戶管理好多個進程之間的關係；

0x05 與普通部署比較

分析到這裏，大家可能也有點疑惑，究竟 TF on K8s 和普通部署有啥區別，優勢何處呢？我們下面就具體分析下。

5.1 運行

我們首先看源碼中的Dockerfile內容

FROM tensorflow/tensorflow:1.5.0

ADD . /var/tf_dist_mnist
ENTRYPOINT ["python", "/var/tf_dist_mnist/dist_mnist.py"]

然後看看對應的 spec，分別有2個 PS，4個 Worker。

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0

然後再安裝example，跑一個分佈式的 mnist 訓練任務。

cd ./examples/v1/dist-mnist
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
kubectl create -f ./tf_job_mnist.yaml

5.2 比較

我們就簡單從訓練代碼看看。

5.2.1 普通 TF

各種host 的配置是通過腳本參數來設置的，下面就是讀取參數的配置啓動。

# 讀取參數
ps_spec = FLAGS.ps_hosts.split(',')
worker_spec = FLAGS.worker_hosts.split(',')

# 創建集羣
num_worker = len(worker_spec)
cluster = tf.train.ClusterSpec({'ps': ps_spec, 'worker': worker_spec})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

5.2.2 TF-Operator

首先，dist_mnist.py中有如下方式獲取 cluster 信息。

# If not explicitly specified in the constructor and the TF_CONFIG
# environment variable is present, load cluster_spec from TF_CONFIG.
tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')

其次，在 TF-Operator 之中有如下，說明 cluster 信息是從這裏設置：

tfConfig = "TF_CONFIG"

然後，在 SetClusterSpec 中有如下，就是調用 K8S 接口動態獲取配置：

// SetClusterSpec generates and sets TF_CONFIG for the given podTemplateSpec.
func (tc *TFController) SetClusterSpec(job interface{}, podTemplate *v1.PodTemplateSpec, rtype, index string) error {
   tfjob, ok := job.(*tfv1.TFJob)

   // Do not set TF_CONFIG for local training jobs.
   if !isDistributed(tfjob) {
      return nil
   }
   // Generate TF_CONFIG JSON string.
   tfConfigStr, err := genTFConfigJSONStr(tfjob, rtype, index)

   // Add TF_CONFIG environment variable to tensorflow container in the pod.
   for i := range podTemplate.Spec.Containers {
      if podTemplate.Spec.Containers[i].Name == tfv1.DefaultContainerName {
         if len(podTemplate.Spec.Containers[i].Env) == 0 {
            podTemplate.Spec.Containers[i].Env = make([]v1.EnvVar, 0)
         }
         podTemplate.Spec.Containers[i].Env = append(podTemplate.Spec.Containers[i].Env, v1.EnvVar{
            Name:  tfConfig,
            Value: tfConfigStr,
         })
         break
      }
   }
   return nil
}

因此可以知道，從用戶角度看，就修改了一點代碼即可。至於部署服務等，都是由 K8S 接管了。

用戶只要在 spec 之中設定需要多少 worker，ps 就成。這樣用戶就可以把精力集中在模型之上。而devops 則大展身手爲你搞定一切。

0x06 總結

綜合之前的我們可以得出 TF-Operator 如下優勢：

通過 TF-Operator 的這種自定義資源對象來描述分佈式機器學習的訓練任務；
同時實現了 TFJob 的 Controller 來控制容器的生老病死，給用戶管理好多個進程之間的關係；
對於用戶，只要創建一個 TFJob 的自定義資源對象，在 Template 配置好相關信息，就相當於描述好一個分佈式訓練程序的執行過程了。
用戶可以把精力集中在模型之上。而devops 則大展身手爲你搞定一切；

kubeflow/tf-operator 雖然可以運作，但是依然有很多缺陷。

Kubeflow 可以在 Kubernetes 上啓動基於 TensorFlow 原生的分佈式計算能力的作業。但是因爲後者並不能容錯，所以 Kubeflow 並不能無中生有。不能容錯，也意味着不能彈性調度。
使用 kubeflow/tf-operator 執行分佈式 TensorFlow 作業，模型迭代必須等待申請的進程全部啓動後才能開始。如果集羣資源不足以啓動所有進程，則當前作業只能等待其他作業釋放資源。爲了縮短資源等待時間，可以給作業配置專有資源池。
由於資源不共享，集羣資源利用率會很低。所以 kubeflow/tf-operator 很難同時兼顧研發效率和集羣利用率。

而且，最重要的是：沒有和 horovod 聯繫起來，沒有安裝 MPI 等軟件，所以下文我們看看 MPI-Operator。