在 Kubernetes 上擴展 TensorFlow 模型

原創

2021-03-22 18:34

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於人工智能 \/ 機器學習日益集成到應用和業務流程中，因此生產級機器學習模型需要更多可擴展的基礎設施和計算能力，以用於訓練和部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現代機器學習算法在大量數據上進行訓練，並且需要數十億次迭代才能使成本函數最小化。這類模型的垂直擴展會遇到操作系統級別的瓶頸，包括可提供的 CPU、GPU 和存儲的數量，而且對於這種類型的模型，已經證明效率並不高。更爲高效的並行處理算法，例如異步訓練和 allreduce 式訓練，需要一個分佈式集羣系統，由不同的 worker （工作器）以協調的方式同時學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可擴展性對於在生產環境中服務深度學習模型也非常重要。將單個 API 請求處理到模型預測端點可能會觸發複雜的處理邏輯，這將花費大量時間。由於更多用戶訪問模型的端點，爲了有效地處理客戶端請求，需要更多服務實例。在機器學習模型中，以分佈式、可擴展的方式提供服務的能力成爲保證其應用有效性的關鍵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要解決分佈式雲環境中的這些擴展性問題非常困難。在確保容錯、高可用性和應用健康的同時， MLOps 工程師要配置多個節點和推理服務之間的交互。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文中，我將討論 Kubernetes 和 Kubeflow 如何能夠滿足 TensorFlow 的機器學習模型的這些擴展性需求。通過一些實際的例子，我將向你介紹如何在 Kubernetes 上使用 Kubeflow 擴展機器學習模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，我將討論如何使用 TensorFlow training jobs（TensorFlow 訓練作業，TFJobs）抽象，通過 Kubeflow 在 Kubernetes 上協調 TensorFlow 模型的分佈式訓練。然後，我將介紹如何實現同步和異步分佈式訓練的 TensorFlow 分發策略。最後，我將討論用於擴展在 Kubernetes 中服務的 TensorFlow 模型的各種選項，包括 KFServing、Seldon Core 和 BentoML。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文的最後，你將更好地理解基本的 Kubernetes 和 Kubeflow 抽象，並瞭解 TensorFlow 模型的可擴展工具，用於訓練和生產級服務。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 Kubernetes 和 Kubeflow 擴展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/automating-machine-learning-pipelines-on-kubernetes-with-kubeflow\/","title":"","type":null},"content":[{"type":"text","text":"Kubeflow"}]},{"type":"text","text":"是一個 Kubernetes 的機器學習框架，最初由谷歌開發。它建立在 Kubernetes 資源和編排服務之上，實現複雜的自動化機器學習管道，用於"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/the-never-ending-story-of-model-training-a-new-chapter-for-machine-learning\/","title":"","type":null},"content":[{"type":"text","text":"訓練和服務機器學習模型"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以結合使用 Kubernetes 和 Kubeflow 來有效地擴展 TensorFlow 模型。爲使 TensorFlow 模型具有可擴展性，主要的資源和特性如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 kubectl"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/deployment\/#scaling-a-deployment","title":"","type":null},"content":[{"type":"text","text":"手動擴展 Kubernetes 部署和 StatefulSets"}]},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/","title":"","type":null},"content":[{"type":"text","text":"Pod 水平自動伸縮"}]},{"type":"text","text":"（Horizontal Pod Autoscaler）進行自動擴展，它基於一組計算指標（CPU、GPU、內存）或用戶定義的指標（如每秒請求）。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 TFJob 和 MPI Operator 對 TensorFlow 模型進行分佈式訓練。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 KFServing、Seldon Core 和 BentoML 擴展已部署的 TensorFlow 模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來，我將提供一些例子，說明如何使用這些解決方案中的一些，有效地在 Kubernetes 上擴展 TensorFlow 模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用 TFJob 進行可擴展的 TensorFlow 訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 可以在 Kubernetes 中擴展，方法是使用"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"TensorFlow 分發策略"}]},{"type":"text","text":"實現分佈式訓練。在機器學習中有兩種常用的分佈式策略：同步和異步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在同步訓練中，worker 對特定批次的訓練數據進行並行訓練。每個 worker 都會進行自己的前向傳播步驟，並對迭代的整體結果進行彙總。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下，在異步訓練中，worker 對相同的數據進行並行學習。在這種方法中，有一個稱爲"},{"type":"link","attrs":{"href":"https:\/\/www.cs.cmu.edu\/~muli\/file\/ps.pdf","title":"","type":null},"content":[{"type":"text","text":"Parameter Server"}]},{"type":"text","text":"（參數服務器）的中央實體，它負責聚合和計算梯度，並將更新的參數傳遞給每個 worker。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分佈式集羣中實現這樣的策略並非易事。特別是，worker 應該能夠在不同節點之間進行數據和權重的溝通，並有效協調它們的學習，同時避免錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 在"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"tf.distribut.Strategy"}]},{"type":"text","text":"模塊中實現了各種分佈式訓練策略，以節省開發人員的時間。有了這個模塊，機器學習開發人員只要對他們的代碼做最少的修改，就可以在多個節點和 GPU 之間分發訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個模塊實現了幾種同步策略，包括 MirroredStrategy、TPUStrategy 和 MultiworkerMirroredStrategy。它還實現了一個異步的 ParameterServerStrategy。你可以在這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/guide\/distributed_training","title":"","type":null},"content":[{"type":"text","text":"使用 TensorFlow 進行分佈式訓練"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"Distributed training with TensorFlow"},{"type":"text","text":"**）中閱讀更多關於可用的 TensorFlow 分佈策略以及如何在你的 TensorFlow 代碼中實現這些策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 隨附了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator","title":"","type":null},"content":[{"type":"text","text":"TF Operator"}]},{"type":"text","text":"和一個自定義的"},{"type":"link","attrs":{"href":"https:\/\/www.kubeflow.org\/docs\/components\/training\/tftraining\/","title":"","type":null},"content":[{"type":"text","text":"TFJob"}]},{"type":"text","text":"資源，可以輕鬆創建上面提到的 TensorFlow 分佈式策略。TFJob 可以識別容器化的 TensorFlow 代碼中定義的分佈式策略，並可以使用一組內置組件和控制邏輯對其進行管理。使得在 Kubeflow 中實現 TensorFlow 的分佈式訓練成爲可能的組件包括："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Chief：組織分佈式訓練並執行模型檢查點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parameter Server：協調異步分佈式訓練和計算梯度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker：執行學習任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Evaluator：計算和記錄評估指標。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述組件可以在 TFJob 中配置，TFJob 是一個用於 TensorFlow 訓練的 Kubeflow CRD。這裏是一個分佈式訓練作業的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator\/blob\/master\/examples\/v1\/mnist_with_summaries\/tf_job_mnist.yaml","title":"","type":null},"content":[{"type":"text","text":"基本例子"}]},{"type":"text","text":"，它依賴於兩個 worker，在沒有 Chief 和 Parameter Server 的情況下進行訓練。這種方法適用於實現 TensorFlow 同步訓練策略，如 MirroredStrategy。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你看，除了標準的 Kubernetes 資源和服務（例如卷、容器、重啓策略）之外，規範還包括一個"},{"type":"text","marks":[{"type":"strong"}],"text":"tfReplicaSpecs"},{"type":"text","text":"，其中你定義了一個 worker。在容器化的 TensorFlow 代碼中，將 worker 副本計數設置爲 2，並定義相關的分發策略，就足以實現 Kubeflow 的同步策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化 TFJob 後，將會在每個 worker 節點上創建一個新的"},{"type":"text","marks":[{"type":"strong"}],"text":"TF_CONFIG"},{"type":"text","text":"環境變量。其中包含了關於訓練批次、當前訓練迭代以及 TFJob 用於執行分佈式訓練的其他參數的信息。通過與各種 Kubernetes 控制器、 API 進行交互，Tf-operator 協調訓練過程，並維護在清單中定義的預期狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外，通過 tf-operator，異步訓練模式可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"ParameterServerStrategy"},{"type":"text","text":"。在"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/scaling-tensorflow-models-on-kubernetes\/","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"（以及下面），你將看到一個由 tf-operator 管理的異步策略的分佈式訓練作業的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/87\/f0\/8781419deyy2477b30406b58587eb4f0.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 並不是用 Kubeflow 實現 TensorFlow 模型分佈式訓練的唯一方法。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/mpi-operator","title":"","type":null},"content":[{"type":"text","text":"MPI Operator"}]},{"type":"text","text":"提供了另一種解決方案。在後臺，MPI Operator 使用"},{"type":"link","attrs":{"href":"https:\/\/www.open-mpi.org\/","title":"","type":null},"content":[{"type":"text","text":"消息傳遞接口"}]},{"type":"text","text":"（Message Passing Interface，MPI），它可以在異構網絡環境中，在 worker 之間通過不同的通信層進行跨節點通信。在 Kubernetes 中， MPI Operator 可用於實現 Allreduce 式的 TensorFlow 模型同步訓練。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"TensorFlow 模型在 Kubernetes 上的可擴展服務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於處理客戶端對推理服務的請求是一項非常耗時耗力的任務，因此可擴展服務對於機器學習工作負載的生產部署至關重要。在這種情況下，部署的模型應該能夠擴展到多個副本，併爲多個併發的請求提供服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 支持 TensorFlow 模型的幾種服務選項。這裏要注意以下幾點："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"TFServing"},{"type":"text","text":"是 TFX Serving 模塊的 Kubeflow 實現。通過 TFServing，你可以創建機器學習模型 REST API，並提供許多有用的功能，包括服務交付、自動生命週期管理、流量分割和版本管理。然而，這個選項並沒有提供自動擴展功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Seldon Core"},{"type":"text","text":"是一款第三方工具，可用於 Kubeflow 抽象和資源。它支持多種機器學習框架，包括 TensorFlow，並允許將訓練好的 TensorFlow 模型轉換爲 REST\/gRPC 微服務，運行在 Kubernetes 中。Seldon Core 默認支持模型自動擴展。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"BentoML"},{"type":"text","text":"是 Kubeflow 使用的另一個第三方工具，它提供高級的模型服務功能，包括自動擴展，以及支持微批處理的高性能 API 模型服務器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一節中，我將展示如何使用 KFServing 對訓練好的 TensorFlow 模型進行自動擴展，KFServing 是默認的 Kubeflow 安裝中的一個模塊。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 KFServing 自動擴展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KFServing 是一種無服務器平臺，它可以輕鬆地將訓練好的 TensorFlow 模型轉換爲從 Kubernetes 集羣外部訪問的推理服務。通過 Istio, KFServing 可以實現網絡和入口、健康檢查、金絲雀發佈（canary rollouts）、時間點快照、流量路由以及針對你部署的 TensorFlow 模型靈活地配置服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時，KFServing 還支持開箱即用的訓練 TensorFlow 模型的自動擴展。在底層，KFServing 依賴於 Knative Serving 的自動擴展能力。Knative 提供了兩個自動擴展的實現。一種是基於 Knative Pod Autoscaler（KPA）工具，另一種個是基於 Kubernetes Horizontal Pod Autoscaler（HPA）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 KFServing 部署 InferenceService 時，KPA 將默認啓用。它支持擴展到零的功能，即在沒有流量時，可將服務的模型擴展到剩餘副本數量爲零。KPA 的主要限制在於它不支持基於 CPU 的自動擴展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若集羣中沒有 GPU，則可以使用 HPA autoscaler，它支持基於 CPU 的自動擴展。然而，它不屬於 KFServing 安裝，應該在 KFServing 安裝完成後啓用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述，KFServing 在缺省情況下使用 KPA，因此你的 InferenceService 在部署後立即獲得自動擴展。使用 InferenceService 清單可以自定義 KPA 行爲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/01\/48\/017b7bf3a8yye0c784ecaca155a75d48.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下，KPA 基於每個 pod 的平均傳入請求數對模型進行擴展。KFServing 將默認的併發的目標數量設置爲 1，這意味着如果服務收到三個請求，KPA 將把它擴展到三個 pod 副本。你可以通過更改"},{"type":"text","marks":[{"type":"strong"}],"text":"autoscaling.knative.dev\/target"},{"type":"text","text":"註釋來定製這個行爲，就像上面的例子一樣，你把它設置爲 10。一旦啓用此設置，只有當併發的請求數增加到 10 時，KPA 纔會增加副本數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 KFServing，你可以配置其他自動擴展目標。舉例來說，你可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"requests-per-second-target-default"},{"type":"text","text":"註解來擴展基於每秒平均請求量（Request per second，RPS）的模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如我在本文中所展示的那樣，Kubeflow 爲擴展 TensorFlow 模型訓練和 Kubernetes 的服務提供了許多有用的工具。你可以使用 Kubeflow 來實現 TensorFlow 分佈策略的同步和異步訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了在 Kubernetes 集羣中高效地執行分佈式訓練，Tf-operator 可以輕鬆定義你所需要的各種組件。另外，Kubeflow 還支持 MPI Operator，這是一個絕佳解決方案，可以使用 MPI 來實現 Allreduce 式的多節點訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在擴展訓練好的 TensorFlow 模型時，Kubeflow 也有很好的功能集。諸如 KFServing 這樣的工具可以讓你根據需要定製擴展邏輯，包括 RPS 和請求並行目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你也可以使用 Kubernetes-native 工具，比如 HPA，根據用戶定義的指標對模型進行擴展。你可以研究一下其他很棒的服務工具，比如 Seldon Core 和 BentoML。它們都支持自動擴展，併爲自動化服務模型版本、金絲雀發佈、更新和生命週期管理提供了許多有用的功能。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介紹："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kirill Goltsman，技術博客寫手、研究員，專攻研究人工智能 \/ 機器學習及容器化技術。在過去的幾年裏，他領導了專注於數據分析、Kubernetes 以及遊戲和安全領域的人工智能的初創公司的內容創作策略。在他的技術寫作中，Kirill 利用了他的編程語言（Javascript、Python）、統計知識以及部署商業網站、應用程序和插件的經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/medium.com\/ai-in-plain-english\/scaling-tensorflow-models-on-kubernetes-e598cb4bfd8a"}]}]}