在 Kubernetes 上擴展 TensorFlow 模型

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於人工智能 \/ 機器學習日益集成到應用和業務流程中,因此生產級機器學習模型需要更多可擴展的基礎設施和計算能力,以用於訓練和部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現代機器學習算法在大量數據上進行訓練,並且需要數十億次迭代才能使成本函數最小化。這類模型的垂直擴展會遇到操作系統級別的瓶頸,包括可提供的 CPU、GPU 和存儲的數量,而且對於這種類型的模型,已經證明效率並不高。更爲高效的並行處理算法,例如異步訓練和 allreduce 式訓練,需要一個分佈式集羣系統,由不同的 worker (工作器)以協調的方式同時學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可擴展性對於在生產環境中服務深度學習模型也非常重要。將單個 API 請求處理到模型預測端點可能會觸發複雜的處理邏輯,這將花費大量時間。由於更多用戶訪問模型的端點,爲了有效地處理客戶端請求,需要更多服務實例。在機器學習模型中,以分佈式、可擴展的方式提供服務的能力成爲保證其應用有效性的關鍵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要解決分佈式雲環境中的這些擴展性問題非常困難。在確保容錯、高可用性和應用健康的同時, MLOps 工程師要配置多個節點和推理服務之間的交互。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文中,我將討論 Kubernetes 和 Kubeflow 如何能夠滿足 TensorFlow 的機器學習模型的這些擴展性需求。通過一些實際的例子,我將向你介紹如何在 Kubernetes 上使用 Kubeflow 擴展機器學習模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我將討論如何使用 TensorFlow training jobs(TensorFlow 訓練作業,TFJobs)抽象,通過 Kubeflow 在 Kubernetes 上協調 TensorFlow 模型的分佈式訓練。然後,我將介紹如何實現同步和異步分佈式訓練的 TensorFlow 分發策略。最後,我將討論用於擴展在 Kubernetes 中服務的 TensorFlow 模型的各種選項,包括 KFServing、Seldon Core 和 BentoML。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文的最後,你將更好地理解基本的 Kubernetes 和 Kubeflow 抽象,並瞭解 TensorFlow 模型的可擴展工具,用於訓練和生產級服務。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 Kubernetes 和 Kubeflow 擴展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/automating-machine-learning-pipelines-on-kubernetes-with-kubeflow\/","title":"","type":null},"content":[{"type":"text","text":"Kubeflow"}]},{"type":"text","text":"是一個 Kubernetes 的機器學習框架,最初由谷歌開發。它建立在 Kubernetes 資源和編排服務之上,實現複雜的自動化機器學習管道,用於"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/the-never-ending-story-of-model-training-a-new-chapter-for-machine-learning\/","title":"","type":null},"content":[{"type":"text","text":"訓練和服務機器學習模型"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以結合使用 Kubernetes 和 Kubeflow 來有效地擴展 TensorFlow 模型。爲使 TensorFlow 模型具有可擴展性,主要的資源和特性如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 kubectl"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/deployment\/#scaling-a-deployment","title":"","type":null},"content":[{"type":"text","text":"手動擴展 Kubernetes 部署和 StatefulSets"}]},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/","title":"","type":null},"content":[{"type":"text","text":"Pod 水平自動伸縮"}]},{"type":"text","text":"(Horizontal Pod Autoscaler)進行自動擴展,它基於一組計算指標(CPU、GPU、內存)或用戶定義的指標(如每秒請求)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 TFJob 和 MPI Operator 對 TensorFlow 模型進行分佈式訓練。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 KFServing、Seldon Core 和 BentoML 擴展已部署的 TensorFlow 模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我將提供一些例子,說明如何使用這些解決方案中的一些,有效地在 Kubernetes 上擴展 TensorFlow 模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用 TFJob 進行可擴展的 TensorFlow 訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 可以在 Kubernetes 中擴展,方法是使用"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"TensorFlow 分發策略"}]},{"type":"text","text":"實現分佈式訓練。在機器學習中有兩種常用的分佈式策略:同步和異步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在同步訓練中,worker 對特定批次的訓練數據進行並行訓練。每個 worker 都會進行自己的前向傳播步驟,並對迭代的整體結果進行彙總。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,在異步訓練中,worker 對相同的數據進行並行學習。在這種方法中,有一個稱爲"},{"type":"link","attrs":{"href":"https:\/\/www.cs.cmu.edu\/~muli\/file\/ps.pdf","title":"","type":null},"content":[{"type":"text","text":"Parameter Server"}]},{"type":"text","text":"(參數服務器)的中央實體,它負責聚合和計算梯度,並將更新的參數傳遞給每個 worker。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分佈式集羣中實現這樣的策略並非易事。特別是,worker 應該能夠在不同節點之間進行數據和權重的溝通,並有效協調它們的學習,同時避免錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 在"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"tf.distribut.Strategy"}]},{"type":"text","text":"模塊中實現了各種分佈式訓練策略,以節省開發人員的時間。有了這個模塊,機器學習開發人員只要對他們的代碼做最少的修改,就可以在多個節點和 GPU 之間分發訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個模塊實現了幾種同步策略,包括 MirroredStrategy、TPUStrategy 和 MultiworkerMirroredStrategy。它還實現了一個異步的 ParameterServerStrategy。你可以在這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/guide\/distributed_training","title":"","type":null},"content":[{"type":"text","text":"使用 TensorFlow 進行分佈式訓練"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Distributed training with TensorFlow"},{"type":"text","text":"**)中閱讀更多關於可用的 TensorFlow 分佈策略以及如何在你的 TensorFlow 代碼中實現這些策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 隨附了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator","title":"","type":null},"content":[{"type":"text","text":"TF Operator"}]},{"type":"text","text":"和一個自定義的"},{"type":"link","attrs":{"href":"https:\/\/www.kubeflow.org\/docs\/components\/training\/tftraining\/","title":"","type":null},"content":[{"type":"text","text":"TFJob"}]},{"type":"text","text":"資源,可以輕鬆創建上面提到的 TensorFlow 分佈式策略。TFJob 可以識別容器化的 TensorFlow 代碼中定義的分佈式策略,並可以使用一組內置組件和控制邏輯對其進行管理。使得在 Kubeflow 中實現 TensorFlow 的分佈式訓練成爲可能的組件包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Chief:組織分佈式訓練並執行模型檢查點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parameter Server:協調異步分佈式訓練和計算梯度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker:執行學習任務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Evaluator:計算和記錄評估指標。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述組件可以在 TFJob 中配置,TFJob 是一個用於 TensorFlow 訓練的 Kubeflow CRD。這裏是一個分佈式訓練作業的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator\/blob\/master\/examples\/v1\/mnist_with_summaries\/tf_job_mnist.yaml","title":"","type":null},"content":[{"type":"text","text":"基本例子"}]},{"type":"text","text":",它依賴於兩個 worker,在沒有 Chief 和 Parameter Server 的情況下進行訓練。這種方法適用於實現 TensorFlow 同步訓練策略,如 MirroredStrategy。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你看,除了標準的 Kubernetes 資源和服務(例如卷、容器、重啓策略)之外,規範還包括一個"},{"type":"text","marks":[{"type":"strong"}],"text":"tfReplicaSpecs"},{"type":"text","text":",其中你定義了一個 worker。在容器化的 TensorFlow 代碼中,將 worker 副本計數設置爲 2,並定義相關的分發策略,就足以實現 Kubeflow 的同步策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化 TFJob 後,將會在每個 worker 節點上創建一個新的"},{"type":"text","marks":[{"type":"strong"}],"text":"TF_CONFIG"},{"type":"text","text":"環境變量。其中包含了關於訓練批次、當前訓練迭代以及 TFJob 用於執行分佈式訓練的其他參數的信息。通過與各種 Kubernetes 控制器、 API 進行交互,Tf-operator 協調訓練過程,並維護在清單中定義的預期狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,通過 tf-operator,異步訓練模式可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"ParameterServerStrategy"},{"type":"text","text":"。在"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/scaling-tensorflow-models-on-kubernetes\/","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"(以及下面),你將看到一個由 tf-operator 管理的異步策略的分佈式訓練作業的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/87\/f0\/8781419deyy2477b30406b58587eb4f0.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 並不是用 Kubeflow 實現 TensorFlow 模型分佈式訓練的唯一方法。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/mpi-operator","title":"","type":null},"content":[{"type":"text","text":"MPI Operator"}]},{"type":"text","text":"提供了另一種解決方案。在後臺,MPI Operator 使用"},{"type":"link","attrs":{"href":"https:\/\/www.open-mpi.org\/","title":"","type":null},"content":[{"type":"text","text":"消息傳遞接口"}]},{"type":"text","text":"(Message Passing Interface,MPI),它可以在異構網絡環境中,在 worker 之間通過不同的通信層進行跨節點通信。在 Kubernetes 中, MPI Operator 可用於實現 Allreduce 式的 TensorFlow 模型同步訓練。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"TensorFlow 模型在 Kubernetes 上的可擴展服務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於處理客戶端對推理服務的請求是一項非常耗時耗力的任務,因此可擴展服務對於機器學習工作負載的生產部署至關重要。在這種情況下,部署的模型應該能夠擴展到多個副本,併爲多個併發的請求提供服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 支持 TensorFlow 模型的幾種服務選項。這裏要注意以下幾點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"TFServing"},{"type":"text","text":"是 TFX Serving 模塊的 Kubeflow 實現。通過 TFServing,你可以創建機器學習模型 REST API,並提供許多有用的功能,包括服務交付、自動生命週期管理、流量分割和版本管理。然而,這個選項並沒有提供自動擴展功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Seldon Core"},{"type":"text","text":"是一款第三方工具,可用於 Kubeflow 抽象和資源。它支持多種機器學習框架,包括 TensorFlow,並允許將訓練好的 TensorFlow 模型轉換爲 REST\/gRPC 微服務,運行在 Kubernetes 中。Seldon Core 默認支持模型自動擴展。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"BentoML"},{"type":"text","text":"是 Kubeflow 使用的另一個第三方工具,它提供高級的模型服務功能,包括自動擴展,以及支持微批處理的高性能 API 模型服務器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一節中,我將展示如何使用 KFServing 對訓練好的 TensorFlow 模型進行自動擴展,KFServing 是默認的 Kubeflow 安裝中的一個模塊。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 KFServing 自動擴展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KFServing 是一種無服務器平臺,它可以輕鬆地將訓練好的 TensorFlow 模型轉換爲從 Kubernetes 集羣外部訪問的推理服務。通過 Istio, KFServing 可以實現網絡和入口、健康檢查、金絲雀發佈(canary rollouts)、時間點快照、流量路由以及針對你部署的 TensorFlow 模型靈活地配置服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,KFServing 還支持開箱即用的訓練 TensorFlow 模型的自動擴展。在底層,KFServing 依賴於 Knative Serving 的自動擴展能力。Knative 提供了兩個自動擴展的實現。一種是基於 Knative Pod Autoscaler(KPA)工具,另一種個是基於 Kubernetes Horizontal Pod Autoscaler(HPA)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 KFServing 部署 InferenceService 時,KPA 將默認啓用。它支持擴展到零的功能,即在沒有流量時,可將服務的模型擴展到剩餘副本數量爲零。KPA 的主要限制在於它不支持基於 CPU 的自動擴展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若集羣中沒有 GPU,則可以使用 HPA autoscaler,它支持基於 CPU 的自動擴展。然而,它不屬於 KFServing 安裝,應該在 KFServing 安裝完成後啓用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,KFServing 在缺省情況下使用 KPA,因此你的 InferenceService 在部署後立即獲得自動擴展。使用 InferenceService 清單可以自定義 KPA 行爲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/01\/48\/017b7bf3a8yye0c784ecaca155a75d48.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,KPA 基於每個 pod 的平均傳入請求數對模型進行擴展。KFServing 將默認的併發的目標數量設置爲 1,這意味着如果服務收到三個請求,KPA 將把它擴展到三個 pod 副本。你可以通過更改"},{"type":"text","marks":[{"type":"strong"}],"text":"autoscaling.knative.dev\/target"},{"type":"text","text":"註釋來定製這個行爲,就像上面的例子一樣,你把它設置爲 10。一旦啓用此設置,只有當併發的請求數增加到 10 時,KPA 纔會增加副本數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過 KFServing,你可以配置其他自動擴展目標。舉例來說,你可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"requests-per-second-target-default"},{"type":"text","text":"註解來擴展基於每秒平均請求量(Request per second,RPS)的模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如我在本文中所展示的那樣,Kubeflow 爲擴展 TensorFlow 模型訓練和 Kubernetes 的服務提供了許多有用的工具。你可以使用 Kubeflow 來實現 TensorFlow 分佈策略的同步和異步訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了在 Kubernetes 集羣中高效地執行分佈式訓練,Tf-operator 可以輕鬆定義你所需要的各種組件。另外,Kubeflow 還支持 MPI Operator,這是一個絕佳解決方案,可以使用 MPI 來實現 Allreduce 式的多節點訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在擴展訓練好的 TensorFlow 模型時,Kubeflow 也有很好的功能集。諸如 KFServing 這樣的工具可以讓你根據需要定製擴展邏輯,包括 RPS 和請求並行目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你也可以使用 Kubernetes-native 工具,比如 HPA,根據用戶定義的指標對模型進行擴展。你可以研究一下其他很棒的服務工具,比如 Seldon Core 和 BentoML。它們都支持自動擴展,併爲自動化服務模型版本、金絲雀發佈、更新和生命週期管理提供了許多有用的功能。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kirill Goltsman,技術博客寫手、研究員,專攻研究人工智能 \/ 機器學習及容器化技術。在過去的幾年裏,他領導了專注於數據分析、Kubernetes 以及遊戲和安全領域的人工智能的初創公司的內容創作策略。在他的技術寫作中,Kirill 利用了他的編程語言(Javascript、Python)、統計知識以及部署商業網站、應用程序和插件的經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/medium.com\/ai-in-plain-english\/scaling-tensorflow-models-on-kubernetes-e598cb4bfd8a"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章