在 Kubernetes 上扩展 TensorFlow 模型

原創

2021-03-22 18:34

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于人工智能 \/ 机器学习日益集成到应用和业务流程中，因此生产级机器学习模型需要更多可扩展的基础设施和计算能力，以用于训练和部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现代机器学习算法在大量数据上进行训练，并且需要数十亿次迭代才能使成本函数最小化。这类模型的垂直扩展会遇到操作系统级别的瓶颈，包括可提供的 CPU、GPU 和存储的数量，而且对于这种类型的模型，已经证明效率并不高。更为高效的并行处理算法，例如异步训练和 allreduce 式训练，需要一个分布式集群系统，由不同的 worker （工作器）以协调的方式同时学习。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可扩展性对于在生产环境中服务深度学习模型也非常重要。将单个 API 请求处理到模型预测端点可能会触发复杂的处理逻辑，这将花费大量时间。由于更多用户访问模型的端点，为了有效地处理客户端请求，需要更多服务实例。在机器学习模型中，以分布式、可扩展的方式提供服务的能力成为保证其应用有效性的关键。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要解决分布式云环境中的这些扩展性问题非常困难。在确保容错、高可用性和应用健康的同时， MLOps 工程师要配置多个节点和推理服务之间的交互。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文中，我将讨论 Kubernetes 和 Kubeflow 如何能够满足 TensorFlow 的机器学习模型的这些扩展性需求。通过一些实际的例子，我将向你介绍如何在 Kubernetes 上使用 Kubeflow 扩展机器学习模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，我将讨论如何使用 TensorFlow training jobs（TensorFlow 训练作业，TFJobs）抽象，通过 Kubeflow 在 Kubernetes 上协调 TensorFlow 模型的分布式训练。然后，我将介绍如何实现同步和异步分布式训练的 TensorFlow 分发策略。最后，我将讨论用于扩展在 Kubernetes 中服务的 TensorFlow 模型的各种选项，包括 KFServing、Seldon Core 和 BentoML。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文的最后，你将更好地理解基本的 Kubernetes 和 Kubeflow 抽象，并了解 TensorFlow 模型的可扩展工具，用于训练和生产级服务。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 Kubernetes 和 Kubeflow 扩展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/automating-machine-learning-pipelines-on-kubernetes-with-kubeflow\/","title":"","type":null},"content":[{"type":"text","text":"Kubeflow"}]},{"type":"text","text":"是一个 Kubernetes 的机器学习框架，最初由谷歌开发。它建立在 Kubernetes 资源和编排服务之上，实现复杂的自动化机器学习管道，用于"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/the-never-ending-story-of-model-training-a-new-chapter-for-machine-learning\/","title":"","type":null},"content":[{"type":"text","text":"训练和服务机器学习模型"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以结合使用 Kubernetes 和 Kubeflow 来有效地扩展 TensorFlow 模型。为使 TensorFlow 模型具有可扩展性，主要的资源和特性如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 kubectl"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/deployment\/#scaling-a-deployment","title":"","type":null},"content":[{"type":"text","text":"手动扩展 Kubernetes 部署和 StatefulSets"}]},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/","title":"","type":null},"content":[{"type":"text","text":"Pod 水平自动伸缩"}]},{"type":"text","text":"（Horizontal Pod Autoscaler）进行自动扩展，它基于一组计算指标（CPU、GPU、内存）或用户定义的指标（如每秒请求）。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 TFJob 和 MPI Operator 对 TensorFlow 模型进行分布式训练。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 KFServing、Seldon Core 和 BentoML 扩展已部署的 TensorFlow 模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来，我将提供一些例子，说明如何使用这些解决方案中的一些，有效地在 Kubernetes 上扩展 TensorFlow 模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用 TFJob 进行可扩展的 TensorFlow 训练"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 可以在 Kubernetes 中扩展，方法是使用"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"TensorFlow 分发策略"}]},{"type":"text","text":"实现分布式训练。在机器学习中有两种常用的分布式策略：同步和异步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在同步训练中，worker 对特定批次的训练数据进行并行训练。每个 worker 都会进行自己的前向传播步骤，并对迭代的整体结果进行汇总。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下，在异步训练中，worker 对相同的数据进行并行学习。在这种方法中，有一个称为"},{"type":"link","attrs":{"href":"https:\/\/www.cs.cmu.edu\/~muli\/file\/ps.pdf","title":"","type":null},"content":[{"type":"text","text":"Parameter Server"}]},{"type":"text","text":"（参数服务器）的中央实体，它负责聚合和计算梯度，并将更新的参数传递给每个 worker。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分布式集群中实现这样的策略并非易事。特别是，worker 应该能够在不同节点之间进行数据和权重的沟通，并有效协调它们的学习，同时避免错误。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 在"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"tf.distribut.Strategy"}]},{"type":"text","text":"模块中实现了各种分布式训练策略，以节省开发人员的时间。有了这个模块，机器学习开发人员只要对他们的代码做最少的修改，就可以在多个节点和 GPU 之间分发训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个模块实现了几种同步策略，包括 MirroredStrategy、TPUStrategy 和 MultiworkerMirroredStrategy。它还实现了一个异步的 ParameterServerStrategy。你可以在这篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/guide\/distributed_training","title":"","type":null},"content":[{"type":"text","text":"使用 TensorFlow 进行分布式训练"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"Distributed training with TensorFlow"},{"type":"text","text":"**）中阅读更多关于可用的 TensorFlow 分布策略以及如何在你的 TensorFlow 代码中实现这些策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 随附了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator","title":"","type":null},"content":[{"type":"text","text":"TF Operator"}]},{"type":"text","text":"和一个自定义的"},{"type":"link","attrs":{"href":"https:\/\/www.kubeflow.org\/docs\/components\/training\/tftraining\/","title":"","type":null},"content":[{"type":"text","text":"TFJob"}]},{"type":"text","text":"资源，可以轻松创建上面提到的 TensorFlow 分布式策略。TFJob 可以识别容器化的 TensorFlow 代码中定义的分布式策略，并可以使用一组内置组件和控制逻辑对其进行管理。使得在 Kubeflow 中实现 TensorFlow 的分布式训练成为可能的组件包括："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Chief：组织分布式训练并执行模型检查点。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parameter Server：协调异步分布式训练和计算梯度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker：执行学习任务。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Evaluator：计算和记录评估指标。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述组件可以在 TFJob 中配置，TFJob 是一个用于 TensorFlow 训练的 Kubeflow CRD。这里是一个分布式训练作业的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator\/blob\/master\/examples\/v1\/mnist_with_summaries\/tf_job_mnist.yaml","title":"","type":null},"content":[{"type":"text","text":"基本例子"}]},{"type":"text","text":"，它依赖于两个 worker，在没有 Chief 和 Parameter Server 的情况下进行训练。这种方法适用于实现 TensorFlow 同步训练策略，如 MirroredStrategy。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你看，除了标准的 Kubernetes 资源和服务（例如卷、容器、重启策略）之外，规范还包括一个"},{"type":"text","marks":[{"type":"strong"}],"text":"tfReplicaSpecs"},{"type":"text","text":"，其中你定义了一个 worker。在容器化的 TensorFlow 代码中，将 worker 副本计数设置为 2，并定义相关的分发策略，就足以实现 Kubeflow 的同步策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化 TFJob 后，将会在每个 worker 节点上创建一个新的"},{"type":"text","marks":[{"type":"strong"}],"text":"TF_CONFIG"},{"type":"text","text":"环境变量。其中包含了关于训练批次、当前训练迭代以及 TFJob 用于执行分布式训练的其他参数的信息。通过与各种 Kubernetes 控制器、 API 进行交互，Tf-operator 协调训练过程，并维护在清单中定义的预期状态。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外，通过 tf-operator，异步训练模式可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"ParameterServerStrategy"},{"type":"text","text":"。在"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/scaling-tensorflow-models-on-kubernetes\/","title":"","type":null},"content":[{"type":"text","text":"这里"}]},{"type":"text","text":"（以及下面），你将看到一个由 tf-operator 管理的异步策略的分布式训练作业的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/87\/f0\/8781419deyy2477b30406b58587eb4f0.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 并不是用 Kubeflow 实现 TensorFlow 模型分布式训练的唯一方法。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/mpi-operator","title":"","type":null},"content":[{"type":"text","text":"MPI Operator"}]},{"type":"text","text":"提供了另一种解决方案。在后台，MPI Operator 使用"},{"type":"link","attrs":{"href":"https:\/\/www.open-mpi.org\/","title":"","type":null},"content":[{"type":"text","text":"消息传递接口"}]},{"type":"text","text":"（Message Passing Interface，MPI），它可以在异构网络环境中，在 worker 之间通过不同的通信层进行跨节点通信。在 Kubernetes 中， MPI Operator 可用于实现 Allreduce 式的 TensorFlow 模型同步训练。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"TensorFlow 模型在 Kubernetes 上的可扩展服务"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于处理客户端对推理服务的请求是一项非常耗时耗力的任务，因此可扩展服务对于机器学习工作负载的生产部署至关重要。在这种情况下，部署的模型应该能够扩展到多个副本，并为多个并发的请求提供服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 支持 TensorFlow 模型的几种服务选项。这里要注意以下几点："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"TFServing"},{"type":"text","text":"是 TFX Serving 模块的 Kubeflow 实现。通过 TFServing，你可以创建机器学习模型 REST API，并提供许多有用的功能，包括服务交付、自动生命周期管理、流量分割和版本管理。然而，这个选项并没有提供自动扩展功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Seldon Core"},{"type":"text","text":"是一款第三方工具，可用于 Kubeflow 抽象和资源。它支持多种机器学习框架，包括 TensorFlow，并允许将训练好的 TensorFlow 模型转换为 REST\/gRPC 微服务，运行在 Kubernetes 中。Seldon Core 默认支持模型自动扩展。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"BentoML"},{"type":"text","text":"是 Kubeflow 使用的另一个第三方工具，它提供高级的模型服务功能，包括自动扩展，以及支持微批处理的高性能 API 模型服务器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一节中，我将展示如何使用 KFServing 对训练好的 TensorFlow 模型进行自动扩展，KFServing 是默认的 Kubeflow 安装中的一个模块。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 KFServing 自动扩展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KFServing 是一种无服务器平台，它可以轻松地将训练好的 TensorFlow 模型转换为从 Kubernetes 集群外部访问的推理服务。通过 Istio, KFServing 可以实现网络和入口、健康检查、金丝雀发布（canary rollouts）、时间点快照、流量路由以及针对你部署的 TensorFlow 模型灵活地配置服务器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时，KFServing 还支持开箱即用的训练 TensorFlow 模型的自动扩展。在底层，KFServing 依赖于 Knative Serving 的自动扩展能力。Knative 提供了两个自动扩展的实现。一种是基于 Knative Pod Autoscaler（KPA）工具，另一种个是基于 Kubernetes Horizontal Pod Autoscaler（HPA）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 KFServing 部署 InferenceService 时，KPA 将默认启用。它支持扩展到零的功能，即在没有流量时，可将服务的模型扩展到剩余副本数量为零。KPA 的主要限制在于它不支持基于 CPU 的自动扩展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若集群中没有 GPU，则可以使用 HPA autoscaler，它支持基于 CPU 的自动扩展。然而，它不属于 KFServing 安装，应该在 KFServing 安装完成后启用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述，KFServing 在缺省情况下使用 KPA，因此你的 InferenceService 在部署后立即获得自动扩展。使用 InferenceService 清单可以自定义 KPA 行为。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/01\/48\/017b7bf3a8yye0c784ecaca155a75d48.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认情况下，KPA 基于每个 pod 的平均传入请求数对模型进行扩展。KFServing 将默认的并发的目标数量设置为 1，这意味着如果服务收到三个请求，KPA 将把它扩展到三个 pod 副本。你可以通过更改"},{"type":"text","marks":[{"type":"strong"}],"text":"autoscaling.knative.dev\/target"},{"type":"text","text":"注释来定制这个行为，就像上面的例子一样，你把它设置为 10。一旦启用此设置，只有当并发的请求数增加到 10 时，KPA 才会增加副本数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 KFServing，你可以配置其他自动扩展目标。举例来说，你可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"requests-per-second-target-default"},{"type":"text","text":"注解来扩展基于每秒平均请求量（Request per second，RPS）的模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如我在本文中所展示的那样，Kubeflow 为扩展 TensorFlow 模型训练和 Kubernetes 的服务提供了许多有用的工具。你可以使用 Kubeflow 来实现 TensorFlow 分布策略的同步和异步训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了在 Kubernetes 集群中高效地执行分布式训练，Tf-operator 可以轻松定义你所需要的各种组件。另外，Kubeflow 还支持 MPI Operator，这是一个绝佳解决方案，可以使用 MPI 来实现 Allreduce 式的多节点训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在扩展训练好的 TensorFlow 模型时，Kubeflow 也有很好的功能集。诸如 KFServing 这样的工具可以让你根据需要定制扩展逻辑，包括 RPS 和请求并行目标。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你也可以使用 Kubernetes-native 工具，比如 HPA，根据用户定义的指标对模型进行扩展。你可以研究一下其他很棒的服务工具，比如 Seldon Core 和 BentoML。它们都支持自动扩展，并为自动化服务模型版本、金丝雀发布、更新和生命周期管理提供了许多有用的功能。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介绍："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kirill Goltsman，技术博客写手、研究员，专攻研究人工智能 \/ 机器学习及容器化技术。在过去的几年里，他领导了专注于数据分析、Kubernetes 以及游戏和安全领域的人工智能的初创公司的内容创作策略。在他的技术写作中，Kirill 利用了他的编程语言（Javascript、Python）、统计知识以及部署商业网站、应用程序和插件的经验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文链接："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/medium.com\/ai-in-plain-english\/scaling-tensorflow-models-on-kubernetes-e598cb4bfd8a"}]}]}