在 Kubernetes 上扩展 TensorFlow 模型

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于人工智能 \/ 机器学习日益集成到应用和业务流程中,因此生产级机器学习模型需要更多可扩展的基础设施和计算能力,以用于训练和部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现代机器学习算法在大量数据上进行训练,并且需要数十亿次迭代才能使成本函数最小化。这类模型的垂直扩展会遇到操作系统级别的瓶颈,包括可提供的 CPU、GPU 和存储的数量,而且对于这种类型的模型,已经证明效率并不高。更为高效的并行处理算法,例如异步训练和 allreduce 式训练,需要一个分布式集群系统,由不同的 worker (工作器)以协调的方式同时学习。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可扩展性对于在生产环境中服务深度学习模型也非常重要。将单个 API 请求处理到模型预测端点可能会触发复杂的处理逻辑,这将花费大量时间。由于更多用户访问模型的端点,为了有效地处理客户端请求,需要更多服务实例。在机器学习模型中,以分布式、可扩展的方式提供服务的能力成为保证其应用有效性的关键。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要解决分布式云环境中的这些扩展性问题非常困难。在确保容错、高可用性和应用健康的同时, MLOps 工程师要配置多个节点和推理服务之间的交互。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文中,我将讨论 Kubernetes 和 Kubeflow 如何能够满足 TensorFlow 的机器学习模型的这些扩展性需求。通过一些实际的例子,我将向你介绍如何在 Kubernetes 上使用 Kubeflow 扩展机器学习模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我将讨论如何使用 TensorFlow training jobs(TensorFlow 训练作业,TFJobs)抽象,通过 Kubeflow 在 Kubernetes 上协调 TensorFlow 模型的分布式训练。然后,我将介绍如何实现同步和异步分布式训练的 TensorFlow 分发策略。最后,我将讨论用于扩展在 Kubernetes 中服务的 TensorFlow 模型的各种选项,包括 KFServing、Seldon Core 和 BentoML。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文的最后,你将更好地理解基本的 Kubernetes 和 Kubeflow 抽象,并了解 TensorFlow 模型的可扩展工具,用于训练和生产级服务。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 Kubernetes 和 Kubeflow 扩展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/automating-machine-learning-pipelines-on-kubernetes-with-kubeflow\/","title":"","type":null},"content":[{"type":"text","text":"Kubeflow"}]},{"type":"text","text":"是一个 Kubernetes 的机器学习框架,最初由谷歌开发。它建立在 Kubernetes 资源和编排服务之上,实现复杂的自动化机器学习管道,用于"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/the-never-ending-story-of-model-training-a-new-chapter-for-machine-learning\/","title":"","type":null},"content":[{"type":"text","text":"训练和服务机器学习模型"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以结合使用 Kubernetes 和 Kubeflow 来有效地扩展 TensorFlow 模型。为使 TensorFlow 模型具有可扩展性,主要的资源和特性如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 kubectl"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/deployment\/#scaling-a-deployment","title":"","type":null},"content":[{"type":"text","text":"手动扩展 Kubernetes 部署和 StatefulSets"}]},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/","title":"","type":null},"content":[{"type":"text","text":"Pod 水平自动伸缩"}]},{"type":"text","text":"(Horizontal Pod Autoscaler)进行自动扩展,它基于一组计算指标(CPU、GPU、内存)或用户定义的指标(如每秒请求)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 TFJob 和 MPI Operator 对 TensorFlow 模型进行分布式训练。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 KFServing、Seldon Core 和 BentoML 扩展已部署的 TensorFlow 模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来,我将提供一些例子,说明如何使用这些解决方案中的一些,有效地在 Kubernetes 上扩展 TensorFlow 模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用 TFJob 进行可扩展的 TensorFlow 训练"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 可以在 Kubernetes 中扩展,方法是使用"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"TensorFlow 分发策略"}]},{"type":"text","text":"实现分布式训练。在机器学习中有两种常用的分布式策略:同步和异步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在同步训练中,worker 对特定批次的训练数据进行并行训练。每个 worker 都会进行自己的前向传播步骤,并对迭代的整体结果进行汇总。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,在异步训练中,worker 对相同的数据进行并行学习。在这种方法中,有一个称为"},{"type":"link","attrs":{"href":"https:\/\/www.cs.cmu.edu\/~muli\/file\/ps.pdf","title":"","type":null},"content":[{"type":"text","text":"Parameter Server"}]},{"type":"text","text":"(参数服务器)的中央实体,它负责聚合和计算梯度,并将更新的参数传递给每个 worker。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分布式集群中实现这样的策略并非易事。特别是,worker 应该能够在不同节点之间进行数据和权重的沟通,并有效协调它们的学习,同时避免错误。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 在"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/distribute\/Strategy","title":"","type":null},"content":[{"type":"text","text":"tf.distribut.Strategy"}]},{"type":"text","text":"模块中实现了各种分布式训练策略,以节省开发人员的时间。有了这个模块,机器学习开发人员只要对他们的代码做最少的修改,就可以在多个节点和 GPU 之间分发训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个模块实现了几种同步策略,包括 MirroredStrategy、TPUStrategy 和 MultiworkerMirroredStrategy。它还实现了一个异步的 ParameterServerStrategy。你可以在这篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.tensorflow.org\/guide\/distributed_training","title":"","type":null},"content":[{"type":"text","text":"使用 TensorFlow 进行分布式训练"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Distributed training with TensorFlow"},{"type":"text","text":"**)中阅读更多关于可用的 TensorFlow 分布策略以及如何在你的 TensorFlow 代码中实现这些策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 随附了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator","title":"","type":null},"content":[{"type":"text","text":"TF Operator"}]},{"type":"text","text":"和一个自定义的"},{"type":"link","attrs":{"href":"https:\/\/www.kubeflow.org\/docs\/components\/training\/tftraining\/","title":"","type":null},"content":[{"type":"text","text":"TFJob"}]},{"type":"text","text":"资源,可以轻松创建上面提到的 TensorFlow 分布式策略。TFJob 可以识别容器化的 TensorFlow 代码中定义的分布式策略,并可以使用一组内置组件和控制逻辑对其进行管理。使得在 Kubeflow 中实现 TensorFlow 的分布式训练成为可能的组件包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Chief:组织分布式训练并执行模型检查点。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parameter Server:协调异步分布式训练和计算梯度。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker:执行学习任务。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Evaluator:计算和记录评估指标。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述组件可以在 TFJob 中配置,TFJob 是一个用于 TensorFlow 训练的 Kubeflow CRD。这里是一个分布式训练作业的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/tf-operator\/blob\/master\/examples\/v1\/mnist_with_summaries\/tf_job_mnist.yaml","title":"","type":null},"content":[{"type":"text","text":"基本例子"}]},{"type":"text","text":",它依赖于两个 worker,在没有 Chief 和 Parameter Server 的情况下进行训练。这种方法适用于实现 TensorFlow 同步训练策略,如 MirroredStrategy。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你看,除了标准的 Kubernetes 资源和服务(例如卷、容器、重启策略)之外,规范还包括一个"},{"type":"text","marks":[{"type":"strong"}],"text":"tfReplicaSpecs"},{"type":"text","text":",其中你定义了一个 worker。在容器化的 TensorFlow 代码中,将 worker 副本计数设置为 2,并定义相关的分发策略,就足以实现 Kubeflow 的同步策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化 TFJob 后,将会在每个 worker 节点上创建一个新的"},{"type":"text","marks":[{"type":"strong"}],"text":"TF_CONFIG"},{"type":"text","text":"环境变量。其中包含了关于训练批次、当前训练迭代以及 TFJob 用于执行分布式训练的其他参数的信息。通过与各种 Kubernetes 控制器、 API 进行交互,Tf-operator 协调训练过程,并维护在清单中定义的预期状态。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,通过 tf-operator,异步训练模式可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"ParameterServerStrategy"},{"type":"text","text":"。在"},{"type":"link","attrs":{"href":"https:\/\/iamondemand.com\/blog\/scaling-tensorflow-models-on-kubernetes\/","title":"","type":null},"content":[{"type":"text","text":"这里"}]},{"type":"text","text":"(以及下面),你将看到一个由 tf-operator 管理的异步策略的分布式训练作业的例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/87\/f0\/8781419deyy2477b30406b58587eb4f0.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TFJob 并不是用 Kubeflow 实现 TensorFlow 模型分布式训练的唯一方法。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/kubeflow\/mpi-operator","title":"","type":null},"content":[{"type":"text","text":"MPI Operator"}]},{"type":"text","text":"提供了另一种解决方案。在后台,MPI Operator 使用"},{"type":"link","attrs":{"href":"https:\/\/www.open-mpi.org\/","title":"","type":null},"content":[{"type":"text","text":"消息传递接口"}]},{"type":"text","text":"(Message Passing Interface,MPI),它可以在异构网络环境中,在 worker 之间通过不同的通信层进行跨节点通信。在 Kubernetes 中, MPI Operator 可用于实现 Allreduce 式的 TensorFlow 模型同步训练。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"TensorFlow 模型在 Kubernetes 上的可扩展服务"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于处理客户端对推理服务的请求是一项非常耗时耗力的任务,因此可扩展服务对于机器学习工作负载的生产部署至关重要。在这种情况下,部署的模型应该能够扩展到多个副本,并为多个并发的请求提供服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow 支持 TensorFlow 模型的几种服务选项。这里要注意以下几点:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"TFServing"},{"type":"text","text":"是 TFX Serving 模块的 Kubeflow 实现。通过 TFServing,你可以创建机器学习模型 REST API,并提供许多有用的功能,包括服务交付、自动生命周期管理、流量分割和版本管理。然而,这个选项并没有提供自动扩展功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Seldon Core"},{"type":"text","text":"是一款第三方工具,可用于 Kubeflow 抽象和资源。它支持多种机器学习框架,包括 TensorFlow,并允许将训练好的 TensorFlow 模型转换为 REST\/gRPC 微服务,运行在 Kubernetes 中。Seldon Core 默认支持模型自动扩展。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"BentoML"},{"type":"text","text":"是 Kubeflow 使用的另一个第三方工具,它提供高级的模型服务功能,包括自动扩展,以及支持微批处理的高性能 API 模型服务器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一节中,我将展示如何使用 KFServing 对训练好的 TensorFlow 模型进行自动扩展,KFServing 是默认的 Kubeflow 安装中的一个模块。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用 KFServing 自动扩展 TensorFlow 模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KFServing 是一种无服务器平台,它可以轻松地将训练好的 TensorFlow 模型转换为从 Kubernetes 集群外部访问的推理服务。通过 Istio, KFServing 可以实现网络和入口、健康检查、金丝雀发布(canary rollouts)、时间点快照、流量路由以及针对你部署的 TensorFlow 模型灵活地配置服务器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时,KFServing 还支持开箱即用的训练 TensorFlow 模型的自动扩展。在底层,KFServing 依赖于 Knative Serving 的自动扩展能力。Knative 提供了两个自动扩展的实现。一种是基于 Knative Pod Autoscaler(KPA)工具,另一种个是基于 Kubernetes Horizontal Pod Autoscaler(HPA)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 KFServing 部署 InferenceService 时,KPA 将默认启用。它支持扩展到零的功能,即在没有流量时,可将服务的模型扩展到剩余副本数量为零。KPA 的主要限制在于它不支持基于 CPU 的自动扩展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若集群中没有 GPU,则可以使用 HPA autoscaler,它支持基于 CPU 的自动扩展。然而,它不属于 KFServing 安装,应该在 KFServing 安装完成后启用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,KFServing 在缺省情况下使用 KPA,因此你的 InferenceService 在部署后立即获得自动扩展。使用 InferenceService 清单可以自定义 KPA 行为。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/01\/48\/017b7bf3a8yye0c784ecaca155a75d48.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认情况下,KPA 基于每个 pod 的平均传入请求数对模型进行扩展。KFServing 将默认的并发的目标数量设置为 1,这意味着如果服务收到三个请求,KPA 将把它扩展到三个 pod 副本。你可以通过更改"},{"type":"text","marks":[{"type":"strong"}],"text":"autoscaling.knative.dev\/target"},{"type":"text","text":"注释来定制这个行为,就像上面的例子一样,你把它设置为 10。一旦启用此设置,只有当并发的请求数增加到 10 时,KPA 才会增加副本数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过 KFServing,你可以配置其他自动扩展目标。举例来说,你可以使用"},{"type":"text","marks":[{"type":"strong"}],"text":"requests-per-second-target-default"},{"type":"text","text":"注解来扩展基于每秒平均请求量(Request per second,RPS)的模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如我在本文中所展示的那样,Kubeflow 为扩展 TensorFlow 模型训练和 Kubernetes 的服务提供了许多有用的工具。你可以使用 Kubeflow 来实现 TensorFlow 分布策略的同步和异步训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了在 Kubernetes 集群中高效地执行分布式训练,Tf-operator 可以轻松定义你所需要的各种组件。另外,Kubeflow 还支持 MPI Operator,这是一个绝佳解决方案,可以使用 MPI 来实现 Allreduce 式的多节点训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在扩展训练好的 TensorFlow 模型时,Kubeflow 也有很好的功能集。诸如 KFServing 这样的工具可以让你根据需要定制扩展逻辑,包括 RPS 和请求并行目标。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你也可以使用 Kubernetes-native 工具,比如 HPA,根据用户定义的指标对模型进行扩展。你可以研究一下其他很棒的服务工具,比如 Seldon Core 和 BentoML。它们都支持自动扩展,并为自动化服务模型版本、金丝雀发布、更新和生命周期管理提供了许多有用的功能。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介绍:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kirill Goltsman,技术博客写手、研究员,专攻研究人工智能 \/ 机器学习及容器化技术。在过去的几年里,他领导了专注于数据分析、Kubernetes 以及游戏和安全领域的人工智能的初创公司的内容创作策略。在他的技术写作中,Kirill 利用了他的编程语言(Javascript、Python)、统计知识以及部署商业网站、应用程序和插件的经验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文链接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/medium.com\/ai-in-plain-english\/scaling-tensorflow-models-on-kubernetes-e598cb4bfd8a"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章