翻译自NVIDIA Triton Inference Server

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server provides an inference service
via an HTTP or GRPC endpoint, allowing remote clients to request
inferencing for any model being managed by the server. For edge
deployments, Triton Server is also available as a shared library with
an API that allows the full functionality of the server to be included
directly in an application. Triton Server provides the following
features:
英伟达推理服务提供了针对英伟达GPU的云端推理优化策略,通过HTTP或者GRPC来提供服务,让远端能请求服务器上的任一模型推理。在边缘部署中,triton服务也可以作为共享库的API,使其上所有的服务成为应用的一部分。triton服务器有以下特性:

  • Multiple framework support
    The server can manage any number and mix of models (limited by system
    disk and memory resources). Supports TensorRT, TensorFlow GraphDef,
    TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
    formats. Also supports TensorFlow-TensorRT and ONNX-TensorRT
    integrated models. Variable-size input and output tensors are
    allowed if supported by the framework. See Capabilities
    for detailed support information for each framework.
    服务器可以管理任意数量和混合的模型,当然这取决与系统的磁盘和内存等资源。支持包括TensorRT, TensorFlow GranphDef, TensorFlow SaveModel, ONNX, PyTorch和Caffe2 NetDef格式的模型。同时也支持TensorFlow-TensorRT和ONNX-TensorRT的整合模型。同步框架来支持可变大小的输入输出张量,详情可见link

  • Concurrent model execution supportMultiple models (or multiple instances of the same model) can run simultaneously on the same GPU.
    多个模型或者同一个模型的多个实例可以在同一个GPU上并发。

  • Batching support. For models that support batching, Triton Server
    can accept requests for a batch of inputs and respond with the
    corresponding batch of outputs. Triton Server also supports multiple
    scheduling and batching
    algorithms that combine individual inference requests together to
    improve inference throughput. These scheduling and batching
    decisions are transparent to the client requesting inference.
    批量策略。对于支持batch的模型,triton服务能够接受和反馈批量请求,同时也支持多策略和批量算法将独立的推理请求整合起来以提高吞吐量,而且这样的多策略和批量算法对于客户端是透明的。

  • Custom backend supportTriton
    Server allows individual models to be implemented with custom
    backends instead of by a deep-learning framework. With a custom
    backend a model can implement any logic desired, while still
    benefiting from the GPU support, concurrent execution, dynamic
    batching and other features provided by the server.
    triton服务允许独立模型不使用任何深度学习框架来定义的普通实现,有了这个自定义实现,模型能跑任何设计好的逻辑,同时获得GPU的支持,并发,动态batch,以及其他triton服务支持的特征。

  • Ensemble supportAn
    ensemble represents a pipeline of one or more models and the
    connection of input and output tensors between those models. A
    single inference request to an ensemble will trigger the execution
    of the entire pipeline.
    组合策略表示一个或多个模型以及连接她们的输入输出组成的pipeline,对一个组合策略的模型发起一个推理请求会触发整个pipeline。

  • Multi-GPU support. Triton Server can distribute inferencing across
    all system GPUs.
    多GPU支持,triton服务器能将推理分布到系统所有的GPU中。

  • Triton Server provides multiple modes for model managementThese
    model management modes allow for both implicit and explicit loading
    and unloading of models without requiring a server restart.
    triton服务提供多个模型的模型管理,模型管理模式支持隐式和显式来上线或者下线模型而不需要重启系统。

  • Model repositoriesmay reside on a locally accessible file system (e.g. NFS), in Google
    Cloud Storage or in Amazon S3.
    模型库可以在类似NFS的本地库,也可以是google或者Amazon的云端。

  • Readiness and liveness health endpointssuitable for any orchestration or deployment framework, such as
    Kubernetes.
    可读性和健壮性,适用于类似Kubernetes的部署框架。

  • Metricsindicating GPU utilization, server throughput, and server latency.
    性能指标,包括GPU利用率,服务吞吐和服务延迟。

  • C library inferfaceallows the full functionality of Triton Server to be included
    directly in an application.
    C接口,可以讲整个triton服务整合成应用的子部分。

Backwards Compatibility

Continuing in the latest version the following interfaces maintain
backwards compatibilty with the 1.0.0 release. If you have model
configuration files, custom backends, or clients that use the
inference server HTTP or GRPC APIs (either directly or through the
client libraries) from releases prior to 1.0.0 you should edit
and rebuild those as necessary to match the version 1.0.0 APIs.
最近版本中的backward算力是1.0.0,如果有模型的配置文件,自定义操作或者客户端使用服务的HTTP和GRPC端口优先级高于1.0.0,需要重新搭建环境来匹配1.0.0的接口。

The following inferfaces will maintain backwards compatibility for all
future 1.x.y releases (see below for exceptions):
以下所有接口也会包含未来所有1.x.y版本

  • Model configuration as defined in model_config.proto
    模型的定义在model_config.proto文件中。

  • The inference server HTTP and GRPC APIs as defined in api.proto
    and grpc_service.protoexcept as noted below.
    如果没有特殊说明,HTTP和GRPC接口定义文件参见api.proto和grpc_service.proto。

  • The V1 and V2 custom backend interfaces as defined in custom.h
    V1和V2的自定义接口定义在custom.h中。

As new features are introduced they may temporarily have beta status
where they are subject to change in non-backwards-compatible
ways. When they exit beta they will conform to the
backwards-compatibility guarantees described above. Currently the
following features are in beta:
介绍一个还在测试中的特征,可能会改变无backward版本方式,完成测试之后会加到backwards-compatibility中。以下都是还在测试状态的特征:

  • The inference server library API as defined in trtserver.h
    is currently in beta and may undergo non-backwards-compatible
    changes.
    推理服务的接口定义在trtserver.h中,还在测试中,获或许会走无backward版本方式。

  • The C++ and Python client libraries are not stictly included in the
    inference server compatibility guarantees and so should be
    considered as beta status.
    C++和Python客户端并不在推理服务版本的严格要求中,所以也在测试阶段。

Roadmap

The inference server’s new name is Triton Inference Server, which can
be shortened to just Triton Server in contexts where inferencing is
already understood. The primary reasons for the name change are to :
推理服务的新名字是triton推理服务,简称triton服务。改名的原因是:

  • Avoid confusion with the NVIDIA TensorRT Programmable Inference
    Accelerator

    容易与NVIDIA TensorRT Programmable Inference Accelerator混淆。

  • Avoid the perception that Triton Server only supports TensorRT
    models when in fact the server supports a wide range of model
    frameworks and formats.
    避免有人字面意思猜测只支持trt模型,实际上支持的框架和模型很多。

  • Highlight that the server is aligning HTTP/REST and GRPC protocols
    with a set of KFServing community standard inference protocols
    that have been proposed by the KFServing project
    重要的是,服务还饮用了HTTP/REST和GRPC服务来支持其推理请求。

Transitioning from the current protocols (version 1) to the new
protocols (version 2) will take place over several releases.
版本1到版本2之间将会经历好几个其他版本。

  • Current master
    现有主干

    • Alpha release of server support for KFServing community standard
      GRPC and HTTP/REST inference protocol.
      Alpha版本服务支持KFServing标准的GRPC和HTTP/REST。
    • Alpha release of Python client library that uses KFServing
      community standard GRPC and HTTP/REST inference protocol.
    • See client documentation <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/client_experimental.rst>_
      for description and examples showing how to enable and use the new
      GRPC and HTTP/REST inference protocol and Python client library.
    • Existing HTTP/REST and GRPC protocols, and existing client APIs
      continue to be supported and remain the default protocols.
  • 20.05

    • Beta release of KFServing community standard HTTP/REST and GRPC
      inference protocol support in server, Python client, and C++
      client.
    • Beta release of the HTTP/REST and GRPC extensions <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/protocol>_
      to the KFServing inference protocol.
    • Existing HTTP/REST and GRPC protocols are deprecated but remain
      the default.
    • Existing shared library inferface defined in trtserver.h continues
      to be supported but is deprecated.
    • Beta release of new shared library interface is defined in
      tritonserver.h.
  • 20.06

    • Triton Server version 2.0.0.
    • KFserving community standard HTTP/REST and GRPC inference
      protocols plus all Triton extensions <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/protocol>_
      become the default and only supported protocols for the server.
    • C++ and Python client libraries based on the KFServing standard
      inference protocols become the default and only supported client
      libraries.
    • The new shared library interface defined in tritonserver.h becomes
      the default and only supported shared library interface.
    • Original C++ and Python client libraries are removed. Release
      20.05 is the last release to support these libraries.
    • Original shared library interface defined in trtserver.h is
      removed. Release 20.05 is the last release to support the
      trtserver.h shared library interface.

Throughout the transition the model repository struture and custom
backend APIs will remain unchanged so that any existing model
repository and custom backends will continue to work with Triton
Server.

In the 20.06 release there will be some minor changes to the
tritonserver command-line executable arguments. It will be necessary
to revisit and possible adjust invocations of tritonserver executable.

In the 20.06 release there will be some minor changes to the model
configuration schema. It is expected that these changes will not
impact the vast majority of model configurations. For impacted models
the model configuration will need minor edits to become compatible
with Triton Server version 2.0.0.

Documentation

The User Guide, Developer Guide, and API Reference documentation for the current release <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html>_
provide guidance on installing, building, and running Triton Inference
Server.

You can also view the documentation for the master branch <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/index.html>_
and for earlier releases <https://docs.nvidia.com/deeplearning/sdk/inference-server-archived/index.html>_.

An FAQ <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/faq.html>_
provides answers for frequently asked questions.

READMEs for deployment examples can be found in subdirectories of
deploy/, for example, deploy/single_server/README.rst <https://github.com/NVIDIA/triton-inference-server/tree/master/deploy/single_server/README.rst>_.

The Release Notes <https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html>_
and Support Matrix <https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>_
indicate the required versions of the NVIDIA Driver and CUDA, and also
describe which GPUs are supported by Triton Server.

Presentations and Papers
^^^^^^^^^^^^^^^^^^^^^^^^

  • High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>_.

  • Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>_.

  • Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>_.

  • Maximizing Utilization for Data Center Inference with TensorRT Inference Server <https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>_.

  • NVIDIA TensorRT Inference Server Boosts Deep Learning Inference <https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>_.

  • GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT Inference Server and Kubeflow <https://www.kubeflow.org/blog/nvidia_tensorrt/>_.

Contributing

Contributions to Triton Inference Server are more than welcome. To
contribute make a pull request and follow the guidelines outlined in
the Contributing <CONTRIBUTING.md>_ document.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this
project. When help with code is needed, follow the process outlined in
the Stack Overflow (https://stackoverflow.com/help/mcve)
document. Ensure posted examples are:

  • minimal – use as little code as possible that still produces the
    same problem

  • complete – provide all parts needed to reproduce the problem. Check
    if you can strip external dependency and still show the problem. The
    less time we spend on reproducing problems the more time we have to
    fix it

  • verifiable – test the code you’re about to provide to make sure it
    reproduces the problem. Remove all other problems that are not
    related to your request/question.

… |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
:target: https://opensource.org/licenses/BSD-3-Clause

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章