翻譯自NVIDIA Triton Inference Server

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server provides an inference service
via an HTTP or GRPC endpoint, allowing remote clients to request
inferencing for any model being managed by the server. For edge
deployments, Triton Server is also available as a shared library with
an API that allows the full functionality of the server to be included
directly in an application. Triton Server provides the following
features:
英偉達推理服務提供了針對英偉達GPU的雲端推理優化策略,通過HTTP或者GRPC來提供服務,讓遠端能請求服務器上的任一模型推理。在邊緣部署中,triton服務也可以作爲共享庫的API,使其上所有的服務成爲應用的一部分。triton服務器有以下特性:

  • Multiple framework support
    The server can manage any number and mix of models (limited by system
    disk and memory resources). Supports TensorRT, TensorFlow GraphDef,
    TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
    formats. Also supports TensorFlow-TensorRT and ONNX-TensorRT
    integrated models. Variable-size input and output tensors are
    allowed if supported by the framework. See Capabilities
    for detailed support information for each framework.
    服務器可以管理任意數量和混合的模型,當然這取決與系統的磁盤和內存等資源。支持包括TensorRT, TensorFlow GranphDef, TensorFlow SaveModel, ONNX, PyTorch和Caffe2 NetDef格式的模型。同時也支持TensorFlow-TensorRT和ONNX-TensorRT的整合模型。同步框架來支持可變大小的輸入輸出張量,詳情可見link

  • Concurrent model execution supportMultiple models (or multiple instances of the same model) can run simultaneously on the same GPU.
    多個模型或者同一個模型的多個實例可以在同一個GPU上併發。

  • Batching support. For models that support batching, Triton Server
    can accept requests for a batch of inputs and respond with the
    corresponding batch of outputs. Triton Server also supports multiple
    scheduling and batching
    algorithms that combine individual inference requests together to
    improve inference throughput. These scheduling and batching
    decisions are transparent to the client requesting inference.
    批量策略。對於支持batch的模型,triton服務能夠接受和反饋批量請求,同時也支持多策略和批量算法將獨立的推理請求整合起來以提高吞吐量,而且這樣的多策略和批量算法對於客戶端是透明的。

  • Custom backend supportTriton
    Server allows individual models to be implemented with custom
    backends instead of by a deep-learning framework. With a custom
    backend a model can implement any logic desired, while still
    benefiting from the GPU support, concurrent execution, dynamic
    batching and other features provided by the server.
    triton服務允許獨立模型不使用任何深度學習框架來定義的普通實現,有了這個自定義實現,模型能跑任何設計好的邏輯,同時獲得GPU的支持,併發,動態batch,以及其他triton服務支持的特徵。

  • Ensemble supportAn
    ensemble represents a pipeline of one or more models and the
    connection of input and output tensors between those models. A
    single inference request to an ensemble will trigger the execution
    of the entire pipeline.
    組合策略表示一個或多個模型以及連接她們的輸入輸出組成的pipeline,對一個組合策略的模型發起一個推理請求會觸發整個pipeline。

  • Multi-GPU support. Triton Server can distribute inferencing across
    all system GPUs.
    多GPU支持,triton服務器能將推理分佈到系統所有的GPU中。

  • Triton Server provides multiple modes for model managementThese
    model management modes allow for both implicit and explicit loading
    and unloading of models without requiring a server restart.
    triton服務提供多個模型的模型管理,模型管理模式支持隱式和顯式來上線或者下線模型而不需要重啓系統。

  • Model repositoriesmay reside on a locally accessible file system (e.g. NFS), in Google
    Cloud Storage or in Amazon S3.
    模型庫可以在類似NFS的本地庫,也可以是google或者Amazon的雲端。

  • Readiness and liveness health endpointssuitable for any orchestration or deployment framework, such as
    Kubernetes.
    可讀性和健壯性,適用於類似Kubernetes的部署框架。

  • Metricsindicating GPU utilization, server throughput, and server latency.
    性能指標,包括GPU利用率,服務吞吐和服務延遲。

  • C library inferfaceallows the full functionality of Triton Server to be included
    directly in an application.
    C接口,可以講整個triton服務整合成應用的子部分。

Backwards Compatibility

Continuing in the latest version the following interfaces maintain
backwards compatibilty with the 1.0.0 release. If you have model
configuration files, custom backends, or clients that use the
inference server HTTP or GRPC APIs (either directly or through the
client libraries) from releases prior to 1.0.0 you should edit
and rebuild those as necessary to match the version 1.0.0 APIs.
最近版本中的backward算力是1.0.0,如果有模型的配置文件,自定義操作或者客戶端使用服務的HTTP和GRPC端口優先級高於1.0.0,需要重新搭建環境來匹配1.0.0的接口。

The following inferfaces will maintain backwards compatibility for all
future 1.x.y releases (see below for exceptions):
以下所有接口也會包含未來所有1.x.y版本

  • Model configuration as defined in model_config.proto
    模型的定義在model_config.proto文件中。

  • The inference server HTTP and GRPC APIs as defined in api.proto
    and grpc_service.protoexcept as noted below.
    如果沒有特殊說明,HTTP和GRPC接口定義文件參見api.proto和grpc_service.proto。

  • The V1 and V2 custom backend interfaces as defined in custom.h
    V1和V2的自定義接口定義在custom.h中。

As new features are introduced they may temporarily have beta status
where they are subject to change in non-backwards-compatible
ways. When they exit beta they will conform to the
backwards-compatibility guarantees described above. Currently the
following features are in beta:
介紹一個還在測試中的特徵,可能會改變無backward版本方式,完成測試之後會加到backwards-compatibility中。以下都是還在測試狀態的特徵:

  • The inference server library API as defined in trtserver.h
    is currently in beta and may undergo non-backwards-compatible
    changes.
    推理服務的接口定義在trtserver.h中,還在測試中,獲或許會走無backward版本方式。

  • The C++ and Python client libraries are not stictly included in the
    inference server compatibility guarantees and so should be
    considered as beta status.
    C++和Python客戶端並不在推理服務版本的嚴格要求中,所以也在測試階段。

Roadmap

The inference server’s new name is Triton Inference Server, which can
be shortened to just Triton Server in contexts where inferencing is
already understood. The primary reasons for the name change are to :
推理服務的新名字是triton推理服務,簡稱triton服務。改名的原因是:

  • Avoid confusion with the NVIDIA TensorRT Programmable Inference
    Accelerator

    容易與NVIDIA TensorRT Programmable Inference Accelerator混淆。

  • Avoid the perception that Triton Server only supports TensorRT
    models when in fact the server supports a wide range of model
    frameworks and formats.
    避免有人字面意思猜測只支持trt模型,實際上支持的框架和模型很多。

  • Highlight that the server is aligning HTTP/REST and GRPC protocols
    with a set of KFServing community standard inference protocols
    that have been proposed by the KFServing project
    重要的是,服務還飲用了HTTP/REST和GRPC服務來支持其推理請求。

Transitioning from the current protocols (version 1) to the new
protocols (version 2) will take place over several releases.
版本1到版本2之間將會經歷好幾個其他版本。

  • Current master
    現有主幹

    • Alpha release of server support for KFServing community standard
      GRPC and HTTP/REST inference protocol.
      Alpha版本服務支持KFServing標準的GRPC和HTTP/REST。
    • Alpha release of Python client library that uses KFServing
      community standard GRPC and HTTP/REST inference protocol.
    • See client documentation <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/client_experimental.rst>_
      for description and examples showing how to enable and use the new
      GRPC and HTTP/REST inference protocol and Python client library.
    • Existing HTTP/REST and GRPC protocols, and existing client APIs
      continue to be supported and remain the default protocols.
  • 20.05

    • Beta release of KFServing community standard HTTP/REST and GRPC
      inference protocol support in server, Python client, and C++
      client.
    • Beta release of the HTTP/REST and GRPC extensions <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/protocol>_
      to the KFServing inference protocol.
    • Existing HTTP/REST and GRPC protocols are deprecated but remain
      the default.
    • Existing shared library inferface defined in trtserver.h continues
      to be supported but is deprecated.
    • Beta release of new shared library interface is defined in
      tritonserver.h.
  • 20.06

    • Triton Server version 2.0.0.
    • KFserving community standard HTTP/REST and GRPC inference
      protocols plus all Triton extensions <https://github.com/NVIDIA/triton-inference-server/tree/master/docs/protocol>_
      become the default and only supported protocols for the server.
    • C++ and Python client libraries based on the KFServing standard
      inference protocols become the default and only supported client
      libraries.
    • The new shared library interface defined in tritonserver.h becomes
      the default and only supported shared library interface.
    • Original C++ and Python client libraries are removed. Release
      20.05 is the last release to support these libraries.
    • Original shared library interface defined in trtserver.h is
      removed. Release 20.05 is the last release to support the
      trtserver.h shared library interface.

Throughout the transition the model repository struture and custom
backend APIs will remain unchanged so that any existing model
repository and custom backends will continue to work with Triton
Server.

In the 20.06 release there will be some minor changes to the
tritonserver command-line executable arguments. It will be necessary
to revisit and possible adjust invocations of tritonserver executable.

In the 20.06 release there will be some minor changes to the model
configuration schema. It is expected that these changes will not
impact the vast majority of model configurations. For impacted models
the model configuration will need minor edits to become compatible
with Triton Server version 2.0.0.

Documentation

The User Guide, Developer Guide, and API Reference documentation for the current release <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html>_
provide guidance on installing, building, and running Triton Inference
Server.

You can also view the documentation for the master branch <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/index.html>_
and for earlier releases <https://docs.nvidia.com/deeplearning/sdk/inference-server-archived/index.html>_.

An FAQ <https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/faq.html>_
provides answers for frequently asked questions.

READMEs for deployment examples can be found in subdirectories of
deploy/, for example, deploy/single_server/README.rst <https://github.com/NVIDIA/triton-inference-server/tree/master/deploy/single_server/README.rst>_.

The Release Notes <https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html>_
and Support Matrix <https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>_
indicate the required versions of the NVIDIA Driver and CUDA, and also
describe which GPUs are supported by Triton Server.

Presentations and Papers
^^^^^^^^^^^^^^^^^^^^^^^^

  • High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>_.

  • Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>_.

  • Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>_.

  • Maximizing Utilization for Data Center Inference with TensorRT Inference Server <https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>_.

  • NVIDIA TensorRT Inference Server Boosts Deep Learning Inference <https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>_.

  • GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT Inference Server and Kubeflow <https://www.kubeflow.org/blog/nvidia_tensorrt/>_.

Contributing

Contributions to Triton Inference Server are more than welcome. To
contribute make a pull request and follow the guidelines outlined in
the Contributing <CONTRIBUTING.md>_ document.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this
project. When help with code is needed, follow the process outlined in
the Stack Overflow (https://stackoverflow.com/help/mcve)
document. Ensure posted examples are:

  • minimal – use as little code as possible that still produces the
    same problem

  • complete – provide all parts needed to reproduce the problem. Check
    if you can strip external dependency and still show the problem. The
    less time we spend on reproducing problems the more time we have to
    fix it

  • verifiable – test the code you’re about to provide to make sure it
    reproduces the problem. Remove all other problems that are not
    related to your request/question.

… |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
:target: https://opensource.org/licenses/BSD-3-Clause

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章