TensorRT優化原理和TensorRT Plguin總結

1. TensorRT優化原理

在這裏插入圖片描述

TensorRT加速DL Inference的能力來源於optimizer和runtime。其優化原理包括四個方面:

  • Layer & Tensor fusion: 將整個網絡中的convolution、bias和ReLU層進行融合,調用一個統一的kernel進行處理,讓數據傳輸變快,kernel lauch時間減少,實現加速。此外,還會消除一些output未被使用的層、聚合一些相似的參數和相同的源張量。
  • Mix precision:使用混合精度,降低數據的大小,減少計算量。
  • kernel auto-tuning:基於採用的硬件平臺、輸入的參數合理的選擇一些層的算法,比如不同卷積的算法,自動選擇GPU上的kernel或者tensor core等。
  • Dynamic tensor memory:tensorrt在運行中會申請一塊memory,最大限度的重複利用此內存,讓計算變得高效。

2. TensorRT開發基本流程

下面代碼介紹了TensorRT開發基本流程:

from random import randint
from PIL import Image
import numpy as np

import pycuda.driver as cuda
# This import causes pycuda to automatically manage CUDA context creation and cleanup.
import pycuda.autoinit
import tensorrt as trt
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class ModelData(object):
    MODEL_FILE = "lenet5.uff"
    INPUT_NAME ="input_1"
    INPUT_SHAPE = (1, 28, 28)
    OUTPUT_NAME = "dense_1/Softmax"

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

def build_engine(model_file):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_workspace_size = common.GiB(1)
        # Parse the Uff Network
        parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE)
        parser.register_output(ModelData.OUTPUT_NAME)
        parser.parse(model_file, network)
        # Build and return an engine.
        return builder.build_cuda_engine(network)

# Loads a test case into the provided pagelocked_buffer.
def load_normalized_test_case(data_path, pagelocked_buffer, case_num=randint(0, 9)):
    test_case_path = os.path.join(data_path, str(case_num) + ".pgm")
    # Flatten the image into a 1D array, normalize, and copy to pagelocked memory.
    img = np.array(Image.open(test_case_path)).ravel()
    np.copyto(pagelocked_buffer, 1.0 - img / 255.0)
    return case_num

def main():
    data_path, _ = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist")
    model_path = os.environ.get("MODEL_PATH") or os.path.join(os.path.dirname(__file__), "models")
    model_file = os.path.join(model_path, ModelData.MODEL_FILE)

    with build_engine(model_file) as engine:
        # Build an engine, allocate buffers and create a stream.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        with engine.create_execution_context() as context:
            case_num = load_normalized_test_case(data_path, pagelocked_buffer=inputs[0].host)  
            [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            pred = np.argmax(output)
            print("Test Case: " + str(case_num))
            print("Prediction: " + str(pred))

if __name__ == '__main__':
    main()


3. TensorRT Network Definition API

除了使用parse的方式解析模型,還可以使用Network Definition API重構整個模型,重構的方式如下圖所示:

在這裏插入圖片描述
首先創建一個builder和INetworkDefinition對象network,然後開始構建network架構。一開始調用network->addInput函數添加一個input層,然後在Input的基礎上再繼續添加其它層,如:卷積層、Scale層、Softmax層等,構建完整個網絡之後,最後設置一下整個網絡的output name,並標記整個網絡的輸出。

這只是定義網絡的骨架,那網絡的權重如何導入呢?
在使用INetorkDefinition API構建network之前,需要先從checkpoint文件中將Weight導出來,在定義每一層網絡時,再將Weight塞進去。如上面定義Convolution時,給出weightMap的參數,將weight導入進去。

上面介紹了TensorRT導入模型的兩種方式:

  • Paser: 解析模型文件。
  • Network Definition API: 重新定義整個模型。

使用Paser方式,一般支持的層數只有十幾層,那對於那些額外的不能解析的層怎麼辦呢?(如RNN,目前RNN還不能直接被Paser解析)
一般做法是使用Network Definition API重構這個模型,然後通過network.addRNN()函數添加RNN layer的方式解決這個問題。
目前TensorRT5.1 Network Definition API支持的層爲:
在這裏插入圖片描述

如果我有一些非常不標準的層或者是我自己想定製化的層,沒有辦法通過Paser或者Network Definition API構建,那麼該怎麼辦呢?
TensorRT提供了第三種方式——自定義Plugin的方式,通過自己開發CUDA,實現自定義的Layer,然後把它封裝成TensortRT的Plugin,這樣TensorRT可以識別自定義的層。


4. TensorRT Plugin

首先要明確Plugin是做什麼的?Plugin是我們針對某個需要定製化的層或目前TensorRT還不支持的層進行實現、封裝。
TensorRT Plugin是對網絡層功能的擴展,TensorRT官方已經預構建了一些在目標檢測中經常使用的Plugin,如:NMS、PriorBOX等,我們可以在TensorRT直接使用,其他的Plugin則需要我們自己創建。
下圖顯示了TensorRT官方已經構建的Plugin:
在這裏插入圖片描述

4.1 實現plugin

Plugin的實現需要分爲三步:目前只能以C++實現:
在這裏插入圖片描述

  1. Kernel代碼實現:這部分是該layer需要做的具體的CUDA操作,一般放在xxx.cu、xxx.h文件中。
  2. IPluginV2:plugin的基類,所有的插件都需要繼承該類。
  3. IPluginCreator: 該類用於在build network時創建plugin對象,或者在inference時deserialize創建plugin對象。

4.2 編譯plugin.so動態庫

當上面的三個類型文件實現後,即可創建Plugin.so動態庫文件了,一般用CMake混合編譯C++與cuda。具體可以參考:

  1. https://blog.csdn.net/fb_help/article/details/79330815
  2. https://pytorch.org/tutorials/advanced/cpp_extension.html

由於torch包實現CUDAExtension編譯,因此這裏使用此方式來混合編譯,代碼如下:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='retinanet_plugin',
    ext_modules=[
        CUDAExtension('retinanet_plugin',
            ['plugins/DecodePlugin.cpp', 'plugins/NMSPlugin.cpp', 'cuda/decode.cu', 'cuda/nms.cu'],
            extra_compile_args={
                'cxx': ['-std=c++11', '-O2', '-Wall'],
                'nvcc': [
                    '-std=c++11', '--expt-extended-lambda','-Xcompiler', '-Wall',
                    '-gencode=arch=compute_30,code=sm_30', '-gencode=arch=compute_35,code=sm_35',
                    '-gencode=arch=compute_61,code=sm_61', '-gencode=arch=compute_62,code=sm_62',
                    '-gencode=arch=compute_70,code=sm_70', '-gencode=arch=compute_72,code=sm_72',
                    '-gencode=arch=compute_75,code=sm_75', '-gencode=arch=compute_75,code=compute_75'
                ],
            },
            libraries=['nvinfer', 'nvinfer_plugin', 'nvonnxparser', 'opencv_highgui', 'opencv_imgproc', 'opencv_imgcodecs'])
    ],
    cmdclass={
        'build_ext': BuildExtension
    })

執行下面命令,可以得到retinanet_plugin.so文件:

# python setup.py install
# ll /usr/local/lib/python3.5/dist-packages/retinanet_plugin-0.0.0-py3.5-linux-x86_64.egg
total 4008
drwxr-sr-x 4 root staff     144 Aug  9 07:01 ./
drwxrwsr-x 1 root staff     289 Aug  9 07:00 ../
drwxr-sr-x 2 root staff     161 Aug  9 07:00 EGG-INFO/
drwxr-sr-x 2 root staff      53 Aug  9 07:00 __pycache__/
-rwxr-xr-x 1 root staff 4097408 Aug  9 07:00 retinanet_plugin.cpython-35m-x86_64-linux-gnu.so*
-rw-r--r-- 1 root staff     321 Aug  9 07:00 retinanet_plugin.py

4.3 在TensorRT中加載plugin

加載plugin的方式如下圖:
在這裏插入圖片描述

1) 使用C++時,直接調用initLibNvinferPlugins()即可將TensorRT pre-build plugin和我們自定義的plugin全部註冊進來。

2)當使用python時,需要先使用ctypes.CDLL(XX_PLUGIN_LIBRARY)將動態庫文件載入進來,然後調用trt.init_libnvinfer_plugins(TRT_LOGGER, ‘’)方法將TensorRT pre-build plugin和我們自定義的plugin全部註冊進來。對於uff模型,parser在解析網絡的時候TensorRT會自動map layer到plugin。而對於engine文件,tensorrt會自動搜索並使用註冊進來plugin。


5. plugin實例

例1:使用Plugin替換原pb模型中的layer,作成uff模型
主要分爲以下幾步:

  1. 設置原pb模型node與plugin node的映射關係,並進行“手術”
    參照sample:/usr/src/tensorrt/samples/python/u_ssd/utils/model.py
def ssd_unsupported_nodes_to_plugin_nodes(ssd_graph):
    """Makes ssd_graph TensorRT comparible using graphsurgeon.

    This function takes ssd_graph, which contains graphsurgeon
    DynamicGraph data structure. This structure describes frozen Tensorflow
    graph, that can be modified using graphsurgeon (by deleting, adding,
    replacing certain nodes). The graph is modified by removing
    Tensorflow operations that are not supported by TensorRT's UffParser
    and replacing them with custom layer plugin nodes.

    Note: This specific implementation works only for
    ssd_inception_v2_coco_2017_11_17 network.

    Args:
        ssd_graph (gs.DynamicGraph): graph to convert
    Returns:
        gs.DynamicGraph: UffParser compatible SSD graph
    """
    # Create TRT plugin nodes to replace unsupported ops in Tensorflow graph
    channels = ModelData.get_input_channels()
    height = ModelData.get_input_height()
    width = ModelData.get_input_width()


Input = gs.create_plugin_node(name="Input",
        op="Placeholder",
        dtype=tf.float32,
        shape=[1, channels, height, width])
    PriorBox = gs.create_plugin_node(name="GridAnchor", op="GridAnchor_TRT",
        minSize=0.2,
        maxSize=0.95,
        aspectRatios=[1.0, 2.0, 0.5, 3.0, 0.33],
        variance=[0.1,0.1,0.2,0.2],
        featureMapShapes=[19, 10, 5, 3, 2, 1],
        numLayers=6
    )
    NMS = gs.create_plugin_node(
        name="NMS",
        op="NMS_TRT",
        shareLocation=1,
        varianceEncodedInTarget=0,
        backgroundLabelId=0,
        confidenceThreshold=1e-8,
        nmsThreshold=0.6,
        topK=100,
        keepTopK=100,
        numClasses=91,
        inputOrder=[0, 2, 1],
        confSigmoid=1,
        isNormalized=1
    )
    concat_priorbox = gs.create_node(
        "concat_priorbox",
        op="ConcatV2",
        dtype=tf.float32,
        axis=2
    )
    concat_box_loc = gs.create_plugin_node(
        "concat_box_loc",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
    )
    concat_box_conf = gs.create_plugin_node(
        "concat_box_conf",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
    )

# 設置映射關係
    # Create a mapping of namespace names -> plugin nodes.
    namespace_plugin_map = {
        "MultipleGridAnchorGenerator": PriorBox,
        "Postprocessor": NMS,
        "Preprocessor": Input,
        "ToFloat": Input,
        "image_tensor": Input,
        "MultipleGridAnchorGenerator/Concatenate": concat_priorbox,
        "MultipleGridAnchorGenerator/Identity": concat_priorbox,
        "concat": concat_box_loc,
        "concat_1": concat_box_conf
    }
    
    # Create a new graph by collapsing namespaces
    ssd_graph.collapse_namespaces(namespace_plugin_map)
    # Remove the outputs, so we just have a single output node (NMS).
    # If remove_exclusive_dependencies is True, the whole graph will be removed!
    ssd_graph.remove(ssd_graph.graph_outputs, remove_exclusive_dependencies=False)
    return ssd_graph

2)生成uff模型文件

def model_to_uff(model_path, output_uff_path, silent=False):
    """Takes frozen .pb graph, converts it to .uff and saves it to file.

    Args:
        model_path (str): .pb model path
        output_uff_path (str): .uff path where the UFF file will be saved
        silent (bool): if True, writes progress messages to stdout

    """
    dynamic_graph = gs.DynamicGraph(model_path)
    dynamic_graph = ssd_unsupported_nodes_to_plugin_nodes(dynamic_graph)

    uff.from_tensorflow(
        dynamic_graph.as_graph_def(),
        [ModelData.OUTPUT_NAME],
        output_filename=output_uff_path,
        text=True
    )

3)加載.so庫文件

ctypes.CDLL(PATHS.get_flatten_concat_plugin_path())

4)加載uff模型文件,創建engine
加載所有自定義的plugin

trt.init_libnvinfer_plugins(TRT_LOGGER, '')

創建engine

def build_engine(uff_model_path, trt_logger, trt_engine_datatype=trt.DataType.FLOAT, batch_size=1, silent=False):
    with trt.Builder(trt_logger) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_workspace_size = 1 << 30
        if trt_engine_datatype == trt.DataType.HALF:
            builder.fp16_mode = True
        builder.max_batch_size = batch_size

        parser.register_input(ModelData.INPUT_NAME, ModelData.INPUT_SHAPE)
        parser.register_output("MarkOutput_0")
        parser.parse(uff_model_path, network)

        if not silent:
            print("Building TensorRT engine. This may take few minutes.")

        return builder.build_cuda_engine(network)

5)save、load engine

def save_engine(engine, engine_dest_path):
    buf = engine.serialize()
    with open(engine_dest_path, 'wb') as f:
        f.write(buf)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, 'rb') as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

例2:使用Network De×nition API及Plugin創建網絡模型
在使用Network Denition API及Plugin創建網絡模型時,則只需要使用network.add_plugin_v2方法來構建網絡即可。參照以下例子:

import tensorrt as trt
import numpy as np
TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list

def get_trt_plugin(plugin_name):
    plugin = None
    for plugin_creator in PLUGIN_CREATORS:
        if plugin_creator.name == plugin_name:
            lrelu_slope_field = trt.PluginField("neg_slope",    np.array([0.1], dtype=np.float32), trt.PluginFieldType.FLOAT32)
            field_collection = trt.PluginFieldCollection([lrelu_slope_field])
            plugin = plugin_creator.create_plugin(name=plugin_name, field_collection=field_collection)
    return plugin

def main():
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
    builder.max_workspace_size = 2**20
    input_layer = network.add_input(name="input_layer", dtype=trt.float32, shape=(1, 1))
    lrelu = network.add_plugin_v2(inputs=[input_layer], plugin=get_trt_plugin("LReLU_TRT"))
    lrelu.get_output(0).name = "outputs"
    network.mark_output(lrelu.get_output(0))
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章