使用TensorRT 加速maskRCNN Benchmark

原創

2019-07-30 19:10

一、所需工具

MaskRCNN benchmark的pth模型文件
pytorch.jit
pytorch.onnx
TensorRT 5.1
我用的是RTX2080Ti顯卡

二、加速過程
由於MaskRCNN是一個兩段式的模型，所以我們可以只改寫第一部分用於提取特徵的backbone網絡，第一部分到第二部分較爲複雜同時佔用的計算時間較少，所以我直接將改寫好的第一部分嫁接到原本的maskrcnn-benchmark上。

使用pytorch.jit追蹤backbone部分

import torch
from maskrcnn_benchmark.config import cfg
from predictor import COCODemo
import os, sys
import cv2
import datetime
from get_outline_json import get_outline_json
import json
from tqdm import tqdm
import glob


config_file = '../configs/e2e_mask_rcnn_R_50_FPN_1x.yaml'
opts = ["MODEL.WEIGHT", 
         "./R50_15801_20190620/model_0082500.pth",
         "MODEL.DEVICE", "cpu"]
# load config from file and command-line arguments
cfg.merge_from_file(config_file)
cfg.merge_from_list(opts)
cfg.freeze()

model = COCODemo(cfg,
                 min_image_size=800,
                 # confidence_threshold=0.7,
                 show_mask_heatmaps=False)  # 0-30
model.model.backbone.eval()

trace_backbone = torch.jit.trace(model.model.backbone, (torch.rand([2,3, 800,800])))

使用pytorch.onnx將追蹤模型導出

output = model.model.backbone(torch.rand([4,3, 800,800]))
import torch.onnx as onnx
onnx._export(trace_backbone, torch.randn(4,3,800,800), './temp.pb', 
             verbose=False, export_params=True, training=False,
            example_outputs=output)

使用tensorRT調用導出的模型

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Builder(TRT_LOGGER) as builder, \
        builder.create_network() as network, \
        trt.OnnxParser(network, TRT_LOGGER) as parser:
    with open('./temp.pb', 'rb') as model:
        a = parser.parse(model.read())
        print(a)
    print('Building an engine from file temp.pb; this may take a while...')
    engine = builder.build_cuda_engine(network)
    print("Completed creating Engine")

inputs = []
outputs = []
bindings = []
stream = cuda.Stream()

for binding in engine:
    print(binding)
    size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
    dtype = trt.nptype(engine.get_binding_dtype(binding))

    # 分配host和device端的buffer
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)

    # 將device端的buffer追加到device的bindings.
    bindings.append(int(device_mem))

    # Append to the appropriate list.
    if engine.binding_is_input(binding):
        inputs.append((host_mem, device_mem))
    else:
        outputs.append((host_mem, device_mem))    

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    [cuda.memcpy_htod_async(inp[1], inp[0], stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out[0], out[1],stream) for out in outputs]
    stream.synchronize()
    return [out[0] for out in outputs]        

#調用代碼
from tqdm import tqdm_notebook as tqdm
import numpy as np
for i in tqdm(range(10000)):
    with engine.create_execution_context() as context:
        np.copyto(inputs[0][0], np.ones([800,800,3]).reshape([1920000]))
        output = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        pred = np.array(output)

三、使用後的效果

單獨使用backbone部分的話，改寫之後的速度是改寫之前的3倍
但是使用整個maskRCNN模型的話，在第一部分轉到第二部分的時候會有大量的數據拷貝的操作，實際使用的速度改寫後的速度只有改寫前的一半左右，cuda的利用率也提不上去，當然，也可能是我比較菜。
改寫成功的朋友不介意的話分享下測的速度。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用TensorRT 加速maskRCNN Benchmark

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

常用漢字3500——文字識別數據準備

tflite製作過程

點雲數據處理之安裝python-openni

MAC格式化並刻錄ISO到U盤

視頻加logo代碼python opencv-日常工具

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結