.
　　找了個時間研究一下下python下，到底是怎麼實現doinference的。
　　第一次接觸tensorRT的時候，主頁上就寫了tensorRT是通過輸入某種框架產生的模型及其weights，就可以進行　　優化生成一個engine，用這個engine多推斷，推斷會加速。
　　具體的來說，就是model->engine->結果。這部分想主要學習一下構建及使用engine的具體代碼，也就是很多demo中用到的doinference過程，我是以TensorRT的onnx_yolo-v3的demo中的代碼來學習這個過程的。
　　嚴格來說，這個demo中，進行推斷的過程如下：

    with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        print('Running inference on image {}...'.format(input_image_path))
        inputs[0].host = image
        trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]

文章結構

common.doinference()

輸入的參數

執行的操作

get_engine

def get_engine(onnx_file_path, engine_file_path=""):
    def build_engine():
        with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
            builder.max_workspace_size = 1 << 30 # 1GB
            builder.max_batch_size = 1
            # Parse model file
            if not os.path.exists(onnx_file_path):
                print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
                exit(0)
            print('Loading ONNX file from path {}...'.format(onnx_file_path))
            with open(onnx_file_path, 'rb') as model:
                print('Beginning ONNX file parsing')
                parser.parse(model.read())
            print('Completed parsing of ONNX file')
            print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
            engine = builder.build_cuda_engine(network)
            print("Completed creating Engine")
            with open(engine_file_path, "wb") as f:
                f.write(engine.serialize())
            return engine
            
    if os.path.exists(engine_file_path):
        # If a serialized engine exists, use it instead of building an engine.
        print("Reading engine from file {}".format(engine_file_path))
        with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())
    else:
        return build_engine()

get_engine()方法判別engine是否存在，存在的話直接用trt.Runtime()類的runtime對象下的deserialize_cuda_engine方法讀取，如下所示：

with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
     return runtime.deserialize_cuda_engine(f.read())

如果不存在，調用build_engine方法先後創建builder, network, parser；然後，運用parser.parse()方法解析model，builder.build_cuda_engine方法根據網絡創建engine；最後，利用engine.serialize()方法把創建好的engine並行化（保存爲二進制流），存在”engine_file_path“這個文件下。

engine.create_execution_context()

這是個封裝好的函數，是閉源的，c++裏面的聲明是這樣的。

    //!
    //! \brief Create an execution context.
    //!
    //! \see IExecutionContext.
    //!
    virtual IExecutionContext* createExecutionContext() = 0;

就是創建了一個在runtime過程中所必需的context。（最後也要destroy）

common.allocate_buffers()

nputs, outputs, bindings, stream = common.allocate_buffers(engine)

allocate_buffers方法爲這個創造出來的engine分配空間。

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

for循環之外

在這裏面創建了一個cuda流，然後就是對input/output的分析，以及空間分配了。

for循環之內

---- binding定義爲list，在engine中4次循環取到的分別是unicode類型的：u’000_net’, u’082_convolutional’, u’094_convolutional’, u’106_convolutional’ 。而這些值代表的恰好就是一個input和三個output。

1. size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
trt.volume方法是計算迭代空間的方法：

def volume(iterable):
    vol = 1
    for elem in iterable:
        vol *= elem
    return vol if iterable else 0

而debug顯示，此時的engine.get_binding_shape(binding)的值爲Dims類型的（608,608,3），顯而易見，trt.volume就是把圖片鋪平了，計算佔位；最後再乘以engine.max_batch_size，計算HCHW的總的內存佔用size。（這裏還沒沒有按照精度計算內存空間，相當於只計算了像素個數)

2. dtype = trt.nptype(engine.get_binding_dtype(binding))
trt.nptype是返回type的：

def nptype(trt_type):
    import numpy as np
    if trt_type == float32:
        return np.float32
    elif trt_type == float16:
        return np.float16
    elif trt_type == int8:
        return np.int8
    elif trt_type == int32:
        return np.int32
    raise TypeError("Could not resolve TensorRT datatype to an equivalent numpy datatype.")

這裏返回了type，因爲不同精度佔的內存不同，所以需要知道type。

3. host_mem = cuda.pagelocked_empty(size, dtype)
根據size和type創建page-locked的內存緩衝區（sizetype所佔byte數）
4. device_mem = cuda.mem_alloc(host_mem.nbytes)
創建設備存儲區。其中host_mem.nbytes就是計算出來的int值，比如在我跑tensorRT的yolov3中，推斷過程中batchsize等於1，輸入爲16086083,對應的float類型爲32位，佔4個byte，所以最後佔用的byte數爲1608608*4= 4435968個byte。
而我debug出來的也正是如此。

最後，list類型的binding存儲了，每個input/output的內存佔有量。

5.    if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
      else:
           outputs.append(HostDeviceMem(host_mem, device_mem))

如果是input，就在inputs的list中append出一個對應的HostDeviceMem對象；如果是output，就在outputs類裏面append一個對應的HostDeviceMem對象。HostDeviceMem對象的結構如下所示：

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

所以，總結起來，allocate_buffers方法就是爲輸入和輸出開闢了內存。每個input/ouput都生成兩個ndarray類型的host_mem,device_mem，封裝在HostDeviceMem類中，用來存inference的結果。

common.doinference()

除了設置了一下inputs[0]中的host(也就是上面求出的input的host_mem)的值,即給input賦值

inputs[0].host=image

就是做doinference了。

trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

doinference方法只有五行代碼：

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

輸入的參數

context

context是通過engine.create_execution_context()生成的，create_execution_context是寫在ICudaEngine.py的一個閉源方法，這個方法是創建立一個IExecutionContext類型的對象。
在python中，這個類是寫在IExecutionContext中，

class IExecutionContext(__pybind11_builtins.pybind11_object):
	def execute(self, batch_size, bindings, p_int=None): # real signature unknown; restored from __doc__
	def execute_async(self, batch_size=1, bindings, p_int=None, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from __doc__
	def __del__(self): # real signature unknown; restored from __doc__
        """ __del__(self: tensorrt.tensorrt.IExecutionContext) -> None """
        pass
    def __enter__(self, *args, **kwargs): # real signature unknown
        pass
    def __exit__(self, *args, **kwargs): # real signature unknown
        pass
    def __init__(self, *args, **kwargs): # real signature unknown
        pass
    debug_sync = property(lambda self: object(), lambda self, v: None, lambda self: None)  # default
    device_memory = property(lambda self: object(), lambda self, v: None, lambda self: None)  # default
    engine = property(lambda self: object(), lambda self, v: None, lambda self: None)  # default
    name = property(lambda self: object(), lambda self, v: None, lambda self: None)  # default
    profiler = property(lambda self: object(), lambda self, v: None, lambda self: None)  # default

又去 c++ 的NvInfer.h中看了IExecutionContext類的一些方法。

class IExecutionContext
{
public:
	# 在一個batch上同步的執行inference
	virtual bool execute(int batchSize,void** bindings)=0;
	# 在一個batch上異步的執行inference
	virtual bool enqueue(int batchSize, void** bindings, cudaStream_t stream, cudaEvent_t* inputConsumed)=0;
	# 設置Debug sync flag
	virtual void setDebugSync(bool sync)=0;
	# 得到 Debug sync flag
	virtual bool getDebugSync()=0;
	# 設置 profiler
	virtual void setProfiler(IProfiler*)=0;
	# 得到 profiler
	virtual const ICudaEngine& getEngine() const=0;
	# 得到鏈接的engine
	virtual const getEngine() const=0;
	virtual void destroy()=0;
protected:
	virtual ~IExecutionContext(){}
public:
	# 設置execution context的名字
	virtual void setName(const char* name)=0;
	# 返回該名字的execution context
	virtual const char* getName() const =0;
	# 設置需要的device memory
	virtual void setDeviceMemory(void* memory)=0;
}

所以，大同小異，context是用來執行推斷的對象，在C++中，是利用execute或者enqueue方法進行推斷執行，但是在python中則是利用execute和execute_async兩個函方法。

bindings

bindings中存的是每個input/output所佔byte數的int值，上面已經有介紹

inputs,outputs

inputs，outputs是有一個個HostDeviceMem類型組成的list,比如inputs[0]就在之前的步驟被賦值爲預處理後的image，而outputs在沒有執行推斷之前，則是值爲0，outputs大小的三個HostDeviceMem對象。

stream

stream爲在allocate_buffers中由cuda.Stream()生成的stream，來自於Stream.py,但是這個不是TensorRT的東西，而來自於pycuda，是cuda使用過程不可缺少的一員。

執行的操作

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

cuda.memcpy_htod_async()

python中，沒有給確切的註釋，但是在NVIDIA CUDA Library Documentation中找到了類似的方法：cuMemcpyHtoD方法。
這個方法從主機內存複製數據到設備內存。 dstDevice和srcHost分別是目標和源的基地址。 ByteCount指定要複製的字節數。
所以，cuda.memcpy_htod_async方法是把input中的數據從主機內存複製到設備內存（遞給GPU），而inputs中的元素恰好是函數可以接受的HostDeviceMem類型。

context.execute_async

這個在上面介紹context中已經提到，利用GPU執行推斷的步驟，這裏應該是異步的。

cuda.memcpy_dtoh_async

推斷，和上面的memcpy_htod_async相同，也有個類似的方法cuMemcpyDtoH，即把計算完的數據從device（GPU）拷回host memory中。

stream.synchronize()

這是在Context.py中的方法，和同步有關，但是目前的知識，還不太會解釋。

函數的最後，把存在HostDeviceMem類型的outpus中的host中的數據，取出來，放在一個list中返回。

當然，這是python中doinference的一種實現方法，C++中比較亂，還需要繼續學習。

TensorRT5.1.5.0 實踐 Doinference過程的探究（python）

文章結構

get_engine

engine.create_execution_context()

common.allocate_buffers()

for循環之外

for循環之內

common.doinference()

輸入的參數

context

bindings

inputs,outputs

stream

執行的操作

cuda.memcpy_htod_async()

context.execute_async

cuda.memcpy_dtoh_async

stream.synchronize()

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

Faceswap開發（一） GAN網絡的基本瞭解

pytorch->onnx->trt 踩坑記錄

SegDecNet的多GPU數據同步訓練代碼改動記錄

ubuntu16.04+cuda9.2+tensorflow-gpu1.8.0+torch1.3.0 問題記錄

TensorRT安裝 & 環境配置

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結