文章目錄

當存在自定義op的時候，自定義op在pytorch2onnx，onnx2tensorRT兩個過程中都應該是需要擴展的。

pytoch 轉 onnx 過程中擴展自定義op

流程

例如，在這裏，自定義一個叫做nonentity的op（但是實際功能就是全連接層，即Linear操作）

自定義一個pytorch的op，即對pytorch進行擴展。
在自定義的op的邏輯裏面加入symbolic函數，使torch.onnx能夠識別該自定義op。

細節學習

自定義pytorch的op

自定義一個pytorch的op，即對pytorch進行擴展。詳情見Pytorch1.1.0 入門自定義op（python）

加入symbolic

即在自定義op的函數中加入symbolic（）函數，之後的整體自定義op函數如下所示。

class LinearFunction(Function):

    # 這裏的beta和alpha沒有實際用處，只是證明使用自定義的op，在torch->onnx過程中，是可以傳遞網絡參數的。
    @staticmethod
    def symbolic(g, self, mat1, mat2, beta, alpha):
        #return g.op("nonentity", mat1, mat2, self, beta_f=beta, alpha_f=alpha)
        return g.op("nonentity", self,mat1, mat2,  beta_f=beta, alpha_f=alpha)
    @staticmethod
    def forward(ctx,input,weight,bias=None,beta_f=1.0,alpha_f=1.0):
        ctx.save_for_backward(input,weight,bias)
        ctx.beta=beta_f
        ctx.alpha=alpha_f

        output=input.mm(weight.t())
        if bias is not None:
            output+=bias.unsqueeze(0).expand_as(output)
        return output

    @staticmethod
    def backward(ctx,grad_output):
        input,weight,bias=ctx.saved_variables
        grad_input=grad_weight=grad_bias=None

        if ctx.needs_input_grad[0]:
            grad_input=grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight=grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias=grad_output.sum(0).squeeze(0)
        return grad_input,grad_weight,grad_bias,None,None

symbolic可以認爲規定了，pytorch->onnx這個過程中的輸出規範。
這裏參考這裏torch.onnx
網址( https://segmentfault.com/p/1210000018097701/read )
簡單的來說我們就是在自己創造，onnx非標準化的非ATen操作符（op），我的代碼中對應的symbolic是這樣的

    def symbolic(g, self, mat1, mat2, beta, alpha):
        return g.op("nonentity", self,mat1, mat2,  beta_f=beta, alpha_f=alpha)

對應的輸出的onnx結構的部分也就是如下的

...
  %19 : Float(64, 64, 3, 3) = onnx::MaxPool[kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%18), scope: Net_LinearFunction/Sequential[conv3]/MaxPool2d[2]
 %20 : Float(64, 576) = onnx::Flatten[axis=1](%19), scope: Net_LinearFunction
 %21 : Float(64, 128) = onnx::nonentity[alpha=1.3, beta=1.2](%20, %dense.0.weight, %dense.0.bias), scope: Net_LinearFunction/Sequential[dense]/Linear[0]
 %22 : Float(64, 128) = onnx::Relu(%21), scope: Net_LinearFunction/Sequential[dense]/ReLU[1]
 %23 : Float(64, 10) = onnx::nonentity[alpha=1.33, beta=1.22](%22, %dense.2.weight, %dense.2.bias), scope: Net_LinearFunction/Sequential[dense]/Linear[2]
 return (%23)

%21和%23都是用我自定義的op,“nonentity”來執行運算的，“[]”中代表的是網絡參數，"()"中代表的權重

onnx 轉 tensorRT 過程中擴展自定義op

流程

例如，在這裏，自定義一個叫做nonentity的op

下載官網源碼onnx-tensorrt
參考InstanceNormalization.cpp/.h ,寫好自己nonentity.hpp和nonentity.cpp的實現。(同樣可以參考FancyActivation,ResizeNearest等，都是官方寫好的自定義op的示例，是op的邏輯)
在builtin_op_importers.cpp中使用DEFINE_BUILTIN_OP_IMPORTER添加對自己註冊Op的使用。
在CMakeLists.txt中，set(IMPORTER_SOURCES... 下面將自己的nonentity.cpp加進去。
按照教程，重新編譯自己的onnx-tensorRT

然後拿自己輸出的onnx文件測試，其實自己的nonentity層就可以正常被讀取了，輸出如下所示：

boyun@boyun-MS-7B90:~/workspace/onnx-tensorrt-master$ onnx2trt ./onnx/customer_op_FC.onnx -v
----------------------------------------------------------------
Input filename:   ./onnx/customer_op_FC.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.1
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
Parsing model
[2019-08-13 06:15:48    INFO] 11:Conv -> (32, 28, 28)
[2019-08-13 06:15:48    INFO] 12:Relu -> (32, 28, 28)
[2019-08-13 06:15:48    INFO] 13:MaxPool -> (32, 14, 14)
[2019-08-13 06:15:48    INFO] 14:Conv -> (64, 14, 14)
[2019-08-13 06:15:48    INFO] 15:Relu -> (64, 14, 14)
[2019-08-13 06:15:48    INFO] 16:MaxPool -> (64, 7, 7)
[2019-08-13 06:15:48    INFO] 17:Conv -> (64, 7, 7)
[2019-08-13 06:15:48    INFO] 18:Relu -> (64, 7, 7)
[2019-08-13 06:15:48    INFO] 19:MaxPool -> (64, 3, 3)
[2019-08-13 06:15:48    INFO] 20:Flatten -> (576)
[2019-08-13 06:15:48    INFO] 21:nonentity -> (576)
[2019-08-13 06:15:48    INFO] 22:Relu -> (576)
[2019-08-13 06:15:48    INFO] 23:nonentity -> (576)
All done

因爲我把nonentity的內部邏輯寫成了Normalize，所以後來維度就不變了，這也說明層邏輯也被讀取了。

細節學習

onnx-TensorRT的自定義op寫法，用的是IPluginV2(NvInfer.h)。

對自定義op：InstanceNormalization的詳解

InstanceNormalizationPlugin.hpp

包含三個部分：namespace，InstanceNormalizationPlugin ， InstanceNormalizationPluginCreator。

namespace
主要是定義了該pluign的版本和該plugin的名稱。
InstanceNormalizationPlugin
是對onnx2trt::PluginV2的繼承。
而其實onnx2trt::PluginV2是對NvInfer.h中的nvinfer1::IPluginV2的繼承。
對於IPluginV2的結構介紹詳見這篇博文IPluginV2.
InstanceNormalizationPluginCreator
是對NvInfer.h中的nvinfer1::IPluginCreator的繼承，這個基類是配合PluginV2Ext類來實現自定義op（層）註冊並使用的。介紹見IPluginCreator

這裏可以看到“2”步驟提出的是繼承nvinfer1::IPluginV2,但是“3”步驟提出的是配合nvinfer1::IPluginV2Ext。這是因爲5.1.x.x相比5.0.x.x更新了幾個新方法，寫在IPluginV2Ext中，IPluginVExt繼承IPluginV2，官方支持使用最新版本的 IPluginV2Ext.
IPluginCreator的各個函數的實現方法，在不同的自定義op中，寫法基本一樣，只需要在getPluginName和getPluginVersion的時候return對應參數即可。

！！！建議上面的onnx2trt::PluginV2可以考慮繼承IPluginV2Ext，也就是說官方這onnx-tensorrt中的寫法已經有些落後了。

InstanceNormalizationPlugin.cpp

就是對頭文件的實現，和caffe的自定義層邏輯大同小異。
同樣具有兩個構造函數分別負責build和runtime階段，其餘函數就不多說了，各司其職。
想理解一下核心的enqueue()

int InstanceNormalizationPlugin::enqueue(int batchSize,
                                         const void *const *inputs, void **outputs,
                                         void *workspace, cudaStream_t stream) {
  assert(_initialized);
  nvinfer1::Dims input_dims = this->getInputDims(0);
  int n = batchSize;
  int c = input_dims.d[0];
  int h = input_dims.d[1];
  int w = input_dims.d[2];
  CHECK_CUDNN(cudnnSetTensor4dDescriptor(_b_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, n*c, 1, 1));
  cudnnDataType_t cudnn_dtype;
  CHECK_CUDNN(convert_trt2cudnn_dtype(this->getDataType(), &cudnn_dtype));
  CHECK_CUDNN(cudnnSetTensor4dDescriptor(_x_desc, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n*c, h, w));
  CHECK_CUDNN(cudnnSetTensor4dDescriptor(_y_desc, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n*c, h, w));
  float alpha = 1;
  float beta  = 0;
  void const* x_ptr = inputs[0];
  void*       y_ptr = outputs[0];
  CHECK_CUDNN(cudnnSetStream(_cudnn_handle, stream));
  // Note: Use of CUDNN_BATCHNORM_SPATIAL_PERSISTENT can cause numerical
  //       overflows (NaNs) for fp32 data in some circumstances. The lower-
  //       performance CUDNN_BATCHNORM_SPATIAL should be used if this is not
  //       acceptable.
  CHECK_CUDNN(
      cudnnBatchNormalizationForwardTraining(
          _cudnn_handle, CUDNN_BATCHNORM_SPATIAL_PERSISTENT, &alpha, &beta,
          _x_desc, x_ptr, _y_desc, y_ptr, _b_desc, _d_scale, _d_bias,
          1., nullptr, nullptr, _epsilon, nullptr, nullptr));
  return 0;
}

CHECK_CUDNN的必要性
因爲cudnn的每個函數，都會返回類型爲cudnnStatus_t的錯誤碼。成功執行的話，返回CUDNN_STATUS_SUCCESS。其他錯誤的話，可以用cudnnGetErrorString(status)來獲得具體的錯誤信息。
而CHECK_CUDNN就是檢測錯誤碼用的。
cudnnSetTensor4dDescriptor()
用來構造cudnn可用的輸入或輸出描述。
對於卷積計算來說，主要有三個參數，輸入輸出和權重。構造輸入輸出的描述的三個步驟如下：

cudnnTensorDescriptor_t input_descriptor;
cudnnCreateTensorDescriptor(&input_descriptor);
cudnnSetTensor4dDescriptor(input_descriptor,
                                      /*format=*/CUDNN_TENSOR_NHWC,
                                      /*dataType=*/CUDNN_DATA_FLOAT,
                                      /*batch_size=*/1,
                                      /*channels=*/3,
                                      /*image_height=*/image.rows,
                                      /*image_width=*/image.cols);

即創造一個tensorDescriptor，然後再給它設置屬性。
在這個InstanceNormalization的自定義op中
（1）類構建的時候創建了cudnnTensorDescriptor_t類型的_x_desc, _y_desc, _b_desc.
（2）cpp的initialize()中做了udnnCreateTensorDescriptor
（3）cpp的enqueue()中做了cudnnSetTensor4dDescriptor
這樣，需要執行計算的tensor就準備完畢了。

convert_trt2cudnn_dtype()
進行精度選擇：half or float
cudnnSetStream()

cudnnStatus_t cudnnSetStream(cudnnHandle_t handle, cudaStream_t streamId)

此函數在cuDNN句柄中設置用戶的CUDA流。當在內部流中啓動cuDNN內核時，新流將用於啓動cuDNN GPU內核或同步到此流。如果未設置cuDNN庫流，則所有內核都使用默認（NULL）流。在cuDNN句柄中設置用戶流可確保在同一流中啓動cuDNN調用和其他GPU內核的問題順序執行。
handle：指向cuDNN handle的指針
streamID：新的CUDA流

cudnnBatchNormalizationForwardTraining()
此函數執行訓練階段的前向BatchNormalization層計算。
好吧，看到這步有點失望。。。那就說明又沒有源代碼…
想知道具體這個函數的作用：cuDNN開發手冊

所以，整個InstanceNormalization的自定義邏輯就綜上所述。

builtin_op_importers.cpp的理解

這裏可以看一下這個文章，大佬寫的很清楚Onnx-tensorrt詳解之nvonnxparser庫.

流程

將onnx輸入數據轉化爲trt要求的數據格式
建立trt層
計算trt輸出結果

打開這個cpp，可以看到了所有的op的邏輯調用，以DEFINE_BUILTIN_OP_IMPORTER(Conv) 爲例，也就是說，當檢測到onnx的Conv操作時，Conv操作的處理過程爲：

//************將onnx輸入數據轉化爲trt要求的數據格式*******************
nvinfer1::ITensor* tensor_ptr = &inputs.at(0).tensor();
auto kernel_weights = inputs.at(1).weights();   //onnxmodel的輸入格式 inputs=['x','W'],轉化爲trtmodel輸出的數據格式
int noutput = kernel_weights.shape.d[0];

//*************************建立trt層*********************
nvinfer1::IConvolutionLayer* layer = ctx->network()->addConvolution(*tensor_ptr, noutput, kernel_size, kernel_weights, bias_weights);  //此時，onnx的layer已經轉化爲trtmodel的layer,   ctx（context的簡寫）就是trtmodel的network。  trt官方文檔給出的添加convolution層的例子：IConvolutionLayer* conv1 = network->addConvolution(*scale_1->getOutput(0), 20, DimsHW{5, 5}, mWeightMap["conv1filter"], mWeightMap["conv1bias"]);

//***********************計算trt輸出結果********************
tensor_ptr = layer->getOutput(0); //利用trtmodel計算輸出輸出tensor，並作爲輸出返回
return {{tensor_ptr}}; //返回輸出tensor  Y

所以，創造自定義op的時候，在這裏添加讀取邏輯，是必要的。

網絡結構參數、weights和bias的讀取

onnx文件是可以可視化的，例如下面一個有自定義層mnist網絡是這樣的：

  %11 : Float(64, 32, 28, 28) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[3, 3], pads=[1, 1, 1, 1], strides=[1, 1]](%0, %conv1.0.weight, %conv1.0.bias), scope: Net_LinearFunction/Sequential[conv1]/Conv2d[0]
  %12 : Float(64, 32, 28, 28) = onnx::Relu(%11), scope: Net_LinearFunction/Sequential[conv1]/ReLU[1]
  %13 : Float(64, 32, 14, 14) = onnx::MaxPool[kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%12), scope: Net_LinearFunction/Sequential[conv1]/MaxPool2d[2]
  %14 : Float(64, 64, 14, 14) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[3, 3], pads=[1, 1, 1, 1], strides=[1, 1]](%13, %conv2.0.weight, %conv2.0.bias), scope: Net_LinearFunction/Sequential[conv2]/Conv2d[0]
  %15 : Float(64, 64, 14, 14) = onnx::Relu(%14), scope: Net_LinearFunction/Sequential[conv2]/ReLU[1]
  %16 : Float(64, 64, 7, 7) = onnx::MaxPool[kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%15), scope: Net_LinearFunction/Sequential[conv2]/MaxPool2d[2]
  %17 : Float(64, 64, 7, 7) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[3, 3], pads=[1, 1, 1, 1], strides=[1, 1]](%16, %conv3.0.weight, %conv3.0.bias), scope: Net_LinearFunction/Sequential[conv3]/Conv2d[0]
  %18 : Float(64, 64, 7, 7) = onnx::Relu(%17), scope: Net_LinearFunction/Sequential[conv3]/ReLU[1]
  %19 : Float(64, 64, 3, 3) = onnx::MaxPool[kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%18), scope: Net_LinearFunction/Sequential[conv3]/MaxPool2d[2]
  %20 : Float(64, 576) = onnx::Flatten[axis=1](%19), scope: Net_LinearFunction
  %21 : Float(64, 128) = onnx::nonentity(%20, %dense.0.weight, %dense.0.bias), scope: Net_LinearFunction/Sequential[dense]/Linear[0]
  %22 : Float(64, 128) = onnx::Relu(%21), scope: Net_LinearFunction/Sequential[dense]/ReLU[1]
  %23 : Float(64, 10) = onnx::nonentity(%22, %dense.2.weight, %dense.2.bias), scope: Net_LinearFunction/Sequential[dense]/Linear[2]
  return (%23)

如圖所示，每一行op名稱後跟着的“[]”裏面的是網絡結構參數，“()”裏面的是代表上一層的“%n”,“weights”和“bias”。
所以在TensorRT中讀onnx的時候，是依照一以下邏輯讀數據的。

網絡結構參數
經過對比發現，Conv操作的網絡結構參數是用get_kernel_params方法來讀取的。

    get_kernel_params(node, get_DimsHW_from_CHW(dims), &kernel_size,
        &strides, &beg_padding, &end_padding, paddingMode, &dilations);

weights和bias
而Conv操作的weights和bias則是通過inputs.at來讀取的。（ input.at[0]是代表上一層的一個數據結構）

    ASSERT(inputs.at(0).is_tensor(),  ErrorCode::kUNSUPPORTED_NODE);
    ASSERT(inputs.at(1).is_weights(), ErrorCode::kUNSUPPORTED_NODE);

最後測試兩個pipeline是否契合

用onnx2trt命令直接測試，結果如下

boyun@boyun-MS-7B90:~/workspace/onnx-tensorrt-master$ onnx2trt ./onnx/customer_op_FC.onnx  -v
----------------------------------------------------------------
Input filename:   ./onnx/customer_op_FC.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    pytorch
Producer version: 1.1
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
Parsing model
[2019-08-14 07:30:02    INFO] 11:Conv -> (32, 28, 28)
[2019-08-14 07:30:02    INFO] 12:Relu -> (32, 28, 28)
[2019-08-14 07:30:02    INFO] 13:MaxPool -> (32, 14, 14)
[2019-08-14 07:30:02    INFO] 14:Conv -> (64, 14, 14)
[2019-08-14 07:30:02    INFO] 15:Relu -> (64, 14, 14)
[2019-08-14 07:30:02    INFO] 16:MaxPool -> (64, 7, 7)
[2019-08-14 07:30:02    INFO] 17:Conv -> (64, 7, 7)
[2019-08-14 07:30:02    INFO] 18:Relu -> (64, 7, 7)
[2019-08-14 07:30:02    INFO] 19:MaxPool -> (64, 3, 3)
[2019-08-14 07:30:02    INFO] 20:Flatten -> (576)
[2019-08-14 07:30:02    INFO] 21:nonentity -> (576)
[2019-08-14 07:30:02    INFO] 22:Relu -> (576)
[2019-08-14 07:30:02    INFO] 23:nonentity -> (576)
All done

TensorRT5.1.5.0 實踐 onnx-TensorRT的自定義op

文章目錄

pytoch 轉 onnx 過程中擴展自定義op

流程

細節學習

自定義pytorch的op

加入symbolic

onnx 轉 tensorRT 過程中擴展自定義op

流程

細節學習

對自定義op：InstanceNormalization的詳解

InstanceNormalizationPlugin.hpp

InstanceNormalizationPlugin.cpp

builtin_op_importers.cpp的理解

流程

網絡結構參數、weights和bias的讀取

最後測試兩個pipeline是否契合

工作中用到的腳本合集

24-5-18 X

Faceswap開發（一） GAN網絡的基本瞭解

pytorch->onnx->trt 踩坑記錄

SegDecNet的多GPU數據同步訓練代碼改動記錄

ubuntu16.04+cuda9.2+tensorflow-gpu1.8.0+torch1.3.0 問題記錄

TensorRT安裝 & 環境配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結