2020-01-01 初版
2020-01-10 修改vgg結構至torchvision.models.vgg, 更新代碼
一、讀入權重並搭建網絡
參考TRT提供的官方文檔python_samples,注意這個TRT版本是6.0的,目前TRT已經更新到了7.0,不過看Release Note可以發現,TRT6.0與TRT7.0在API上沒有變動,因此也不必有所顧忌。另外,由於這個Python Sample必須要將TRT整個給下載下來,才能看到其中的PyThon API的文檔,因此這裏我給的是自己的倉庫鏈接。github上官方有提供CPP API文檔,可見Building a Simple MNIST Network Layer by Layer,不過本文是使用PyThon API搭建,所以就不再談及CPP API的事。
1.1 分析源碼
python_samples/network_api_pytorch_mnist
中有README.md
、model.py
、sample.py
與requirement.txt
,明顯的,我們需要具體分析model.py
與sample.py
這兩個文件,model.py
是使用PyTorch搭建的MNIST網絡,sample.py
則是使用TRT API搭建,前者皆包含訓練、測試過程,後者僅有測試,因此後者沒有經過F.log_softmax
操作。下面是我摘抄出來的部分核心代碼,熟悉的人一眼便可明白:
model.py
# Network
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, kernel_size=5)
self.conv2 = nn.Conv2d(20, 50, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(800, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.max_pool2d(self.conv1(x), kernel_size=2, stride=2)
x = F.max_pool2d(self.conv2(x), kernel_size=2, stride=2)
x = x.view(-1, 800)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
sample.py
def populate_network(network, weights):
# Configure the network layers based on the weights provided.
# 標記網絡輸入
input_tensor = network.add_input(name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)
# 對應PyTorch之self.conv1
conv1_w = weights['conv1.weight'].numpy()
conv1_b = weights['conv1.bias'].numpy()
conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
conv1.stride = (1, 1)
# 對應PyTorch之F.max_pool2d
pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
pool1.stride = (2, 2)
# 對應PyTorch之self.conv2
conv2_w = weights['conv2.weight'].numpy()
conv2_b = weights['conv2.bias'].numpy()
conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
conv2.stride = (1, 1)
# 對應PyTorch之F.max_pool2d
pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
pool2.stride = (2, 2)
# 對應PyTorch之self.fc1
fc1_w = weights['fc1.weight'].numpy()
fc1_b = weights['fc1.bias'].numpy()
fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)
# 對應PyTorch之self.relu
relu1 = network.add_activation(input=fc1.get_output(0), type=trt.ActivationType.RELU)
# 對應PyTorch之self.fc2
fc2_w = weights['fc2.weight'].numpy()
fc2_b = weights['fc2.bias'].numpy()
fc2 = network.add_fully_connected(relu1.get_output(0), ModelData.OUTPUT_SIZE, fc2_w, fc2_b)
# 設置該層輸出名字
fc2.get_output(0).name = ModelData.OUTPUT_NAME
# 標記網絡輸出
network.mark_output(tensor=fc2.get_output(0))
在sample.py的populate_network
中,network
是返回值,weights
是輸入值,對應model.py中Net
的Net.state_dict()
,注意weights
是加載在CPU上的。
兩者相互比較、對應,對於PyTorch而言,輸入x
首先經過conv1
卷積、F.max_pool2d
池化、conv2
卷積、F.max_pool2d
池化、view(-1)
一維化、relu
激活、fc
全連接、F.log_softmax
歸一化輸出結果概率分佈;對於TRT而言,整個鏈路的行爲需要跟PyTorch一致,不同的是TRT不需要訓練,因此就不需要log_softmax
了。比較兩者可以歸結如下表所示,簡單且複雜:
PyTorch Operators | TRT API Operators |
---|---|
self.conv1 = nn.Conv2d(1, 20, kernel_size=5) | conv1_w = weights[‘conv1.weight’].numpy() |
conv1_b = weights[‘conv1.bias’].numpy() | |
conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b) | |
conv1.stride = (1, 1) | |
F.max_pool2d(self.conv1(x), kernel_size=2, stride=2) | pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2)) |
pool1.stride = (2, 2) | |
self.conv2 = nn.Conv2d(20, 50, kernel_size=5) | conv2_w = weights[‘conv2.weight’].numpy() |
conv2_b = weights[‘conv2.bias’].numpy() | |
conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b) | |
conv2.stride = (1, 1) | |
F.max_pool2d(self.conv2(x), kernel_size=2, stride=2) | pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2)) |
pool2.stride = (2, 2) | |
self.fc1 = nn.Linear(800, 500) | fc1_w = weights[‘fc1.weight’].numpy() |
fc1_b = weights[‘fc1.bias’].numpy() | |
fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b) | |
F.relu(self.fc1(x)) | relu1 = network.add_activation(input=fc1.get_output(0), type=trt.ActivationType.RELU) |
self.fc2 = nn.Linear(500, 10) | fc2_w = weights[‘fc2.weight’].numpy() |
fc2_b = weights[‘fc2.bias’].numpy() | |
fc2 = network.add_fully_connected(relu1.get_output(0), ModelData.OUTPUT_SIZE, fc2_w, fc2_b) | |
F.log_softmax(x, dim=1) | THERE IS NO NEED… |
注意
對於TRT API而言,network
是tensorrt.INetworkDefinition
類,使用network.add_xxx
成員函數爲該對象添加任意成員變量——網絡層tensorrt.ILayer
,不同的層繼承自基類tensorrt.ILayer
,派生出不同子類並擁有不同行爲。這些網絡層,都是TRT類,網絡層的.get_output(0)
只能輸出tensorrt.ITensor
類,無法在構建網絡的時候輸出其中的內容,這裏又有TRT的run time
與build time
的概念,可閱讀TensorRT Developer Guide之動態shape,如果需要對各層進行調試的話,只能xxxlayer.get_output(0).shape
輸出這個Tensor類的shape,或者在構建完成後的運行時輸出結果,對內部進行調試是不可能的。
總之,TRT API的行爲描述如下:
add_input
與add_xxxlayers
與get_output(0)
與mark_output
與對網路結構的清晰認知,如果你有上述5點,恭喜你TensorRT入門了。
1.2 搭建網絡
那麼對於VGG而言該怎麼做呢?首先你得熟悉網絡的構成,以VGG16爲例,它應該長得如下圖所示:
詳細點,網絡參數應該如下表所示,這裏使用的是PyTorch的torchvision.models.vgg
提供的VGG16結構:
VGG16 Config |
---|
Conv-3x3-64-strd1-pad1 |
Relu |
Conv-3x3-64-strd1-pad1 |
Relu |
Maxpool-2x2-strd2-pad0 |
Conv-3x3-128-strd1-pad1 |
Relu |
Conv-3x3-128-strd1-pad1 |
Relu |
Maxpool-2x2-strd2-pad0 |
Conv-3x3-256-strd1-pad1 |
Relu |
Conv-3x3-256-strd1-pad1 |
Relu |
Conv-3x3-256-strd1-pad1 |
Relu |
Maxpool-2x2-strd2-pad0 |
Conv-3x3-512-strd1-pad1 |
Relu |
Conv-3x3-512-strd1-pad1 |
Relu |
Conv-3x3-512-strd1-pad1 |
Relu |
Maxpool-2x2-strd2-pad0 |
Conv-3x3-512-strd1-pad1 |
Relu |
Conv-3x3-512-strd1-pad1 |
Relu |
Conv-3x3-512-strd1-pad1 |
Relu |
Maxpool-2x2-strd2-pad0 |
Avgpool-1x1-strd0-pad0 |
FC-4096 |
Relu |
FC-4096 |
Relu |
FC-1000 |
Relu |
源碼torchvision.models.vgg
的VGG16中,在features
塊和classifier
塊中有個avgpool
塊,裏面是PyTorch內建的AdaptiveAvgPool2d
層,這裏我使用普通的Avgpool
進行替代,該層的參數設置參考上篇文章PyTorch2ONNX2TensorRT 踩坑日誌之5. 使用AvgPooling替換AdaptivePooling,另外,源碼在全連接層後面跟了Dropout
,這是在訓練過程中防止過擬合的,在推理過程是不需要的,因此這裏我就全部拋棄了。也就是說,我們需要用到add_convolution
13次、add_fully_connected
3次、add_activation
13次、add_pooling
6次,python_samples完全能夠勝任。
"""
讀入VGG16的權重,這裏我使用的VGG16的信息如下:
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
[VGG16 pre-trained weight](https://drive.google.com/open?id=1jOBAqe4fPFMCgRnYt794lYgSAlx4hwCj)
"""
weights = torch.load('./vgg16_20M.pth', map_location='cpu')
for k, v in weights.items():
print("Layer: {}".format(k))
通過上述代碼打印出模型各層的名字,將weights['xxx'].numpy()
裏的xxx
進行逐一替代就好了。
1.3 完整代碼
雖然很長,但熟練了以後,碼起來就是重複工作了。在ModelData
裏的DTYPE
需要顯式地定義模型使用的數據類型,例如用FP32推理,就設定ModelData.DTYPE = trt.float32
;用FP16推理,設定ModelData.DTYPE = trt.float16
,並且在builder
處強制設置builder.fp16_mode = True
。
詳細代碼見github -> i_just_want_a_simple_demo/trt_api_pytorch/vgg16_sample/。
import tensorrt as trt
class ModelData(object):
INPUT_NAME = "in_frame"
# P, C, H, W
INPUT_SHAPE = (1, 3, 224, 224)
OUTPUT_NAME = "out_frame"
DTYPE = trt.float32
def populate_network(network, weights):
# Configure the network layers based on the weights provided.
input_tensor = network.add_input(name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)
# VGG16 features
# VGG16_block_1
vgg16_f0_w = weights['features.0.weight'].numpy()
vgg16_f0_b = weights['features.0.bias'].numpy()
vgg16_f0 = network.add_convolution(input=input_tensor, num_output_maps=64, kernel_shape=(3, 3), kernel=vgg16_f0_w, bias=vgg16_f0_b)
vgg16_f0.padding = (1, 1)
vgg16_f0.name = 'vgg16_conv_1_1'
vgg16_f1 = network.add_activation(input=vgg16_f0.get_output(0), type=trt.ActivationType.RELU)
vgg16_f1.name = 'vgg16_relu_1_1'
vgg16_f2_w = weights['features.2.weight'].numpy()
vgg16_f2_b = weights['features.2.bias'].numpy()
vgg16_f2 = network.add_convolution(input=vgg16_f1.get_output(0), num_output_maps=64, kernel_shape=(3, 3), kernel=vgg16_f2_w, bias=vgg16_f2_b)
vgg16_f2.padding = (1, 1)
vgg16_f2.name = 'vgg16_conv_1_2'
vgg16_f3 = network.add_activation(input=vgg16_f2.get_output(0), type=trt.ActivationType.RELU)
vgg16_f3.name = 'vgg16_relu_1_2'
vgg16_f4 = network.add_pooling(input=vgg16_f3.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
vgg16_f4.stride = (2, 2)
vgg16_f4.name = 'vgg16_max_pool_1'
# VGG16_block_2
vgg16_f5_w = weights['features.5.weight'].numpy()
vgg16_f5_b = weights['features.5.bias'].numpy()
vgg16_f5 = network.add_convolution(input=vgg16_f4.get_output(0), num_output_maps=128, kernel_shape=(3, 3), kernel=vgg16_f5_w, bias=vgg16_f5_b)
vgg16_f5.padding = (1, 1)
vgg16_f5.name = "vgg16_conv_2_1"
vgg16_f6 = network.add_activation(input=vgg16_f5.get_output(0), type=trt.ActivationType.RELU)
vgg16_f6.name = 'vgg16_relu_2_1'
vgg16_f7_w = weights['features.7.weight'].numpy()
vgg16_f7_b = weights['features.7.bias'].numpy()
vgg16_f7 = network.add_convolution(input=vgg16_f6.get_output(0), num_output_maps=128, kernel_shape=(3, 3), kernel=vgg16_f7_w, bias=vgg16_f7_b)
vgg16_f7.padding = (1, 1)
vgg16_f7.name = "vgg16_conv_2_2"
vgg16_f8 = network.add_activation(input=vgg16_f7.get_output(0), type=trt.ActivationType.RELU)
vgg16_f8.name = 'vgg16_relu_2_2'
vgg16_f9 = network.add_pooling(input=vgg16_f8.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
vgg16_f9.stride = (2, 2)
vgg16_f9.name = 'vgg16_max_pool_2'
# VGG16_block_3
vgg16_f10_w = weights['features.10.weight'].numpy()
vgg16_f10_b = weights['features.10.bias'].numpy()
vgg16_f10 = network.add_convolution(input=vgg16_f9.get_output(0), num_output_maps=256, kernel_shape=(3, 3), kernel=vgg16_f10_w, bias=vgg16_f10_b)
vgg16_f10.padding = (1, 1)
vgg16_f10.name = "vgg16_conv_3_1"
vgg16_f11 = network.add_activation(input=vgg16_f10.get_output(0), type=trt.ActivationType.RELU)
vgg16_f11.name = 'vgg16_relu_3_1'
vgg16_f12_w = weights['features.12.weight'].numpy()
vgg16_f12_b = weights['features.12.bias'].numpy()
vgg16_f12 = network.add_convolution(input=vgg16_f11.get_output(0), num_output_maps=256, kernel_shape=(3, 3), kernel=vgg16_f12_w, bias=vgg16_f12_b)
vgg16_f12.padding = (1, 1)
vgg16_f12.name = "vgg16_conv_3_2"
vgg16_f13 = network.add_activation(input=vgg16_f12.get_output(0), type=trt.ActivationType.RELU)
vgg16_f13.name = 'vgg16_relu_3_2'
vgg16_f14_w = weights['features.14.weight'].numpy()
vgg16_f14_b = weights['features.14.bias'].numpy()
vgg16_f14 = network.add_convolution(input=vgg16_f13.get_output(0), num_output_maps=256, kernel_shape=(3, 3), kernel=vgg16_f14_w, bias=vgg16_f14_b)
vgg16_f14.padding = (1, 1)
vgg16_f14.name = "vgg16_conv_3_3"
vgg16_f15 = network.add_activation(input=vgg16_f14.get_output(0), type=trt.ActivationType.RELU)
vgg16_f15.name = 'vgg16_relu_3_3'
vgg16_f16 = network.add_pooling(input=vgg16_f15.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
vgg16_f16.stride = (2, 2)
vgg16_f16.name = 'vgg16_max_pool_3'
# VGG16_block_4
vgg16_f17_w = weights['features.17.weight'].numpy()
vgg16_f17_b = weights['features.17.bias'].numpy()
vgg16_f17 = network.add_convolution(input=vgg16_f16.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f17_w, bias=vgg16_f17_b)
vgg16_f17.padding = (1, 1)
vgg16_f17.name = "vgg16_conv_4_1"
vgg16_f18 = network.add_activation(input=vgg16_f17.get_output(0), type=trt.ActivationType.RELU)
vgg16_f18.name = 'vgg16_relu_4_1'
vgg16_f19_w = weights['features.19.weight'].numpy()
vgg16_f19_b = weights['features.19.bias'].numpy()
vgg16_f19 = network.add_convolution(input=vgg16_f18.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f19_w, bias=vgg16_f19_b)
vgg16_f19.padding = (1, 1)
vgg16_f19.name = "vgg16_conv_4_2"
vgg16_f20 = network.add_activation(input=vgg16_f19.get_output(0), type=trt.ActivationType.RELU)
vgg16_f20.name = 'vgg16_relu_4_2'
vgg16_f21_w = weights['features.21.weight'].numpy()
vgg16_f21_b = weights['features.21.bias'].numpy()
vgg16_f21 = network.add_convolution(input=vgg16_f20.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f21_w, bias=vgg16_f21_b)
vgg16_f21.padding = (1, 1)
vgg16_f21.name = "vgg16_conv_4_3"
vgg16_f22 = network.add_activation(input=vgg16_f21.get_output(0), type=trt.ActivationType.RELU)
vgg16_f22.name = 'vgg16_relu_4_3'
vgg16_f23 = network.add_pooling(input=vgg16_f22.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
vgg16_f23.stride = (2, 2)
vgg16_f23.name = 'vgg16_max_pool_4'
# VGG16_block_5
vgg16_f24_w = weights['features.24.weight'].numpy()
vgg16_f24_b = weights['features.24.bias'].numpy()
vgg16_f24 = network.add_convolution(input=vgg16_f23.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f24_w, bias=vgg16_f24_b)
vgg16_f24.padding = (1, 1)
vgg16_f24.name = "vgg16_conv_5_1"
vgg16_f25 = network.add_activation(input=vgg16_f24.get_output(0), type=trt.ActivationType.RELU)
vgg16_f25.name = "vgg16_relu_5_1"
vgg16_f26_w = weights['features.26.weight'].numpy()
vgg16_f26_b = weights['features.26.bias'].numpy()
vgg16_f26 = network.add_convolution(input=vgg16_f25.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f26_w, bias=vgg16_f26_b)
vgg16_f26.padding = (1, 1)
vgg16_f26.name = "vgg16_conv_5_2"
vgg16_f27 = network.add_activation(input=vgg16_f26.get_output(0), type=trt.ActivationType.RELU)
vgg16_f27.name = "vgg16_relu_5_2"
vgg16_f28_w = weights['features.28.weight'].numpy()
vgg16_f28_b = weights['features.28.bias'].numpy()
vgg16_f28 = network.add_convolution(input=vgg16_f27.get_output(0), num_output_maps=512, kernel_shape=(3, 3), kernel=vgg16_f28_w, bias=vgg16_f28_b)
vgg16_f28.padding = (1, 1)
vgg16_f28.name = "vgg16_conv_5_3"
vgg16_f29 = network.add_activation(input=vgg16_f28.get_output(0), type=trt.ActivationType.RELU)
vgg16_f29.name = "vgg16_relu_5_3"
vgg16_f30 = network.add_pooling(input=vgg16_f29.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
vgg16_f30.stride = (2, 2)
vgg16_f30.name = 'vgg16_max_pool_5'
# VGG16 nn.AdaptiveAvgPool2d((7, 7))
vgg16_a0 = network.add_pooling(input=vgg16_f30.get_output(0), type=trt.PoolingType.AVERAGE, window_size=(1, 1))
vgg16_a0.name = 'vgg16_avg_pool_0'
# VGG16 torch.flatten(x, 1)
# there is no need for torch.flatten(x, 1). because, tensorrt.IFullyConnectedLayer would first reshape the input
# tensor from shape {P, C, H, W} into {P, C*H*W}.
# VGG16 classifier
# VGG16_fc_1
vgg16_c0_w = weights['classifier.0.weight'].numpy()
vgg16_c0_b = weights['classifier.0.bias'].numpy()
vgg16_c0 = network.add_fully_connected(input=vgg16_a0.get_output(0), num_outputs=4096, kernel=vgg16_c0_w, bias=vgg16_c0_b)
vgg16_c0.name = "vgg16_fc_1"
vgg16_c1 = network.add_activation(input=vgg16_c0.get_output(0), type=trt.ActivationType.RELU)
vgg16_c1.name = "vgg16_relu_fc_1"
# there is no need for Dropout during inference
# VGG16_fc_2
vgg16_c3_w = weights['classifier.3.weight'].numpy()
vgg16_c3_b = weights['classifier.3.bias'].numpy()
vgg16_c3 = network.add_fully_connected(input=vgg16_c1.get_output(0), num_outputs=4096, kernel=vgg16_c3_w, bias=vgg16_c3_b)
vgg16_c3.name = "vgg16_fc_2"
vgg16_c4 = network.add_activation(input=vgg16_c3.get_output(0), type=trt.ActivationType.RELU)
vgg16_c4.name = "vgg16_relu_fc_2"
# there is no need for Dropout during inference
# VGG16_fc_3
vgg16_c6_w = weights['classifier.6.weight'].numpy()
vgg16_c6_b = weights['classifier.6.bias'].numpy()
vgg16_c6 = network.add_fully_connected(input=vgg16_c4.get_output(0), num_outputs=1000, kernel=vgg16_c6_w, bias=vgg16_c6_b)
vgg16_c6.name = "vgg16_fc_3"
# Output
vgg16_c6.get_output(0).name = ModelData.OUTPUT_NAME
network.mark_output(tensor=vgg16_c6.get_output(0))
二、混合精度
(未完待續)
三、量化
(未完待續)
四、性能分析
(未完待續)
五、小技巧
5.1 如何動態輸入、輸出
如果在我的網絡內部,先是定義了一個resize操作,讓輸入插值到固定尺寸,然後再編碼/解碼、得到結果,最後我還想讓這個輸出跟我的輸入一致,即如下操作:
input -> resized_input -> inference -> output -> resized_output
{1, 3, in_w, in_h} {1, 3, in_w_new, in_h_new} {1, 3, out_w, img_out_h} {1, 3, in_w, in_h}
TRT中如果是動態輸入,TRT在運行時(runtime)就有個shape tensor
的概念,區別於execution tensor
,在網絡建立的時候,就已經規定了輸入、輸出大小了,在7. Working With Dynamic Shapes中有詳細說明。shape tensor
是一個一維的tensor,記錄着輸入tensor的大小,對應的操作層爲IShapeLayer
,詳見文檔,所以我們就可以按下列操作獲取輸入的shape了。然後使用IResizeLayer
就可以讓輸出與輸入大小一致了。
input_tensor = network.add_input("input", trt.float32,(1, 3, -1, -1)) # 輸入順序爲BCWH, 這裏的W和H設爲-1, 即寬高是動態的, 需要在runtime纔可以確定
input_shape = network.add_shape(input=input_tensor)
print(input_shape.get_output(0).shape) # 輸出爲(4,), 即一維tensor, input_shape 裏面的內容爲 input_tensor 的shape
output_tensor = network.add_resize(input=last_layer.get_output(0)) # 得到最後一層 last_layer 的輸出
output_tensor.resize_mode = trt.ResizeMode.LINEAR
output_tensor.align_corners = True
output_tensor.set_input(1, input_shape.get_output(0))
5.2 如何構建BN layer
TRT API中是沒有Batch Normalization layer的,需要手動搭建。BN按公式需要矩陣操作,TRT API使用IScaleLayer
層爲BN提供矩陣乘法,因此使用IScaleLayer
就可以搭建BN layer了。詳見文章TensorRT實戰(一) 如何搭建Batch Normalization層。
5.3 如何設定Pool layer的ceil mode
TRT API中tensorrt.IPoolingLayer處乍一看是沒有ceil_mode
字樣的。實際上,ceil_mode=True
的作用是when True, will use ceil instead of floor to compute the output shape,就是當採樣框長度不能整除輸入Tensor大小的時候,在Tensor的右下添加一圈-NaN
,如下圖所示,就能看懂ceil_mode
的作用了。
那麼,TRT API所對應的,就是tensorrt.PaddingMode.EXPLICIT_ROUND_UP
,官方文檔torch2trt以及TensorRT#84也能提供印證。
六、錯誤合集
6.1 mEngine.getHasImplicitBatchDim()
6.1.1 報錯信息
[TensorRT] ERROR: Parameter check failed at: engine.cpp::executeV2::701, condition: !mEngine.getHasImplicitBatchDim()
其中我使用的推理函數是execute_v2
,所以報錯爲executeV2
相關。該報錯的原因是因爲,在TRT的engine的建立過程中,使用了隱式的batch大小,即未規定builder.create_network
的batch
到底是隱式的、還是顯式的。
6.1.2 解決辦法
查看TRT的文檔可以發現,create_network
的原型爲create_network(self: tensorrt.tensorrt.Builder, flags: int = 0) → tensorrt.tensorrt.INetworkDefinition
,所以按下面代碼就可以規定顯式batch
了。
flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
builder.create_network(flag)
6.2 mEngine.bindingIsInput(bindingIndex)
6.2.1 報錯信息
[TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::893, condition: mEngine.bindingIsInput(bindingIndex)
網絡使用了動態輸入,在推理過程中,context
未設置binding shape
。
6.2.2 解決方法
查看文檔,其原型爲set_binding_shape(self: tensorrt.tensorrt.IExecutionContext, binding: int, shape: tensorrt.tensorrt.Dims) → bool
,因此按下面代碼規定就好了。
context.set_binding_shape(binding_index, (Batch, Channel, Width, Height))
# binding_index 動態輸入的index,規定這是網絡的第幾個輸入
# shape() 輸入的shape,這裏我使用的是BCWH