1 TensorRT簡介
TensorRT的核心是一個c++庫,它促進了對NVIDIA圖形處理單元(gpu)的高性能計算。它與TensorFlow,Pytorch等框架相輔相成。他可以快速高效的運行一個已經訓練好的神經網絡,並生成結果。它包括用於從Caffe、ONNX或TensorFlow導入現有模型的解析器,以及用於以編程方式構建模型的c++和Python api。TensorRT在所有支持的平臺上提供c++實現,在x86上提供Python實現。
2 TensorRT的好處
訓練神經網絡之後,TensorRT使網絡能夠被壓縮、優化並作爲運行時部署,而不需要框架的開銷。TensorRT結合了多個層,優化內核選擇,並根據指定的精度(FP32、FP16或INT8)執行規範化和轉換,以優化矩陣計算,從而提高延遲、吞吐量和效率。TensorRT通過將API與特定硬件細節的高級抽象以及專門針對高吞吐量、低延遲和低設備內存佔用計算而開發和優化的實現相結合來解決這些問題。
注:1080TI支持fp32和int8精度的運算,而最新出的RTX2080TI系列則支持fp16。
3 TensorRT的使用說明
(1)TensorRT一般不用於訓練階段的任何部分。
(2)需要使用ONNX解析器、Caffe解析器或TensorFlow/UFF解析器將保存的神經網絡從其保存的格式解析爲TensorRT。(網絡可以直接從Caffe導入,也可以通過UFF或ONNX格式從其他框架導入。)TensorRT 5.0.0附帶的ONNX解析器支持ONNX IR(中間表示)版本0.0.3和opset版本7。
(3)考慮優化選項——批大小、工作區大小和混合精度。這些選項被選擇並指定TensorRT構建步驟的一部分,在此步驟中,根據網絡實際構建一個優化的計算引擎。
(4)驗證結果,選擇FP32 或 FP16結果應該更準確,INT8效果應該會差一點
(5)以串行格式寫出計算引擎,稱爲計劃文件
(6)要初始化推理引擎,應用程序首先將模型從計劃文件反序列化爲推理引擎。
(7)TensorRT通常異步使用,因此,當輸入數據到達時,程序調用一個execute_async函數,其中包含輸入緩衝區、輸出緩衝區,TensorRT應該將結果放入其中的緩衝區。(用pycuda操作緩衝區)
注:生成的計劃文件不能跨平臺或TensorRT版本移植。並且針對特定的GPU。
4 TensorRT的使用步驟:(假設以及有一個訓練好的模型)
(1) 根據模型創建TensorRT網絡定義
(2) 調用TensorRT構建器從網絡創建優化的運行引擎
(3) 序列化和反序列化引擎,以便在運行時快速創建引擎
(4) 爲引擎提供數據以執行計算
5 TensorRT版本說明
TensorRT有C++和Python兩個版本。TensorRT在所有支持的平臺上提供C++實現,在x86上提供Python實現。兩個版本在功能上相近,Python版本可能更簡單,因爲有很多第三方庫可以幫助做數據處理,C++版本安全性更高。
6 TensorRT環境安裝
這個可以通過NVIDIA官網教程完成安裝,或者很多博文有很詳細記錄在這裏不多說。
7 Pytorch轉TensorRT5示例代碼
因爲網上資料較少,僅有的代碼也都是TensorRT4的而TensorRT5對API改變比較大,簡化了開發過程,所以這裏把一個基於TensorRT4的示例代碼改了一下,供大家參考。
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from matplotlib.pyplot import imshow # to show test case
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
BATCH_SIZE = 64
TEST_BATCH_SIZE = 1000
EPOCHS = 3
LEARNING_RATE = 0.001
SGD_MOMENTUM = 0.5
SEED = 1
LOG_INTERVAL = 10
torch.cuda.manual_seed(SEED)
# 加載MNIST數據集
kwargs = {'num_workers': 1, 'pin_memory': True}
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('/tmp/mnist/data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=BATCH_SIZE,
shuffle=True,
**kwargs)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('/tmp/mnist/data', train=False,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=TEST_BATCH_SIZE,
shuffle=True,
**kwargs)
# 神經網絡結構
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, kernel_size=5)
self.conv2 = nn.Conv2d(20, 50, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(800, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = F.max_pool2d(self.conv1(x), kernel_size=2, stride=2)
x = F.max_pool2d(self.conv2(x), kernel_size=2, stride=2)
x = x.view(-1, 800)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x)
model = Net()
# 利用GPU訓練
model.cuda()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=SGD_MOMENTUM)
# 訓練神經網絡
def train(epoch):
model.train()
for batch, (data, target) in enumerate(train_loader):
data, target = data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
# if batch % LOG_INTERVAL == 0:
# print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'
# .format(epoch,
# batch * len(data),
# len(train_loader.dataset),
# 100. * batch / len(train_loader),
# loss.data[0]))
# 測試正確率
def test(epoch):
model.eval()
test_loss = 0
correct = 0
for data, target in test_loader:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output = model(data)
test_loss += F.nll_loss(output, target).item()
pred = output.data.max(1)[1]
correct += pred.eq(target.data).cpu().sum()
test_loss /= len(test_loader)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'
.format(test_loss,
correct,
len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
# 訓練三世代
for e in range(EPOCHS):
train(e + 1)
test(e + 1)
from time import time
# 計算開始時間
Start = time()
# 讀取訓練好的模型的
weights = model.state_dict()
# 打印日誌
G_LOGGER = trt.Logger(trt.Logger.WARNING)
# 創建Builder
builder = trt.Builder(G_LOGGER)
# 根據模型創建TensorRT的網絡結構
network = builder.create_network()
# Name for the input layer, data type, tuple for dimension
data = network.add_input("data", trt.float32, (1, 28, 28))
assert (data)
# -------------
conv1_w = weights['conv1.weight'].cpu().numpy().reshape(-1)
conv1_b = weights['conv1.bias'].cpu().numpy().reshape(-1)
conv1 = network.add_convolution(data, 20, (5, 5), conv1_w, conv1_b)
assert (conv1)
conv1.stride = (1, 1)
# -------------
pool1 = network.add_pooling(conv1.get_output(0), trt.PoolingType.MAX, (2, 2))
assert (pool1)
pool1.stride = (2, 2)
# -------------
conv2_w = weights['conv2.weight'].cpu().numpy().reshape(-1)
conv2_b = weights['conv2.bias'].cpu().numpy().reshape(-1)
conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
assert (conv2)
conv2.stride = (1, 1)
# -------------
pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
assert (pool2)
pool2.stride = (2, 2)
# -------------
fc1_w = weights['fc1.weight'].cpu().numpy().reshape(-1)
fc1_b = weights['fc1.bias'].cpu().numpy().reshape(-1)
fc1 = network.add_fully_connected(pool2.get_output(0), 500, fc1_w, fc1_b)
assert (fc1)
# -------------
relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)
assert (relu1)
# -------------
fc2_w = weights['fc2.weight'].cpu().numpy().reshape(-1)
fc2_b = weights['fc2.bias'].cpu().numpy().reshape(-1)
fc2 = network.add_fully_connected(relu1.get_output(0), 10, fc2_w, fc2_b)
assert (fc2)
fc2.get_output(0).name = "prob"
network.mark_output(fc2.get_output(0))
builder.max_batch_size = 100
builder.max_workspace_size = 1 << 20
# 創建引擎
engine = builder.build_cuda_engine(network)
del network
del builder
runtime = trt.Runtime(G_LOGGER)
# 讀取測試集
img, target = next(iter(test_loader))
img = img.numpy()
target = target.numpy()
# print("Test Case: " + str(target))
img = img.ravel()
context = engine.create_execution_context()
output = np.empty((100, 10), dtype=np.float32)
# 分配內存
d_input = cuda.mem_alloc(1 * img.size * img.dtype.itemsize)
d_output = cuda.mem_alloc(1 * output.size * output.dtype.itemsize)
bindings = [int(d_input), int(d_output)]
# pycuda操作緩衝區
stream = cuda.Stream()
# 將輸入數據放入device
cuda.memcpy_htod_async(d_input, img, stream)
# 執行模型
context.execute_async(100, bindings, stream.handle, None)
# 將預測結果從從緩衝區取出
cuda.memcpy_dtoh_async(output, d_output, stream)
# 線程同步
stream.synchronize()
print(len(np.argmax(output, axis=1)))
print("Test Case: " + str(target))
print("Prediction: " + str(np.argmax(output, axis=1)))
print("tensorrt time:", time() - Start)
# 保存計劃文件
with open("lianzheng.engine", "wb") as f:
f.write(engine.serialize())
# 反序列化引擎
with open("lianzheng.engine", "rb") as f, trt.Runtime(G_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
del context
del engine
del runtime