目錄
前言
TensorRT 的”hello world“ 程序sampleMNIST是衆多TensorRT初學者很好的起點,本文旨在詳細分析sampleMNIST的代碼,從實踐出發幫助理解TensorRT的相關概念、與cuda的關係、以及核心API的使用。
代碼分析
sampleMNIST的github 代碼參考link: https://github.com/NVIDIA/TensorRT/blob/release/6.0/samples/opensource/sampleMNIST/sampleMNIST.cpp
程序的主要流程分爲 main與程序輸入參數初始化 -> 網絡構建 -> 網絡推理 -> 釋放資源結束 這幾個階段,下面逐個階段分析代碼
Main入口
void printHelpInfo()
{
std::cout
<< "Usage: ./sample_mnist [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]\n";
std::cout << "--help Display help information\n";
std::cout << "--datadir Specify path to a data directory, overriding the default. This option can be used "
"multiple times to add multiple directories. If no data directories are given, the default is to use "
"(data/samples/mnist/, data/mnist/)"
<< std::endl;
std::cout << "--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
"where n is the number of DLA engines on the platform."
<< std::endl;
std::cout << "--int8 Run in Int8 mode.\n";
std::cout << "--fp16 Run in FP16 mode.\n";
}
int main(int argc, char** argv)
{
samplesCommon::Args args;
bool argsOK = samplesCommon::parseArgs(args, argc, argv);
- main函數開始獲取程序的輸入參數,允許指定caffe模型的文件目錄、使用DLA engine的數目、int8或者fp16的模式,參考printHelpInfo()函數
samplesCommon::CaffeSampleParams initializeSampleParams(const samplesCommon::Args& args)
{
samplesCommon::CaffeSampleParams params;
if (args.dataDirs.empty()) //!< Use default directories if user hasn't provided directory paths
{
params.dataDirs.push_back("data/mnist/");
params.dataDirs.push_back("data/samples/mnist/");
}
else //!< Use the data directory provided by the user
{
params.dataDirs = args.dataDirs;
}
params.prototxtFileName = locateFile("mnist.prototxt", params.dataDirs);
params.weightsFileName = locateFile("mnist.caffemodel", params.dataDirs);
params.meanFileName = locateFile("mnist_mean.binaryproto", params.dataDirs);
params.inputTensorNames.push_back("data");
params.batchSize = 1;
params.outputTensorNames.push_back("prob");
params.dlaCore = args.useDLACore;
params.int8 = args.runInInt8;
params.fp16 = args.runInFp16;
return params;
}
......
int main(int arg, char** arg)
{
......
samplesCommon::CaffeSampleParams params = initializeSampleParams(args);
- 根據程序運行參數生成CaffeSampleParams實例,包括配置caffe模型的默認目錄、minist的proto文件、caff模型文件、binary proto文件,配置minist深度學習網絡的input Tensor名字爲data,output Tensor名字爲prob,batch size爲1,根據用戶的輸入參數來配置是由需要DLA Engine,是否運行在Int8 / FP16模式
class SampleMNIST
{
template <typename T>
using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;
public:
SampleMNIST(const samplesCommon::CaffeSampleParams& params)
: mParams(params)
......
int main(int argc, char** argv)
{
......
SampleMNIST sample(params);
gLogInfo << "Building and running a GPU inference engine for MNIST" << std::endl;
- 通過CaffeSampleParams作爲配置參數來構造SampleMNIST對象,將配置參數保存到mParams中
int main(int argc, char** argv)
{
......
if (!sample.build())
{
return gLogger.reportFail(sampleTest);
}
通過SampleMNIST對象來創建MNIST深度學習網絡,下面開始詳細分析網絡構建階段的build方法
網絡構建(build)階段
bool SampleMNIST::build()
{
auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger.getTRTLogger()));
if (!builder)
{
return false;
}
auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
if (!network)
{
return false;
}
auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
if (!config)
{
return false;
}
auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
if (!parser)
{
return false;
}
constructNetwork(parser, network);
- TensorRT使用的標準流程即通過Logger創建IBuilder,通過IBuilder創建INetworkDefinition,通過INetworkDefinition創建IBuilderConfig,再創建用於解析Caffe模型的ICafferParser,然後調用constructNetwork通過ICafferParser對象分析caffe模型,通過INetworkDefinition對象創建可以被TensorRT優化和運行的網絡
void SampleMNIST::constructNetwork(
SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
{
const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor = parser->parse(
mParams.prototxtFileName.c_str(), mParams.weightsFileName.c_str(), *network, nvinfer1::DataType::kFLOAT);
for (auto& s : mParams.outputTensorNames)
{
network->markOutput(*blobNameToTensor->find(s.c_str()));
}
// add mean subtraction to the beginning of the network
nvinfer1::Dims inputDims = network->getInput(0)->getDimensions();
mMeanBlob
= SampleUniquePtr<nvcaffeparser1::IBinaryProtoBlob>(parser->parseBinaryProto(mParams.meanFileName.c_str()));
nvinfer1::Weights meanWeights{nvinfer1::DataType::kFLOAT, mMeanBlob->getData(), inputDims.d[1] * inputDims.d[2]};
// For this sample, a large range based on the mean data is chosen and applied to the head of the network.
// After the mean subtraction occurs, the range is expected to be between -127 and 127, so the rest of the network
// is given a generic range.
// The preferred method is use scales computed based on a representative data set
// and apply each one individually based on the tensor. The range here is large enough for the
// network, but is chosen for example purposes only.
float maxMean
= samplesCommon::getMaxValue(static_cast<const float*>(meanWeights.values), samplesCommon::volume(inputDims));
auto mean = network->addConstant(nvinfer1::Dims3(1, inputDims.d[1], inputDims.d[2]), meanWeights);
mean->getOutput(0)->setDynamicRange(-maxMean, maxMean);
network->getInput(0)->setDynamicRange(-maxMean, maxMean);
auto meanSub = network->addElementWise(*network->getInput(0), *mean->getOutput(0), ElementWiseOperation::kSUB);
meanSub->getOutput(0)->setDynamicRange(-maxMean, maxMean);
network->getLayer(0)->setInput(0, *meanSub->getOutput(0));
samplesCommon::setAllTensorScales(network.get(), 127.0f, 127.0f);
}
- 通過parser->parse方法分析caffe的模型和權重文件,構建network並返回可以通過名字查找數據ITensor的對象blobNameToTensor
- 通過blobNameToTensor->find方法找到輸入參數中指定的網絡output ITensor對象,並通過network->markOutput標記它爲網絡的Output ITensor
- 通過network->getInput(0)->getDimensions()找到網絡的input ITensor對象並獲取它的Dims維度對象
- 通過parser->parseBinaryProto解析caffe權重平均值文件幷包裝爲IBinaryProtoBlob對象
- 創建Input的平均權重meanWeights,該權重的數據從mMeanBlob->getData()獲得,數據個數是inputDims.d[1] * inputDims.d[2]
- 如下圖所示爲網絡的Input做一個範圍限制處理,包括
- 通過network->addConstant方法創建一個IConstant Layer,該Layer的input是個3維Dims3對象
- 通過network->addElementWise方法創建一個IElementWise Layer,將原網絡的Input和IConstant Layer的output作爲Input求相減
- 最後通過network->getLayer(0)->setInput替換原網絡的Input爲IElementWise Layer的output,完成對原網絡Input的範圍限制處理
bool SampleMNIST::build()
{
......
builder->setMaxBatchSize(mParams.batchSize);
config->setMaxWorkspaceSize(16_MiB);
config->setFlag(BuilderFlag::kGPU_FALLBACK);
config->setFlag(BuilderFlag::kSTRICT_TYPES);
if (mParams.fp16)
{
config->setFlag(BuilderFlag::kFP16);
}
if (mParams.int8)
{
config->setFlag(BuilderFlag::kINT8);
}
samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());
if (!mEngine)
return false;
assert(network->getNbInputs() == 1);
mInputDims = network->getInput(0)->getDimensions();
assert(mInputDims.nbDims == 3);
return true;
}
- constructNetwork函數執行完畢後,通過builder設置程序運行參數中的batchSize
- 通過config設置每一層Layer的內存大小和相關FLAG
- 通過enableDLA函數設置是否適用NV的DeepLearn Accelerator做硬件加速
- 通過network和config對象創建ICudaEngine對象用戶後續的推理過程
- 最後確定network的input個數只有1個,input的維度爲3維
網絡推理(infer) 階段
bool SampleMNIST::infer()
{
// Create RAII buffer manager object
samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);
auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
if (!context)
{
return false;
}
// Pick a random digit to try to infer
srand(time(NULL));
const int digit = rand() % 10;
// Read the input data into the managed buffers
// There should be just 1 input tensor
assert(mParams.inputTensorNames.size() == 1);
if (!processInput(buffers, mParams.inputTensorNames[0], digit))
{
return false;
}
.....
int main(int argc, char** argv)
{
......
if (!sample.infer())
{
return gLogger.reportFail(sampleTest);
}
- main函數執行完build函數後,通過infer函數開始做網絡推理
- infer函數通過幫助類構建了BufferManager,用戶創建和管理host與device的memory,如下圖所示
- 模板類GenericBuffer通過模板參數AllocFunc和FreeFunc來指定Host和Device分配存儲的類型,如下代碼所示,DeviceAllocator/DeviceFree類使用了cudaMalloc/cudaFree方法從GPU Device分配和釋放存儲,HostAllocator/HostFree則時候用malloc/free方法從CPU Device分配和釋放存儲
class DeviceAllocator
{
public:
bool operator()(void** ptr, size_t size) const
{
return cudaMalloc(ptr, size) == cudaSuccess;
}
};
class DeviceFree
{
public:
void operator()(void* ptr) const
{
cudaFree(ptr);
}
};
......
class HostAllocator
{
public:
bool operator()(void** ptr, size_t size) const
{
*ptr = malloc(size);
return *ptr != nullptr;
}
};
class HostFree
{
public:
void operator()(void* ptr) const
{
free(ptr);
}
};
- ManagerBuffer對象通過配對的deviceBuffer和hostBuffer來管理Device和Host 存儲
BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, const int& batchSize,
const nvinfer1::IExecutionContext* context = nullptr)
: mEngine(engine)
, mBatchSize(batchSize)
{
// Create host and device buffers
for (int i = 0; i < mEngine->getNbBindings(); i++)
{
auto dims = context ? context->getBindingDimensions(i) : mEngine->getBindingDimensions(i);
size_t vol = context ? 1 : static_cast<size_t>(mBatchSize);
nvinfer1::DataType type = mEngine->getBindingDataType(i);
int vecDim = mEngine->getBindingVectorizedDim(i);
if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector
{
int scalarsPerVec = mEngine->getBindingComponentsPerElement(i);
dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);
vol *= scalarsPerVec;
}
vol *= samplesCommon::volume(dims);
std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
manBuf->deviceBuffer = DeviceBuffer(vol, type);
manBuf->hostBuffer = HostBuffer(vol, type);
mDeviceBindings.emplace_back(manBuf->deviceBuffer.data());
mManagedBuffers.emplace_back(std::move(manBuf));
}
}
- BufferManager對象則管理多個ManagerBuffer,保存每個ManagerBuffer中deviceBuffer對應的設備存儲器指針到DeviceBindering
- BufferManager的構造函數可以看到,通過mEngine->getNbBindings()遍歷當前網絡的所有Input/Output(此處有個細節,即遍歷的index i和Tensor的名字是有一一對應關係的,即通過Tensor的名字查找到的Binding index == 對應的index i ),對每個Input/Output獲得它的維度dims和數據類型type,計算Input/Output的ITensor數據需要的存儲器容量vol,通過構造ManagerBuffer的DeviceBuffer和HostBuffer對象來分配Device和Host存儲(用於後續CPU Host端輸入數據到GPU Device端),再將Device的數據指針保存到DeviceBindering,將ManagerBuffer保存到BufferManager的隊列中,最終通過BufferManager獲得了所有Input/Output的Device和Host 存儲空間
bool SampleMNIST::infer()
{
......
// Pick a random digit to try to infer
srand(time(NULL));
const int digit = rand() % 10;
// Read the input data into the managed buffers
// There should be just 1 input tensor
assert(mParams.inputTensorNames.size() == 1);
if (!processInput(buffers, mParams.inputTensorNames[0], digit))
{
return false;
}
......
bool SampleMNIST::processInput(
const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
{
const int inputH = mInputDims.d[1];
const int inputW = mInputDims.d[2];
// Read a random digit file
srand(unsigned(time(nullptr)));
std::vector<uint8_t> fileData(inputH * inputW);
readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm", mParams.dataDirs), fileData.data(), inputH, inputW);
// Print ASCII representation of digit
gLogInfo << "Input:\n";
for (int i = 0; i < inputH * inputW; i++)
{
gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
}
gLogInfo << std::endl;
float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer(inputTensorName));
for (int i = 0; i < inputH * inputW; i++)
{
hostInputBuffer[i] = float(fileData[i]);
}
return true;
}
- 有了 BufferManager後通過processInput函數來獲取Input數據,通過隨機構建文件名的方式readPGMFfile 讀取Input的數據
- 如下代碼所示,通過buffers.getHostBuffer(inputTensorName) 根據Input Tensor的名字找到對應的Binding index,進而找到對應的HostBuffer獲得CPU Host端的存儲指針
- 通過inputH*inputW 計算input數據的尺寸、遍歷input數據,將input數據從文件中讀取到CPU 端的存儲器中( hostInputBuffer[i] = float(fileData[i]); )
void* getDeviceBuffer(const std::string& tensorName) const
{
return getBuffer(false, tensorName);
}
void* getHostBuffer(const std::string& tensorName) const
{
return getBuffer(true, tensorName);
}
......
void* getBuffer(const bool isHost, const std::string& tensorName) const
{
int index = mEngine->getBindingIndex(tensorName.c_str());
if (index == -1)
return nullptr;
return (isHost ? mManagedBuffers[index]->hostBuffer.data() : mManagedBuffers[index]->deviceBuffer.data());
}
bool SampleMNIST::infer()
{
......
// Create CUDA stream for the execution of this inference.
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// Asynchronously copy data from host input buffers to device input buffers
buffers.copyInputToDeviceAsync(stream);
......
- 通過cudaStreamCreate 創建cuda stream用於GPU Device上做並行計算流
- 通過buffers.copyInputToDeviceAsync 將processInput中讀取的Input數據從CPU 端異步傳送到GPU Device端,如下代碼所示copyInputToDeviceAsync最終會通過cudeMemcpyAsync方法結合CPU -> GPU還是GPU -> CPU的方向來異步傳送數據
void copyInputToDeviceAsync(const cudaStream_t& stream = 0)
{
memcpyBuffers(true, false, true, stream);
}
......
void memcpyBuffers(const bool copyInput, const bool deviceToHost, const bool async, const cudaStream_t& stream = 0)
{
for (int i = 0; i < mEngine->getNbBindings(); i++)
{
void* dstPtr
= deviceToHost ? mManagedBuffers[i]->hostBuffer.data() : mManagedBuffers[i]->deviceBuffer.data();
const void* srcPtr
= deviceToHost ? mManagedBuffers[i]->deviceBuffer.data() : mManagedBuffers[i]->hostBuffer.data();
const size_t byteSize = mManagedBuffers[i]->hostBuffer.nbBytes();
const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;
if ((copyInput && mEngine->bindingIsInput(i)) || (!copyInput && !mEngine->bindingIsInput(i)))
{
if (async)
CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));
else
CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));
}
}
}
bool SampleMNIST::infer()
{
......
// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
{
return false;
}
// Asynchronously copy data from device output buffers to host output buffers
buffers.copyOutputToHostAsync(stream);
// Wait for the work in the stream to complete
cudaStreamSynchronize(stream);
// Release stream
cudaStreamDestroy(stream);
// Check and print the output of the inference
// There should be just one output tensor
assert(mParams.outputTensorNames.size() == 1);
bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);
return outputCorrect;
}
- 通過context->enqueue 通知TensorRT 進行網絡推理過程,傳入的參數包括batchSize,Input與Output的Device端存儲器指針(其中Input的數據已經在processInput函數中傳入Device端),用於cuda並行計算的stream流
- 通過buffers.copyOutputToHostAsync將TensorRT計算結果從Device端的Output存儲器指針copy到CPU端的存儲器指針中
- 通過cudaStreamSynchronize同步等待上面的所有計算完成,這樣在buffers的CPU端Output指針中即保持了網絡的推理結果
- 通過cudaStreamDestroy(stream) 釋放cuda並行計算資源
bool SampleMNIST::verifyOutput(
const samplesCommon::BufferManager& buffers, const std::string& outputTensorName, int groundTruthDigit) const
{
const float* prob = static_cast<const float*>(buffers.getHostBuffer(outputTensorName));
// Print histogram of the output distribution
gLogInfo << "Output:\n";
float val{0.0f};
int idx{0};
const int kDIGITS = 10;
for (int i = 0; i < kDIGITS; i++)
{
if (val < prob[i])
{
val = prob[i];
idx = i;
}
gLogInfo << i << ": " << std::string(int(std::floor(prob[i] * 10 + 0.5f)), '*') << "\n";
}
gLogInfo << std::endl;
return (idx == groundTruthDigit && val > 0.9f);
}
- 通過verifyOutput方法來驗證網絡推理結果的正確性
- 通過buffers.getHostBuffer(outputTensorName)根據output Tensor的名字找到對應的Binding index,進而找到對應的HostBuffer和它的數據指針*prob
- 遍歷所有*prob找到概率最大的結果並輸出
- 最後判斷概率最大的結果是否等於groundTruth,得出Output是否正確的結論
釋放資源
bool SampleMNIST::teardown()
{
//! Clean up the libprotobuf files as the parsing is complete
//! \note It is not safe to use any other part of the protocol buffers library after
//! ShutdownProtobufLibrary() has been called.
nvcaffeparser1::shutdownProtobufLibrary();
return true;
}
......
int main(int argc, char** argv)
{
.......
if (!sample.teardown())
{
return gLogger.reportFail(sampleTest);
}
return gLogger.reportPass(sampleTest);
}
- 最後通過teardown 釋放分配的資源,完成整個構建網絡,網絡推理的過程