前言

TensorRT 的”hello world“ 程序sampleMNIST是衆多TensorRT初學者很好的起點，本文旨在詳細分析sampleMNIST的代碼，從實踐出發幫助理解TensorRT的相關概念、與cuda的關係、以及核心API的使用。

代碼分析

sampleMNIST的github 代碼參考link: https://github.com/NVIDIA/TensorRT/blob/release/6.0/samples/opensource/sampleMNIST/sampleMNIST.cpp

程序的主要流程分爲 main與程序輸入參數初始化 -> 網絡構建 -> 網絡推理 -> 釋放資源結束這幾個階段，下面逐個階段分析代碼

Main入口

void printHelpInfo()
{
    std::cout
        << "Usage: ./sample_mnist [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}

int main(int argc, char** argv)
{
    samplesCommon::Args args;
    bool argsOK = samplesCommon::parseArgs(args, argc, argv);

main函數開始獲取程序的輸入參數，允許指定caffe模型的文件目錄、使用DLA engine的數目、int8或者fp16的模式，參考printHelpInfo()函數

samplesCommon::CaffeSampleParams initializeSampleParams(const samplesCommon::Args& args)
{
    samplesCommon::CaffeSampleParams params;
    if (args.dataDirs.empty()) //!< Use default directories if user hasn't provided directory paths
    {
        params.dataDirs.push_back("data/mnist/");
        params.dataDirs.push_back("data/samples/mnist/");
    }
    else //!< Use the data directory provided by the user
    {
        params.dataDirs = args.dataDirs;
    }

    params.prototxtFileName = locateFile("mnist.prototxt", params.dataDirs);
    params.weightsFileName = locateFile("mnist.caffemodel", params.dataDirs);
    params.meanFileName = locateFile("mnist_mean.binaryproto", params.dataDirs);
    params.inputTensorNames.push_back("data");
    params.batchSize = 1;
    params.outputTensorNames.push_back("prob");
    params.dlaCore = args.useDLACore;
    params.int8 = args.runInInt8;
    params.fp16 = args.runInFp16;

    return params;
}

......

int main(int arg, char** arg)
{
......
samplesCommon::CaffeSampleParams params = initializeSampleParams(args);

根據程序運行參數生成CaffeSampleParams實例，包括配置caffe模型的默認目錄、minist的proto文件、caff模型文件、binary proto文件，配置minist深度學習網絡的input Tensor名字爲data，output Tensor名字爲prob，batch size爲1，根據用戶的輸入參數來配置是由需要DLA Engine，是否運行在Int8 / FP16模式

class SampleMNIST
{
    template <typename T>
    using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;

public:
    SampleMNIST(const samplesCommon::CaffeSampleParams& params)
        : mParams(params)

......

int main(int argc, char** argv)
{
......

SampleMNIST sample(params);
    gLogInfo << "Building and running a GPU inference engine for MNIST" << std::endl;

通過CaffeSampleParams作爲配置參數來構造SampleMNIST對象，將配置參數保存到mParams中

int main(int argc, char** argv)
{
......

    if (!sample.build())
    {
        return gLogger.reportFail(sampleTest);
    }

通過SampleMNIST對象來創建MNIST深度學習網絡，下面開始詳細分析網絡構建階段的build方法

網絡構建（build）階段

bool SampleMNIST::build()
{
    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger.getTRTLogger()));
    if (!builder)
    {
        return false;
    }

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
    if (!network)
    {
        return false;
    }

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
    {
        return false;
    }

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
    if (!parser)
    {
        return false;
    }

    constructNetwork(parser, network);

TensorRT使用的標準流程即通過Logger創建IBuilder，通過IBuilder創建INetworkDefinition，通過INetworkDefinition創建IBuilderConfig，再創建用於解析Caffe模型的ICafferParser，然後調用constructNetwork通過ICafferParser對象分析caffe模型，通過INetworkDefinition對象創建可以被TensorRT優化和運行的網絡

void SampleMNIST::constructNetwork(
    SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
{
    const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor = parser->parse(
        mParams.prototxtFileName.c_str(), mParams.weightsFileName.c_str(), *network, nvinfer1::DataType::kFLOAT);

    for (auto& s : mParams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }

    // add mean subtraction to the beginning of the network
    nvinfer1::Dims inputDims = network->getInput(0)->getDimensions();
    mMeanBlob
        = SampleUniquePtr<nvcaffeparser1::IBinaryProtoBlob>(parser->parseBinaryProto(mParams.meanFileName.c_str()));
    nvinfer1::Weights meanWeights{nvinfer1::DataType::kFLOAT, mMeanBlob->getData(), inputDims.d[1] * inputDims.d[2]};
    // For this sample, a large range based on the mean data is chosen and applied to the head of the network.
    // After the mean subtraction occurs, the range is expected to be between -127 and 127, so the rest of the network
    // is given a generic range.
    // The preferred method is use scales computed based on a representative data set
    // and apply each one individually based on the tensor. The range here is large enough for the
    // network, but is chosen for example purposes only.
    float maxMean
        = samplesCommon::getMaxValue(static_cast<const float*>(meanWeights.values), samplesCommon::volume(inputDims));

    auto mean = network->addConstant(nvinfer1::Dims3(1, inputDims.d[1], inputDims.d[2]), meanWeights);
    mean->getOutput(0)->setDynamicRange(-maxMean, maxMean);
    network->getInput(0)->setDynamicRange(-maxMean, maxMean);
    auto meanSub = network->addElementWise(*network->getInput(0), *mean->getOutput(0), ElementWiseOperation::kSUB);
    meanSub->getOutput(0)->setDynamicRange(-maxMean, maxMean);
    network->getLayer(0)->setInput(0, *meanSub->getOutput(0));
    samplesCommon::setAllTensorScales(network.get(), 127.0f, 127.0f);
}

通過parser->parse方法分析caffe的模型和權重文件，構建network並返回可以通過名字查找數據ITensor的對象blobNameToTensor
通過blobNameToTensor->find方法找到輸入參數中指定的網絡output ITensor對象，並通過network->markOutput標記它爲網絡的Output ITensor
通過network->getInput(0)->getDimensions()找到網絡的input ITensor對象並獲取它的Dims維度對象
通過parser->parseBinaryProto解析caffe權重平均值文件幷包裝爲IBinaryProtoBlob對象
創建Input的平均權重meanWeights，該權重的數據從mMeanBlob->getData()獲得，數據個數是inputDims.d[1] * inputDims.d[2]
如下圖所示爲網絡的Input做一個範圍限制處理，包括

通過network->addConstant方法創建一個IConstant Layer，該Layer的input是個3維Dims3對象
通過network->addElementWise方法創建一個IElementWise Layer，將原網絡的Input和IConstant Layer的output作爲Input求相減
最後通過network->getLayer(0)->setInput替換原網絡的Input爲IElementWise Layer的output，完成對原網絡Input的範圍限制處理

bool SampleMNIST::build()
{
......   
    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    config->setFlag(BuilderFlag::kGPU_FALLBACK);
    config->setFlag(BuilderFlag::kSTRICT_TYPES);
    if (mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if (mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }

    samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);

    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
        builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());

    if (!mEngine)
        return false;

    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}

constructNetwork函數執行完畢後，通過builder設置程序運行參數中的batchSize
通過config設置每一層Layer的內存大小和相關FLAG
通過enableDLA函數設置是否適用NV的DeepLearn Accelerator做硬件加速
通過network和config對象創建ICudaEngine對象用戶後續的推理過程
最後確定network的input個數只有1個，input的維度爲3維

網絡推理(infer) 階段

bool SampleMNIST::infer()
{
    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    if (!context)
    {
        return false;
    }

    // Pick a random digit to try to infer
    srand(time(NULL));
    const int digit = rand() % 10;

    // Read the input data into the managed buffers
    // There should be just 1 input tensor
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers, mParams.inputTensorNames[0], digit))
    {
        return false;
    }

.....


int main(int argc, char** argv)
{

......

if (!sample.infer())
    {
        return gLogger.reportFail(sampleTest);
    }

main函數執行完build函數後，通過infer函數開始做網絡推理
infer函數通過幫助類構建了BufferManager，用戶創建和管理host與device的memory，如下圖所示

模板類GenericBuffer通過模板參數AllocFunc和FreeFunc來指定Host和Device分配存儲的類型，如下代碼所示，DeviceAllocator/DeviceFree類使用了cudaMalloc/cudaFree方法從GPU Device分配和釋放存儲，HostAllocator/HostFree則時候用malloc/free方法從CPU Device分配和釋放存儲

class DeviceAllocator
{
public:
    bool operator()(void** ptr, size_t size) const
    {
        return cudaMalloc(ptr, size) == cudaSuccess;
    }
};

class DeviceFree
{
public:
    void operator()(void* ptr) const
    {
        cudaFree(ptr);
    }
};

......

class HostAllocator
{
public:
    bool operator()(void** ptr, size_t size) const
    {
        *ptr = malloc(size);
        return *ptr != nullptr;
    }
};

class HostFree
{
public:
    void operator()(void* ptr) const
    {
        free(ptr);
    }
};

ManagerBuffer對象通過配對的deviceBuffer和hostBuffer來管理Device和Host 存儲

    BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, const int& batchSize,
        const nvinfer1::IExecutionContext* context = nullptr)
        : mEngine(engine)
        , mBatchSize(batchSize)
    {
        // Create host and device buffers
        for (int i = 0; i < mEngine->getNbBindings(); i++)
        {
            auto dims = context ? context->getBindingDimensions(i) : mEngine->getBindingDimensions(i);
            size_t vol = context ? 1 : static_cast<size_t>(mBatchSize);
            nvinfer1::DataType type = mEngine->getBindingDataType(i);
            int vecDim = mEngine->getBindingVectorizedDim(i);
            if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector
            {
                int scalarsPerVec = mEngine->getBindingComponentsPerElement(i);
                dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);
                vol *= scalarsPerVec;
            }
            vol *= samplesCommon::volume(dims);
            std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
            manBuf->deviceBuffer = DeviceBuffer(vol, type);
            manBuf->hostBuffer = HostBuffer(vol, type);
            mDeviceBindings.emplace_back(manBuf->deviceBuffer.data());
            mManagedBuffers.emplace_back(std::move(manBuf));
        }
    }

BufferManager對象則管理多個ManagerBuffer，保存每個ManagerBuffer中deviceBuffer對應的設備存儲器指針到DeviceBindering
BufferManager的構造函數可以看到，通過mEngine->getNbBindings()遍歷當前網絡的所有Input/Output（此處有個細節，即遍歷的index i和Tensor的名字是有一一對應關係的，即通過Tensor的名字查找到的Binding index == 對應的index i ），對每個Input/Output獲得它的維度dims和數據類型type，計算Input/Output的ITensor數據需要的存儲器容量vol，通過構造ManagerBuffer的DeviceBuffer和HostBuffer對象來分配Device和Host存儲（用於後續CPU Host端輸入數據到GPU Device端），再將Device的數據指針保存到DeviceBindering，將ManagerBuffer保存到BufferManager的隊列中，最終通過BufferManager獲得了所有Input/Output的Device和Host 存儲空間

bool SampleMNIST::infer()
{

......

    // Pick a random digit to try to infer
    srand(time(NULL));
    const int digit = rand() % 10;

    // Read the input data into the managed buffers
    // There should be just 1 input tensor
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers, mParams.inputTensorNames[0], digit))
    {
        return false;
    }

......

bool SampleMNIST::processInput(
    const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
{
    const int inputH = mInputDims.d[1];
    const int inputW = mInputDims.d[2];

    // Read a random digit file
    srand(unsigned(time(nullptr)));
    std::vector<uint8_t> fileData(inputH * inputW);
    readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm", mParams.dataDirs), fileData.data(), inputH, inputW);

    // Print ASCII representation of digit
    gLogInfo << "Input:\n";
    for (int i = 0; i < inputH * inputW; i++)
    {
        gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
    }
    gLogInfo << std::endl;

    float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer(inputTensorName));

    for (int i = 0; i < inputH * inputW; i++)
    {
        hostInputBuffer[i] = float(fileData[i]);
    }

    return true;
}

有了 BufferManager後通過processInput函數來獲取Input數據，通過隨機構建文件名的方式readPGMFfile 讀取Input的數據
如下代碼所示，通過buffers.getHostBuffer(inputTensorName) 根據Input Tensor的名字找到對應的Binding index，進而找到對應的HostBuffer獲得CPU Host端的存儲指針
通過inputH*inputW 計算input數據的尺寸、遍歷input數據，將input數據從文件中讀取到CPU 端的存儲器中（ hostInputBuffer[i] = float(fileData[i]); ）

    void* getDeviceBuffer(const std::string& tensorName) const
    {
        return getBuffer(false, tensorName);
    }


    void* getHostBuffer(const std::string& tensorName) const
    {
        return getBuffer(true, tensorName);
    }

......

    void* getBuffer(const bool isHost, const std::string& tensorName) const
    {
        int index = mEngine->getBindingIndex(tensorName.c_str());
        if (index == -1)
            return nullptr;
        return (isHost ? mManagedBuffers[index]->hostBuffer.data() : mManagedBuffers[index]->deviceBuffer.data());
    }

bool SampleMNIST::infer()
{
......

// Create CUDA stream for the execution of this inference.
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // Asynchronously copy data from host input buffers to device input buffers
    buffers.copyInputToDeviceAsync(stream);

......

通過cudaStreamCreate 創建cuda stream用於GPU Device上做並行計算流
通過buffers.copyInputToDeviceAsync 將processInput中讀取的Input數據從CPU 端異步傳送到GPU Device端，如下代碼所示copyInputToDeviceAsync最終會通過cudeMemcpyAsync方法結合CPU -> GPU還是GPU -> CPU的方向來異步傳送數據

    void copyInputToDeviceAsync(const cudaStream_t& stream = 0)
    {
        memcpyBuffers(true, false, true, stream);
    }

......

    void memcpyBuffers(const bool copyInput, const bool deviceToHost, const bool async, const cudaStream_t& stream = 0)
    {
        for (int i = 0; i < mEngine->getNbBindings(); i++)
        {
            void* dstPtr
                = deviceToHost ? mManagedBuffers[i]->hostBuffer.data() : mManagedBuffers[i]->deviceBuffer.data();
            const void* srcPtr
                = deviceToHost ? mManagedBuffers[i]->deviceBuffer.data() : mManagedBuffers[i]->hostBuffer.data();
            const size_t byteSize = mManagedBuffers[i]->hostBuffer.nbBytes();
            const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;
            if ((copyInput && mEngine->bindingIsInput(i)) || (!copyInput && !mEngine->bindingIsInput(i)))
            {
                if (async)
                    CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));
                else
                    CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));
            }
        }
    }

bool SampleMNIST::infer()
{
......

    // Asynchronously enqueue the inference work
    if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
    {
        return false;
    }
    // Asynchronously copy data from device output buffers to host output buffers
    buffers.copyOutputToHostAsync(stream);

    // Wait for the work in the stream to complete
    cudaStreamSynchronize(stream);

    // Release stream
    cudaStreamDestroy(stream);

    // Check and print the output of the inference
    // There should be just one output tensor
    assert(mParams.outputTensorNames.size() == 1);
    bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);

    return outputCorrect;
}

通過context->enqueue 通知TensorRT 進行網絡推理過程，傳入的參數包括batchSize，Input與Output的Device端存儲器指針（其中Input的數據已經在processInput函數中傳入Device端），用於cuda並行計算的stream流
通過buffers.copyOutputToHostAsync將TensorRT計算結果從Device端的Output存儲器指針copy到CPU端的存儲器指針中
通過cudaStreamSynchronize同步等待上面的所有計算完成，這樣在buffers的CPU端Output指針中即保持了網絡的推理結果
通過cudaStreamDestroy(stream) 釋放cuda並行計算資源

bool SampleMNIST::verifyOutput(
    const samplesCommon::BufferManager& buffers, const std::string& outputTensorName, int groundTruthDigit) const
{
    const float* prob = static_cast<const float*>(buffers.getHostBuffer(outputTensorName));

    // Print histogram of the output distribution
    gLogInfo << "Output:\n";
    float val{0.0f};
    int idx{0};
    const int kDIGITS = 10;

    for (int i = 0; i < kDIGITS; i++)
    {
        if (val < prob[i])
        {
            val = prob[i];
            idx = i;
        }

        gLogInfo << i << ": " << std::string(int(std::floor(prob[i] * 10 + 0.5f)), '*') << "\n";
    }
    gLogInfo << std::endl;

    return (idx == groundTruthDigit && val > 0.9f);
}

通過verifyOutput方法來驗證網絡推理結果的正確性
通過buffers.getHostBuffer(outputTensorName)根據output Tensor的名字找到對應的Binding index，進而找到對應的HostBuffer和它的數據指針*prob
遍歷所有*prob找到概率最大的結果並輸出
最後判斷概率最大的結果是否等於groundTruth，得出Output是否正確的結論

釋放資源

bool SampleMNIST::teardown()
{
    //! Clean up the libprotobuf files as the parsing is complete
    //! \note It is not safe to use any other part of the protocol buffers library after
    //! ShutdownProtobufLibrary() has been called.
    nvcaffeparser1::shutdownProtobufLibrary();
    return true;
}

......

int main(int argc, char** argv)
{
.......

    if (!sample.teardown())
    {
        return gLogger.reportFail(sampleTest);
    }

    return gLogger.reportPass(sampleTest);
}

最後通過teardown 釋放分配的資源，完成整個構建網絡，網絡推理的過程

【代碼分析】TensorRT sampleMNIST 詳解

前言

代碼分析

Main入口

網絡構建（build）階段

網絡推理(infer) 階段

釋放資源

【架構分析】Fuchsia FIDL IPC 詳解

【代碼分析】TensorRT sampleINT8 詳解

【代碼分析】TensorRT sampleMNIST 詳解

【工程技術】QEMU運行Linux Kernel環境配置

【芯片原理】NPU矩陣乘法加速詳解

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結