公司需求：跑通模型並輸出前饋時間。
本文內容：修改官方提供的trtexec（用於前饋時間基準測試的demo），尋找、修改自定義的upsample層，把該層嵌入到trtexec中，然後輸入工程的網絡結構，返回前饋時間。
接到的網絡，TensorRT無法識別的只有upsample層和最後的detection層，主要需要處理的是upsample自定義層。查詢了一些資料，在github上下載了一些工程，跑通了一個檢測前饋時間的實例（還沒有進行準確度測試）。
upsample層是借鑑於github上的工程TensorRT Wrapper

文章目錄

translate_idx

結構簡介

TensorRT Wrapper是依附於yolov3工程的，內含了yolov3層和upsample層的源碼，下面主要介紹upsample層。
不管是在trtexec中使用自定義層，還是在其他demo中使用自定義層，都要調用PluginFactory這個類，利用這個類去實例化Plugin（也就是所說的自定義層，詳見TensorRT的Plugin）。
　　所以想要在trtexec中加入自定義的upsample，主要需要改動的就是PluginFactory.h, upsample.h, upsample.cpp, upsample.cu四個文件，下面依次介紹。

PluginFactory.h

首先介紹PluginFactory頭文件的改動，PluginFactory.h 裏面內含PluginFactory類，繼承於nvinfer1::IPluginFactory和nvcaffeparser1::IPluginFactoryExt類，負責對Plugin層進行實例化。
主要代碼：

class PluginFactory : public nvinfer1::IPluginFactory, public nvcaffeparser1::IPluginFactoryExt
{
public:
    inline bool isUpsample(const char* layerName)
    {
        return std::regex_match(layerName , std::regex(R"(layer(\d*)-upsample)"));
    }


    virtual nvinfer1::IPlugin* createPlugin(const char* layerName, const nvinfer1::Weights* weights, int nbWeights) override
    {
        assert(isPlugin(layerName));
        if (isUpsample(layerName))
        {
            // 對於Upsample層來說，輸入的Weight的種類數量爲0，所以指針爲nullptr
            assert(nbWeights == 0 && weights == nullptr);
            //
            mPluginUpsample.emplace_back(std::unique_ptr<UpsampleLayerPlugin>(new UpsampleLayerPlugin(UPSAMPLE_SCALE,CUDA_THREAD_NUM)));
            return mPluginUpsample.back().get();
        }

        else
        {
            assert(0);
            return nullptr;
        }
    }

    nvinfer1::IPlugin* createPlugin(const char* layerName, const void* serialData, size_t serialLength) override
    {
        assert(isPlugin(layerName));

        if (isUpsample(layerName))
        {
            mPluginUpsample.emplace_back(std::unique_ptr<UpsampleLayerPlugin>(new UpsampleLayerPlugin(serialData, serialLength)));
            return mPluginUpsample.back().get();
        }
        else
        {
            assert(0);
            return nullptr;
        }
    }


    bool isPlugin(const char* name) override
    {
        return isPluginExt(name);
    }

    bool isPluginExt(const char* name) override
    {
        //std::cout << "check plugin " << name  << isYolo(name)<< std::endl;
        return isUpsample(name);
    }

    // The application has to destroy the plugin when it knows it's safe to do so.
    void destroyPlugin()
    {
        for (auto& item : mPluginUpsample)
            item.reset();

    }

    void (*nvPluginDeleter)(INvPlugin*){[](INvPlugin* ptr) { if(ptr) ptr->destroy(); }};

    std::vector<std::unique_ptr<UpsampleLayerPlugin>> mPluginUpsample{};
};

這個PluginFactory類和官方demo: samplePlugin中定義的類，結構大致相同，即：isPlugin(), isPluginExt(), 兩個createPlugin()，destroyPlugin()。
不同的地方的是

官方只加一個FC層，所以最後只是聲明瞭一個智能指針。

std::unique_ptr< FCPlugin> mPlugin{nullptr};

但是，在這裏，爲了能讓upsample層的代碼可以爲多個層服務，所以聲明瞭unique_ptr的一個vector。

std::vector<std::unique_ptr< UpsampleLayerPlugin>> mPluginUpsample{};

加入了一個destroy過程，具體我自己還沒太弄清楚，挖坑。

std::vector<std::unique_ptr< UpsampleLayerPlugin>> mPluginUpsample{};

PluginFactory的改動重點就是，智能指針的vector的設置配合isUpsample方法的正則匹配進行多個upsample層的匹配，如下代碼所示；又因爲這裏設置的scale都爲2，且爲全局變量，所以沒有weights需要傳入，所以nbWeights == 0，weights==nullptr，其餘部分和samplePlugin這個demo中寫的幾乎一樣。

//isUpsample方法中利用正則匹配
return std::regex_match(layerName , std::regex(R"(layer(\d*)-upsample)"));
//最後建立一個vector來存所有指向各個upsample層的unique_ptr指針。
std::vector<std::unique_ptr< UpsampleLayerPlugin >> mPluginUpsample{};
// 不管是在build還是runtime的plugin實例化過程中（調用兩個createPlugin方法），都直接在vector中直接emplace_back即可。
mPluginUpsample.emplace_back(std::unique_ptr< UpsampleLayerPlugin>(new UpsampleLayerPlugin(UPSAMPLE_SCALE,CUDA_THREAD_NUM)));
mPluginUpsample.emplace_back(std::unique_ptr< UpsampleLayerPlugin>(new UpsampleLayerPlugin(serialData, serialLength)));

upsample.h

upsample層的頭文件依照tensorRT官方給的模板，聲明需要的函數。

構造、析構函數
getNbOutputs() 返回輸出的數量
getOutputDimensions() 返回輸出的維度（Dims數據結構）
supportsFormat()和configureWithFormat() 定義datatype和format
getSerializationSize()和serialize() 獲得串行化的字節長度的方法和串行化方法
getWorkspaceSize() 獲得workSpace需要的size的方法
initialize，terminate和enqueue方法層的核心方法
forwardGpu方法的聲明，爲upsample層進行計算的方法。
private中的聲明：

        nvinfer1::Dims mCHW;
        DataType mDataType{DataType::kFLOAT};
        float mScale;
        int mOutputWidth;
        int mOutputHeight;
        int mThreadCount;

.
　　這裏的結構，除了多了forwardGpu之外，和官方給的samplePlugin是一樣的，缺少了deserialize相關的方法，因爲這個程序中不涉及到並行化問題。
　　其中private中的參數mTreadCount是線程數量，構造函數在構造過程中默認該參數值爲512，用於後來調用cuda核函數的時候，計算grid和block的設置（grid和block的解釋在下面）。

upsample.cpp

下面是對應函數的介紹
1. 構造函數與析構函數

    UpsampleLayerPlugin::UpsampleLayerPlugin(const float scale, const int cudaThread /*= 512*/)
            : mScale(scale),mThreadCount(cudaThread)
    {
    }

    UpsampleLayerPlugin::~UpsampleLayerPlugin()
    {
    }

    UpsampleLayerPlugin::UpsampleLayerPlugin(const void* data, size_t length)
    {
        using namespace Tn;
        const char *d = reinterpret_cast<const char *>(data), *a = d;
        read(d, mCHW);
        read(d, mDataType);
        read(d, mScale);
        read(d, mOutputWidth);
        read(d, mOutputHeight);
        read(d, mThreadCount);
        //std::cout << "read:" << a << " " << mOutputWidth<< " " <<mOutputHeight<<std::endl;
        assert(d == a + length);
    }

build期間的構造函數只是給scale和cudaThread進行了賦值。
runtime期間的構造函數（爲了從byte流中創建一個plugin的構造函數），通過read函數讀取byte流中對應位置存儲的參數值。
析構函數爲空。

2. getNbOutputs()
return 1；
3. getOutputDimensions()

   Dims UpsampleLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)
    {
        mCHW = inputs[0];
        mOutputHeight = inputs[0].d[1]* mScale;
        mOutputWidth = inputs[0].d[2]* mScale;
        return Dims3(mCHW.d[0], mOutputHeight, mOutputWidth);
    }

inputs[0]: 第0個Dims類型的輸入(因爲upsample層的輸入也只有一個層，所以只有input[0])
inputs[0].d[0]/d[1]/d[2]: 分別是C,H,W
return 的是dims3類型，保存的是1個輸出層的信息。

4. supportsFormat()和configureWithFormat()

  bool supportsFormat(DataType type, PluginFormat format) const override 
   {
           return (type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8 )&& format == PluginFormat::kNCHW;
   }
  void UpsampleLayerPlugin::configureWithFormat(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs, DataType type, PluginFormat format, int maxBatchSize)
   {
       assert((type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8) && format == PluginFormat::kNCHW);
       mDataType = type;
   }

這個工程麼有設置fp16和int8的選項，只支持，DataType:kFLOAT 和 PluginFormat::KNCHW

5. getSerializationSize()和serialize()

        virtual size_t getSerializationSize() override
       {
           return sizeof(nvinfer1::Dims) + sizeof(mDataType) + sizeof(mScale)
                  + sizeof(mOutputWidth) + sizeof(mOutputHeight) + sizeof(mThreadCount);
       }

計算需要的所有參數需要的總長度，其中包括一個Dims（getNbOutputDimensions返回的輸出層的Dims）+ mDataType（kFLOAT）+ mScale +mOutputWidth+mOutputHeight+mThreadCount，即所有執行層運算的參數的總的byte流長度。

  void UpsampleLayerPlugin::serialize(void* buffer)
  {
      using namespace Tn;
      char* d = static_cast<char*>(buffer), *a = d;
      write(d, mCHW);
      write(d, mDataType);
      write(d, mScale);
      write(d, mOutputWidth);
      write(d, mOutputHeight);
      write(d, mThreadCount);
      assert(d == a + getSerializationSize());
  }

按照長度把參數信息寫進byte流中。最後的assert是確保寫入長度正確。

6. getWorkspaceSize()
這個坑看明白後再填
7. initialize，terminate和enqueue方法層的核心方法

   int UpsampleLayerPlugin::initialize()
   {
       int inputHeight = mCHW.d[1];
       int inputWidth = mCHW.d[2];

       mOutputHeight = inputHeight * mScale;
       mOutputWidth = inputWidth * mScale;

       return 0;
   }

initialize() 爲UpsampleLayerPlugin類的幾個私有成員變量賦值。
terminate()爲空
enqueue()方法寫在了upsample.cu文件中

8. forwardGpu()方法也寫在了upsample.cu的文件中

upsample.cu

.cu文件中主要是涉及到了cuda核函數的調用，所以enqueue()和forwardGpu方法都寫在這個文件中。

enqueue

enqueue是TensorRT自定義層的核心方法，用來調用cuda核函數或者cuda handle對自定義層進行運算。

int UpsampleLayerPlugin::enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream)
    {
        const int channels = mCHW.d[0];
        const int64_t in_height = mCHW.d[1];
        const int64_t in_width = mCHW.d[2];
        const int64_t out_height = mOutputHeight;
        const int64_t out_width = mOutputWidth;
        int totalElems = batchSize * in_height * in_width * channels;

        if (out_height == in_height && out_width == in_width) {
            CUDA_CHECK(cudaMemcpyAsync(outputs[0], inputs[0], totalElems * type2size(mDataType), cudaMemcpyDeviceToDevice, stream));
            CUDA_CHECK(cudaStreamSynchronize(stream));
            return 0;
        }
        switch (mDataType)
        {
            case DataType::kFLOAT :
                forwardGpu<float>((const float *)inputs[0],(float *)outputs[0],batchSize,mCHW.d[0],mOutputHeight,mOutputWidth);
                break;
            case DataType::kHALF:
                forwardGpu<__half>((const __half *)inputs[0],(__half *)outputs[0],batchSize,mCHW.d[0],mOutputHeight,mOutputWidth);
                break;
            case DataType::kINT8:
                forwardGpu<u_int8_t>((const u_int8_t *)inputs[0],(u_int8_t *)outputs[0],batchSize,mCHW.d[0],mOutputHeight,mOutputWidth);
                break;
            default:
                std::cerr << "error data type" << std::endl;
        }
        return 0;
    };

兩個CUDA_CHECK還不是很明白，先挖坑。
enqueue函數主要是進行賦值任務和調用forwardGpu()

forwardGpu

forwardGpuforwardGpu主要是對upscale內核函數進行調用

    template <typename Dtype>
    void UpsampleLayerPlugin::forwardGpu(const Dtype* input,Dtype * output, int N,int C,int H ,int W) {
        int numElem = N*C*H*W;
        upscale<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount>>>(input,output, numElem, mScale, C, H, W);
    }

其中，函數內部的第二行是對cuda核函數調用的方法。這裏我參考這個博文自定義cuda核函數的調用
調用定義的核函數的方式：

kernel<<<1,1>>>(param1,param2,…)

“<<< >>>”中參數的作用是告訴我們該如何啓動核函數(比如如何設置線程)。

upscale

核函數的調用

1. 核函數的聲明與調用

upscale就是自定義的核函數，規定使用__global__聲明核函數

_global_ void kernel(param list){ }

如上面提到的，調用時，如下所示

kernel<<<Dg,Db, Ns, S>>>(param list);

參數解釋：

Dg： int型或者dim3類型(x,y,z)。 用於定義一個grid中的block是如何組織的。 int型則直接表示爲1維組織結構。
Db： int型或者dim3類型(x,y,z)。 用於定義一個block中的thread是如何組織的。 int型則直接表示爲1維組織結構。
Ns： size_t類型，可缺省，默認爲0。 用於設置每個block除了靜態分配的共享內存外，最多能動態分配的共享內存大小，單位爲byte。 0表示不需要動態分配。
S： cudaStream_t類型，可缺省，默認爲0。 表示該核函數位於哪個流。

看完下面的介紹且前文把mThreadCount設置爲512可知，一個block中存512個thread，一個grid中存：

(numElem + mThreadCount - 1) / mThreadCount

個block，在這裏Dg和Db都是int型，所以grid和block都是一維的。
2. cuda的線程結構

CUDA的線性結構，有三個重要的概念：grid,block,thread
(1) GPU工作時的最小單位是thread
(2) 多個thread可以組成一個block，但是一個block能夠包含的thread是有限的。因爲一個block上的所有線程最好同時位於同一個處理器核心上，同時共享同一塊內存，於是同一塊block上的thread可以快速進行同步的動作，而不用擔心數據通信壁壘。
(3) 執行相同程序的多個block，可以組成grid。不同block中的thread無法存取同一塊共享的內存，無法直接互通或者進行同步。因此，不同block中的thread能合作的程度是比較低的。不過，利用這種模式，可以讓程序不用擔心顯示芯片實際上可以同時執行的thread數目限制。例如，一個具有很少量顯示單元的顯示芯片，可能會把各個block中的thread順序執行，而非同時執行。不同的grid可以執行不同的程序。

下面是個一個結構關係圖：

另外，block，thread的組織結構可以是二維，三維的，例如上圖，block是二維，thread是三維。
CUDA中每一個線程都有一個唯一的標識ThreadIdx，這個ID隨着組織結構形式的變化而變化。 (注意：ID的計算，同計算行優先排列的矩陣元素ID思路一樣。)

回顧之前我們的矢量加法：
Block是一維的，Tread是一維的：

__global__ void addKernel(int *c, const int *a, const int *b)

    int i = blockIdx.x *blockDim.x + threadIdx.x;  
    c[i] = a[i] + b[i];
}

這裏blockIdx.x指的是第blockIdx.x個block，blockDim.x指的是一個block一行具有的thread數（當block爲一維的時候，就是一個block具備的thread的數量)；threadIdx.x是該block上thread的位置

Block是一維的，thread是二維的：

__global__ void addKernel(int *c, int *a, int *b)
{
    int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}

這裏blockDim.x指的是一個block一行具有的thread數量，blockDim.y指的是一個block一列具有的thread數量，threadIdx.y指的是當前thread所在的block的行數，threadIdx.x指的是當前行該thread處於的列數。
Block是二維的，thread是三維的

__global__ void addKernel(int *c, int *a, int *b)
{
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;  
    int i = blockId * (blockDim.x * blockDim.y * blockDim.z)  
        + (threadIdx.z * (blockDim.x * blockDim.y))  
        + (threadIdx.y * blockDim.x) + threadIdx.x; 
    c[i] = a[i] + b[i];
}

blockIdx.y是指這個grid上第blockIdx.y行,gridDim.x指這個grid一行具有的block數量（也即是grid的block列數）
同理。

3. 內存結構
如下圖所示,每個 thread 都有自己的一份 register 和 local memory 的空間。同一個 block 中的每個 thread 則有共享的一份 share memory。此外，所有的 thread(包括不同 block 的 thread)都共享一份 global memory、constant memory、和 texture memory。不同的 grid 則有各自的 global memory、constant memory 和 texture memory。

這種特殊的內存結構直接影響着我們的線程分配策略，因爲需要通盤考慮資源限制及利用率。

4. 異構編程
如下圖所示，是常見的GPU程序的處理流程，其實是一種異構程序，即CPU和GPU的協同。

主機上執行串行代碼，設備上則執行並行代碼。

以上簡單的介紹了cuda計算時thread結構。

upscale核函數的原理

    __global__ void upscale(const Dtype *input, Dtype *output,
                            int no_elements, int scale_factor, int d1, int d2, int d3) {
        int ii = threadIdx.x + blockDim.x * blockIdx.x;
        if (ii >= no_elements) return;  
        int ipidx = translate_idx(ii, d1, d2, d3, scale_factor);
        output[ii]=input[ipidx];
    }

自定義的upscale核函數的代碼如上所示，上面由upscale的定義

upscale<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount>>>(input,output, numElem, mScale, C, H, W);

可以知道,grid和block都是一維的。
根據上面的示例，我們知道，這個方法先讓ii等於現在要使用的這個thread，然後放到translate_idx方法中計算，最後賦值給output。

translate_idx

    __device__ int translate_idx(int ii, int d1, int d2, int d3, int scale_factor) {
        int x, y, z, w;
        w = ii % d3;
        ii = ii/d3;
        z = ii % d2;
        ii = ii/d2;
        y = ii % d1;
        ii = ii/d1;
        x = ii;
        w = w/scale_factor;
        z = z/scale_factor;
        d2 /= scale_factor;
        d3 /= scale_factor;
        return (((x*d1+y)*d2)+z)*d3+w;
    }

由upscale方法可知，傳入的參數意義如下：
ii: 　當前使用的thread（圖像中的每個像素的計算都分配一個thread?）
d1:　channel數
d2:　輸出的高度
d3:　輸出的寬度
scale_factor:　scale值

以上便是Upsample的實踐過程，都是拼拼湊湊起來的，還需要改進。

TensorRT5.1實踐自定義的Upsample層的方法

文章目錄

結構簡介

PluginFactory.h

upsample.h

upsample.cpp

upsample.cu

enqueue

forwardGpu

upscale

核函數的調用

upscale核函數的原理

translate_idx

Faceswap開發（一） GAN網絡的基本瞭解

pytorch->onnx->trt 踩坑記錄

SegDecNet的多GPU數據同步訓練代碼改動記錄

ubuntu16.04+cuda9.2+tensorflow-gpu1.8.0+torch1.3.0 問題記錄

TensorRT安裝 & 環境配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

TensorRT5.1實踐 自定義的Upsample層的方法

文章目錄

結構簡介

PluginFactory.h

upsample.h

upsample.cpp

enqueue

forwardGpu

upscale

核函數的調用

upscale核函數的原理

translate_idx

TensorRT5.1實踐自定義的Upsample層的方法