黑科技：用cutlass進行低成本、高性能卷積算子定製開發

圖形處理器通用計算(GPGPU)是指利用 GPU 來計算原本由 CPU 處理的通用計算任務。由於現代 GPU 擁有強大的並行處理能力，通用 GPU 在面對矩陣乘法、卷積等大量並行的計算密集型算法時，性能遠遠超越了傳統的 CPU。CUDA 是由 NVIDIA 推出的 GPGPU 高性能計算方案，目前大多數深度學習推理任務都可以通過 CUDA 來進行加速。

爲了充分發揮 CUDA 平臺的計算能力，NVIDIA 推出了高度優化的深度學習、線性代數算子庫 cudnn、cublas、cutlass，以及 CUDA 平臺上的深度學習推理框架 TensorRT。

• cudnn、cublas 這樣的基礎算子原語庫在常見的卷積層上性能表現很好，通常都能夠滿足用戶的需求，但是在面對用戶高度定製化的算法時，基礎算子庫往往並不能充分發揮硬件的性能。這是由於算子優化的長尾問題引起的，基礎算子庫引入了許多卷積優化的通用策略，但是這些優化的策略並不能覆蓋所有的情況，實際算法中的卷積層有可能並不能從通用的優化策略中獲得收益，從而無法充分發揮硬件的性能。

基礎算子庫的另一個問題是用戶無法對這些基礎算子進行定製化開發，當算法開發人員想爲卷積算子添加一種新的激活函數，或者想添加一種特殊的卷積算子(比如：LocalConv)時，就會變得束手無策。

• cutlass 是 NVIDIA 推出的一款線性代數模板庫，它定義了一系列高度優化的算子組件，開發人員可以通過組合這些組件，開發出性能和 cudnn、cublas 相當的線性代數算子。但是 cutlass 僅支持矩陣乘法運算，不支持卷積算子，從而難以直接應用到計算機視覺領域的推理部署中。

• TensorRT 是一款非常強大的深度學習推理部署框架，在 CUDA 平臺上性能表現非常優秀，而且目前已經比較成熟，用戶使用起來比較方便。然而 TensorRT 也存在着一些問題，對於開發人員來說，TensorRT 是一個黑盒，用戶沒有辦法細粒度控制 TensorRT 內部的實現細節。

例如：在部署量化網絡時，開發人員無法控制 TensorRT 底層的量化細節，有可能會出現部署和訓練的精度對不齊的問題。再比如：TensorRT 在推理部署時，用戶無法精細的控制算子的顯存使用情況，有時 TensorRT 在運行網絡時耗費了大量的顯存，而用戶卻沒有特別好的辦法對此進行優化。

爲了在 CUDA 平臺上進行深度學習的推理部署，各大開源框架也都推出了各自的解決方案。

• 大部分開源訓練框架在 CUDA 平臺上的部署方案，都是基於模型轉換工具，將網絡轉換成 TensorRT 支持的格式，然後交由 TensorRT 來執行推理任務。然而各大訓練框架在算子的定義上會有細微的差別，這使得在模型轉換的過程中會引入難以避免的性能、精度上的損失。

• TVM 作爲一款支持全平臺的深度學習推理框架，對 CUDA 平臺進行了比較好的支持。TVM 基於算子優化的原語定義了一系列矩陣乘法、卷積的模板，通過對模板進行運行時調優，來獲得最優的性能。但是 TVM 採用的代碼自動生成技術在 CUDA 平臺上的效果和 cudnn、cublas 等手動調優的算子庫還有不少差距，另外 TVM 在性能調優時需要耗費比較長的時間。上述兩點原因阻礙了 TVM 在真實的推理部署場景中得到很好的應用。

由於官方庫無法滿足算法開發中的定製化需求，而開源界對 CUDA 平臺的優化不夠深入，無法滿足算法部署中的性能需求，MegEngine 基於 cutlass 進行了二次開發，補充了 cutlass 對卷積算子的支持。用戶通過自定義分塊大小，可以很好的解決算子優化中的長尾問題。同時框架複用了 cutlass 裏高度優化的算子組件，同時提煉了一套 CUDA 平臺卷積算子的優化策略，讓用戶以較低的開發成本，完成定製化的卷積算子開發。

基於 CUTLASS 的卷積算子開發框架

算子優化的長尾問題

在實際的模型推理部署中，cudnn 這樣的官方庫的性能往往不夠好。例如，cudnn 只對輸出通道數多於 64 的情況進行了優化，而當通道數不足 64 的時候，cudnn 需要將通道數補齊 64，並且啓動更多的線程數來進行計算，這不僅造成了計算資源的浪費，而且不能獲得較好的算子性能。

如果我們利用 MegEngine 開源的 cutlass 算子開發框架，就可以很方便地對輸出通道數較小的情況進行定製優化。

例如：當輸入 feature map 的 4 維分別是 N=16, C=64, H=92, W=160 時，卷積核的大小爲 3x3，輸出的通道數爲 32 時，我們可以通過如下的代碼，添加一種新的分塊大小，來處理輸出通道數爲 32 的情形：

// 定義輸入 feature map tensor 的 layout
using LayoutSrc = cutlass::layout::TensorNCxHWx<32>;
// 定義輸入 weight tensor 的 layout
using LayoutFilter = cutlass::layout::TensorCxRSKx<32>;
// 定義線程塊的分塊大小，M，N，K
using ThreadBlockShape = cutlass::gemm::GemmShape<32, 64, 64>;
// 定義 warp 的分塊大小，M，N，K
using WarpShape = cutlass::gemm::GemmShape<32, 16, 64>;
// 定義 Matrix Multiply-Add 指令的矩陣分塊大小，M，N，K
using InstructionShape = cutlass::gemm::GemmShape<8, 8, 16>;
// 定義卷積後處理 operator
using EpilogueOp = cutlass::epilogue::thread::
                      BiasAddLinearCombinationReluClamp<int8_t, 8,
                          int32_t, int32_t, float>;
using Convolution = cutlass::convolution::device::Convolution<
  int8_t,       // 輸入 feature map 的 data type
  LayoutSrc,    // 輸入 feature map 的 layout
  int8_t,       // 輸入 weight 的 data type
  LayoutFilter, // 輸入 weight 的 layout
  int8_t,       // 輸出 tensor 的 data type
  LayoutSrc,    // 輸出 tensor 的 layout
  int32_t,      // 輸入 bias 的 data type
  LayoutSrc,    // 輸入 bias 的 layout
  int32_t,      // 矩陣乘法內部累加的 data type
  cutlass::convolution::ConvType::kConvolution,
  cutlass::arch::OpClassTensorOp,
  cutlass::arch::Sm75,
  ThreadBlockShape, WarpShape, InstructionShape,
  EpilogueOp,
  cutlass::convolution::threadblock::
      ConvolutionNCxHWxThreadblockSwizzle<
          cutlass::Convolution::ConvType::kConvolution>,
  2,           // 2 代表是否開啓 shared memory ping-pong prefetch 優化
  16, 16>;     // tensor alignment, 代表 load/store 指令的位寬
               // 越寬指令吞吐量越高，有助於提升性能
Convolution conv_op;
typename Convoluition::Arguments args{...};
conv_op.initialize(args, workspace);
// 執行 convolution 算子
conv_op();

在 T4 卡上實測，我們通過 cutlass 自定義的算子實現比 cudnn 的性能快了 26%。

而在一些常見的卷積參數下，cutlass 定義的卷積算子的性能也是和 cudnn 的性能可比的，我們在 T4 卡上實測了 ResNet50 中一些常見卷積層的性能：

cutlass 在選取的 17 個卷積層下有 11 個卷積層的性能超過了 cudnn，餘下的 6 個卷積層的性能也基本達到了 cudnn 的 80%以上。

算子融合

NVIDIA 的 Turing 架構顯卡引入了 TensorCore int8 計算單元，GPU 的計算能力得到了極大的提升，然而 GPU 的訪存能力並沒有相應增長，這時候 GPU 的訪存往往成爲了推理性能的瓶頸。在這種場景下，我們就需要將訪存密集型算子和計算密集型算子進行融合，減少訪存密集型算子的開銷。下面我們通過一個使用 TensorCore int8 推理加速的例子來介紹 MegEngine 和 cutlass 是如何進行算子融合的。

CUDA 平臺上的 8-bit 量化卷積層採用的是 NCHW4 的數據佈局(Layout)。不同於常見的 NCHW 的 Layout，這種 Layout 將 4 個通道打包在一起，連續的存放在內存中，然後按照 stride 從小到大依次存放 Tensor 的 W、H、C、N 四個維度的數據。爲了使用 TensorCore 來進行加速，需要將 Tensor 的 Layout 轉換爲 NCHW32 的 Layout，這種 Layout 和 NCHW4 類似，只是將 32 個通道打包到一起存放到內存中。

在使用 MegEngine 進行推理部署時，只要用戶打開了 TensorCore 的優化選項，MegEngine 就會在圖優化階段插入合適的 Tensor Reformat 算子來完成 Layout 的轉換，如圖 2 中的第一個階段圖變換所示。接下來 MegEngine 會將消去冗餘的 Tensor Reformat 算子，得到圖 2 中的第二個階段的計算序列。

結合 cutlass，MegEngine 還可以進一步對計算圖進行優化。首先，我們發現池化(Pooling)算子和它後面相連的 Reformat 算子是可以交換的。交換兩個算子的順序之後，計算圖最前面的 Elemwise、Convolution、Reformat 這三個算子可以通過 cutlass 融合成一個超級卷積算子(Super Conv)，這樣就得到了圖 2 中最後的計算圖。在優化後的計算圖中，TensorCore 引入的訪存密集型算子已經全部融合進卷積算子中了，這樣優化後的推理網絡可以完全享受到 TensorCore 的加速效果，而沒有額外的 Tensor Reformat 的開銷。

那麼如何使用 cutlass 的算子融合功能呢？cutlass 已經提供了 NCHW4 和 NCHW32 這兩種 Layout 相互轉換的高性能讀寫組件，只需要將卷積的 operator 和相應的後處理(Epilogue)的 operator 組合起來就可以定義 Convolution+Reformat 的融合算子了。圖 3 中示例代碼展示瞭如何用 cutlass 定義一個輸入 Tensor 爲 NCHW4 Layout，輸出 Tensor 爲 NCHW32 Layout 的卷積算子。

// 定義輸入 feature map tensor 的 layout
using LayoutSrc = cutlass::layout::TensorNCxHWx<4>;
// 定義輸入 weight tensor 的 layout
using LayoutFilter = cutlass::layout::TensorCxRSKx<4>;
// 定義輸出 tensor 的 layout
using LayoutDst = cutlass::layout::TensorNCxHWx<32>;
// 定義線程塊的分塊大小，M，N，K
using ThreadBlockShape = cutlass::gemm::GemmShape<64, 128, 32>;
// 定義 warp 的分塊大小，M，N，K
using WarpShape = cutlass::gemm::GemmShape<64, 32, 32>;
// 定義 Matrix Multiply-Add 指令的矩陣分塊大小，M，N，K
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>;
// 定義卷積後處理 operator
using EpilogueOp = cutlass::epilogue::thread::
                      BiasAddLinearCombinationReluClamp<int8_t, 4,
                          int32_t, int32_t, float>;
using Convolution = cutlass::convolution::device::Convolution<
  int8_t,       // 輸入 feature map 的 data type
  LayoutSrc,    // 輸入 feature map 的 layout
  int8_t,       // 輸入 weight 的 data type
  LayoutFilter, // 輸入 weight 的 layout
  int8_t,       // 輸出 tensor 的 data type
  LayoutDst,    // 輸出 tensor 的 layout
  int32_t,      // 輸入 bias 的 data type
  LayoutDst,    // 輸入 bias 的 layout
  int32_t,      // 矩陣乘法內部累加的 data type
  cutlass::convolution::ConvType::kConvolution,
  cutlass::arch::OpClassSimt,
  cutlass::arch::Sm61,
  ThreadBlockShape, WarpShape, InstructionShape,
  EpilogueOp,
  cutlass::convolution::threadblock::
      ConvolutionNCxHWxThreadblockSwizzle<
          cutlass::Convolution::ConvType::kConvolution>,
  2,           // 2 代表是否開啓 shared memory ping-pong prefetch 優化
  4, 16>;      // tensor alignment, 代表 load/store 指令的位寬
               // 越寬指令吞吐量越高，有助於提升性能
Convolution conv_op;
typename Convoluition::Arguments args{...};
conv_op.initialize(args, workspace);
// 執行 convolution 算子
conv_op();

我們在 T4 卡上對 ResNet50 中的第一個卷積層進行了測試，當輸出 Tensor 是 NCHW4 Layout 時的耗時是 3.03ms，Tensor Reformat 算子的耗時是 0.309ms，Convolution+Reformat 算子的耗時也是 3.03ms，但是融合後減少了 Tensor Reformat 算子開銷，性能提升了約 10%。

定製卷積算子

在一些高度定製化的場景，算法工程師會提出一些新的卷積算子來提升網絡的性能。例如，識別任務中 Local 算子、Google Brain 提出的 CondConv 算子等等，這些算子引入了更多的參數量，來提升模型的推理精度。

但是在 CUDA 平臺上，這些算子往往沒有比較好的優化實現，這就阻礙了這些算子在實際的推理任務中落地。我們發現這些算子的計算過程和普通的卷積算子大體相同，只是訪問卷積核的方式略有不同。

我們可以在 cutlass 的卷積算子定義前處理(Prologue)的 operator 來改變卷積算子訪問卷積核的方式，同時複用 cutlass 中的高性能卷積組件，來實現性能較優的 Local 算子和 CondConv 算子。在曠視的人臉識別業務中，我們基於 cutlass 實現了高性能的量化 CondConv 算子已經得到了落地，在不影響推理性能的情況下，獲得了免費的漲點。

自定義激活函數

目前 NVIDIA 提供的 cudnn 算子庫中卷積算子支持的激活函數只有 ReLU，如果算法工程師在模型中想開一下腦洞，使用一些新穎的激活函數(例如：HSwish)，那麼這樣的激活函數是不能被融合進卷積算子中的，這樣會造成模型推理耗時增加，在一些對推理延時要求高的場景下，新型激活函數就不能真正得到落地。

如果藉助 cutlass，就可以比較輕鬆地解決自定義激活函數的問題，我們只需要添加一種新的後處理(Epilogue)operator 就可以實現新的激活函數了。例如，下面的代碼定義了 HSwish 的激活函數：

template <typename ElementOutput_,  
          int Count,  
          typename ElementAccumulator_ = ElementOutput_,
          typename ElementBias_ = ElementOutput_,    
          typename ElementCompute_ = ElementOutput_,
          FloatRoundStyle Round = FloatRoundStyle::round_to_nearest,
          typename Policy = NumericArrayConverterPolicy<
                  ElementOutput_, Count,
                  ElementAccumulator_, ElementBias_,
                  ElementCompute_, Round>>
class BiasAddLinearCombinationHSwishClamp {
    /// 定義 Param、構造函數等，這裏省略部分代碼
    /// ...
public:
    CUTLASS_HOST_DEVICE
    FragmentOutput operator()(FragmentAccumulator const& accumulator,
                              FragmentBias const& bias,
                              FragmentOutput const& source) const {
        SourceConverter source_converter;
        AccumulatorConverter accumulator_converter;
        BiasConverter bias_converter;
 
        ComputeFragment converted_source = source_converter(source);
        ComputeFragment converted_accumulator =
                accumulator_converter(accumulator);
        ComputeFragmentBias converted_bias = bias_converter(bias);
 
        ComputeFragment intermediate;
 
        multiplies<ComputeFragment> mul_add_source;
        multiply_add<ComputeFragment> mul_add_accumulator;
        multiply_add<ComputeFragmentBias> mul_add_bias;
        HSwish<ComputeFragment> hswish;
 
        minimum<ComputeFragment> min_accumulator;
        maximum<ComputeFragment> max_accumulator;
 
        /// 計算+bias
        intermediate =
                mul_add_source(gamma_, converted_source);
        intermediate =
                mul_add_accumulator(alpha_, converted_accumulator,
                                    intermediate);
        intermediate = mul_add_bias(beta_, converted_bias,
                                    intermediate);
        /// 計算 HSwish 激活                     
        intermediate = hswish(scale_, inv_scale_, intermediate);
 
        ElementCompute const kClamp = ElementCompute(
                (1U << (sizeof_bits<ElementOutput>::value - 1)) - 1);
 
        intermediate =
                max_accumulator(intermediate, -kClamp - ElementCompute(1));
        intermediate = min_accumulator(intermediate, kClamp);
 
        /// 轉換成輸出的 data type
        OutputConverter destination_converter;
        return destination_converter(intermediate);
    }
};

只需要要將新定義的 Epilogue operator 傳入 Convolution operator 的模板，就可以得到一個融合了新的激活函數的卷積算子了。

CUDA 平臺的推理部署

到目前爲止，最新版本的 MegEngine 已經集成了由 cutlass 實現的卷積算子。

按照[文檔]介紹的方法 dump 量化好的模型，就可以使用 MegEngine 來完成推理的部署了。

[文檔地址]

https://megengine.org.cn/doc/advanced/inference_in_nvidia_gpu.html#inference-in-nvidia-gpu

我們可以用 load_and_run 工具來對模型測速。

[如何使用 load_and_run]

https://megengine.org.cn/doc/advanced/how_to_use_load_and_run.html#how-to-use-load-and-run

例如ResNet-18 測試結果如下圖所示：

./load_and_run resnet18.mge --input ./cat.npy --enable-nchw32 --fast-run
mgb load-and-run: using MegBrain 8.9999.0(0) and MegDNN 9.3.0
[09 14:14:14 [email protected]:1169][WARN] enable nchw32 optimization
load model: 3018.428ms
=== prepare: 182.441ms; going to warmup
[09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
[09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
[09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
warmup 0: 481.411ms
=== going to run input for 10 times
iter 0/10: 19.432ms (exec=0.754,device=19.307)
iter 1/10: 18.537ms (exec=0.899,device=18.497)
iter 2/10: 18.802ms (exec=0.727,device=18.762)
iter 3/10: 18.791ms (exec=0.653,device=18.759)
iter 4/10: 18.614ms (exec=0.761,device=18.585)
iter 5/10: 18.529ms (exec=0.708,device=18.499)
iter 6/10: 18.660ms (exec=0.706,device=18.634)
iter 7/10: 18.917ms (exec=0.667,device=18.894)
iter 8/10: 19.093ms (exec=0.655,device=19.070)
iter 9/10: 19.211ms (exec=0.630,device=19.187)
=== finished test #0: time=188.586ms avg_time=18.859ms sd=0.304ms minmax=18.529,19.432

可以看到，在 T4 卡上，ResNet18 的 end-to-end 時間大概是 18.86ms，如果使用 TensorRT 來部署的話，end-to-end 時間大概是 16.85ms。MegEngine 在 CUDA 平臺上的推理性能能達到 TensorRT 的 90%左右，總的來說還是可以比較的。在一些推理延時要求不高，但是高度定製化，推理精度要求高的部署場景下，直接使用 MegEngine 的 CUDA 平臺推理部署方案還是能滿足需求的。

總結

本文介紹了最新版的 MegEngine 中基於 cutlass 開發的卷積算子優化的框架。在接下來幾篇文章，我們會繼續介紹 cutlass 優化卷積算子的原理，以及如何使用 cutlass 在 MegEngine 裏添加一個高性能的自定義卷積算子。

藉助 cutlass 框架，開發人員可以開發自定義分塊大小的卷積算子，解決推理優化中的長尾問題，可以支持自定義激活函數，可以完成卷積算子和訪存密集型算子的融合，還可以定製性能還不錯的變種卷積算子。

我們非常歡迎大家來使用 MegEngine 在 CUDA 平臺的推理部署功能，以及基於 cutlass 的卷積算子的定製化功能，也非常期待開發者們能在使用過程中提出寶貴的意見，使得 MegEngine 和 cutlass 卷積框架能夠在高度定製化的推理部署場景下幫助到廣大的深度學習開發者。

參考文獻

[1] Kerr, A., (2020). Developing CUDA kernels to push tensor cores to the absolute limit on NVIDIA A100. In: GPU Technology Conference.

[2] Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Cantanzaro, B., & Shelhamer, E. (2014). cudnn: Efficient primitives for deep learning. arXiv preprint arXiv: 1410.0759.

[3] Vanholder, H. (2016). Efficient Inference with TensorRT. In: GPU Technology Conference.

[4] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Guestrin, C. (2018). TVM: An automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). (pp. 578-594).

[5] Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). CondConv: Conditionally parameterized convolutions for efficient inference. In: Advances in Neural Information Processing Systems. (pp. 1305-1316).

[6] Ma, N., Zhang, X., Huang, J., & Sun, J. (2020). WeightNet: Revisiting the design space of weight network. In: Proceedings of the European Conference on Computer Vision (ECCV).

欲瞭解更多信息請參見：

• MegEngine Website：https://megengine.org.cn

• MegEngine GitHub（歡迎Star）：https://github.com/MegEngine

作者介紹

章曉，曠視研究院

黑科技：用cutlass進行低成本、高性能卷積算子定製開發

基於 CUTLASS 的卷積算子開發框架

算子優化的長尾問題

算子融合

定製卷積算子

自定義激活函數

CUDA 平臺的推理部署

總結

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

黑科技：用cutlass進行低成本、高性能卷積算子定製開發

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結