PyTorch學習筆記(16)——編寫你自己的PyTorch kernel(基於PyTorch1.2.0)

在前一陣看過PyTorch官方核心開發者Edward Z, Yang的在紐約舉辦的PyTorch NYC Meetup的關於PyTorch內部機制的講解。從通過strides指定邏輯佈局，tensor wrapper到autograd機制以及對PyTorch內部最重要的幾個基本代碼模塊的扼要說明，讓人受益匪淺。其中，在PyTorch寫kernel是一個非常讓人興奮的內容，作爲一個contributor，我對這個部分非常感興趣。今天，我將手把手的構建一個標準的ATen kernel是如何構建併成功編譯的，下面，讓我們開始～

0. 寫PyTorch的kernel需要什麼？

E.Z Yang指出，PyTorch爲未來的內核作者提供了許多有用的工具[1]，我個人認爲，能夠寫出kernel會幫助我們更深入的理解框架內層結構和佈局，並可以在最基礎的層面上實現一些對Tensor層面的功能，而這種目的一般無法僅僅依靠PyTorch提供的API來完成。但是首先，我們需要搞明白寫一個kernel需要哪幾步？

在PyTorch中，一個kernel有如下部分組成：

首先，我們需要編寫一些 元數據(metadata) 來描述我們的kernel，在PyTorch中可以通過這種方式來 自動生成代碼(code generation)，並自動生成Python bindings，而不需要我們寫一行額外的代碼，是不是很cool？

First, there’s some metadata which we write about the kernel, which
powers the code generation and lets you get all the bindings to
Python, without having to write a single line of code.

接着，我們需要根據Tensor wrapper的構造，爲你定義的kernel分配device type和layout(默認是strided)，類型檢查非常重要，不要忘記。

Once you’ve gotten to the kernel, you’re past the device type / layout
dispatch. The first thing you need to write is error checking, to make
sure the input tensors are the correct dimensions. (Error checking is
really important! Don’t skimp on it!)

我們需要寫該kernel的實現，併爲它的輸出分配一塊新內存(inplace操作不需要)。
一些高性能的kernel需要某種程度的並行化，所以這塊可以利用多CPU系統和一些比較複雜的底層機制。

Most performant kernels need some sort of parallelization, so that you
can take advantage of multi-CPU systems. (CUDA kernels are
“implicitly” parallelized, since their programming model is built on
top of massive parallelization).

最後，讓我們開始吧～

Finally, you need to access the data and do the computation you wanted
to do!

1. 準備工作

PyTorch爲我們提供了豐富的腳手架(scaffolding)，使得我們可以比較容易的開發一個自己的PyTorch kernel。但是，在開始動工之前，我們也需要做一些準備工作：

① gcc/g++ 版本7.0以上 (本文用的gcc/g++都是7.4.0)
② cuda9.2及以上. (本文是cuda9.2 + cudnn 7.5.0) (如不需要編譯cuda版本，則無需這項)
③ anaconda 比較新的版本. (2018年以後的就可以)

這3步準備完畢之後，讓我們開始…

首先，我們需要創建一個虛擬環境，這步是爲了避免搞亂系統的Python。

創建一個名爲pytorch-build的虛擬環境。

conda create --name pytorch-build python=3.6.3 numpy=1.16.3 conda

進入虛擬環境。

activate pytorch-build # 或者 source activate pytorch-build

安裝必備的包。

conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

安裝magma-cuda92這步比較慢，網速不好的話很可能會失敗… 慢慢等吧~，如果等不及，可以只編譯個cpu版本的

conda install -c pytorch magma-cuda92

顯式安裝OpenMP，這個庫可以更好的使用CPU的多線程。
sudo apt-get install libomp-dev

接着，我們需要顯式告知CMake在哪裏存放結果文件。
export CMAKE_PREFIX_PATH="$HOME/anaconda3/envs/pytorch-build"

2. 工程&實現基本的kernel

注意，因爲我們需要start from scratch的編譯PyTorch工程，是因爲pytorch項目有很多子模塊，所以需要一併clone下來。所以需要用git clone --recursive https://github.com/pytorch/pytorch的方式將工程clone到你的指定目錄，我的路徑是這樣的：

ps: 因爲實在太慢了，所以git clone --recursive [email protected]:pytorch/pytorch.git (原因: 實際上還是網絡協議問題，git支持多種協議，包括上面的https協議以及原生的ssh協議，git對ssh的支持是最好的，速度也是最快的，所以我們改用ssh協議來clone)

好了，到這步，項目已經clone下來了，我們現在就可以進行實現我們自己的kernel的操作了~，我將編寫一個基礎的kernel分爲如下幾步，按照下面的步驟走，就可以實現一個基於最新版PyTorch 1.2.0a0+ede0849

① 修改kernel的描述文件
第一步，我們需要在aten/src/ATen/native/native_functions.yaml 中添加自己定義的kernel，這裏，我起的kernel名字就是我名字的縮寫gdh。

...
- func: acos(Tensor self) -> Tensor
  variants: function, method

- func: gdh(Tensor self) -> Tensor
  variants: function, method

- func: acos_(Tensor(a!) self) -> Tensor(a!)
  variants: function, method
  dispatch:
    CPU: _acos__cpu
    CUDA: _acos__cuda
...

其中，function表示生成torch.gdh()這個函數, method表示如果你有一個Tensor a，可以調用a.gdh()方法。

② kernel實現編寫
接着，我們需要完成這個gdh kernel的具體實現：我選擇的位置是aten/src/ATen/native/TensorShape.cpp, 實測在aten/src/ATen/native/下的一般的***.cpp文件下定義其實都是可以的，並沒有一般的限制。

...
Tensor gdh(const Tensor& self) {
  int64_t ndim = self.dim();
  printf(" gdh tensor shape %ld\n", ndim);
  return self.contiguous().view(-1)
}
...

③ 梯度定義
需要注意的是，如果你這個op需要參與register autograd，那就應該在tools/autograd/derivatives.yaml中定義梯度。

...
- name: acos(Tensor self) -> Tensor
  self: grad * -((-self * self + 1).rsqrt())

- name: gdh(Tensor self) -> Tensor
  self: grad

- name: add(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  self: grad
  other: maybe_multiply(grad, alpha)
 ...

3. 編譯&運行

第2步完成之後，我們就可以編譯了，編譯的方式可以參考[2], [3]，爲了速度起見，這裏我編譯的是CPU版本。

我用的方式是:

1) # 進入虛擬環境.
source activate pytorch-build 

2) # 設置環境變量.
export NO_MKLDNN=1 # disable due to GLIBC problem
export NO_SYSTEM_NCCL=1
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # 必不可少.
export USE_CUDA=0 USE_CUDNN=0 # 不適用CUDA和CUDNN. 還可以在clone下的頂層目錄的CMakeLists.txt將CUDA和CUDNN對應的選項置爲OFF.
export GEN_TO_SOURCE=/你的路徑/pytorch/aten/src/ATen/core/ # 這個路徑的設置也非常重要，如果不設置的話，系統會生成一個core_tmp/，會找不到文件需要鏈接的頭文件導致編譯失敗~.

3) # 安裝/清理.
python setup.py install     # build and install
python setup.py clean --all # clean the build

編譯完成後，我們發現在最新的PyTorch python bindings非常ok，如果編譯成功，你會看到如下的內容：

好了，現在讓我們試驗一下效果，可以看到，無論是調用torch.gdh()還是a.gdh()，都會調用我們定義在TensorShape.cpp中的printf(" gdh tensor shape %ld\n", ndim);，成功！！！

4. 問題

① ModuleNotFoundError: No module named 'torch._C'

這個問題[5]是因爲你調用python的當前目錄下有torch文件夾導致的:
② RuntimeError: Source files: ['Type.h', 'Tensor.h', 'TensorMethods.h'] did not match generated files.

這個問題是編譯中產生的，原因是PyTorch1.0.0以後的版本aten/src/ATen/gen.py中的第42行加入瞭如下的部分[6], 也就是會創建一個core_tmp的文件夾，但是由於某種原因，Tensor.h, TensorMethods.h並沒有被拷貝過去，所以我們顯式的設置GEN_TO_SOURCE就可以避免這個問題。
export GEN_TO_SOURCE=/你的路徑/pytorch/aten/src/ATen/core/ # 這個路徑的設置也非常重要，如果不設置的話，系統會生成一個core_tmp/，會找不到文件需要鏈接的頭文件導致編譯失敗~.

參考資料

[1] PyTorch internals
[2] Build pytorch from source—Beenfrog’s research blog
[3] Speed Up PyTorch by Building from Source on Ubuntu 18.04—Zhanwen Chen
[4] realdoug的native實現
[5] ModuleNotFoundError: No module named ‘torch._C’ #574
[6] Source code changes of the file “aten/src/ATen/gen.py” between
pytorch-0.4.1.tar.gz and pytorch-1.0.1.tar.gz

PyTorch學習筆記(16)——編寫你自己的PyTorch kernel(基於PyTorch1.2.0)

0. 寫PyTorch的kernel需要什麼？

1. 準備工作

創建一個名爲pytorch-build的虛擬環境。

進入虛擬環境。

安裝必備的包。

安裝magma-cuda92這步比較慢，網速不好的話很可能會失敗… 慢慢等吧~，如果等不及，可以只編譯個cpu版本的

2. 工程&實現基本的kernel

3. 編譯&運行

4. 問題

參考資料

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

公司來了個新同事，把 DDD 運用得爐火純青！

[CVPR2020最佳論文詳細解讀] Unsupervised Learning of Probably Symmetric Deformable 3D Object

PyTorch學習筆記(18) ——基於pytorch 1.1.0編寫cuda擴展

[PyG] 1.如何使用GCN完成一個最基本的訓練過程(含GCN實現)

PyTorch學習筆記(17) ——pytorch的torch.repeat和tf.tile的對比

Voxceleb2 視頻數據集下載(國內鏈接)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

PyTorch學習筆記(16)——編寫你自己的PyTorch kernel(基於PyTorch1.2.0)

0. 寫PyTorch的kernel需要什麼？

1. 準備工作

創建一個名爲pytorch-build的虛擬環境。

進入虛擬環境。

安裝必備的包。

安裝magma-cuda92這步比較慢，網速不好的話很可能會失敗… 慢慢等吧~， 如果等不及，可以只編譯個cpu版本的

2. 工程&實現基本的kernel

3. 編譯&運行

4. 問題

參考資料

安裝magma-cuda92這步比較慢，網速不好的話很可能會失敗… 慢慢等吧~，如果等不及，可以只編譯個cpu版本的