Jetson tx2 上源碼安裝 pytorch1.0.0（真. 血淚史）

本篇以在python3.5安裝過程爲例。在安裝之前說明以下：

重點一：平臺及cuda cudnn的安裝問題

Jetson TX2平臺版本：Jetpack 3.3, cuda 9.0.252, cudnn7.1.5, TensorRT4.0.2, python2.7/python3.5

系統內核：tegra-ubuntu 4.4.38-tegra aarch64

Linux系統版本：Ubuntu16.04，cmake 3.15.6 （TX2刷機完原始的cmake是3.5.1版本，由於後面自己搗鼓的時候說最好安裝3.9.0以上版本cmake，所以我就直接升級到新版本了）

在源碼安裝pytorch的時候會使用到cuda及cudnn，首先檢查自己Jetson TX2上的cuda cudnn 是不是從jetpack安裝的，如果不是那麼就需要注意了！！！Jetson TX2的CPU是基於ARM的，所以安裝的cuda及cudnn都必須是ARM版本的（即aarch64），Jetson TX2上cuda及cudnn的安裝可以參考這篇：Jetson TX2 安裝 cuda9.0 及 cudnn7 超詳細（真實親測）

重點二：pytorch源碼下載問題

1、pytorch不同版本對應着不同cuda版本

在pytorch的github上直接下載的是最新版的pytorch，本文寫於2020.1.14，現在使用

git clone  http://github.com/pytorch/pytorch

下載得到的是 pytorch 1.4.0a0，想要安裝這個版本的pytorch需要的平臺需要安裝cuda9.2及以上，對於我現在的平臺是不匹配的，pytorch與cuda的對應可以在pytorch的官網上找到：

pytorch版本與cuda的對應可以參考pytorch官網：
pytorch歷史版本：https://pytorch.org/get-started/previous-versions/
pytorch最新版本：https://pytorch.org/get-started/locally/

如果你的平臺跟我的一樣，那麼我推薦 pytorch1.0.0版本。怎麼才能下載到自己想要版本的pytorch呢？建議大家好看這個鏈接：如何下載自己想要版本的pytorch

2、pytorch github源碼中的third_party文件夾是鏈接沒有文件

在pytorch github上，文件夾裏面雖然顯示是有內容的，但是其實是相關子項目鏈接，直接下載pytorch源碼是不能將第三方庫一起下載下來的。所以，在下載pytorch的時候需要注意。推薦使用下面的命令下載：

git clone --recursive --branch v1.0.0 http://github.com/pytorch/pytorch

一定要加上 --recursive 用於循環克隆git子項目

重點三：We should turn-off NCCL support since it is only available on the desktop GPU.

見 https://devtalk.nvidia.com/default/topic/1042821/jetson-tx2/pytorch-install-broken/
在編譯中出現下面的錯誤，就是因爲沒有關閉 NCCL，具體的關閉方法下面會講到

nvlink error : entry function '_Z28ncclAllReduceLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_f168ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_f328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_u648ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z28ncclAllReduceLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
Makefile:83: recipe for target '/home/nvidia/jetson-reinforcement/build/pytorch/third_party/build/nccl/obj/collectives/device/devlink.o' failed
make[5]: *** [/home/nvidia/jetson-reinforcement/build/pytorch/third_party/build/nccl/obj/collectives/device/devlink.o] Error 255
Makefile:45: recipe for target 'devicelib' failed
make[4]: *** [devicelib] Error 2
Makefile:24: recipe for target 'src.build' failed
make[3]: *** [src.build] Error 2
CMakeFiles/nccl.dir/build.make:60: recipe for target 'lib/libnccl.so' failed
make[2]: *** [lib/libnccl.so] Error 2
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/nccl.dir/all' failed
make[1]: *** [CMakeFiles/nccl.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

所以我下載了pytorch1.0.0，下面開始安裝

一、確定 python 及 pip 命令的版本

在jetson tx2刷機之後自帶有python2.7和python3.5兩個版本的python，所以在使用命令的時候需要注意系統中默認的python和pip是哪個版本的，比如在我的平臺上：

$ pip --version
顯示：
pip 19.3.1 from /home/nvidia/.local/lib/python2.7/site-packages/pip (python 2.7)

說明默認的 pip 綁定的是python2.7版本，想要把pytorch安裝在python3.5上的話，需要使用python3 和 pip3 命令，當然也可以修改系統默認python 和 pip綁定的版本（自行查找方法）。

二、安裝依賴及必要組件

sudo apt install libopenblas-dev libatlas-dev liblapack-dev
sudo apt install liblapacke-dev checkinstall # For OpenCV
sudo apt-get install python3-pip
 
pip3 install --upgrade pip3==9.0.1
sudo apt-get install python3-dev
 
sudo pip3 install numpy scipy
sudo pip3 install pyyaml
sudo pip3 install scikit-build
sudo apt-get -y install cmake
sudo apt install libffi-dev
sudo pip3 install cffi

安裝完之後，我們添加cudnn的lib和include路徑

sudo gedit ~/.bashrc
export CUDNN_LIB_DIR=/usr/lib/aarch64-linux-gnu
export CUDNN_INCLUDE_DIR=/usr/include
source ~/.bashrc

三、下載 pytorch 1.0.0 源碼及修改

根據本文剛開始所說，以及這個鏈接：如何下載自己想要版本的pytorch 下載好pytorch1.0.0的源碼，然後開始關閉 NCCL

注意，在編譯之前我們必須先關閉程序中的NCCL

#sudo gedit /pytorch/CMakeList.txt
#   > CmakeLists.txt : Change NCCL to 'Off' on line 98

#sudo gedit /pytorch/setup.py
#   > setup.py: Add USE_NCCL = False below line 200

#sudo gedit /pytorch/tools/setup_helpers/nccl.py
#   > nccl.py : Change USE_SYSTEM_NCCL to 'False' on line 13
#               Change NCCL to 'False' on line 78

#sudo gedit /pytorch/torch/csrc/cuda/nccl.h
#   > nccl.h : Comment self-include on line 8
#              Comment entire code from line 21 to 28

#sudo gedit torch/csrc/distributed/c10d/ddp.cpp
#   > ddp.cpp : Comment nccl.h include on line 6
#               Comment torch::cuda::nccl::reduce on line 163

修改完成後開始編譯過程。然後執行下面的命令：

cd pytorch

git submodule update --init --recursive # 如果這個命令報錯，那先執行   git init  即可

sudo pip3 install -U setuptools
sudo pip3 install -r requirements.txt

四、編譯

首先，先開啓TX2的最大功率模式，這樣可以使我們的編譯速度稍微快一些：

sudo nvpmodel -m 0         # 切換工作模式到最大
sudo  ~/jetson_clocks.sh   # 強制開啓風扇最大轉速

sudo pip3 install scikit-build --user
sudo ldconfig

export USE_NCCL=0
export USE_DISTRIBUTED=1
export USE_OPENCV=ON
export USE_CUDNN=1
export USE_CUDA=1
export ONNX_ML=1

然後開始編譯：

sudo python3 setup.py bdist_wheel   # 這一步其實是在編譯生成 wheel 文件，存在 /pytorch/disk 下

漫長的編譯完成後，再執行下面命令：

sudo DEBUG=1 python3 setup.py build develop  
# 如果在執行這一句的時候顯示 tx2內存不足，那麼就可以 現將 /pytorch/disk 下的 wheel 文件拷貝出來，
# 再執行  sudo python3 setup.py clean  清除編譯的內容，然後 cd 到wheel拷貝出來目錄下，執行下面的命令安裝：
# sudo pip3 install torch-1.0.0a0-cp35-cp35m-linux_aarch64.whl

同樣漫長的編譯完成後，再執行後續的安裝命令：

sudo apt clean
sudo apt-get install libjpeg-dev zlib1g-dev

cd ~
git clone https://github.com/python-pillow/Pillow.git
cd Pillow/
sudo python3 setup.py install
sudo apt-get install python3-sklearn
sudo pip3 install pandas Cython scikit-image 

sudo pip3 --no-cache-dir install torchvision

安裝過程中 error 集錦及解決方案

error 1：

./caffe2/operators/quantized/int8_utils.h:4:38: fatal error: gemmlowp/public/gemmlowp.h: No such file or directory compilation terminated.
詳細信息如下：

In file included from ../caffe2/operators/quantized/int8_concat_op.h:7:0,
                 from ../caffe2/operators/quantized/int8_concat_op.cc:1:
../caffe2/operators/quantized/int8_utils.h:4:38: fatal error: gemmlowp/public/gemmlowp.h: No such file or directory
compilation terminated.
[1669/2643] Building CXX object caffe2/CMakeFile...ators/rnn/recurrent_network_blob_fetcher_op.cc.o
ninja: build stopped: subcommand failed.
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解決方案
報錯信息中可見，是由於 caffe2/operators/quantized/int8_utils.h 找不到 gemmlowp/public/gemmlowp.h這個文件，但是這個文件真實存在，所以解決辦法就是，在 int8_utils.h 中的引用方式

#include <gemmlowp/public/gemmlowp.h>
變成
#include "third_party/gemmlowp/public/gemmlowp.h"

同時，將 caffe2/operators/quantized/int8_simd.h 文件中的頭文件也進行修改：

#include "gemmlowp/fixedpoint/fixedpoint.h"
#include "gemmlowp/public/gemmlowp.h"
變成
#include "third_party/gemmlowp/fixedpoint/fixedpoint.h"
#include "third_party/gemmlowp/public/gemmlowp.h"

error 2：

libcudnn.so.7: error adding symbols: File in wrong format
詳細信息如下：

[ 58%] Linking CXX shared library ../lib/libcaffe2_gpu.so
/usr/local/cuda/lib64/
collect2: error: ld returned 1 exit status
caffe2/CMakeFiles/caffe2_gpu.dir/build.make:185448: recipe for target 'lib/libcaffe2_gpu.so' failed
make[2]: *** [lib/libcaffe2_gpu.so] Error 1
CMakeFiles/Makefile2:4400: recipe for target 'caffe2/CMakeFiles/caffe2_gpu.dir/all' failed
make[1]: *** [caffe2/CMakeFiles/caffe2_gpu.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 58%] Built target python_copy_files
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解決方案
這是因爲當前平臺上安裝的cudnn不是ARM版本的原因導致的。Jetson TX2是基於ARM架構的，與PC端不同，PC端的cudnn是基於X86_64架構的。因此，解決方案就是安裝ARM版本的cudnn，安裝方法可以參考這篇文章：

Jetson TX2 安裝 cuda9.0 及 cudnn7 超詳細（真實親測）

error 3：

/third_party/onnx/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
詳細信息如下：

caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -c ../caffe2/python/pybind_state.cc
In file included from ../caffe2/onnx/helper.h:4:0,
                 from ../caffe2/onnx/backend.h:5,
                 from ../caffe2/python/pybind_state.cc:19:


../third_party/onnx/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
[1735/2643] Building CXX object caffe2/CMakeFiles/caffe2.dir/share/contrib/depthwise/depthwise3x3_conv_op.cc.o
ninja: build stopped: subcommand failed.
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解決方案
參考：https://github.com/onnx/onnx/issues/1947

third_party/onnx/onnx/onnx_pb.h 中的代碼如下：

#ifdef ONNX_ML
#include "onnx/onnx-ml.pb.h"
#else
#include "onnx/onnx.pb.h"
#endif

但是， onnx-ml.pb.h 和 onnx.pb.h兩個文件不在third_party/onnx，他們是在編譯的過程中生成的，在pytorch的路徑下搜索只能發現 onnx-ml.pb.h 這個文件，因此我們只需要聲明一下 ONNX_ML 即可：

在當前終端下輸入
export ONNX_ML=1

再次編譯即可

error 4：

RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66
詳細信息如下：

 self.nyu = h5py.File(self.data_path)
THCudaCheck FAIL file=/home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu line=66 error=7 : too many resources requested for launch
Traceback (most recent call last):
  File "test.py", line 96, in <module>
    output = model(input_var)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/nvidia/xiudan/zd_structure_paraformImgNet/fcrn.py", line 249, in forward
    ad1 = self._upsample_add(p1, c2)
  File "/media/nvidia/xiudan/zd_structure_paraformImgNet/fcrn.py", line 216, in _upsample_add
    return torch.nn.functional.interpolate(x, size=(H,W), mode='bilinear',align_corners=True) + y
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 2447, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, _output_size(2), align_corners)
RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66

解決方案
參照：https://github.com/pytorch/pytorch/issues/8103#issucomment-424343705
在 “aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu”: 文件中做如下修改：

Around line 62:
comment out THCState_getCurrentDeviceProperties(state)->maxThraedsPerBlock;
Set
const int num_threads = 512;

Around line 97
comment out THCState_getCurrentDeviceProperties(state)->maxThraedsPerBlock;
Set
const int num_threads = 512;

I followed this guide for installation: https://gist.github.com/dusty-nv/ef2b372301c00c0a9d3203e42fd83426 using the install mode command “sudo python setup.py install”

Jetson tx2 上源碼安裝 pytorch1.0.0（真. 血淚史）