CUDA9.0更新到10.1等相關軟件(TensorFlow,TensorRT,openCV)調整

系統：
Ubuntu 16.04LTS
配置：
GeForce GTX 1060 （6078MiB）
已安裝好的顯卡驅動：
NVIDIA-SMI 418.56 Driver Version: 418.56

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| 24%   38C    P8     5W / 130W |    244MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

CUDA 10.1

我下載的是runfile文件：
Archived Releases：CUDA Toolkit 10.1 (Feb 2019)
https://developer.nvidia.com/cuda-10.1-download-archive-base （需要先登錄，然後再點開此鏈接。）
安裝：
1.Run sudo sh cuda_10.1.105_418.39_linux.run
2.Follow the command-line prompts

│ Do you accept the above EULA? (accept/decline/quit):
│ accept  
│─────────────────────────────────────────────────────
#安裝選項，由於我已經安裝有Driver: 418.56，所以沒有選擇。
│ CUDA Installer
│ - [ ] Driver
│      [ ] 418.39
│ + [X] CUDA Toolkit 10.1
│   [X] CUDA Samples 10.1
│   [X] CUDA Demo Suite 10.1
│   [X] CUDA Documentation 10.1
│   Install 
│   Options
#遇到錯誤，看信息需要關閉 X server
[INFO]: ERROR: You appear to be running an X server; please exit X before
[INFO]:        installing.  For further details, please see the section INSTALLING
[INFO]:        THE NVIDIA DRIVER in the README available on the Linux driver
[INFO]:        download page at www.nvidia.com.
#關閉 X server
#  先退出已經登錄的ubuntu系統，再使用ctrl+alt+F1進入命令行
$ sudo service lightdm stop
$ sudo init 3

#提示中帶有nouveau字眼，貌似安裝cuda需要禁用nouveau
#  在安裝cuda的時候，由於涉及到NVIDIA驅動的安裝，使得nouveau驅動與NVIDIA驅動衝突，爲了能夠繼續安裝，必須禁用此驅動。禁用步驟如下：
#1）把nouveau驅動加入黑名單，即在/etc/modprobe.d/blacklist.conf的後面加入：
vim /etc/modprobe.d/blacklist.conf
blacklist nouveau
#2）備份initramfs文件
sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bat
#3)重新建立initramfs文件
sudo dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
#4)檢查nouveau驅動，確保沒有被加載
lsmod | grep nouveau

#安裝完成後查看一下cuda版本
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

cuDNN 7.5.0

上官網下載對應的cudnn：
https://developer.nvidia.com/cudnn
由於項目上程序跑起來的時候提示[W] [TRT] TensorRT was compiled against cuDNN 7.5.0 but is linked against cuDNN 7.4.2
所以我下載的是cuDNN v7.5.0 (Feb 25, 2019), for CUDA 10.1

# 直接解壓到指定目錄
$ sudo tar -zxvf cudnn-10.1-linux-x64-v7.5.0.56.tgz -C /usr/local

其他軟件

1.openCV 3.4升級爲 4.0

需要卸載並刪除openCV 3.4相關頭文件和庫文件：

#進入openCV 3.4 build目錄，卸載
$ sudo make uninstall
#刪除殘留頭文件
$ sudo rm -rf /usr/local/include/opencv*
#刪除殘留庫文件
$ sudo rm -rf /usr/local/lib/libopencv_*

然後再安裝openCV 4.0

#進入 openCV4.0/build 目錄
$ cmake -D CMAKE_BUILD_TYPE=Release ..
#編譯
$ make -j12
#安裝
$ sudo make install

提示有未定義等問題：

libemotion_sdk.so: undefined reference to `cv::imread(std::string const&, int)'
libemotion_sdk.so: undefined reference to `cv::VideoCapture::VideoCapture(std::string const&, int)'

https://github.com/opencv/opencv/issues/13000
提示：cmake的時候先在CMakeLists.txt中添加：
add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)
注："-D_GLIBCXX_USE_CXX11_ABI=0"是由於protobuf是基於GCC4等等一系列原因。

問題：

//當openCV更新爲4.1版本後，出現問題：
//  [swscaler @ 0x7fe4e457a8c0] deprecated pixel format used, make sure you did set range correctly
//  原因就是使用的格式已經被廢除了。
//該函數用於解決該問題。
AVPixelFormat ConvertDeprecatedFormat(enum AVPixelFormat format)
{
	switch (format)
	{
		case AV_PIX_FMT_YUVJ420P:
			return AV_PIX_FMT_YUV420P;
			break;
		case AV_PIX_FMT_YUVJ422P:
			return AV_PIX_FMT_YUV422P;
			break;
		case AV_PIX_FMT_YUVJ444P:
			return AV_PIX_FMT_YUV444P;
			break;
		case AV_PIX_FMT_YUVJ440P:
			return AV_PIX_FMT_YUV440P;
			break;
		default:
			return format;
			break;
	}
}

2. TensorRT

我原來的TensorRT是：TensorRT-5.1.5.0.Ubuntu-16.04.5.x86_64-gnu.cuda-9.0.cudnn7.5
需要重新下載支持cuda10.1的TensorRT：
下載（需要登錄）：https://developer.nvidia.com/tensorrt
TensorRT 5.1.5.0 GA for Ubuntu 16.04 and CUDA 10.1 tar package
然後解壓到自己選擇的目錄，再在CMakeLists.txt中包含就可以了。

3. TensorFlow

當換成cuda10.1後，編譯時TensorFlow會有報錯，它仍然試圖鏈接cuda9的庫：

/usr/bin/ld: warning: libcublas.so.9.0, needed by /home/toson/tf_include/libtensorflow_cc.so, not found (try using -rpath or -rpath-link)

我們需要重新編譯TensorFlow。

bazel
直接編譯TensorFlow會提示：

Cannot find bazel. Please install bazel.
Configuration finished

TensorFlow依賴bazel，需要先安裝bazel：
我下載的是0.15.2版本：https://github.com/bazelbuild/bazel/releases/download/0.15.2/bazel-0.15.2-installer-linux-x86_64.sh
源碼編譯還要依賴Java比較麻煩，我使用bash腳本的方式：
bazel-0.15.2-installer-linux-x86_64.sh

#安裝
$ chmod +x bazel-0.15.2-installer-linux-x86_64.sh
$ sudo ./bazel-0.15.2-installer-linux-x86_64.sh #--user

protobuf
我的系統裏已經安裝了有protobuf，不過我還是寫一下編譯安裝步驟：

$ tar zvxf protobuf-all-3.6.1.tar.gz
$ cd protobuf-3.6.1
$ ./configure –prefix=/usr/local/
$ make
$ make check
$ sudo make install

TensorFlow
我下載的是1.12版本：https://github.com/tosonw/tensorflow/archive/v1.12.0.tar.gz
解壓後命令行進入目錄：

$ mkdir build
$ cd build
$ ../configure 
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.2 installed.
Please specify the location of python. [Default is /home/toson/anaconda3/bin/python]: 


Found possible Python library paths:
  /home/toson/anaconda3/lib/python3.6/site-packages
  /opt/intel/openvino_2019.1.144/python/python3.6
  /opt/intel/openvino_2019.1.144/deployment_tools/model_optimizer
  /home/toson/compile_libs/caffes/caffe_origin/python
Please input the desired Python library path to use.  Default is [/home/toson/anaconda3/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: n
No Apache Ignite support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.1


Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 1.3


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]: 


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=gdr         	# Build with GDR support.
	--config=verbs       	# Build with libverbs support.
	--config=ngraph      	# Build with Intel nGraph support.
Configuration finished

編譯：

# 注："-D_GLIBCXX_USE_CXX11_ABI=0"是由於protobuf是基於GCC4等等一系列原因。
$ bazel build --config=opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow:libtensorflow_cc.so
#等待有點久，最後提示以下內容就算成功了：
INFO: Elapsed time: 1365.375s, Critical Path: 185.64s
INFO: 4956 processes: 4956 local.
INFO: Build completed successfully, 4985 total actions

如果有問題：

#找不到庫文件
Cuda Configuration Error: Cannot find cuda library libcublas.so.10.1
#我查詢後發現確實找不到，但是隻是名字不對應
$ locate libcublas.so.10.1
/usr/lib/x86_64-linux-gnu/libcublas.so.10.1.0.105
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105
#我就手動拷貝了一下
sudo cp /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1
#再次編譯，仍有問題提示：找不到庫文件。通過上述方式進行拷貝可解決。
Cuda Configuration Error: Cannot find cuda library libcusolver.so.10.1
Cuda Configuration Error: Cannot find cuda library libcurand.so.10.1
Cuda Configuration Error: Cannot find cuda library libcufft.so.10.1

編譯成功後，在 /bazel-bin/tensorflow 目錄下會出現 libtensorflow_cc.so 文件

C版本： bazel build :libtensorflow.so
C++版本： bazel build :libtensorflow_cc.so

需要的頭文件，要在源碼裏拷貝出來使用：
bazel-genfiles/...，eigen/...，include/...，tf/...

4. PyTorch

PyTorch的C++調用庫可以在官網直接下載，解壓即可，無需安裝。
需要重新下載基於cuda10的：https://pytorch.org/get-started/locally/

https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-latest.zip
下載後解壓，並在CMakeLists.txt中包含引用：

set(Torch_DIR /home/toson/download_libs/libtorch/share/cmake/Torch)
find_package(Torch REQUIRED)

程序中會報錯：找不到 tensorFromBlob() 函數
可能是我使用的是PyTorch 1.1的緣故，但也沒找到1.0的libtorch在哪裏下載。
我嘗試查找問題：
在GitHub上有看到該函數被替換了：
https://github.com/pytorch/pytorch/pull/18780
https://github.com/pytorch/pytorch/issues/15426
注：torch::CPU和torch::CUDA在PyTorch 1.0已被棄用，應該寫這個：
torch::from_blob(img_float.data, {1, 224, 224, 3}).to(torch::kCUDA)

5. Caffe

GitHub地址：https://github.com/BVLC/caffe
Caffe的GPU編譯需要依賴CUDA，所以要重新編譯。
編譯的是GPU版本：

$ cd caffe
# 依賴項
$ sudo apt-get install 
# 修改選項 # 修改Makefile.config，例如我們可以打開CPU_ONLY選項。
$ cp Makefile.config.example Makefile.config
# 我是打開了USE_CUDNN選項，關閉CPU_ONLY選項
 USE_CUDNN := 1
 # CPU_ONLY := 1
 USE_OPENCV := 0
 OPENCV_VERSION := 3
 CUDA_DIR := /usr/local/cuda
 BLAS := open
# 編譯
$ make clean # 如果編譯有奇怪問題，乾脆直接clean一下。
$ make all -j12
# make runtest -j16
# make pycaffe

CUDA9.0更新到10.1等相關軟件(TensorFlow,TensorRT,openCV)調整

CUDA 10.1

cuDNN 7.5.0

其他軟件

1.openCV 3.4升級爲 4.0

2. TensorRT

3. TensorFlow

4. PyTorch

5. Caffe

VLC核顯編解碼簡介 + Linux平臺vlc編譯

CUDA9.0更新到10.1等相關軟件(TensorFlow,TensorRT,openCV)調整

MXNet: A Flexible and Efficient Machine LearningLibrary for Heterogeneous Distributed Systems

Linux MXNet編譯安裝 C++

DeepStream進階插件之路

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結