在深度學習中,服務器的GPU可以極大地加快算法的執行速度,不同版本的TensorFlow默認使用的GPU版本不同,導致與服務器無法兼容,這就需要根據服務器的GPU版本,重新編譯TensorFlow源碼。
歡迎Follow我的GitHub:https://github.com/SpikeKing
檢查GPU
檢測服務器的GPU,用於在編譯中選擇合適的GPU版本。CUDA是NVIDIA發佈的GPU上的並行計算平臺和模型,多數GPU的運行環境都需要CUDA的支持。
導入CUDA的環境變量,具體的cuda版本,在/usr/local
中查看。
export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
檢查CUDA版本,使用nvcc命令,當前CUDA版本是8.0.61:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
或,查看version文件,當前CUDA版本是8.0.61:
cat /usr/local/cuda/version.txt
CUDA Version 8.0.61
檢查cuDNN的版本,當前cuDNN版本是6.0.21:
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 6
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 21
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
檢測GPU數量和型號,當前服務器的GPU數量是4:
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A |
| 23% 20C P0 54W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A |
| 23% 21C P0 54W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 0000:83:00.0 Off | N/A |
| 23% 21C P0 55W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 0000:84:00.0 Off | N/A |
| 0% 21C P0 51W / 250W | 0MiB / 12189MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
TensorFlow編譯
參考TensorFlow的官方文檔。
Bazel
Bazel是構建和測試軟件的工具。在Ubuntu服務器中,支持apt工具安裝Bazel,參考。
安裝JDK(Install JDK 8):
sudo apt-get install openjdk-8-jdk
添加Bazel的發佈URI作爲包源(Add Bazel distribution URI as a package source)
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
安裝Bazel(Install and update Bazel)
sudo apt-get install bazel
與官網略有不同,不需要更新apt-get,因爲storage.googleapis.com
可能無法訪問。
檢查Bazel版本:
bazel version
輸出,Bazel安裝成功:
Build label: 0.15.0
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jun 26 12:10:19 2018 (1530015019)
Build timestamp: 1530015019
Build timestamp as int: 1530015019
導入CUDA
與官方文檔不同,不需要安裝libcupti和cuda-command-line-tools,已經包含在CUDA文件夾中,導入CUDA文件夾即可。
export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
編譯源碼
下載TensorFlow源碼
git clone https://github.com/tensorflow/tensorflow
配置編譯,選擇GPU支持。
./configure
Please specify the location of python. [Default is /data2/wcl1/tensorflow/venv/bin/python] ## 選擇Python的版本,2或3
## 其餘選N或n
Do you wish to build TensorFlow with CUDA support? [y/N]: y # 選擇GPU版本,y
CUDA support will be enabled for TensorFlow.
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 8.0 # CUDA版本,與服務器一致,8.0
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 6.0 # cuDNN版本,與服務器一致,6.0
## 其餘選N,或默認
Do you want to use clang as CUDA compiler? [y/N]: N ## 選擇nvcc
nvcc will be used as CUDA compiler.
## 其餘選N,或默認
Configuration finished
構建GPU包:
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
libdevice異常:
Cannot find libdevice.10.bc under /usr/local/cuda-8.0
則,修改/usr/local/cuda-8.0/nvvm/libdevice/
的libdevice.compute_50.10.bc
爲libdevice.10.bc
,同時,複製到/usr/local/cuda-8.0/
中。
cd /usr/local/cuda-8.0/nvvm/libdevice/
sduo cp libdevice.compute_50.10.bc libdevice.10.bc
sudo cp libdevice.compute_50.10.bc /usr/local/cuda-8.0/libdevice.10.bc
Bazel的構建時間較長,耐心等待…,共9945步,依次執行。
將構建完成的數據,轉換爲whl的pip支持包,默認存放於/tmp/tensorflow_pkg
文件夾中:
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
使用pip安裝軟件包,TensorFlow的1.9版本,支持GPU:
pip install /tmp/tensorflow_pkg/tensorflow-1.9.0rc0-cp27-cp27mu-linux_x86_64.whl -i https://pypi.douban.com/simple
檢查是否安裝成功,**退出**TensorFlow文件夾,進入Python的shell,執行:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
platform異常,原因是位於tensorflow文件夾下,import導入tensorflow的包:
No module named tensorflow.python.platform
則,退出tensorflow文件夾,再進入Python的shell,導入tensorflow包即可。
檢查是否可用GPU,執行:
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print "all: %s" % [x.name for x in local_device_protos]
## 輸出
all: [u'/device:CPU:0', u'/device:GPU:0', u'/device:GPU:1', u'/device:GPU:2', u'/device:GPU:3']
注意:遇到在編譯TensorFlow之後,無法執行nvidia-smi,卡住(Stuck),在重啓服務器之後,恢復正常,原因不明,可能GPU資源未完全釋放,又進行二次加載,導致異常。
OK, that’s all! Enjoy it!