老徐
Thursday, 26 July 2018

Tensorflow 1.8 with GPU on macOS High Sierra 10.13.6

Tensorflow團隊宣佈停止支持1.2以後mac版的tensorflow gpu版本。
因此沒辦法直接安裝只能自己用源碼編譯了。

Tensorflow 1.8 with CUDA on macOS High Sierra 10.13.6

CPU 運行 Tensorflow 感覺不夠快，想試試 GPU 加速！正好自己有一塊支持CUDA的顯卡。

版本

重要的事情說三遍：相關的驅動以及編譯環境工具必須選擇配套的版本，否則編譯不成功！！！

版本：

TensorFlow r1.8 source code，最新的1.9貌似還有問題
macOS 10.13.6，這個應該關係不大
顯卡驅動 387.10.10.10.40.105，支持的 CUDA 9.1
CUDA 9.2，這個是 CUDA 驅動，可以高於上面的顯卡支持的CUDA 版本，也就是 CUDA Driver 9.2
cuDNN 7.2，與上面的CUDA對應，直接安裝最新版
XCode 8.2.1，這個是重點，請降級到這個版本，否則會編譯出錯或運行時出錯 Segmentation Fault
bazel 0.14.0，這個是重點，請降級到這個版本
Python 3.6，這個是重點，不要使用最新版的 Python 3.7 截止目前編譯會有問題

準備

需要下載（某些文件較大需要下載，請在繼續閱讀前先開始下載，節省時間）：

Xcode 8.2.1
https://developer.apple.com/d...

Xcode_8.2.1.xip
bazel-0.14.0
https://github.com/bazelbuild...
CUDA Toolkit 9.2
https://developer.nvidia.com/...
cuDNN v7.2.1
https://developer.nvidia.com/...

Tensorflow source code，333M

$ git clone https://github.com/tensorflow/tensorflow -b r1.8

Python 3.6.5_1

目前裝的是3.7，降級吧

$ brew unlink python
$ brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/f2a764ef944b1080be64bd88dca9a1d80130c558/Formula/python.rb
$ pip3 install --upgrade pip setuptools wheel
# $ brew switch python 3.6.5_1

不要使用 Python 3.7.0，否則編譯會有問題

編譯完後可以切換回去

$ brew switch python 3.7.0

Xcode 8.2.1

需要降級 Xcode 到 8.2.1

去apple開發者官網下載包，https://developer.apple.com/d...

解壓後複製到/Applications/Xcode.app，然後進行指向

$ sudo xcode-select -s /Applications/Xcode.app

確認安裝是否準確

$ cc -v
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Command Line Tools，cc 即 clang
這個很重要，否則雖然編譯成功但是跑複雜一點項目會出現 Segmentation Fault

環境變量

由於用到 CUDA 的 lib 不是在系統目錄下，因此需要設置環境變量來指向
在 Mac 下 LD_LIBRARY_PATH 無效，使用的是 DYLD_LIBRARY_PATH

配置環境變量編輯 ~/.bash_profile或 ~/.zshrc

export CUDA_HOME=/usr/local/cuda
export DYLD_LIBRARY_PATH=$CUDA_HOME/lib:$CUDA_HOME/extras/CUPTI/lib
export PATH=$CUDA_HOME/bin:$PATH

安裝 CUDA

CUDA是NVIDIA推出的用於自家GPU的並行計算框架，也就是說CUDA只能在NVIDIA的GPU上運行，而且只有當要解決的計算問題是可以大量並行計算的時候才能發揮CUDA的作用。

第一步：確認顯卡是否支持 GPU 計算

在這裏找到你的顯卡型號，看是否支持
https://developer.nvidia.com/...

我的顯卡是 NVIDIA GeForce GTX 750 Ti:

GPU	Compute Capability
GeForce GTX 750 Ti	5.0

第二步：安裝 CUDA

如果安裝了其他版本的CUDA，需要卸載請執行

$ sudo /usr/local/bin/uninstall_cuda_drv.pl
$ sudo /usr/local/cuda/bin/uninstall_cuda_9.1.pl
$ sudo rm -rf /Developer/NVIDIA/CUDA-9.1/
$ sudo rm -rf /Library/Frameworks/CUDA.framework
$ sudo rm -rf /usr/local/cuda/

爲了萬無一失，最好還是重啓一下

首先需要說明的是：CUDA Driver 與 GPU Driver的版本必須一致，才能讓CUDA找到顯卡。

GPU Driver 即顯卡驅動
- http://www.macvidcards.com/dr...
- 我的 macOS 是 10.13.6 對應的驅動已經安裝最新版 387.10.10.10.40.105
  
  https://www.nvidia.com/downlo...
```
Version:    387.10.10.10.40.105
Release Date:    2018.7.10
Operating System:    macOS High Sierra 10.13.6
CUDA Toolkit:    9.1
```

CUDA Driver

http://www.nvidia.com/object/...
單獨先安裝 CUDA Driver，可以選擇最新版本，看他對顯卡驅動的支持

cudadriver_396.148_macos.dmg

New Release 396.148
CUDA driver update to support CUDA Toolkit 9.2, macOS 10.13.6 and NVIDIA display driver 387.10.10.10.40.105
Recommended CUDA version(s): CUDA 9.2
Supported macOS 10.13

CUDA Toolkit
- https://developer.nvidia.com/...
- 可以選擇最新版本，這裏選擇 9.2
- cuda_9.2.148_mac.dmg、cuda_9.2.148.1_mac.dmg

安裝完成後檢查：

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:08:12_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

確認驅動是否已加載

$ kextstat | grep -i cuda.
  149    0 0xffffff7f838d3000 0x2000     0x2000     com.nvidia.CUDA (1.1.0) E13478CB-B251-3C0A-86E9-A6B56F528FE8 <4 1>

測試CUDA能否正常運行：

$ cd /usr/local/cuda/samples
$ sudo make -C 1_Utilities/deviceQuery
$ ./bin/x86_64/darwin/release/deviceQuery
./bin/x86_64/darwin/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750 Ti"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147155968 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  GPU Max Clock rate:                            1254 MHz (1.25 GHz)
  Memory Clock rate:                             2700 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

如果最後顯示 Result = PASS，那麼CUDA就工作正常

如果出現下列錯誤

The version ('9.1') of the host compiler ('Apple clang') is not supported

說明 Xcode 版本太新了，要求降級 Xcode

第三步：安裝 cuDNN

cuDNN（CUDA Deep Neural Network library）：是NVIDIA打造的針對深度神經網絡的加速庫，是一個用於深層神經網絡的GPU加速庫。如果你要用GPU訓練模型，cuDNN不是必須的，但是一般會採用這個加速庫。

cuDNN

https://developer.nvidia.com/...
下載最新版 cuDNN v7.2.1 for CUDA 9.2
cudnn-9.2-osx-x64-v7.2.1.38.tgz

下好後直接把解壓縮合併到CUDA目錄/usr/local/cuda/下即可：

$ tar -xzvf cudnn-9.2-osx-x64-v7.2.1.38.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib/libcudnn*
$ rm -rf cuda

第四步：安裝 CUDA-Z

用來查看 CUDA 運行情況

$ brew cask install cuda-z

然後就可以從 Application 裏運行 CUDA-Z 來查看CUDA運行情況了

編譯

如果有已經編譯好的版本，則可以跳過本章直接到"安裝"部分

下面從源碼編譯 Tensorflow GPU 版本

CUDA準備

請參考前面部分

編譯環境準備

Python

$ python3 --version
Python 3.6.5

不要使用 Python 3.7.0，否則編譯會有問題

Python 依賴

$ pip3 install six numpy wheel

Coreutils，llvm，OpenMP

$ brew install coreutils llvm cliutils/apple/libomp

Bazel

需要注意，這裏必須是 0.14.0 版本，新或舊都能導致編譯失敗。下載0.14.0版本，bazel發佈頁

$ curl -O https://github.com/bazelbuild/bazel/releases/download/0.14.0/bazel-0.14.0-installer-darwin-x86_64.sh
$ chmod +x bazel-0.14.0-installer-darwin-x86_64.sh
$ ./bazel-0.14.0-installer-darwin-x86_64.sh
$ bazel version
Build label: 0.14.0

太低版本可能會導致找不到環境變量，從而 Library not loaded

檢查NVIDIA開發環境

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:08:12_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

檢查clang版本

$ cc -v
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

源碼準備

拉取 TensorFlow 源碼 release 1.8 分支並進行修改，使其與macOS兼容

這裏可以直接下載修改好的源碼

$ curl -O https://raw.githubusercontent.com/SixQuant/tensorflow-macos-gpu/master/tensorflow-macos-gpu-r1.8-src.tar.gz

或者手工修改

$ git clone https://github.com/tensorflow/tensorflow -b r1.8
$ cd tensorflow
$ curl -O https://raw.githubusercontent.com/SixQuant/tensorflow-macos-gpu/master/patch/tensorflow-macos-gpu-r1.8.patch
$ git apply tensorflow-macos-gpu-r1.8.patch
$ curl -o third_party/nccl/nccl.h https://raw.githubusercontent.com/SixQuant/tensorflow-macos-gpu/master/patch/nccl.h

Build

配置

$ which python3
/usr/local/bin/python3

$ ./configure

Please specify the location of python. [Default is /usr/local/opt/python@2/bin/python2.7]: /usr/local/bin/python3

Found possible Python library paths:
  /usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.2

Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.2

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]3.0,3.5,5.0,5.2,6.0,6.1

Do you want to use clang as CUDA compiler? [y/N]:n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
    --config=mkl             # Build with MKL support.
    --config=monolithic      # Config for mostly static monolithic build.
Configuration finished

一定要輸入正確的版本

/usr/local/bin/python3

CUDA 9.2

cuDNN 7.2

compute capability 3.0,3.5,5.0,5.2,6.0,6.1 這個一定要去查你的顯卡支持的版本，可以輸入多個

上面實際上是生成了編譯配置文件 .tf_configure.bazelrc

開始編譯

$ bazel clean --expunge
$ bazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --action_env PATH --action_env DYLD_LIBRARY_PATH //tensorflow/tools/pip_package:build_pip_package

編譯過程中由於網絡問題，可能會下載失敗，多重試幾次
如果bazel版本不對，可能會造成 DYLD_LIBRARY_PATH 沒有傳遞過去，從而Library not loaded

編譯說明

--config=opt 的意思應該是

build:opt --copt=-march=native
build:opt --host_copt=-march=native
build:opt --define with_default_optimizations=true

-march=native 表示使用當前CPU支持的優化指令來進行編譯

查看當前 CPU 支持的指令集

$ sysctl machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

$ gcc -march=native -dM -E -x c++ /dev/null | egrep "AVX|SSE"

#define __AVX2__ 1
#define __AVX__ 1
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSE__ 1
#define __SSSE3__ 1

編譯錯誤 dyld: Library not loaded: @rpath/libcudart.9.2.dylib

ERROR: /Users/c/Downloads/tensorflow-macos-gpu-r1.8/src/tensorflow/python/BUILD:1590:1: Executing genrule //tensorflow/python:string_ops_pygenrule failed (Aborted): bash failed: error executing command /bin/bash bazel-out/host/genfiles/tensorflow/python/string_ops_pygenrule.genrule_script.sh
dyld: Library not loaded: @rpath/libcudart.9.2.dylib
  Referenced from: /private/var/tmp/_bazel_c/ea0f1e868907c49391ddb6d2fb9d5630/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/gen_string_ops_py_wrappers_cc
  Reason: image not found

是由於 bazel 的 bug 導致環境變量 DYLD_LIBRARY_PATH 沒有傳遞過去

解決：安裝正確版本的 bazel

編譯錯誤 PyString_AsStringAndSize

external/protobuf_archive/python/google/protobuf/pyext/descriptor_pool.cc:169:7: error: assigning to 'char *' from incompatible type 'const char *'
  if (PyString_AsStringAndSize(arg, &name, &name_size) < 0) {

這是因爲 Python3.7 對 protobuf_python 有 bug, 請換爲 Python3.6 後重新編譯
https://github.com/google/pro...

編譯時間長達1.5小時，請耐心等待

生成PIP安裝包

重編譯並且替換_nccl_ops.so

$ gcc -march=native -c -fPIC tensorflow/contrib/nccl/kernels/nccl_ops.cc -o _nccl_ops.o
$ gcc _nccl_ops.o -shared -o _nccl_ops.so
$ mv _nccl_ops.so bazel-out/darwin-py3-opt/bin/tensorflow/contrib/nccl/python/ops
$ rm _nccl_ops.o

打包

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/Downloads/

清理

$ bazel clean --expunge

安裝

$ pip3 uninstall tensorflow
$ pip3 install ~/Downloads/tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl

也可以直接通過http安裝

$ pip3 install https://github.com/SixQuant/tensorflow-macos-gpu/releases/download/v1.8.0/tensorflow-1.8.0-cp36-cp36m-macosx_10_13_x86_64.whl

如果是直接安裝，請一定要確認相關的版本是否和編譯的一致或更高

cudadriver_396.148_macos.dmg

cuda_9.2.148_mac.dmg

cuda_9.2.148.1_mac.dmg

cudnn-9.2-osx-x64-v7.2.1.38.tgz

確認

確認 Tensorflow GPU 是否工作正常

確認環境變量

確認Python代碼是否可以讀取到正確的環境變量DYLD_LIBRARY_PATH

$ nano tensorflow-gpu-01-env.py

#!/usr/bin/env python

import os

print(os.environ["DYLD_LIBRARY_PATH"])

$ python3 tensorflow-gpu-01-env.py
/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib

確認是否啓用了GPU

如果 TensorFlow 指令中兼有 CPU 和 GPU 實現，當該指令分配到設備時，GPU 設備有優先權。例如，如果 matmul 同時存在 CPU 和 GPU 核函數，在同時有 cpu:0 和 gpu:0 設備的系統中，gpu:0 會被選來運行 matmul。要找出您的指令和張量被分配到哪個設備，請創建會話並將 log_device_placement 配置選項設爲 True。

$ nano tensorflow-gpu-02-hello.py

#!/usr/bin/env python

import tensorflow as tf

config = tf.ConfigProto()
config.log_device_placement = True

# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
with tf.Session(config=config) as sess:
    # Runs the op.
    print(sess.run(c))

$ python3 tensorflow-gpu-02-hello.py
2018-08-26 14:13:45.987276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.2545
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 706.66MiB
2018-08-26 14:13:45.987303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-26 14:13:46.245132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 426 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0
2018-08-26 14:13:46.253938: I tensorflow/core/common_runtime/direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-26 14:13:46.254406: I tensorflow/core/common_runtime/placer.cc:886] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-26 14:13:46.254415: I tensorflow/core/common_runtime/placer.cc:886] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-26 14:13:46.254421: I tensorflow/core/common_runtime/placer.cc:886] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]

其中一些無用的看起來讓人擔心的日誌輸出我直接從源碼中註釋掉了，例如：
OS X does not support NUMA - returning NUMA node zero

Not found: TF GPU device with id 0 was not registered

跑複雜一點的

$ nano tensorflow-gpu-04-cnn-gpu.py

#!/usr/bin/env python

from __future__ import absolute_import, division, print_function
import os
import time
import numpy as np
import tflearn
import tensorflow as tf

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0'

from tensorflow.python.client import device_lib
def print_gpu_info():
    for device in device_lib.list_local_devices():
        print(device.name, 'memory_limit', str(round(device.memory_limit/1024/1024))+'M', 
            device.physical_device_desc)
    print('=======================')

print_gpu_info()


DATA_PATH = "/Volumes/Cloud/DataSet"

mnist = tflearn.datasets.mnist.read_data_sets(DATA_PATH+"/mnist", one_hot=True)

config = tf.ConfigProto()
config.log_device_placement = True
config.allow_soft_placement = True

config.gpu_options.allocator_type = 'BFC'
config.gpu_options.allow_growth = True
#config.gpu_options.per_process_gpu_memory_fraction = 0.3

# Building convolutional network
net = tflearn.input_data(shape=[None, 28, 28, 1], name='input') 
net = tflearn.conv_2d(net, 32, 5, weights_init='variance_scaling', activation='relu', regularizer="L2") 
net = tflearn.conv_2d(net, 64, 5, weights_init='variance_scaling', activation='relu', regularizer="L2") 
net = tflearn.fully_connected(net, 10, activation='softmax') 
net = tflearn.regression(net,
                         optimizer='adam',                  
                         learning_rate=0.01,
                         loss='categorical_crossentropy', 
                         name='target')

# Training
model = tflearn.DNN(net, tensorboard_verbose=3)

start_time = time.time()
model.fit(mnist.train.images.reshape([-1, 28, 28, 1]),
          mnist.train.labels.astype(np.int32),
          validation_set=(
              mnist.test.images.reshape([-1, 28, 28, 1]),
              mnist.test.labels.astype(np.int32)
          ),
          n_epoch=1,
          batch_size=128,
          shuffle=True,
          show_metric=True,
          run_id='cnn_mnist_tflearn')

duration = time.time() - start_time
print('Training Duration %.3f sec' % (duration))

$ python3 tensorflow-gpu-04-cnn-gpu.py
2018-08-26 14:11:00.463212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.2545
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 258.06MiB
2018-08-26 14:11:00.463235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-26 14:11:00.717963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
/device:CPU:0 memory_limit 256M
/device:GPU:0 memory_limit 204M device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0
=======================
Extracting /Volumes/Cloud/DataSet/mnist/train-images-idx3-ubyte.gz
Extracting /Volumes/Cloud/DataSet/mnist/train-labels-idx1-ubyte.gz
Extracting /Volumes/Cloud/DataSet/mnist/t10k-images-idx3-ubyte.gz
Extracting /Volumes/Cloud/DataSet/mnist/t10k-labels-idx1-ubyte.gz
2018-08-26 14:11:01.158727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-26 14:11:01.158843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
2018-08-26 14:11:01.487530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-26 14:11:01.487630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 203 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0)
---------------------------------
Run id: cnn_mnist_tflearn
Log directory: /tmp/tflearn_logs/
---------------------------------
Training samples: 55000
Validation samples: 10000
--
Training Step: 430  | total loss: 0.16522 | time: 45.764s
| Adam | epoch: 001 | loss: 0.16522 - acc: 0.9660 | val_loss: 0.06837 - val_acc: 0.9780 -- iter: 55000/55000
--
Training Duration 45.898 sec

速度提升明顯：
CPU 版無 AVX2 FMA，time: 168.151s

CPU 版加 AVX2 FMA，time: 147.697s

GPU 版加 AVX2 FMA，time: 45.898s

cuda-smi

cuda-smi 用來在Mac上代替 nvidia-smi

nvidia-smi是用來查看GPU內存使用情況的。

下載後放到 /usr/local/bin/ 目錄下

$ sudo scp cuda-smi /usr/local/bin/
$ sudo chmod 755 /usr/local/bin/cuda-smi
$ cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GTX 750 Ti (CC 5.0): 5.0234 of 2047.7 MB (i.e. 0.245%) Free

問題

錯誤 _ncclAllReduce

重新編譯一個 _nccl_ops.so 複製過去即可

$ gcc -c -fPIC tensorflow/contrib/nccl/kernels/nccl_ops.cc -o _nccl_ops.o
$ gcc _nccl_ops.o -shared -o _nccl_ops.so
$ mv _nccl_ops.so /usr/local/lib/python3.6/site-packages/tensorflow/contrib/nccl/python/ops/
$ rm _nccl_ops.o

Library not loaded: @rpath/libcublas.9.2.dylib

這是因爲 Jupyter 中丟失了 DYLD_LIBRARY_PATH 環境變量
或者說是新版本的 MacOS 禁止了你對 DYLD_LIBRARY_PATH 等不安全因素的隨意修改，除非你關閉SIP功能

重現

import os
os.environ['DYLD_LIBRARY_PATH']

上面的代碼在 Jupyter 中會出錯，原因是因爲 SIP的原因環境變量 DYLD_LIBRARY_PATH 不能被修改

解決：參考前面的 “環境變量” 設置部分

Segmentation Fault

所謂的段錯誤就是指訪問的內存超過了系統所給這個程序的內存空間

解決：請再次確認使用了正確的版本和編譯參數，尤其是 XCode

Not found: TF GPU device with id 0 was not registered

直接忽略這個警告

GPU 內存有泄漏？？？

不知道咋解決:(