1. 查看gpu的使用詳情：

（1）查看gpu使用情況

nvidia-smi.exe # windows上
nvidia-smi -l #linux服務器上
# 顯示的結果中 Volatile GPU-Util：浮動的GPU利用率；

（2）linux上查看進程佔用gpu的情況：

##實用技巧：

##如果你在linux終端運行深度學習python腳本文件，運行中發現佔用多個GPU和內存資源，則請先查看佔用資源##的進程歸屬方是誰：
$ps -f PID號

##然後確認該進程可以kill掉情況下建議：
$kill -9 PID號
#ctrl+Z指令只能強制結束當前流程，無法退出進程，所以發現有問題後用ctrl+Z後還需要kill進程

2. 使用指定的GPU：

# 其中參數：

CUDA_VISIBLE_DEVICES=1           Only device 1 will be seen
CUDA_VISIBLE_DEVICES=0,1         Devices 0 and 1 will be visible
CUDA_VISIBLE_DEVICES="0,1"       Same as above, quotation marks are optional
CUDA_VISIBLE_DEVICES=0,2,3       Devices 0, 2, 3 will be visible; device 1 is masked
CUDA_VISIBLE_DEVICES=""          No GPU will be visible

方式1. 在python中指定GPU

# 方式1. 直接在python文件最開始指定GPU
import os
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # 默認，不需要這句
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # 選擇ID爲0的GPU

# 方式2. 通過ID選擇GPU
def selectGpuById(id):
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(id)

## 方式3. session中指定GPU
with tf.Session() as ses:
    with tf.device("/gpu:1"):
        # 訓練model的代碼
## 字符說明：
##"/cpu:0"	The CPU of your machine
##"/gpu:0"	The GPU of yout machine ,if you have one

方式2. 在終端shell運行程序時指定GPU

命令行輸入:

# 指定採用1號GPU運行*.py
CUDA_VISIBLE_DEVICES=1 python *.py

3. 設置tensorflow使用GPU顯存大小

函數tf.ConfigProto()的參數：

log_device_placement=True : 是否打印設備分配日誌

allow_soft_placement=True ：如果你指定的設備不存在，允許TF自動分配設備

tf.ConfigProto(log_device_placement=True,allow_soft_placement=True)

在構造tf.Session()時可通過tf.GPUOptions作爲可選配置參數的一部分來顯示地指定需要分配的顯存比例。

per_process_gpu_memory_fraction指定了每個GPU進程中使用顯存的上限，但它只能均勻地作用於所有GPU，無法對不同GPU設置不同的上限。

（1）定量設置顯存

# 分配給Tensorflow的GPU顯存大小爲：GPU實際顯存*0.7
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

（2）按需設置顯存

# 如果想按需分配，可以使用allow_growth參數
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

總結：單GPU設置使用舉例：

config = tf.ConfigProto(allow_soft_placement=True)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9)
config.gpu_options.allow_growth = True

with tf.Session(config=config) as sess:
    #訓練model的代碼

4.Keras多GPU訓練

Keras 2.X版本後可以很方便的支持使用多GPU進行訓練了，使用多GPU可以提高我們的訓練過程，比如加速和解決內存不足問題。有多張GPU卡可用時，使用TnesorFlow後端。

多GPU其實分爲兩種使用情況：

數據並行
設備並行

數據並行

數據並行將目標模型在多個設備上各複製一份，並使用每個設備上的複製品處理整個數據集的不同部分數據。Keras在 keras.utils.multi_gpu_model 中提供有內置函數，該函數可以產生任意模型的數據並行版本，最高支持在8片GPU上並行。

數據並行是指將模型放到多個GPU上去跑，來處理數據集的不同部分，Keras的keras.utils.multi_gpu_model支持任意模型的數據並行，最多支持8個GPU。大多數時候要用到的都是數據並行，參考utils中的multi_gpu_model文檔。下面是一個例子：

from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

數據並行利用多塊GPU同時訓練多個batch數據，運行在每塊GPU上的模型爲同一個神經網絡，網絡結構完全一樣，並且共享模型參數。

from keras.utils.training_utils import multi_gpu_model   #導入keras多GPU函數

model = get_model()
parallel_model = multi_gpu_model(model, gpus=2) # 設置使用2個gpu，該句放在模型compile之前
parallel_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
hist = parallel_model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs_num, validation_data=(x_test, y_test), verbose=1, callbacks=callbacks)

還可以指定要哪幾個GPU來跑：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3,5"

使用命令“nvidia-smi”可以查看各GPU的使用情況和序號，上面代碼就是指定用序號爲3和5的兩個GPU來跑訓練。

報錯1：ValueError: Variable batch_normalization_1/moving_mean/biased already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:
解決：使用單GPU訓練的時候沒有問題，改成多GPU後出現這個問題。這個問題好解決，將Tensorflow升級到1.4即可。

報錯2：TypeError: can't pickle ...(different text at different situation) objects
解決：查找資料後，發現可能源於callbacks.ModelCheckpoint() 並進行多 gpu 並行計算時，使用姿勢不對導致callbacks 函數報錯。在代碼中爲了保存最優的訓練模型，加了這個callback：

checkpoint = ModelCheckpoint(filepath='./cifar10_resnet_ckpt.h5', monitor='val_acc', verbose=1,save_best_only=True)

而在改爲多GPU訓練後，每次回調存儲的模型變成了parallel_model，這會導致報錯，只需要改成依然保存原本的model即可，所以需要改一下：

class ParallelModelCheckpoint(ModelCheckpoint):
    def __init__(self,model,filepath, monitor='val_loss', verbose=0,
                 save_best_only=False, save_weights_only=False,
                 mode='auto', period=1):
        self.single_model = model
        super(ParallelModelCheckpoint,self).__init__(filepath, monitor, verbose,save_best_only, save_weights_only,mode, period)

    def set_model(self, model):
        super(ParallelModelCheckpoint,self).set_model(self.single_model)

checkpoint = ParallelModelCheckpoint(model, filepath='./cifar10_resnet_ckpt.h5', monitor='val_acc', verbose=1, save_best_only=True) # 解決多GPU運行下保存模型報錯的問題

其餘的不變，也就是改爲依然存儲原本的model即可。

設備並行

是在不同設備上運行同一個模型的不同部分，當模型含有多個並行結構，例如含有兩個分支時，這種方式很適合。這種並行方法可以通過使用TensorFlow device scopes實現，下面是一個例子：

# Model where a shared LSTM is used to encode two different sequences in parallel
input_a = keras.Input(shape=(140, 256))
input_b = keras.Input(shape=(140, 256))

shared_lstm = keras.layers.LSTM(64)

# Process the first sequence on one GPU
with tf.device_scope('/gpu:0'):
    encoded_a = shared_lstm(tweet_a)
# Process the next sequence on another GPU
with tf.device_scope('/gpu:1'):
    encoded_b = shared_lstm(tweet_b)

# Concatenate results on CPU
with tf.device_scope('/cpu:0'):
    merged_vector = keras.layers.concatenate([encoded_a, encoded_b], axis=-1)

多任務輸出數據並行

在Keras版的Faster-RCNN，它由多個輸出支路，也就是多個loss，在網絡定義的時候一般會給命名，然後編譯的時候找到不同支路layer的名字即可，就像這樣：

model.compile(optimizer=optimizer, 
              loss={'main_output': jaccard_distance_loss, 'aux_output': 'binary_crossentropy'},
              metrics={'main_output': jaccard_distance_loss, 'aux_output': 'acc'},
              loss_weights={'main_output': 1., 'aux_output': 0.5})

其中main_output和aux_output就是認爲定義的layer name，但是如果用了keras.utils.training_utils.multi_gpu_model()以後，名字就自動換掉了，變成默認的concatenate_1, concatenate_2等等，因此你需要先model.summary()一下，打印出來網絡結構，然後弄明白哪個輸出代表哪個支路，然後重新編譯網絡，如下：

from keras.optimizers import Adam, RMSprop, SGD
model.compile(optimizer=RMSprop(lr=0.045, rho=0.9, epsilon=1.0), 
              loss={'concatenate_1': jaccard_distance_loss, 'concatenate_2': 'binary_crossentropy'},
              metrics={'concatenate_1': jaccard_distance_loss, 'concatenate_2': 'acc'},
              loss_weights={'concatenate_1': 1., 'concatenate_2': 0.5})

而且在Keras版的Faster-RCNN中，每個batch裏，對RPN進行訓練，測試後的結果作爲檢測網絡的輸入，來訓練，最後把2個模型對參數的訓練結果作爲一個模型保存下來。

分佈式

Keras的分佈式是利用TensorFlow實現的，要想完成分佈式的訓練，你需要將Keras註冊在連接一個集羣的TensorFlow會話上：

server = tf.train.Server.create_local_server()
sess = tf.Session(server.target)

from keras import backend as K
K.set_session(sess)

參考：https://blog.csdn.net/c20081052/article/details/82345454?utm_source=blogxgwz8

https://www.cnblogs.com/YSPXIZHEN/p/11416843.html

tensorflow使用GPU訓練

1. 查看gpu的使用詳情：

（1）查看gpu使用情況

（2）linux上查看進程佔用gpu的情況：

2. 使用指定的GPU：

方式1. 在python中指定GPU

方式2. 在終端shell運行程序時指定GPU

3. 設置tensorflow使用GPU顯存大小

函數tf.ConfigProto()的參數：

（1）定量設置顯存

（2）按需設置顯存

總結：單GPU設置使用舉例：

4.Keras多GPU訓練

數據並行

設備並行

多任務輸出數據並行

分佈式

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

cocos2d-x 學習筆記(二) 非常簡單地跑一跑官方demo！

cocos2d-x 學習筆記(一) windows 開發環境配置

anaconda 安裝tensorflow 2.1

python 關於變量賦值與函數參數傳遞(二)——函數參數傳遞是傳對象

貝葉斯分析——分佈之分佈（beta分佈）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結