在tensorflow官方tutorial上給出了多GPU的用法,但那是基於data-parallelism的計算,主要思想是將數據劃分成不同部分,用同一個模型進行計算
但是我在寫代碼中發現,會出現單個模型過大無法再單個GPU上運行,這時候就需要model-parallelism
上網查找了很多資料後,發現這個博主寫的不錯,附帶了github代碼,How to Use Distributed TensorFlow to Split Your TensorFlow Graph Between Multiple Machines
實現起來其實非常簡單,只需要將模型劃分,讓不同的網絡層在不同的GPU上計算就可以了
#實現一個[9k,9k,9k]的densenet,前兩層在GPU0上訓練
#最後一層在GPU1上訓練,因爲輸出層權重矩陣大概是[28k,10k]單個GPU會顯示內存不夠
def dense_gpu(input, keep_prob):
units = 9000
with tf.device("/gpu:0"):
input_layer = input
dropout1 = tf.nn.dropout(input_layer, keep_prob=keep_prob)
# Dense Layer1
hidden1 = weightnorm.dense(inputs=dropout1, units=units)
dense1 = tf.keras.layers.concatenate([hidden1, input_layer])
dropout2 = tf.nn.dropout(dense1, keep_prob=keep_prob)
activation1 = tf.nn.leaky_relu(dropout2)
hidden2 = weightnorm.dense(inputs=activation1, units=units)
dense2 = tf.keras.layers.concatenate([hidden2, dense1])
dropout3 = tf.nn.dropout(dense2, keep_prob=keep_prob)
activation2 = tf.nn.leaky_relu(dropout3)
with tf.device("/gpu:1"):
hidden3 = weightnorm.dense(inputs=activation2, units=units)
dense3 = tf.keras.layers.concatenate([hidden3, dense2])
dropout4 = tf.nn.dropout(dense3, keep_prob=keep_prob)
activation3 = tf.nn.leaky_relu(dropout4)
# Output Layer
# 9520 is the length of the target gene
output = weightnorm.dense(inputs=activation3, units=9520)
return output