Keras同時用多張顯卡訓練網絡

原創

平原2018

2020-06-08 08:15

文章目錄

轉自：https://www.jianshu.com/p/db0ba022936f

References.

官方文檔：multi_gpu_model
以及Google

0. 誤區

目前Keras是支持了多個GPU同時訓練網絡，非常容易，但是靠以下這個代碼是不行的。

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

當你監視GPU的使用情況（nvidia-smi -l 1）的時候會發現，儘管GPU不空閒，實質上只有一個GPU在跑，其他的就是閒置的佔用狀態，也就是說，如果你的電腦裏面有多張顯卡，無論有沒有上面的代碼，Keras都會默認的去佔用所有能檢測到的GPU。這行代碼在你只需要一個GPU的時候時候用的，也就是可以讓Keras檢測不到電腦裏其他的GPU。假設你一共有三張顯卡，每個顯卡都是有自己的標號的（0, 1, 2），爲了不影響別人的使用，你只用其中一個，比如用gpu=1的這張，那麼

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

然後再監視GPU的使用情況（nvidia-smi -l 1），確實只有一個被佔用，其他都是空閒狀態。所以這是一個Keras使用多顯卡的誤區，它並不能同時利用多個GPU。

1. 目的

爲什麼要同時用多個GPU來訓練？
單個顯卡內存太小 -> batch size無法設的比較大，有時甚至batch_size=1都內存溢出（OUT OF MEMORY）

從我跑深度網絡的經驗來看，batch_size設的大一點會比較好，相當於每次反向傳播更新權重，網絡都可以看到更多的樣本，從而不會每次iteration都過擬合到不同的地方去Don’t Decay the Learning Rate, Increase the Batch Size。當然，我也看過有論文說也不能設的過大，原因不明… 反正我也沒有機會試過。我建議的batch_size大概就是64~256的範圍內，都沒什麼大問題。

但是隨着現在網絡的深度越來越深，對於GPU的內存要求也越來越大，很多入門的新人最大的問題往往不是代碼，而是從Github裏面抄下來的代碼自己的GPU太渣，實現不了，只能降低batch_size，最後訓練不出那種效果。

解決方案兩個：一是買一個超級牛逼的GPU，內存巨大無比；二是買多個一般般的GPU，一起用。
第一個方案不行，因爲目前即便最好的NVIDIA顯卡，內存也不過十幾個G了不起了，網絡一深也掛，並且買一個牛逼顯卡的性價比不高。所以、學會在Keras下用多個GPU是比較靠譜的選擇。

2. 實現

2.1 設計一個類

cite: parallel_model.py

import tensorflow as tf
import keras.backend as K
import keras.layers as KL
import keras.models as KM


class ParallelModel(KM.Model):
    """Subclasses the standard Keras Model and adds multi-GPU support.
    It works by creating a copy of the model on each GPU. Then it slices
    the inputs and sends a slice to each copy of the model, and then
    merges the outputs together and applies the loss on the combined
    outputs.
    """

    def __init__(self, keras_model, gpu_count):
        """Class constructor.
        keras_model: The Keras model to parallelize
        gpu_count: Number of GPUs. Must be > 1
        """
        self.inner_model = keras_model
        self.gpu_count = gpu_count
        merged_outputs = self.make_parallel()
        super(ParallelModel, self).__init__(inputs=self.inner_model.inputs,
                                            outputs=merged_outputs)

    def __getattribute__(self, attrname):
        """Redirect loading and saving methods to the inner model. That's where
        the weights are stored."""
        if 'load' in attrname or 'save' in attrname:
            return getattr(self.inner_model, attrname)
        return super(ParallelModel, self).__getattribute__(attrname)

    def summary(self, *args, **kwargs):
        """Override summary() to display summaries of both, the wrapper
        and inner models."""
        super(ParallelModel, self).summary(*args, **kwargs)
        self.inner_model.summary(*args, **kwargs)

    def make_parallel(self):
        """Creates a new wrapper model that consists of multiple replicas of
        the original model placed on different GPUs.
        """
        # Slice inputs. Slice inputs on the CPU to avoid sending a copy
        # of the full inputs to all GPUs. Saves on bandwidth and memory.
        input_slices = {name: tf.split(x, self.gpu_count)
                        for name, x in zip(self.inner_model.input_names,
                                           self.inner_model.inputs)}

        output_names = self.inner_model.output_names
        outputs_all = []
        for i in range(len(self.inner_model.outputs)):
            outputs_all.append([])

        # Run the model call() on each GPU to place the ops there
        for i in range(self.gpu_count):
            with tf.device('/gpu:%d' % i):
                with tf.name_scope('tower_%d' % i):
                    # Run a slice of inputs through this replica
                    zipped_inputs = zip(self.inner_model.input_names,
                                        self.inner_model.inputs)
                    inputs = [
                        KL.Lambda(lambda s: input_slices[name][i],
                                  output_shape=lambda s: (None,) + s[1:])(tensor)
                        for name, tensor in zipped_inputs]
                    # Create the model replica and get the outputs
                    outputs = self.inner_model(inputs)
                    if not isinstance(outputs, list):
                        outputs = [outputs]
                    # Save the outputs for merging back together later
                    for l, o in enumerate(outputs):
                        outputs_all[l].append(o)

        # Merge outputs on CPU
        with tf.device('/cpu:0'):
            merged = []
            for outputs, name in zip(outputs_all, output_names):
                # If outputs are numbers without dimensions, add a batch dim.
                def add_dim(tensor):
                    """Add a dimension to tensors that don't have any."""
                    if K.int_shape(tensor) == ():
                        return KL.Lambda(lambda t: K.reshape(t, [1, 1]))(tensor)
                    return tensor
                outputs = list(map(add_dim, outputs))

                # Concatenate
                merged.append(KL.Concatenate(axis=0, name=name)(outputs))
        return merged

2.2 調用非常簡潔

GPU_COUNT = 3 # 同時使用3個GPU
model = keras.applications.densenet.DenseNet201() # 比如使用DenseNet-201
model = ParallelModel(model, GPU_COUNT)
model.compile(optimizer=Adam(lr=1e-5), loss='binary_crossentropy', metrics = ['accuracy'])
model.fit(X_train, y_train,
              batch_size=batch_size*GPU_COUNT, 
              epochs=nb_epoch, verbose=0, shuffle=True,
              validation_data=(X_valid, y_valid))

model.save_weights('/path/to/save/model.h5')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Keras同時用多張顯卡訓練網絡

文章目錄

References.

0. 誤區

1. 目的

2. 實現

2.1 設計一個類

2.2 調用非常簡潔

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Java中止線程的方式

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

《最新出爐》系列入門篇-Python+Playwright自動化測試-43-分頁測試

HTTP協議相關文檔

常見30種NLP任務的練手項目

python 將多個表格合併成一個表格中的多個sheet

python 合併兩個txt文件

超好用Python小功能（持續更新中）

實現windows 和linux環境 word轉pdf功能

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結