PyTorch利用多GPU训练深度学习网络

PyTorch深度学习框架可以利用GPU加速网络训练，网络太深，参数太多的话，很可能在利用GPU训练网络的时候导致GPU显存不够，无法继续训练。GPU的显存大小几乎与其价格成正比，显存越大，也就越贵。但是为了利用GPU训练深度学习网络模型，可能需要大显存的显卡，比如直接买一个1080ti，显存为11G，但是也可以退而求其次，买两个1070ti，总显存为16G，似乎更划算。那么，单机多卡（一台机器配置多个GPU）情况下，在PyTorch框架下怎样训练模型呢？在PyTorch 1.0之后，可以利用多GPU进行网络模型训练。

1. 第一种情况，利用单机多卡对模型进行并行GPU处理（本人当时的需求为一个gtx 1070ti显存为8G，训练模型时出现超出显存的错误，所以又加装了一个gtx 1070ti显卡，这样总显存为16G，够用啦）。

model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()
output = model(input)

class torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)，通过device_ids参数可以指定在哪些GPU上进行优化，output_device指定输出到哪个GPU上。

DataParallel并行的方式，是将输入一个batch的数据均分成多份，分别送到对应的GPU进行计算，各个GPU得到的梯度累加。与Module相关的所有数据也都会以浅复制的方式复制多份，在此需要注意，在module中属性应该是只读的。

if torch.cuda.device_count() > 1:
  model = nn.DataParallel(model)
 
if torch.cuda.is_available():
   model.cuda()

2,。第二种情况，利用多机多卡进行分布式训练。torch.nn.parallel.DistributedDataParallel可以实现单机多卡和多机多卡的分布式训练。对於单机多卡，利用torch.nn.parallel.DistributedDataParallel对模型进行训练，每个GPU独立执行一个BatchSize的数据，如果单卡显存太小，仍然会出现显存不够的错误，导致模型无法继续训练。启动方式如下：

torch.distributed.init_process_group(backend='nccl', world_size=4, rank=, init_method='...') 
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[i], output_device=i)

下面是关于torch.nn.DataParallel与torch.nn.parallel.DistributedDataParallel的启动对比：

    if args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            # When using a single GPU per process and per
            # DistributedDataParallel, we need to divide the batch size
            # ourselves based on the total number of GPUs we have
            args.batch_size = int(args.batch_size / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
        else:
            model.cuda()
            # DistributedDataParallel will divide and allocate batch_size to all
            # available GPUs if device_ids are not set
            model = torch.nn.parallel.DistributedDataParallel(model)
    elif args.gpu is not None:
        torch.cuda.set_device(args.gpu)
        model = model.cuda(args.gpu)
    else:
        # DataParallel will divide and allocate batch_size to all available GPUs
        if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
            model.features = torch.nn.DataParallel(model.features)
            model.cuda()
        else:
            model = torch.nn.DataParallel(model).cuda()

参考：

1. pytorch 多GPU训练总结（DataParallel的使用）

2. PyTorch使用并行GPU处理数据

3. Pytorch中多GPU训练指北

4. [深度学习] 分布式Pytorch 1.0介绍（三）

5. PyTorch分布式训练简介

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

PyTorch利用多GPU训练深度学习网络

PDManer [元数建模]-v4.9.0 发布：一款简单好用的数据库建模平台

使用neovim打造go ide(支持代码跳转, 代码补全, 实时语法检查)

sql求连续值问题

cs01 CSS Syntax

挑战程序设计竞赛 2.3章习题 poj 3046 Ant Counting

[MASM拾遗]Offset伪指令

h30 HTML Layout Elements

了解显卡

一款基于C#开发的通讯调试工具（支持Modbus RTU、MQTT调试）

Linux/Golang/glibC系统调用

PyTorch權重初始化的幾種方法

關於650w電源能否帶動兩個GTX 1070ti顯卡

PyTorch利用多GPU訓練深度學習網絡

關於Visio畫圖與GSview轉換爲eps格式圖片問題

Learning to Compare: Relation Network for Few-Shot Learning source code

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結