一、CPU与GPU

二、数据迁移

数据在GPU和CPU之间迁移：

数据迁移使用的方法：to()函数

进行迁移的数据：Tensor和Module

2.1 to函数

to函数：转换数据类型/设备

tensor.to(*args, **kwargs)
module.to(*args, **kwargs)

区别：张量不执行inplace，模型执行inplace
张量执行to函数之后，会重新构建一个新的张量，而module执行to函数是会执行inplace操作的

2.1.1 tensor.to()

张量当中的to函数使用：

# to函数转换数据类型
x = torch.ones((3, 3))
x = x.to(torch.float64)

# to函数数据迁移到设备上
x = torch.ones((3, 3))
x = x.to("cuda")

代码示例：

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# ========================== tensor to cuda
# flag = 0
flag = 1
if flag:
    x_cpu = torch.ones((3, 3))
    print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))

    x_gpu = x_cpu.to(device)
    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))


# x_gpu = x_cpu.cuda()  # 该方法已弃用

2.1.2 module.to()

module中的to函数使用：

# to函数转换数据类型
linear = nn.Linear(2, 2)
linear.to(torch.double)

# to函数数据迁移到设备上
gpu1 = torch.device("cuda")
linear.to(gpu1)

说明：
module数据类型转换，是将网络层的所有参数数据都转换为指定类型
module数据迁移和tensor一样，都是将其迁移到指定的设备上

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# ========================== module to cuda
# flag = 0
flag = 1
if flag:
    net = nn.Sequential(nn.Linear(3, 3))

    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

    net.to(device)
    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

由上图可知，module在数据迁移前后的内存地址不变，所以说明module的to函数操作是inplace是操作

2.2 torch.cuda常用方法

torch.cuda常用方法

torch.cuda.device_count()：计算当前可见可用gpu数
torch.cuda.get_device_name()：获取gpu名称
torch.cuda.manual_seed()：为当前gpu设置随机种子
torch.cuda.manual_seed_all()：为所有可见可用gpu设置随机种子
torch.cuda.set_device()：设置主gpu为哪一个物理gpu（不推荐）
推荐： os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)

2.2.1 逻辑gpu和物理gpu

物理gpu：就是实实在在插在主机上的gpu显卡
逻辑gpu：是python脚本中设置的可见的 gpu，其个数小于等于物理gpu数

os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)
当使用了上述命令，则逻辑gpu个数就是设置可见gpu的个数，这里也就是2，而 "2, 3"指明逻辑gpu对应的物理gpu，也就是，逻辑gpu0对应物理gpu2，罗家gpu1对应物理gpu3

在逻辑gpu中，通常会设置主gpu，默认第0个gpu为主gpu，而划分主gpu与多gpu分发并行机制有关

三、多GPU并行运算

3.1 多gpu运算的分发并行机制

多GPU并行运算举例说明：
假设小明有4份作业要做，每份作业完成需要60分钟
单GPU运算：小明独自完成，那么需要240分钟
多GPU运算：小明先寻找小伙伴并平均分发作业需3分钟，并行运算60分钟，最后小明回收完成的作业3分钟，那么总共是66分钟

Pytorch中的多GPU分发并行机制：
首先将训练数据进行平均的分发，分发到每一个GPU上，然后每个GPU进行并行运算，得到运算结果后，再进行结果的回收，回收到主GPU上，也就是默认为可见gpu中的第一个gpu

3.2 torch.nn.DataParallel

torch.nn.DataParallel(module, 
					  device_ids=None,
					  output_device=None, 
					  dim=0)

功能：包装模型，实现分发并行机制

主要参数：
• module: 需要包装分发的模型
• device_ids: 可分发的gpu，默认分发到所有可见可用gpu
• output_device: 结果输出设备

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn

# ============================ 手动选择gpu
# flag = 0
flag = 1
if flag:

    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================ 依内存情况自动选择主gpu
# flag = 0
flag = 1
if flag:
    def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("显存计算功能暂不支持windows操作系统")
        return memory_gpu


    gpu_memory = get_gpu_memory()
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory))
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


if __name__ == "__main__":

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device)

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)           # 将自建模型通过nn.DataParallel进行包装
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

左边是使用两个gpu的情况：
因为总的batch_size=16，所以每个gpu的batch size分别为8

右边是使用4个gpu的情况：
pytorch会将gpu的空闲内存进行排序，然后将空闲内存最多的gpu设置为主gpu
因为总的batch_size=16，所以每个gpu的batch size分别为4

3.3 查询当前gpu内存剩余

def get_gpu_memory():
	import os
	os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
	memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
	os.system('rm tmp.txt')
	return memory_gpu

说明：
nvidia-smi ：英伟达的查询命令
-q：查询
-d：查询的内容
grep：搜索
-A4：显示当前行以及后4行
grep Free：内存剩余空间
> tmp.txt：将查询的信息重定向到指定临时的txt文件
rm tmp.txt：删除临时的txt文件

example:
gpu_memory = get_gpu_memory()                    		    # 获取gpu内存
gpu_list = np.argsort(gpu_memory)[::-1]				    	# 对gpu内存进行排序
gpu_list_str = ','.join(map(str, gpu_list))      		    # 将排序结果转成字符串
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) # 按顺序设置可见gpu
print("\ngpu free memory: {}".format(gpu_memory))
print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
>>> gpu free memory: [10362, 10058, 9990, 9990]
>>> CUDA_VISIBLE_DEVICES :0,1,3,2

3.4 gpu模型加载

3.4.1 常见报错1

模型在gpu上训练后，保存之后，在gpu不可用的设备上加载该模型，则会报错

代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 加载至cpu
# flag = 0
flag = 1
if flag:
    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_gpu_0.pkl"
    torch.save(net_state_dict, path_state_dict)

    # load
    state_dict_load = torch.load(path_state_dict)
    # state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

有gpu时，加载后，会显示到数据放到了GPU设备上

在设置了state_dict_load = torch.load(path_state_dict, map_location=“cpu”)在cpu上时，不会显示设备信息，因为pytorch只会当数据或模型在gpu上才会显示设备信息

3.4.2 常见报错2

训练时，采用了多GPU并行计算，也就是使用了torch.nn.DataParallel对模型进行了包装，因此，模型的网络层命名会多了一个module，所以导致加载state_dict时命名不匹配

代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 多gpu 保存
flag = 0
# flag = 1
if flag:

    if torch.cuda.device_count() < 2:
        print("gpu数量不足，请到多gpu环境下运行")
        import sys
        sys.exit(0)

    gpu_list = [0, 1, 2, 3]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_multi_gpu.pkl"
    torch.save(net_state_dict, path_state_dict)

# =================================== 多gpu 加载
flag = 0
# flag = 1
if flag:

    net = FooNet(neural_num=3, layers=3)

    path_state_dict = "./model_in_multi_gpu.pkl"
    state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

    # net.load_state_dict(state_dict_load)

    # remove module.
    from collections import OrderedDict
    new_state_dict = OrderedDict()
    for k, v in state_dict_load.items():
        namekey = k[7:] if k.startswith('module.') else k
        new_state_dict[namekey] = v
    print("new_state_dict:\n{}".format(new_state_dict))

    net.load_state_dict(new_state_dict)

由上图可知，torch.nn.DataParallel对模型包装，网络层名词都会多了前面的module，因此加载时发生命名不匹配

按照上述解决方案对加载进来的state_dict进行处理，解决了命名不匹配的问题

25GPU的使用

一、CPU与GPU

二、数据迁移

2.1 to函数

2.1.1 tensor.to()

2.1.2 module.to()

2.2 torch.cuda常用方法

2.2.1 逻辑gpu和物理gpu

三、多GPU并行运算

3.1 多gpu运算的分发并行机制

3.2 torch.nn.DataParallel

3.3 查询当前gpu内存剩余

3.4 gpu模型加载

3.4.1 常见报错1

3.4.2 常见报错2

Shuffle機制及優化

面試題——數據倉庫的輸入輸出是什麼

Hadoop集羣搭建過程及配置文件總結

Kafka Channel的parseAsFlumeEvent

面試題——數倉項目技術如何選型？

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結