一、CPU與GPU

二、數據遷移

數據在GPU和CPU之間遷移：

數據遷移使用的方法：to()函數

進行遷移的數據：Tensor和Module

2.1 to函數

to函數：轉換數據類型/設備

tensor.to(*args, **kwargs)
module.to(*args, **kwargs)

區別：張量不執行inplace，模型執行inplace
張量執行to函數之後，會重新構建一個新的張量，而module執行to函數是會執行inplace操作的

2.1.1 tensor.to()

張量當中的to函數使用：

# to函數轉換數據類型
x = torch.ones((3, 3))
x = x.to(torch.float64)

# to函數數據遷移到設備上
x = torch.ones((3, 3))
x = x.to("cuda")

代碼示例：

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# ========================== tensor to cuda
# flag = 0
flag = 1
if flag:
    x_cpu = torch.ones((3, 3))
    print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))

    x_gpu = x_cpu.to(device)
    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))


# x_gpu = x_cpu.cuda()  # 該方法已棄用

2.1.2 module.to()

module中的to函數使用：

# to函數轉換數據類型
linear = nn.Linear(2, 2)
linear.to(torch.double)

# to函數數據遷移到設備上
gpu1 = torch.device("cuda")
linear.to(gpu1)

說明：
module數據類型轉換，是將網絡層的所有參數數據都轉換爲指定類型
module數據遷移和tensor一樣，都是將其遷移到指定的設備上

# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# ========================== module to cuda
# flag = 0
flag = 1
if flag:
    net = nn.Sequential(nn.Linear(3, 3))

    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

    net.to(device)
    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

由上圖可知，module在數據遷移前後的內存地址不變，所以說明module的to函數操作是inplace是操作

2.2 torch.cuda常用方法

torch.cuda常用方法

torch.cuda.device_count()：計算當前可見可用gpu數
torch.cuda.get_device_name()：獲取gpu名稱
torch.cuda.manual_seed()：爲當前gpu設置隨機種子
torch.cuda.manual_seed_all()：爲所有可見可用gpu設置隨機種子
torch.cuda.set_device()：設置主gpu爲哪一個物理gpu（不推薦）
推薦： os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)

2.2.1 邏輯gpu和物理gpu

物理gpu：就是實實在在插在主機上的gpu顯卡
邏輯gpu：是python腳本中設置的可見的 gpu，其個數小於等於物理gpu數

os.environ.setdefault(“CUDA_VISIBLE_DEVICES”, “2, 3”)
當使用了上述命令，則邏輯gpu個數就是設置可見gpu的個數，這裏也就是2，而 "2, 3"指明邏輯gpu對應的物理gpu，也就是，邏輯gpu0對應物理gpu2，羅家gpu1對應物理gpu3

在邏輯gpu中，通常會設置主gpu，默認第0個gpu爲主gpu，而劃分主gpu與多gpu分發並行機制有關

三、多GPU並行運算

3.1 多gpu運算的分發並行機制

多GPU並行運算舉例說明：
假設小明有4份作業要做，每份作業完成需要60分鐘
單GPU運算：小明獨自完成，那麼需要240分鐘
多GPU運算：小明先尋找小夥伴並平均分發作業需3分鐘，並行運算60分鐘，最後小明回收完成的作業3分鐘，那麼總共是66分鐘

Pytorch中的多GPU分發並行機制：
首先將訓練數據進行平均的分發，分發到每一個GPU上，然後每個GPU進行並行運算，得到運算結果後，再進行結果的回收，回收到主GPU上，也就是默認爲可見gpu中的第一個gpu

3.2 torch.nn.DataParallel

torch.nn.DataParallel(module, 
					  device_ids=None,
					  output_device=None, 
					  dim=0)

功能：包裝模型，實現分發並行機制

主要參數：
• module: 需要包裝分發的模型
• device_ids: 可分發的gpu，默認分發到所有可見可用gpu
• output_device: 結果輸出設備

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn

# ============================ 手動選擇gpu
# flag = 0
flag = 1
if flag:

    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================ 依內存情況自動選擇主gpu
# flag = 0
flag = 1
if flag:
    def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("顯存計算功能暫不支持windows操作系統")
        return memory_gpu


    gpu_memory = get_gpu_memory()
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory))
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


if __name__ == "__main__":

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device)

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)           # 將自建模型通過nn.DataParallel進行包裝
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

左邊是使用兩個gpu的情況：
因爲總的batch_size=16，所以每個gpu的batch size分別爲8

右邊是使用4個gpu的情況：
pytorch會將gpu的空閒內存進行排序，然後將空閒內存最多的gpu設置爲主gpu
因爲總的batch_size=16，所以每個gpu的batch size分別爲4

3.3 查詢當前gpu內存剩餘

def get_gpu_memory():
	import os
	os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
	memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
	os.system('rm tmp.txt')
	return memory_gpu

說明：
nvidia-smi ：英偉達的查詢命令
-q：查詢
-d：查詢的內容
grep：搜索
-A4：顯示當前行以及後4行
grep Free：內存剩餘空間
> tmp.txt：將查詢的信息重定向到指定臨時的txt文件
rm tmp.txt：刪除臨時的txt文件

example:
gpu_memory = get_gpu_memory()                    		    # 獲取gpu內存
gpu_list = np.argsort(gpu_memory)[::-1]				    	# 對gpu內存進行排序
gpu_list_str = ','.join(map(str, gpu_list))      		    # 將排序結果轉成字符串
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) # 按順序設置可見gpu
print("\ngpu free memory: {}".format(gpu_memory))
print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
>>> gpu free memory: [10362, 10058, 9990, 9990]
>>> CUDA_VISIBLE_DEVICES :0,1,3,2

3.4 gpu模型加載

3.4.1 常見報錯1

模型在gpu上訓練後，保存之後，在gpu不可用的設備上加載該模型，則會報錯

代碼示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 加載至cpu
# flag = 0
flag = 1
if flag:
    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_gpu_0.pkl"
    torch.save(net_state_dict, path_state_dict)

    # load
    state_dict_load = torch.load(path_state_dict)
    # state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

有gpu時，加載後，會顯示到數據放到了GPU設備上

在設置了state_dict_load = torch.load(path_state_dict, map_location=“cpu”)在cpu上時，不會顯示設備信息，因爲pytorch只會當數據或模型在gpu上纔會顯示設備信息

3.4.2 常見報錯2

訓練時，採用了多GPU並行計算，也就是使用了torch.nn.DataParallel對模型進行了包裝，因此，模型的網絡層命名會多了一個module，所以導致加載state_dict時命名不匹配

代碼示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))

        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


# =================================== 多gpu 保存
flag = 0
# flag = 1
if flag:

    if torch.cuda.device_count() < 2:
        print("gpu數量不足，請到多gpu環境下運行")
        import sys
        sys.exit(0)

    gpu_list = [0, 1, 2, 3]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_multi_gpu.pkl"
    torch.save(net_state_dict, path_state_dict)

# =================================== 多gpu 加載
flag = 0
# flag = 1
if flag:

    net = FooNet(neural_num=3, layers=3)

    path_state_dict = "./model_in_multi_gpu.pkl"
    state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

    # net.load_state_dict(state_dict_load)

    # remove module.
    from collections import OrderedDict
    new_state_dict = OrderedDict()
    for k, v in state_dict_load.items():
        namekey = k[7:] if k.startswith('module.') else k
        new_state_dict[namekey] = v
    print("new_state_dict:\n{}".format(new_state_dict))

    net.load_state_dict(new_state_dict)

由上圖可知，torch.nn.DataParallel對模型包裝，網絡層名詞都會多了前面的module，因此加載時發生命名不匹配

按照上述解決方案對加載進來的state_dict進行處理，解決了命名不匹配的問題

25GPU的使用

一、CPU與GPU

二、數據遷移

2.1 to函數

2.1.1 tensor.to()

2.1.2 module.to()

2.2 torch.cuda常用方法

2.2.1 邏輯gpu和物理gpu

三、多GPU並行運算

3.1 多gpu運算的分發並行機制

3.2 torch.nn.DataParallel

3.3 查詢當前gpu內存剩餘

3.4 gpu模型加載

3.4.1 常見報錯1

3.4.2 常見報錯2

Shuffle機制及優化

面試題——數據倉庫的輸入輸出是什麼

Hadoop集羣搭建過程及配置文件總結

Kafka Channel的parseAsFlumeEvent

面試題——數倉項目技術如何選型？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結