本課程來自深度之眼deepshare.net，部分截圖來自課程視頻。

CPU與GPU

CPU（Central Processing Unit，中央處理器）：主要包括控制器和運算器
GPU（Graphics Processing Unit，圖形處理器）：處理統一的，無依賴的大規模數據運算
二者的結構圖如下，可以看到綠色部分（計算單元）GPU明顯要比CPU要多

注意，在計算的時候，數據要在相同類型的處理器中，例如兩個張量的相加，要麼都在CPU中進行，要麼都在GPU中運行。不能一個在CPU，一個在GPU，需要把數據從一個處理器遷移到另外一個處理器，PyTorch提供了一個函數：
to函數：轉換數據類型/設備

data.to("cpu")#GPU遷移至CPU
data.to("cuda")#CPU遷移至GPU

可遷移的數據包括：
1.Tensor
2.Module
下面分別對這兩種數據中的to函數進行學習：

tensor.to(*args,**kwargs)

x=torch.ones((3,3))#定義一個張量
x=x.to(torch.float64)#把默認的float32轉換爲float64
x=torch.ones(3,3)#定義一個張量
x=x.to("cuda")#遷移到GPU

module.to(*args,**kwargs)

linear=nn.Linear(2,2)#定義一個module
linear.to(torch.double)#把module中所有的參數從默認的float32轉換爲float64（double就是float64）
gpu1=torch.device("cuda")#定義設備
linear.to(gpu1)#遷移到gpu

小結

區別：張量不執行inplace，模型執行inplace
可以看到上面兩個例子中，tensor是需要用等號進行賦值的，而module是直接執行to函數即可。

數據遷移至GPU

下面直接上老師的代碼

# -*- coding: utf-8 -*-
"""
# @file name  : cuda_methods.py
# @author     : TingsongYu https://github.com/TingsongYu
# @date       : 2019-11-11
# @brief      : 數據遷移至cuda的方法
"""
import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")#這裏是第10行

# ========================== tensor to cuda
# flag = 0
flag = 1
if flag:
    x_cpu = torch.ones((3, 3))
    print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))#打印is_cuda屬性， id(x_cpu)內存地址

    x_gpu = x_cpu.to(device)#轉移到GPU，然後再打印相應信息，device要看第10處，先進行判斷
    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))

# 棄用
# x_gpu = x_cpu.cuda()

從打印出來的內存地址可以看出來，張量不執行inplace，會重新構建一個張量。

# ========================== forward in cuda
# ========================== module to cuda
# flag = 0
flag = 1
if flag:
    net = nn.Sequential(nn.Linear(3, 3))

    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

    net.to(device)
    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

模型執行inplace，內存地址不變。

# flag = 0
flag = 1
if flag:
    output = net(x_gpu)
    print("output is_cuda: {}".format(output.is_cuda))

    # output = net(x_cpu)#這裏會報錯，用gpu來輸出cpu上數據。

GPU in PyTorch

torch.cuda常用方法

1.torch.cuda.device_count()：計算當前可見可用gpu數
2.torch.cuda.get device_name()：獲取gpu名稱
3.torch.cuda.manual_seed()：爲當前gpu設置隨機種子
4.torch.cuda.manual_seed_all()：爲所有可見可用gpu設置隨機種子
5.torch.cuda.set device()：設置主gpu爲哪一個物理gpu（多CPU環境不推薦，會比較混亂）
推薦：os.environ.setdefault(“CUDA_VISIBLE_DEVICES”,“2,3”)，表示有兩個邏輯GPU可見，對應的物理GPU"2,3"設置如下圖：

os.environ.setdefault(“CUDA_VISIBLE_DEVICES”,“0,3,2”)，則表示有三個邏輯GPU可見，每個邏輯GPU對應物理GPU"0,3,2"的關係如下圖：

爲什麼要這樣設置呢？如果一個服務器上有多個人在同時使用GPU，可以用這個方法來均衡配置。或者是一個用戶一次跑多個實驗，可以爲每個實驗分配不同的GPU資源。
通常情況下，邏輯GPU中的GPU0稱爲主GPU，原因在於下面一個主題：

多gpu運算的分發並行機制

小明一個人做作業每份作業需要60minutes，四份需要240minutes

他決定作弊：這個過程中分發和回收作業各需要3分鐘，每個小朋友效率一樣，總體時間爲：3+60+3=66minutes，這個小明（分發和回收作業的）就是主GPU。
因此，多gpu運算的分發並行機制爲：
分發->並行運算->結果回收

多GPU並行運算

torch.nn.DataParallel
功能：包裝模型，實現分發並行機制
主要參數：
·module：需要包裝分發的模型
·device_ids：可分發的gpu，默認分發到所有可見可用gpu
·output_device：結果輸出設備（通常是主GPU）

#查詢當前gpu內存剩餘
def get gpu memory(): 
	import os os.system('nvidia -smi -g -d Memory | grep -A4 GPU | grep Free>tmp. txt')
	memory_gpu=[int(x.split())[2]) for x in open(' tmp. txt','r'). readlines()]
	os.system('rm tmp. txt')
	return memory_gpu

多GPU並行運算實例代碼：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn

# ============================ 手動選擇gpu
# flag = 0
flag = 1
if flag:
    #gpu_list=[2,3] 如果你的煉丹爐只有一個GPU，這樣設置也沒有用，當前設備沒有2號和3號GPU，pt在運行的時候的device_count屬性爲0
    gpu_list = [0]#因此要設置爲0號，這樣device_count屬性纔會爲1
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================ 依內存情況自動選擇主gpu
# flag = 0
flag = 1
if flag:
    def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("顯存計算功能暫不支持windows操作系統")
        return memory_gpu


    gpu_memory = get_gpu_memory()#這裏是獲取所有GPU的剩餘內存
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory))#然後打印出來
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)#這裏是把剩餘內存最多的GPU做爲主GPU
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0]))#觀察每個forward的batchsize大小
        #注意，這裏是傳入的batchsize經過分發後的數量，所以應該是原batchsize除以GPU的數量
        #這裏的 batch_size = 16，如果有device_count=1，這裏應該是16，如果是device_count=2，這裏應該是8.
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


if __name__ == "__main__":

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device)#把輸入和標籤放到指定的device中，device根據上面的代碼是優先GPU的。

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)#對模型進行包裝，使得模型具有並行分發運行的能力，讓模型能把一個batchsize的數據分發到不同GPU上進行運算
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

gpu模型加載

報錯1：

RuntimeError:Attempting to deserialize object on a CUDA device but torch.cuda.is available() is False.If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
解決：torch.load（path_state_dict，map_location=“cpu”）

報錯2

RuntimeError:Error(s) in loading state dict for FooNet：Missing key(s) in state_dict: “linears.0.weight”, “linears.1.weight”, “linears.2.weight”.Unexpected key(s) in state dict:“module.linears.0.weight”, “module.linears.1.weight”, “module.linears.2.weight”.
錯誤原因是在訓練的時候採用了多GPU運算，模型用DataParallel對並行了包裝，這就是的模型網絡層的命名多了一個module，造成加載（load）的key不匹配，就是上面的黑體部分。解決方法：

from collections import OrderedDict 
new_state_dict=OrderedDict()#導入OrderedDict，然後重新構建OrderedDict（key不能直接改，所以重新創建）
for k,v in state dict load.items(): 
	namekey=k[7:] if k. startswith('module.') else k #對傳入的字典進行判斷，看是否以'module.'，然後從第七個字符開始求子串。
	new state dict[namekey]=V

例子：

# =================================== 多gpu 加載
# flag = 0
flag = 1
if flag:

    net = FooNet(neural_num=3, layers=3)

    path_state_dict = "./model_in_multi_gpu.pkl"#事先從多GPU服務器上保存下來的參數
    state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

    # net.load_state_dict(state_dict_load)

    # remove module.
    from collections import OrderedDict#去掉key中的'module.'
    new_state_dict = OrderedDict()
    for k, v in state_dict_load.items():
        namekey = k[7:] if k.startswith('module.') else k
        new_state_dict[namekey] = v
    print("new_state_dict:\n{}".format(new_state_dict))

    net.load_state_dict(new_state_dict)

結果：

PyTorch常見報錯

共同貢獻PyTorch常見錯誤與坑彙總文檔：
《PyTorch常見報錯/坑彙總》
老師提供的連接，裏面還只有10個報錯信息。

N.O.1.

報錯： ValueError: num_samples should be a positive integer value, but got num_samples=0
可能的原因：傳入的Dataset中的len(self.data_info)==0，即傳入該dataloader的dataset裏沒有數據
解決方法：

檢查dataset中的路徑，路徑不對，讀取不到數據。
檢查Dataset的__len__()函數爲何輸出爲零

N.O.2

報錯：TypeError: pic should be PIL Image or ndarray. Got <class ‘torch.Tensor’>
可能的原因：當前操作需要PIL Image或ndarray數據類型，但傳入了Tensor
解決方法：

檢查transform中是否存在兩次ToTensor()方法
檢查transform中每一個操作的數據類型變化

N.O.3

報錯：RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 93 and 89 in dimension 1 at /Users/soumith/code/builder/wheel/pytorch-src/aten/src/TH/generic/THTensorMath.cpp:3616
可能的原因：dataloader的__getitem__函數中，返回的圖片形狀不一致，導致無法stack
解決方法：檢查__getitem__函數中的操作

N.O.4

報錯：
conv: RuntimeError: Given groups=1, weight of size 6 1 5 5, expected input[16, 3, 32, 32] to have 1 channels, but got 3 channels instead
linear: RuntimeError: size mismatch, m1: [16 x 576], m2: [400 x 120] at …/aten/src/TH/generic/THTensorMath.cpp:752
可能的原因：網絡層輸入數據與網絡的參數不匹配
解決方法：

檢查對應網絡層前後定義是否有誤
檢查輸入數據shape

N.O.5

報錯：AttributeError: ‘DataParallel’ object has no attribute ‘linear’
可能的原因：並行運算時，模型被dataparallel包裝，所有module都增加一個屬性 module. 因此需要通過 net.module.linear調用
解決方法：

網絡層前加入module.
這個錯誤和gpu模型加載中的錯誤2原理一樣。

N.O.6

報錯:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
可能的原因：gpu訓練的模型保存後，在無gpu設備上無法直接加載
解決方法：

需要設置map_location=“cpu”

N.O.7

報錯：
AttributeError: Can’t get attribute ‘FooNet2’ on <module ‘main’ from ’
可能的原因：保存的網絡模型在當前python腳本中沒有定義
解決方法：

提前定義該類

N.O.8

報錯：
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes’ failed. at …/aten/src/THNN/generic/ClassNLLCriterion.c:94
可能的原因：

標籤數大於等於類別數量，即不滿足 cur_target < n_classes，通常是因爲標籤從1開始而不是從0開始
解決方法：
修改label，從0開始，例如：10分類的標籤取值應該是0-9

N.O.9

報錯：
RuntimeError: expected device cuda:0 and dtype Long but got device cpu and dtype Long
Expected object of backend CPU but got backend CUDA for argument #2 ‘weight’
可能的原因：需計算的兩個數據不在同一個設備上
解決方法：採用to函數將數據遷移到同一個設備上，如果是tensor要記得用等號賦值。

N.O.10

報錯：
RuntimeError: DataLoader worker (pid 27) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
可能原因：內存不夠（不是gpu顯存，是內存）
解決方法：申請更大內存

PyTorch框架訓練營課程總結

https://github.com/TingsongYu/PyTorch_Tutorial

17.GPU的使用；PyTorch常見報錯信息;小結

文章目錄