pytorch进行GPU训练权重初始化的经验总结

前言

权重如何初始化关系到模型的训练能否快速收敛,这对于模型能否减少训练时间也至关重要。
下面以两个卷积层和一个全连接层的权重初始化为例子,两个代码都只运行一个epoch,来进行对照实验。
注意使用GPU训练时候,模型的初始化要设置保存梯度,否则返回的梯度就是0了

未对权重归一化的结果

代码

import torch

USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
 #-------------------------权重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel

conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel

# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)

结果
在这里插入图片描述

归一化权重后

代码

import torch
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
  
  #------------------------权重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) * np.sqrt(2. / (3*5*5))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel

conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)* np.sqrt(2. / (16*3*3))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel

# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)* np.sqrt(2. / (channel_2 * 32 * 32))
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)

结果
在这里插入图片描述

结论

可以看出来在进行归一化后 模型可以快速收敛

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章