歸一化一般是用來消除不同特徵之間量綱差異的技巧,在傳統ML中,把數據丟入某個模型前都會進行這樣的操作,不過經常是整個數據集,DL中批量歸一化倒是有點不同;殘差網,2015年ImageNet競賽的冠軍模型,用跳躍連接的技巧很好地解決了超深網絡中梯度消失的問題。
批量歸一化
- 目標
利用小批量上的均值和標準差,不斷調整神經網絡中間輸出,從而使整個神經網絡在各層的中間輸出的數值更穩定。 - 如何歸一化
無論是全連接層還是卷積層,批量歸一化都在線性組合以後,激活函數之前進行。
全連接層
批量歸一化
其中採用的是Z-score方法並且引入一個很小的正數,確保分母不爲0。引入可學習參數和,做拉伸與偏移。
卷積層
如果卷積計算輸出多個通道,我們需要對這些通道的輸出分別做批量歸一化,且每個通道都擁有獨立的拉伸和偏移參數。 - 預測時歸一化如何做
訓練:以batch爲單位,對每個batch計算均值和方差。
預測:用移動平均估算整個訓練數據集的樣本均值和方差。 - 代碼實現
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
# 判斷當前模式是訓練模式還是預測模式
if not is_training:
# 如果是在預測模式下,直接使用傳入的移動平均所得的均值和方差
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# 使用全連接層的情況,計算特徵維上的均值和方差
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# 使用二維卷積層的情況,計算通道維上(axis=1)的均值和方差。這裏我們需要保持
# X的形狀以便後面可以做廣播運算
mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
# 訓練模式下用當前的均值和方差做標準化
X_hat = (X - mean) / torch.sqrt(var + eps)
# 更新移動平均的均值和方差
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # 拉伸和偏移
return Y, moving_mean, moving_var
殘差網
- 深層CNN的問題
CNN爲了得到更強的非線性擬合能力,做的更深是最有效地途徑。但是一旦做深,首先面臨的便是數值不穩定的問題(梯度消失或爆炸),其次訓練時間也會很長,甚至錯誤率不減反增(退化問題)。前者可以用歸一化解決,但後面的問題便是通過殘差網的原理來解決。 - 殘差塊
殘差塊解決的是恆等映射的問題,因爲有實驗表明神經網絡並不能很好地擬合一個恆等映射。恆等映射涉及到淺層網絡與深層網絡性能的比較,按理說深層網絡可以分成淺層網絡+後面做個恆等映射的網絡,深層網絡的效果至少要跟淺層網絡相同甚至更好,然而實際情況卻是出現了退化問題。
恆等映射:
左邊:f(x)=x
右邊:f(x)-x=0(易於捕捉恆等映射的細微波動)
這個易於捕捉細微波動,可以這樣理解:
左邊,第一次擬合的f(x=1)=1.1,第二次擬合出f(x=1)=1.2,那麼波動幅度爲0.1/1;
右邊,第一次擬合的f(x=1)-1=0.1,第二次f(x=1)-1=0.2,波動幅度爲0.1/0.1。
- 代碼實現
class Residual(nn.Module): # 本類已保存在d2lzh_pytorch包中方便以後使用
#可以設定輸出通道數、是否使用額外的1x1卷積層來修改通道數以及卷積層的步幅。
def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):
super(Residual, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(out_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return F.relu(Y + X)
def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
if first_block:
assert in_channels == out_channels # 第一個模塊的通道數同輸入通道數一致
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
else:
blk.append(Residual(out_channels, out_channels))
return nn.Sequential(*blk)
net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))
"""
resnet_block1 output shape: torch.Size([1, 64, 56, 56])
resnet_block2 output shape: torch.Size([1, 128, 28, 28])
resnet_block3 output shape: torch.Size([1, 256, 14, 14])
resnet_block4 output shape: torch.Size([1, 512, 7, 7])
"""
DenseNet
DenseNet用到了concat神器,我理解的話即是瘋狂堆特徵(原始特徵+抽取出來的特徵),再一股腦地丟入分類器,簡單粗暴,一般來說效果會比較好。事實上,很多競賽的核心技巧便有它。
主要構建模塊
稠密塊(dense block):定義了輸入和輸出是如何連結的。
過渡層(transition layer):用來控制通道數,使之不過大。
代碼實現
dense block
def conv_block(in_channels, out_channels):
blk = nn.Sequential(nn.BatchNorm2d(in_channels),
nn.ReLU(),
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
return blk
class DenseBlock(nn.Module):
def __init__(self, num_convs, in_channels, out_channels):
super(DenseBlock, self).__init__()
net = []
for i in range(num_convs):
in_c = in_channels + i * out_channels
net.append(conv_block(in_c, out_channels))
self.net = nn.ModuleList(net)
self.out_channels = in_channels + num_convs * out_channels # 計算輸出通道數
def forward(self, X):
for blk in self.net:
Y = blk(X)
X = torch.cat((X, Y), dim=1) # 在通道維上將輸入和輸出連結
return X
blk = DenseBlock(2, 3, 10)
X = torch.rand(4, 3, 8, 8)
Y = blk(X)
Y.shape # torch.Size([4, 23, 8, 8])
transition block
1×1卷積層:來減小通道數
步幅爲2的平均池化層:減半高和寬
def transition_block(in_channels, out_channels):
blk = nn.Sequential(
nn.BatchNorm2d(in_channels),
nn.ReLU(),
nn.Conv2d(in_channels, out_channels, kernel_size=1),
nn.AvgPool2d(kernel_size=2, stride=2))
return blk
blk = transition_block(23, 10)
blk(Y).shape # torch.Size([4, 10, 4, 4])
# DenseNet
net = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
num_channels, growth_rate = 64, 32 # num_channels爲當前的通道數
num_convs_in_dense_blocks = [4, 4, 4, 4]
for i, num_convs in enumerate(num_convs_in_dense_blocks):
DB = DenseBlock(num_convs, num_channels, growth_rate)
net.add_module("DenseBlosk_%d" % i, DB)
# 上一個稠密塊的輸出通道數
num_channels = DB.out_channels
# 在稠密塊之間加入通道數減半的過渡層
if i != len(num_convs_in_dense_blocks) - 1:
net.add_module("transition_block_%d" % i, transition_block(num_channels, num_channels // 2))
num_channels = num_channels // 2
net.add_module("BN", nn.BatchNorm2d(num_channels))
net.add_module("relu", nn.ReLU())
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的輸出: (Batch, num_channels, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(num_channels, 10)))
X = torch.rand((1, 1, 96, 96))
for name, layer in net.named_children():
X = layer(X)
print(name, ' output shape:\t', X.shape)
# test
#batch_size = 256
batch_size=16
# 如出現“out of memory”的報錯信息,可減小batch_size或resize
train_iter, test_iter =load_data_fashion_mnist(batch_size, resize=96)
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
"""
0 output shape: torch.Size([1, 64, 48, 48])
1 output shape: torch.Size([1, 64, 48, 48])
2 output shape: torch.Size([1, 64, 48, 48])
3 output shape: torch.Size([1, 64, 24, 24])
DenseBlosk_0 output shape: torch.Size([1, 192, 24, 24])
transition_block_0 output shape: torch.Size([1, 96, 12, 12])
DenseBlosk_1 output shape: torch.Size([1, 224, 12, 12])
transition_block_1 output shape: torch.Size([1, 112, 6, 6])
DenseBlosk_2 output shape: torch.Size([1, 240, 6, 6])
transition_block_2 output shape: torch.Size([1, 120, 3, 3])
DenseBlosk_3 output shape: torch.Size([1, 248, 3, 3])
BN output shape: torch.Size([1, 248, 3, 3])
relu output shape: torch.Size([1, 248, 3, 3])
global_avg_pool output shape: torch.Size([1, 248, 1, 1])
fc output shape: torch.Size([1, 10])
"""
有些話說
一些問題:
- 批量歸一化爲什麼要引入可學習參數:拉伸參數和偏移參數?
- 歸一化方法有很多,這裏用的是Z-score方法,可以採用其它的方法嗎?
- 殘差網解決了什麼問題?ResNet和DenseNet的區別,以及都是如何實現的?