動手學深度學習之批量歸一化和殘差網

歸一化一般是用來消除不同特徵之間量綱差異的技巧,在傳統ML中,把數據丟入某個模型前都會進行這樣的操作,不過經常是整個數據集,DL中批量歸一化倒是有點不同;殘差網,2015年ImageNet競賽的冠軍模型,用跳躍連接的技巧很好地解決了超深網絡中梯度消失的問題。

批量歸一化

  1. 目標
    利用小批量上的均值和標準差,不斷調整神經網絡中間輸出,從而使整個神經網絡在各層的中間輸出的數值更穩定。
  2. 如何歸一化
    無論是全連接層還是卷積層,批量歸一化都在線性組合以後,激活函數之前進行。
    全連接層
    x=uW+b output =ϕ(x) \begin{aligned} &\boldsymbol{x}=\boldsymbol{u} \boldsymbol{W}+\boldsymbol{b}\\ &\text { output }=\phi(\boldsymbol{x}) \end{aligned}
    批量歸一化
    output=ϕ(BN(x))y(i)=BN(x(i))μB1mi=1mx(i)σB21mi=1m(x(i)μB)2x^(i)x(i)μBσB2+ϵy(i)γx^(i)+β \begin{aligned} \text {output} &=\phi(\mathrm{BN}(\boldsymbol{x})) \\ \boldsymbol{y}^{(i)} &=\mathrm{BN}\left(\boldsymbol{x}^{(i)}\right) \\ \boldsymbol{\mu}_{\mathcal{B}} & \leftarrow \frac{1}{m} \sum_{i=1}^{m} \boldsymbol{x}^{(i)} \\ \boldsymbol{\sigma}_{\mathcal{B}}^{2} & \leftarrow \frac{1}{m} \sum_{i=1}^{m}\left(\boldsymbol{x}^{(i)}-\boldsymbol{\mu}_{\mathcal{B}}\right)^{2} \\ \hat{\boldsymbol{x}}^{(i)} & \leftarrow \frac{\boldsymbol{x}^{(i)}-\boldsymbol{\mu}_{\mathcal{B}}}{\sqrt{\boldsymbol{\sigma}_{\mathcal{B}}^{2}+\epsilon}} \\ \boldsymbol{y}^{(i)} & \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)}+\boldsymbol{\beta} \end{aligned}
    其中採用的是Z-score方法並且引入一個很小的正數ϵ\epsilon,確保分母不爲0。引入可學習參數γ\gammaβ\beta,做拉伸與偏移。
    卷積層
    如果卷積計算輸出多個通道,我們需要對這些通道的輸出分別做批量歸一化,且每個通道都擁有獨立的拉伸和偏移參數。
  3. 預測時歸一化如何做
    訓練:以batch爲單位,對每個batch計算均值和方差。
    預測:用移動平均估算整個訓練數據集的樣本均值和方差。
  4. 代碼實現
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 判斷當前模式是訓練模式還是預測模式
    if not is_training:
        # 如果是在預測模式下,直接使用傳入的移動平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全連接層的情況,計算特徵維上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二維卷積層的情況,計算通道維上(axis=1)的均值和方差。這裏我們需要保持
            # X的形狀以便後面可以做廣播運算
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        # 訓練模式下用當前的均值和方差做標準化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移動平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 拉伸和偏移
    return Y, moving_mean, moving_var

殘差網

  1. 深層CNN的問題
    CNN爲了得到更強的非線性擬合能力,做的更深是最有效地途徑。但是一旦做深,首先面臨的便是數值不穩定的問題(梯度消失或爆炸),其次訓練時間也會很長,甚至錯誤率不減反增(退化問題)。前者可以用歸一化解決,但後面的問題便是通過殘差網的原理來解決。
  2. 殘差塊
    殘差塊解決的是恆等映射的問題,因爲有實驗表明神經網絡並不能很好地擬合一個恆等映射。恆等映射涉及到淺層網絡與深層網絡性能的比較,按理說深層網絡可以分成淺層網絡+後面做個恆等映射的網絡,深層網絡的效果至少要跟淺層網絡相同甚至更好,然而實際情況卻是出現了退化問題。
    恆等映射:
    左邊:f(x)=x
    右邊:f(x)-x=0(易於捕捉恆等映射的細微波動)
    這個易於捕捉細微波動,可以這樣理解:
    左邊,第一次擬合的f(x=1)=1.1,第二次擬合出f(x=1)=1.2,那麼波動幅度爲0.1/1;
    右邊,第一次擬合的f(x=1)-1=0.1,第二次f(x=1)-1=0.2,波動幅度爲0.1/0.1。
    在這裏插入圖片描述
  3. 代碼實現
class Residual(nn.Module):  # 本類已保存在d2lzh_pytorch包中方便以後使用
    #可以設定輸出通道數、是否使用額外的1x1卷積層來修改通道數以及卷積層的步幅。
    def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):
        super(Residual, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        return F.relu(Y + X)

def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
    if first_block:
        assert in_channels == out_channels # 第一個模塊的通道數同輸入通道數一致
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
        else:
            blk.append(Residual(out_channels, out_channels))
    return nn.Sequential(*blk)

net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))

"""
resnet_block1  output shape:	 torch.Size([1, 64, 56, 56])
resnet_block2  output shape:	 torch.Size([1, 128, 28, 28])
resnet_block3  output shape:	 torch.Size([1, 256, 14, 14])
resnet_block4  output shape:	 torch.Size([1, 512, 7, 7])
"""

DenseNet

DenseNet用到了concat神器,我理解的話即是瘋狂堆特徵(原始特徵+抽取出來的特徵),再一股腦地丟入分類器,簡單粗暴,一般來說效果會比較好。事實上,很多競賽的核心技巧便有它。
在這裏插入圖片描述
主要構建模塊
稠密塊(dense block):定義了輸入和輸出是如何連結的。
過渡層(transition layer):用來控制通道數,使之不過大。
代碼實現
dense block

def conv_block(in_channels, out_channels):
    blk = nn.Sequential(nn.BatchNorm2d(in_channels), 
                        nn.ReLU(),
                        nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
    return blk

class DenseBlock(nn.Module):
    def __init__(self, num_convs, in_channels, out_channels):
        super(DenseBlock, self).__init__()
        net = []
        for i in range(num_convs):
            in_c = in_channels + i * out_channels
            net.append(conv_block(in_c, out_channels))
        self.net = nn.ModuleList(net)
        self.out_channels = in_channels + num_convs * out_channels # 計算輸出通道數

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            X = torch.cat((X, Y), dim=1)  # 在通道維上將輸入和輸出連結
        return X
blk = DenseBlock(2, 3, 10)
X = torch.rand(4, 3, 8, 8)
Y = blk(X)
Y.shape # torch.Size([4, 23, 8, 8])

transition block
1×1卷積層:來減小通道數
步幅爲2的平均池化層:減半高和寬

def transition_block(in_channels, out_channels):
    blk = nn.Sequential(
            nn.BatchNorm2d(in_channels), 
            nn.ReLU(),
            nn.Conv2d(in_channels, out_channels, kernel_size=1),
            nn.AvgPool2d(kernel_size=2, stride=2))
    return blk

blk = transition_block(23, 10)
blk(Y).shape # torch.Size([4, 10, 4, 4])
# DenseNet
net = nn.Sequential(
        nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), 
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

num_channels, growth_rate = 64, 32  # num_channels爲當前的通道數
num_convs_in_dense_blocks = [4, 4, 4, 4]

for i, num_convs in enumerate(num_convs_in_dense_blocks):
    DB = DenseBlock(num_convs, num_channels, growth_rate)
    net.add_module("DenseBlosk_%d" % i, DB)
    # 上一個稠密塊的輸出通道數
    num_channels = DB.out_channels
    # 在稠密塊之間加入通道數減半的過渡層
    if i != len(num_convs_in_dense_blocks) - 1:
        net.add_module("transition_block_%d" % i, transition_block(num_channels, num_channels // 2))
        num_channels = num_channels // 2

net.add_module("BN", nn.BatchNorm2d(num_channels))
net.add_module("relu", nn.ReLU())
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的輸出: (Batch, num_channels, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(num_channels, 10))) 

X = torch.rand((1, 1, 96, 96))
for name, layer in net.named_children():
    X = layer(X)
    print(name, ' output shape:\t', X.shape)

# test
#batch_size = 256
batch_size=16
# 如出現“out of memory”的報錯信息,可減小batch_size或resize
train_iter, test_iter =load_data_fashion_mnist(batch_size, resize=96)
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

"""
0  output shape:	 torch.Size([1, 64, 48, 48])
1  output shape:	 torch.Size([1, 64, 48, 48])
2  output shape:	 torch.Size([1, 64, 48, 48])
3  output shape:	 torch.Size([1, 64, 24, 24])
DenseBlosk_0  output shape:	 torch.Size([1, 192, 24, 24])
transition_block_0  output shape:	 torch.Size([1, 96, 12, 12])
DenseBlosk_1  output shape:	 torch.Size([1, 224, 12, 12])
transition_block_1  output shape:	 torch.Size([1, 112, 6, 6])
DenseBlosk_2  output shape:	 torch.Size([1, 240, 6, 6])
transition_block_2  output shape:	 torch.Size([1, 120, 3, 3])
DenseBlosk_3  output shape:	 torch.Size([1, 248, 3, 3])
BN  output shape:	 torch.Size([1, 248, 3, 3])
relu  output shape:	 torch.Size([1, 248, 3, 3])
global_avg_pool  output shape:	 torch.Size([1, 248, 1, 1])
fc  output shape:	 torch.Size([1, 10])
"""

有些話說

一些問題:

  1. 批量歸一化爲什麼要引入可學習參數:拉伸參數γ\gamma和偏移參數β\beta
  2. 歸一化方法有很多,這裏用的是Z-score方法,可以採用其它的方法嗎?
  3. 殘差網解決了什麼問題?ResNet和DenseNet的區別,以及都是如何實現的?
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章