經典 network -- 圖像分類篇(03 ResNet v1-v2)

近期,實驗室小組成員決定定期學習經典網絡模型。因此,特別準備寫這麼一個博客,持續更新我們的學習、及個人對各種經典網絡的理解。如有不足和理解不到位的地方,還望讀者提出質疑和批評,定虛心改進。望共同討論、學習和進步。

系列目錄:

經典 network -- 圖像分類篇(01 AlexNet / VGG)

經典 network -- 圖像分類篇(02 Inception v1-v4)(-ing)

經典 network -- 圖像分類篇(03 ResNet v1-v2)

 

 

經典 network -- 圖像分類篇(03 ResNet v1-v2)

本部分包括 ResNet,ResNet v2,ResNeXt。


ResNet


[paper] Deep Residual Learning for Image Recognition

[github] https://github.com/KaimingHe/deep-residual-networks

[pytorch] https://pytorch.org/hub/pytorch_vision_resnet/

Introduction 

We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, Going deeper with convolutions, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [VGG] to thirty [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

[41, 44] 表明網絡的深度是至關重要的。

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error.

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22]. When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

重點:

Is learning better networks as easy as stacking more layers?

很顯然不是,原因:
1. vanishing/exploding gradients;網絡深度會導致梯度彌散/爆炸,使得系統不能收斂。梯度彌散/爆炸可以通過 normalized initialization 和 intermediate normalization layers 處理。
2. degradation;當深度開始增加的時候,accuracy 經常會達到飽和,然後開始下降,但這並不是由於過擬合引起的

degradation 現象說明一個問題:並不是所有的系統都同樣容易優化。

設想,有一個淺層網絡,已經達到最優解了。那麼在這個淺層網絡上加幾層網絡使其更深。如果任何網絡都是可以最優化的,那麼新增加的這幾層可以被訓練爲恆等變換就可以了。或者說,深層網絡至少不應該比淺層的差。但事實是,加深的網絡反而準確率不如淺層的。

Deep Residual Learning

We address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as \small H(x), we let the stacked nonlinear layers fit another mapping of \small F(x) := H(x)-x. The original mapping is recast into \small F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

重點:

作者認爲,學習 \small F(x)=0 比學習 \small H(x)=x 更容易:將殘差推到零要比通過一堆非線性層來適應一個恆等映射更容易。

      Figure 2. Residual learning: a building block.

The formulation of \small F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers.

 

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1). The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings. In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

重點:

motivation:堆疊深層網絡,在用多個非線性層逼近恆等映射時可能會遇到困難。所以直接加一個 skipconnection,這個跳接就是恆等變換(可以這麼想,你網絡不是學習不到恆等變換嗎,那我就強制給你加一個恆等變換咯)。

ResNet 的思想核心是:學習恆等映射的擾動(殘差)要比學習一個全新的函數要容易的多。

Architectures

網絡結構表一目瞭然。

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2

Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet34. Right: a “bottleneck” building block for ResNet-50/101/152.

這個表中有幾個細節:

1. bottleneck:如圖5所示。用於ResNet-50/101/152。另外,瓶頸形式還有以下一些優點 [引用來源:知乎Keep Learning CV]:

1)對通道數進行升維和降維(跨通道信息整合),實現了多個特徵圖的線性組合,同時保持了原有的特徵圖大小;

2)相比於其他尺寸的卷積核,可以極大地降低運算複雜度;

3)如果使用兩個3x3卷積堆疊,只有一個relu,但使用1x1卷積就會有兩個relu,引入了更多的非線性映射。

2. 3x3 或 1x1:受VGG的啓發,卷積層主要是3×3卷積;對於同一stage,具有相同數量的濾波器數;每個stage 開始,通過 stride=2 的卷積進行特徵圖分辨率減半,濾波器的數量加倍以保持每層的時間複雜度(爲什麼要這麼設計);

3. shortcuts:在每個stage開始,特徵圖的分辨率和維度都變了,如何跳接?文章給出三種:

A) zero-padding:在增加的維度上,都用0填充,這種方法是 parameter-free 的;

B) projection:其實就是卷積,只在維度變化的兩層之間連接,其餘都用 identity 連接;

C) 所有的跳接都用 projection (不用 identity 連接)。

準確度:A<B<C。是的,C是最好的,但我們好像用的都是 B。而這篇文章用的實驗結果,很多都是 A。因爲,A沒有比B、C差多少。如表3實驗:

Conclusion

關於ResNet的解釋,有很多很多種。而本文的解釋,這裏說下我自己的理解。

首先,ResNet要解決的問題是退化問題。

網絡的深度越深,準確度越高。但簡單的堆疊網絡,雖然網絡更深了,但效果卻變差了。

兩個原因,梯度彌失和網絡退化。前者,已經可以通過標準初始化和BN解決,而後者,則是ResNet要解決的問題。

然後,看看什麼是退化問題。

之所以認爲更深的網絡應用比淺層網絡有效,是因爲深層網絡可以看做是在淺層網絡上又堆疊增加了幾層網絡而組成的。

那麼,淺層網絡的解,應該是深層網絡的解的子集。即文章中所說的,深層網絡不應該比淺層網絡的效果更差。

但實驗發現,簡單的堆疊網絡使網絡更深了,但錯誤率也增加了。深層網絡反而退化了。

原因在於,網絡是很難學習恆等變換。因爲,如果網絡可以學習恆等變換,那深層網絡在優化時,可以把新堆疊的那幾層網絡學習爲是恆等變換,而原有的淺層網絡參數與原先的一樣,這樣錯誤率應該不會變壞。

最後,看看ResNet是怎麼解決退化問題的。

既然網絡很難優化爲恆等變換,乾脆直接加一層恆等變換,即跳接。

如果網絡訓練時,這一層希望是恆等變換,那就讓前向傳播網絡的輸出變爲 0。

如果這一層不希望是恆等變換,那就讓前向傳播網絡的輸出不爲0。

如此一來,ResNet具備比堆疊網絡更全面的表達能力。


ResNet v2


[paper] Identity Mappings in Deep Residual Networks

[pytorch] 

Introduction

Deep residual networks (ResNets) [1] consist of many stacked “Residual Units”. Each unit (Fig. 1 (a)) can be expressed in a general form:

\small \begin{aligned} \mathbf{y}_{l}=& h\left(\mathbf{x}_{l}\right)+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right) \\ & \mathbf{x}_{l+1}=f\left(\mathbf{y}_{l}\right) \end{aligned}     

where \small x_l and \small x_{l+1} are input and output of the \small l-th unit, and \small F is a residual function. In [1], \small h(x_l) = x_l is an identity mapping and \small f is a ReLU [2] function.

給出了本文對於 ResNet 的字符表達方式。

In this paper, we analyze deep residual networks by focusing on creating a “direct” path for propagating information — not only within a residual unit, but through the entire network. Our derivations reveal that if both \small h(x_l) and \small f(y_l) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward passes. 

To understand the role of skip connections, we analyze and compare various types of \small h(x_l). We find that the identity mapping \small h(x_l) = x_l chosen in ResNet achieves the fastest error reduction and lowest training loss among all variants we investigated, whereas skip connections of scaling, gating [5,6,7], and 1×1 convolutions all lead to higher training loss and error

重點:

ResNet:identity connection 只在 \small h(x_l) = x_l 中,而 \small f(x_l)=ReLU(x_l)

本文的 ResNet v2:旨在設計的模型 \small h(x_l) = x_l,且 \small f(y_l) = y_l

motivation NO.1 爲什麼要這麼設計呢?

因爲通過大量對比試驗發現,如果將 \small h(x_l) 的 identity connection 換成其他形式:scaling, gating [5,6,7], and 1×1 convolutions 都會使準確率降低。進而推斷,如果 \small f(y_l) = y_l,即 \small f(y_l) 也是恆等變換的話,是不是也能提高準確率?

 

To construct an identity mapping \small f(y_l) = y_l, we view the activation functions (ReLU and BN [8]) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in (Fig. 1(b)). Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet in [1]. 

motivation NO.2 如何實現這樣的設計呢?

爲了實現 \small f(y_l) = y_l,本文的第二個工作即研究了各種組合:ReLU,BN 和 卷積的順序位置。文章發現,如果 BN+ReLU 的激活函數在卷積前面,可以達到更好的效果。如圖1的結構和試驗結果。

Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey arrows indicate the easiest paths for the information to propagate, corresponding to the additive term “\small x_l” in Eqn.(4) (forward propagation) and the additive term “1” in Eqn.(5) (backward propagation). Right: training curves on CIFAR-10 of 1001-layer ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.

Analysis of Deep Residual Networks

本節是用數學公式證明 motivation NO.1 的結論。

If \small f is also an identity mapping: \small x_{l+1} \equiv y_l , we can obtain:

\small \mathbf{x}_{l+1}=\mathbf{x}_{l}+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right)                  (3)

Recursively  (\small \mathbf{x}_{l+2}=\mathbf{x}_{l+1}+\mathcal{F}\left(\mathbf{x}_{l+1}, \mathcal{W}_{l+1}\right)=\mathbf{x}_{l}+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right)+\mathcal{F}\left(\mathbf{x}_{l+1}, \mathcal{W}_{l+1}\right)\right., etc.) we will have:

\small \mathbf{x}_{L}=\mathbf{x}_{l}+\sum_{i=l}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)              (4)

for any deeper unit \small L and any shallower unit \small l.

Eqn.(4) also leads to nice backward propagation properties. Denoting the loss function as \small \varepsilon, from the chain rule of backpropagation [9] we have:

\small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}} \frac{\partial \mathbf{x}_{L}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(1+\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)\right)             (5)

Eqn.(5) indicates that the gradient \small \partial \mathcal{E}}/{\partial \mathbf{x}_{l}} can be decomposed into two additive terms: a term of \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} that propagates information directly without concerning any weight layers, and another term of \small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F}\right) that propagates through the weight layers. The additive term of \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} ensures that information is directly propagated back to any shallower unit \small l. Eqn.(5) also suggests that it is unlikely for the gradient \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} to be canceled out for a mini-batch, because in general the term \small \frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F} cannot be always -1 for all samples in a mini-batch. This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.

重點:

公式(4)說明兩點:

1. 特徵 x_{L} 是特徵 x_l 加上一個殘差 \sum_{i=l}^{L-1}F,即任意層都可以表示爲其前面任意層與一個殘差的和;

2. 設l=0, \small \mathbf{x}_{L}=\mathbf{x}_{0}+\sum_{i=0}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right),即 x_{L} 表示爲 x_{0} 與前面所有層的輸出的總合;而 plain network 則是前面幾層的成績(網絡堆疊)。

上述公式不難推導。由於公式(5)中有 1,導致這個導數幾乎不太可能發生梯度消失或爆炸。

 

On the Importance of Identity Skip Connections

下面證明,與scaling、gating、1x1 conv. 相比,identity connection 確實是最好的辦法。

Let’s consider a simple modification, h(x_l) = \lambda\left | x \right | , to break the identity shortcut:

\small \mathbf{x}_{L}=({\Pi _{i=l}^{L-1}}\lambda _i) \mathbf{x}_{l}+\sum_{i=l}^{L-1} \mathcal{\hat{F}}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)             (7)

\small \mathcal{\hat{F}} absorbs the scalars into the residual functions. We have backpropagation of the following form:
\small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(({\Pi _{i=l}^{L-1}}\lambda _i)+\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{\hat{F}}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)\right)      (8)

重點:

可以從公式(8)看到,反向傳播多了一個 \small {\Pi _{i=l}^{L-1}}\lambda _i 項,當 \small \lambda _i>1 ,梯度爆炸;\small \lambda _i<1,梯度彌失。

上面是 h(x_l) = \lambda\left | x \right | 的情況。

其它變換,得到的(7)和(8)式中,得到的結果是 \small {\Pi _{i=l}^{L-1}}{h}' (替換\small {\Pi _{i=l}^{L-1}}\lambda _i)。此時 \small {\Pi _{i=l}^{L-1}}{h}'也是乘積形式,也會出現梯度爆炸/彌失現象。

 

ResNet v2 in pytorch (Pre-activation version and it's bottlenec module)

class PreActBlock(nn.Module):
    '''Pre-activation version of the BasicBlock.'''
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(PreActBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, 
                               stride=1, padding=1, bias=False)

        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False)
            )

    def forward(self, x):
        out = F.relu(self.bn1(x))
        shortcut = self.shortcut(out) if hasattr(self, 'shortcut') else x
        out = self.conv1(out)
        out = self.conv2(F.relu(self.bn2(out)))
        out += shortcut
        return out


class PreActBottleneck(nn.Module):
    '''Pre-activation version of the original Bottleneck module.'''
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(PreActBottleneck, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes,
                               kernel_size=1, bias=False)

        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False)
            )

    def forward(self, x):
        out = F.relu(self.bn1(x))
        shortcut = self.shortcut(out) if hasattr(self, 'shortcut') else x
        out = self.conv1(out)
        out = self.conv2(F.relu(self.bn2(out)))
        out = self.conv3(F.relu(self.bn3(out)))
        out += shortcut
        return out

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章