经典 network -- 图像分类篇(03 ResNet v1-v2)

近期,实验室小组成员决定定期学习经典网络模型。因此,特别准备写这么一个博客,持续更新我们的学习、及个人对各种经典网络的理解。如有不足和理解不到位的地方,还望读者提出质疑和批评,定虚心改进。望共同讨论、学习和进步。

系列目录:

经典 network -- 图像分类篇(01 AlexNet / VGG)

经典 network -- 图像分类篇(02 Inception v1-v4)(-ing)

经典 network -- 图像分类篇(03 ResNet v1-v2)

 

 

经典 network -- 图像分类篇(03 ResNet v1-v2)

本部分包括 ResNet,ResNet v2,ResNeXt。


ResNet


[paper] Deep Residual Learning for Image Recognition

[github] https://github.com/KaimingHe/deep-residual-networks

[pytorch] https://pytorch.org/hub/pytorch_vision_resnet/

Introduction 

We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, Going deeper with convolutions, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [VGG] to thirty [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

[41, 44] 表明网络的深度是至关重要的。

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error.

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22]. When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

重点:

Is learning better networks as easy as stacking more layers?

很显然不是,原因:
1. vanishing/exploding gradients;网络深度会导致梯度弥散/爆炸,使得系统不能收敛。梯度弥散/爆炸可以通过 normalized initialization 和 intermediate normalization layers 处理。
2. degradation;当深度开始增加的时候,accuracy 经常会达到饱和,然后开始下降,但这并不是由于过拟合引起的

degradation 现象说明一个问题:并不是所有的系统都同样容易优化。

设想,有一个浅层网络,已经达到最优解了。那么在这个浅层网络上加几层网络使其更深。如果任何网络都是可以最优化的,那么新增加的这几层可以被训练为恒等变换就可以了。或者说,深层网络至少不应该比浅层的差。但事实是,加深的网络反而准确率不如浅层的。

Deep Residual Learning

We address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as \small H(x), we let the stacked nonlinear layers fit another mapping of \small F(x) := H(x)-x. The original mapping is recast into \small F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

重点:

作者认为,学习 \small F(x)=0 比学习 \small H(x)=x 更容易:将残差推到零要比通过一堆非线性层来适应一个恒等映射更容易。

      Figure 2. Residual learning: a building block.

The formulation of \small F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers.

 

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1). The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings. In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

重点:

motivation:堆叠深层网络,在用多个非线性层逼近恒等映射时可能会遇到困难。所以直接加一个 skipconnection,这个跳接就是恒等变换(可以这么想,你网络不是学习不到恒等变换吗,那我就强制给你加一个恒等变换咯)。

ResNet 的思想核心是:学习恒等映射的扰动(残差)要比学习一个全新的函数要容易的多。

Architectures

网络结构表一目了然。

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2

Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet34. Right: a “bottleneck” building block for ResNet-50/101/152.

这个表中有几个细节:

1. bottleneck:如图5所示。用于ResNet-50/101/152。另外,瓶颈形式还有以下一些优点 [引用来源:知乎Keep Learning CV]:

1)对通道数进行升维和降维(跨通道信息整合),实现了多个特征图的线性组合,同时保持了原有的特征图大小;

2)相比于其他尺寸的卷积核,可以极大地降低运算复杂度;

3)如果使用两个3x3卷积堆叠,只有一个relu,但使用1x1卷积就会有两个relu,引入了更多的非线性映射。

2. 3x3 或 1x1:受VGG的启发,卷积层主要是3×3卷积;对于同一stage,具有相同数量的滤波器数;每个stage 开始,通过 stride=2 的卷积进行特征图分辨率减半,滤波器的数量加倍以保持每层的时间复杂度(为什么要这么设计);

3. shortcuts:在每个stage开始,特征图的分辨率和维度都变了,如何跳接?文章给出三种:

A) zero-padding:在增加的维度上,都用0填充,这种方法是 parameter-free 的;

B) projection:其实就是卷积,只在维度变化的两层之间连接,其余都用 identity 连接;

C) 所有的跳接都用 projection (不用 identity 连接)。

准确度:A<B<C。是的,C是最好的,但我们好像用的都是 B。而这篇文章用的实验结果,很多都是 A。因为,A没有比B、C差多少。如表3实验:

Conclusion

关于ResNet的解释,有很多很多种。而本文的解释,这里说下我自己的理解。

首先,ResNet要解决的问题是退化问题。

网络的深度越深,准确度越高。但简单的堆叠网络,虽然网络更深了,但效果却变差了。

两个原因,梯度弥失和网络退化。前者,已经可以通过标准初始化和BN解决,而后者,则是ResNet要解决的问题。

然后,看看什么是退化问题。

之所以认为更深的网络应用比浅层网络有效,是因为深层网络可以看做是在浅层网络上又堆叠增加了几层网络而组成的。

那么,浅层网络的解,应该是深层网络的解的子集。即文章中所说的,深层网络不应该比浅层网络的效果更差。

但实验发现,简单的堆叠网络使网络更深了,但错误率也增加了。深层网络反而退化了。

原因在于,网络是很难学习恒等变换。因为,如果网络可以学习恒等变换,那深层网络在优化时,可以把新堆叠的那几层网络学习为是恒等变换,而原有的浅层网络参数与原先的一样,这样错误率应该不会变坏。

最后,看看ResNet是怎么解决退化问题的。

既然网络很难优化为恒等变换,干脆直接加一层恒等变换,即跳接。

如果网络训练时,这一层希望是恒等变换,那就让前向传播网络的输出变为 0。

如果这一层不希望是恒等变换,那就让前向传播网络的输出不为0。

如此一来,ResNet具备比堆叠网络更全面的表达能力。


ResNet v2


[paper] Identity Mappings in Deep Residual Networks

[pytorch] 

Introduction

Deep residual networks (ResNets) [1] consist of many stacked “Residual Units”. Each unit (Fig. 1 (a)) can be expressed in a general form:

\small \begin{aligned} \mathbf{y}_{l}=& h\left(\mathbf{x}_{l}\right)+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right) \\ & \mathbf{x}_{l+1}=f\left(\mathbf{y}_{l}\right) \end{aligned}     

where \small x_l and \small x_{l+1} are input and output of the \small l-th unit, and \small F is a residual function. In [1], \small h(x_l) = x_l is an identity mapping and \small f is a ReLU [2] function.

给出了本文对于 ResNet 的字符表达方式。

In this paper, we analyze deep residual networks by focusing on creating a “direct” path for propagating information — not only within a residual unit, but through the entire network. Our derivations reveal that if both \small h(x_l) and \small f(y_l) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward passes. 

To understand the role of skip connections, we analyze and compare various types of \small h(x_l). We find that the identity mapping \small h(x_l) = x_l chosen in ResNet achieves the fastest error reduction and lowest training loss among all variants we investigated, whereas skip connections of scaling, gating [5,6,7], and 1×1 convolutions all lead to higher training loss and error

重点:

ResNet:identity connection 只在 \small h(x_l) = x_l 中,而 \small f(x_l)=ReLU(x_l)

本文的 ResNet v2:旨在设计的模型 \small h(x_l) = x_l,且 \small f(y_l) = y_l

motivation NO.1 为什么要这么设计呢?

因为通过大量对比试验发现,如果将 \small h(x_l) 的 identity connection 换成其他形式:scaling, gating [5,6,7], and 1×1 convolutions 都会使准确率降低。进而推断,如果 \small f(y_l) = y_l,即 \small f(y_l) 也是恒等变换的话,是不是也能提高准确率?

 

To construct an identity mapping \small f(y_l) = y_l, we view the activation functions (ReLU and BN [8]) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a new residual unit design, shown in (Fig. 1(b)). Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier to train and generalizes better than the original ResNet in [1]. 

motivation NO.2 如何实现这样的设计呢?

为了实现 \small f(y_l) = y_l,本文的第二个工作即研究了各种组合:ReLU,BN 和 卷积的顺序位置。文章发现,如果 BN+ReLU 的激活函数在卷积前面,可以达到更好的效果。如图1的结构和试验结果。

Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey arrows indicate the easiest paths for the information to propagate, corresponding to the additive term “\small x_l” in Eqn.(4) (forward propagation) and the additive term “1” in Eqn.(5) (backward propagation). Right: training curves on CIFAR-10 of 1001-layer ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.

Analysis of Deep Residual Networks

本节是用数学公式证明 motivation NO.1 的结论。

If \small f is also an identity mapping: \small x_{l+1} \equiv y_l , we can obtain:

\small \mathbf{x}_{l+1}=\mathbf{x}_{l}+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right)                  (3)

Recursively  (\small \mathbf{x}_{l+2}=\mathbf{x}_{l+1}+\mathcal{F}\left(\mathbf{x}_{l+1}, \mathcal{W}_{l+1}\right)=\mathbf{x}_{l}+\mathcal{F}\left(\mathbf{x}_{l}, \mathcal{W}_{l}\right)+\mathcal{F}\left(\mathbf{x}_{l+1}, \mathcal{W}_{l+1}\right)\right., etc.) we will have:

\small \mathbf{x}_{L}=\mathbf{x}_{l}+\sum_{i=l}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)              (4)

for any deeper unit \small L and any shallower unit \small l.

Eqn.(4) also leads to nice backward propagation properties. Denoting the loss function as \small \varepsilon, from the chain rule of backpropagation [9] we have:

\small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}} \frac{\partial \mathbf{x}_{L}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(1+\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)\right)             (5)

Eqn.(5) indicates that the gradient \small \partial \mathcal{E}}/{\partial \mathbf{x}_{l}} can be decomposed into two additive terms: a term of \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} that propagates information directly without concerning any weight layers, and another term of \small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F}\right) that propagates through the weight layers. The additive term of \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} ensures that information is directly propagated back to any shallower unit \small l. Eqn.(5) also suggests that it is unlikely for the gradient \small \partial \mathcal{E}}/{\partial \mathbf{x}_{L}} to be canceled out for a mini-batch, because in general the term \small \frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{F} cannot be always -1 for all samples in a mini-batch. This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.

重点:

公式(4)说明两点:

1. 特征 x_{L} 是特征 x_l 加上一个残差 \sum_{i=l}^{L-1}F,即任意层都可以表示为其前面任意层与一个残差的和;

2. 设l=0, \small \mathbf{x}_{L}=\mathbf{x}_{0}+\sum_{i=0}^{L-1} \mathcal{F}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right),即 x_{L} 表示为 x_{0} 与前面所有层的输出的总合;而 plain network 则是前面几层的成绩(网络堆叠)。

上述公式不难推导。由于公式(5)中有 1,导致这个导数几乎不太可能发生梯度消失或爆炸。

 

On the Importance of Identity Skip Connections

下面证明,与scaling、gating、1x1 conv. 相比,identity connection 确实是最好的办法。

Let’s consider a simple modification, h(x_l) = \lambda\left | x \right | , to break the identity shortcut:

\small \mathbf{x}_{L}=({\Pi _{i=l}^{L-1}}\lambda _i) \mathbf{x}_{l}+\sum_{i=l}^{L-1} \mathcal{\hat{F}}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)             (7)

\small \mathcal{\hat{F}} absorbs the scalars into the residual functions. We have backpropagation of the following form:
\small \frac{\partial \mathcal{E}}{\partial \mathbf{x}_{l}}=\frac{\partial \mathcal{E}}{\partial \mathbf{x}_{L}}\left(({\Pi _{i=l}^{L-1}}\lambda _i)+\frac{\partial}{\partial \mathbf{x}_{l}} \sum_{i=l}^{L-1} \mathcal{\hat{F}}\left(\mathbf{x}_{i}, \mathcal{W}_{i}\right)\right)      (8)

重点:

可以从公式(8)看到,反向传播多了一个 \small {\Pi _{i=l}^{L-1}}\lambda _i 项,当 \small \lambda _i>1 ,梯度爆炸;\small \lambda _i<1,梯度弥失。

上面是 h(x_l) = \lambda\left | x \right | 的情况。

其它变换,得到的(7)和(8)式中,得到的结果是 \small {\Pi _{i=l}^{L-1}}{h}' (替换\small {\Pi _{i=l}^{L-1}}\lambda _i)。此时 \small {\Pi _{i=l}^{L-1}}{h}'也是乘积形式,也会出现梯度爆炸/弥失现象。

 

ResNet v2 in pytorch (Pre-activation version and it's bottlenec module)

class PreActBlock(nn.Module):
    '''Pre-activation version of the BasicBlock.'''
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(PreActBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, 
                               stride=1, padding=1, bias=False)

        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False)
            )

    def forward(self, x):
        out = F.relu(self.bn1(x))
        shortcut = self.shortcut(out) if hasattr(self, 'shortcut') else x
        out = self.conv1(out)
        out = self.conv2(F.relu(self.bn2(out)))
        out += shortcut
        return out


class PreActBottleneck(nn.Module):
    '''Pre-activation version of the original Bottleneck module.'''
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(PreActBottleneck, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes,
                               kernel_size=1, bias=False)

        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes,
                          kernel_size=1, stride=stride, bias=False)
            )

    def forward(self, x):
        out = F.relu(self.bn1(x))
        shortcut = self.shortcut(out) if hasattr(self, 'shortcut') else x
        out = self.conv1(out)
        out = self.conv2(F.relu(self.bn2(out)))
        out = self.conv3(F.relu(self.bn3(out)))
        out += shortcut
        return out

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章