论文翻译——Weighted Residuals for Very Deep Networks

               Weighted Residuals for Very Deep Networks

 

论文:https://ieeexplore.ieee.org/document/7811085

 

最近,深度剩余网络在许多具有挑战性的计算机视觉任务中表现出了引人注目的性能。然而,原有的残差结构仍然存在一些缺陷,使得其难以在深度很深的网络上收敛。在本文中,我们引入了一个加权残差网络来解决ReLU和元素加法与深度网络初始化之间的不兼容性问题。加权残差网络能够有效地结合不同层次的残差。随着深度从100+层增加到1000+层,所提出的模型在准确性和收敛性方面都有持续的改善。加权残差网络与原始残差网络相比,计算量和GPU内存负担基本没有增加。采用投影随机梯度下降法对网络进行优化。在CIFAR-10上的实验表明,我们的算法比原始的残差网络具有更快的收敛速度,在1192层模型下达到了95.3%的精度。

Deep residual networks have recently shown appealing performance on many challenging computer vision tasks. However, the original residual structure still has some defects making it difficult to converge on very deep networks. In this paper, we introduce a weighted residual network to address the incompatibility between ReLU and elementwise addition and the deep network initialization problem. The weighted residual network is able to learn to combine residuals from different layers effectively and efficiently. The proposed models enjoy a consistent improvement over accuracy and convergence with increasing depths from 100+ layers to 1000+ layers. Besides, the weighted residual networks have little more computation and GPU memory burden than the original residual networks. The networks are optimized by projected stochastic gradient descent. Experiments on CIFAR-10 have shown that our algorithm has a faster convergence speed than the original residual networks and reaches a high accuracy at 95.3% with a 1192-layer model.

1  说明

目前最先进的图像分类模型是建立在inception和residual结构上的[1,2,3]。最近出现了很多关于残差网络的研究[4,5,6,7]。非常深的卷积网络[8,9],特别是带有残差单位的卷积网络,在许多具有挑战性的计算机视觉任务上显示出令人信服的准确性和良好的收敛性[3,10,11]。由于批量归一化[12]和高速公路信号传播[13]很好地解决了梯度消失问题,所以我们正在开发和训练100+层的网络,即使是1000+层结构,在[6]中显示结合足够的dropout也能产生有意义的结果。He等人[4]还引入了预激活结构,以允许高速公路信号通过非常深的网络直接传播。但他们似乎利用了更大维度(4倍)的特征,采用了多个1×1卷积层代替3×3卷积层以实现1000+层的收敛。

The state-of-the-art model for image classification is built on inception and residual structure [1,2,3]. Lots of works devoted on residual networks are emerging recently [4,5,6,7]. Very deep convolutional networks [8,9], especially with residual units, have shown compelling accuracy and nice convergence behaviors on many challenging computer vision tasks [3,10,11]. Since vanishing gradients problem is well handled by batch normalization [12] and highway signal propagation [13], networks with 100+ layers are being developed and trained, even 1000+ layers structure still yields meaningful results when combined with adequate dropout as shown in [6] . He et al. [4] also introduced the pre-activation structure to allow the highway signal to be directly propagated through the very deep networks. However they seemed to harness the features with a larger dimension (4×) and adapted multiple 1 × 1 convolutional layers to substitute 3 × 3 convolutional layers for convergence with 1000+ layers.

翻译:

一个典型的卷积单元由一个卷积层、一个批处理归一层和一个ReLU层组成,它们依次执行[12]。对于一个残差单元,关键问题是如何将残差信号与高速公路信号结合起来,在[3]中提出了按元素添加的方法。一个自然的想法是在ReLU激活后执行加法。但这导致了残差分支的非负输出,限制了残差单元的代表能力,即只能增强公路信号。He等人[3]首先提出在批次归一化和ReLU之间进行添加。在[4]中,他们进一步提出将三层的顺序颠倒,在卷积层之前进行批处理归一化和ReLU。问题在于ReLU激活只能产生正值,而正值与剩余单元的元素添加不相容。

A typical convolutional unit is composed of one convolutional layer, one batch normalization layer and one ReLU layer, all of which are performed sequently [12]. For a residual unit, a central question is how to combine the residual signal and the highway signal, where element-wise addition was proposed in [3]. A natural idea is to perform addition after ReLU activation. However, this leads to a nonnegative output from residual branch, which limits the representative ability of the residual unit meaning that it can only enhance the highway signal. He et al. firstly proposed to perform addition between batch normalization and ReLU. In [4], they further proposed to inverse the order of the three layers, performing batch normalization and ReLU before convolutional layers. The question is due to that ReLU activation can only generate positive value which is incompatible with element-wise addition in the residual unit.

翻译:

由于求解深度网络是一种非凸优化方法,因此适当的初始化对于快速收敛和良好的局部极小值是非常重要的。xavier[14]和msra[15]是深度网络初始化的常用工具。然而,对于深度超过100层的网络,“xavier”和“msra”都不是很好。[3]的论文提出对学习率较小的网络进行“预热”,然后将学习率恢复到正常值。然而,这种手工策略对深度网络不是很有用,在深度网络中,即使非常低的学习率(0.00001)也不足以保证收敛,恢复学习率有机会摆脱初始收敛[2]。

As it is non-convex optimization to solve deep networks, an appropriate initialization is important for both faster convergence and a good local minima. The “xavier” [14] and “msra” [15] are popular used for deep networks initialization. However, for networks with depths beyond 100 layers, neither “xavier” nor “msra” works well. The paper of [3] proposed to “warm up” the network with small learning rate and then restore the learning rate to normal value. However, this hand-craft strategy is not that useful for very deep networks, where even a very low learning rate (0.00001) still is not enough to promise convergence and restoring the learning rate has a chance to get rid of the initial convergence [2].

翻译:

一般来说,原始残差网络的训练存在两个缺陷:

——ReLU的不相容性和元素加法

——使用“msra”初始值设定项很难使网络收敛到1000层以上的深度

Generally speaking, there are two defects embedded in the training of the original residual networks

– Incompatibility of ReLU and element-wise addition.

– difficutly for networks to converge with depths beyond 1000-layer using “msra” initializer.

翻译:

第三点在于,为了训练非常深的网络,需要一种更好的模式来组合来自不同层的残差。对于非常深的网络,并不是所有的层都那么重要,因为1000层网络的性能通常不比100层网络好多少。事实上,许多层充当冗余信息,而且非常深的网络往往会过拟合某些任务。

The third point resides that a better mode to combine the residuals from different layers are necessary to train very deep networks. For very deep networks, not all layers are that important as 1000-layer networks often perform not much better than 100-layer networks. In fact, lots of layers serve as redundant information and very deep networks tend to over-fit on some tasks.

翻译:

本文引入加权残差网络,学习如何有效地组合不同层次的残差。所有的剩余权值都初始化为零,并以非常小的学习率(0.001)进行优化,这使得所有的剩余信号逐渐添加到高速公路信号中。随着一组残差权值的逐渐增大,1192层残差网络的收敛速度甚至比100层网络快得多。最后,学习的残差权重的分布是在[-0.5,0.5]范围内的对称模式,这意味着可以适当处理ReLU和元素加法的不相容性。利用与原残差网络具有完全相同训练时间的投影随机梯度下降法对网络进行优化。

 

In this paper, we introduce the weighted residual networks, which learn to combine residuals from different layers effectively and efficiently. All the residual weights are initialized at zeros and optimized with a very small learning rate (0.001), which allows all the residual signals to gradually add to the highway signal. With a group of gradually growing-up residual weights, the 1192-layer residual networks converge even much faster than the 100-layer networks. Finally, the distribution of the learned residual weights is in a symmetry mode ranging in [−0.5,0.5], which implies the incompatibility of ReLU and elementwise addition can be appropriately handled. The networks are optimized by projected stochastic gradient descent with exactly the same training epochs to original residual networks.

翻译:

CIF AR-10[16]上进行了实验,验证了加权残差网络的实用性。加权残差网络训练比原残差网络收敛速度快,性能好,计算量和GPU内存开销都可以忽略不计。深度超过1000层的加权残差网络仍然比较浅的网络收敛更快,并且随着深度从100+层增加到1000+层,在不使用任何手工策略(如“预热”)的情况下,精度得到了一致的提高[3]。在对残差应用丢包后,我们的加权残差网络对非常深的网络3cif AR-10的加权残差达到了非常高的精度(95.3%),使用1192层模型,与原始残差网络具有相同的训练周期(约164个周期,64k次迭代)。

 

We conduct experiments on CIF AR-10 [16] to verify the practicability of the weighted residual networks. Training with the weighted residual networks can converge much faster and reach a higher performance with negligible more computation and GPU memory cost than the original residual networks. The weighted residual networks with depths beyond 1000 layers still converge faster than shallower networks and enjoy a consistent improvement over accuracy with increasing depths from 100+ layers to 1000+ layers without resorting to any hand-craft strategy such as “warm up” [3]. After applying dropout on the residuals, our weighted residual networks reach a very high accuracy (95.3%) on Weighted Residuals for Very Deep Networks 3 CIF AR-10 using a 1192-layer model with the same training epochs to the original residual networks (about 164 epochs, 64k iterations).

翻译:

本文的工作贡献有四个方面:

我们提出了加权残差网络,该网络学习组合每个残差单元的残差。加权残差网络在训练阶段收敛速度快,在计算量和GPU存储开销增加了很少的同时比原始残差网络有更高的精度。

ReLU和按元素添加的不兼容性可以通过加权残差适当地解决,我们清除了信息高速公路上的所有障碍,使高速公路信号能够畅通无阻地传播。

残差逐渐添加到公路信号中,使训练过程更加可靠,即使深度超过1000层的网络也可以在没有“预热”策略的情况下快速收敛。

我们修改了下采样步骤,使公路信号和分支残差信号的空间大小和特征尺寸一致,而不需要使用零填充或额外的转换矩阵。加权残差网络具有简单易实现、实用性强等优点,特别适用于复杂残差网络的研究和实际应用。

 

The contributions of our work presented in this paper have four folds:

– We propose the weighted residual networks, which learn to combine the residuals from each residual unit. The weighted residual networks converge much faster in the training stage and reach a higher accuracy than the original residual networks at little more computation and GPU memory cost.

– The incompatibility of ReLU and element-wise addition can be addressed appropriately by weighted residuals and we clear all the obstacles on the information highway to allow the highway signal to enjoy a unhindered propagation.

– The residuals are gradually added to the highway signal to make the training process more reliable, even networks with depths beyond 1000 layers can converge very fast without the “warm up” strategy.

– We modify the down-sampling step to make the spatial size and feature dimension consistent between highway signal and branched residual signal, without resorting to zero-padding or extra converting matrix. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

2 相关工作

翻译:

残差网络吸引了大量的研究者,出现了许多关于残差网络的研究成果[4,5,6,7,17]。在下面的段落中,我们将回顾一些相关的工作。

The residual networks have attracted lots of researchers and many works on it have appeared [4,5,6,7,17]. In the following paragraphs we will review some related works.

翻译:

残差网络使用身份跳过连接简化了高速网络【13】,这允许信息直接流动并绕过复杂的层。残差网络由多个残差单元组成。在残差单元中有两个信息流。高速信号通过同一跳接,支路残差信号由Conv-BN ReLU-Conv-BN实现。这两个流在残差单元的末端通过元素加和进行组合,然后通过ReLU层进行激活。这个简单的结构非常强大,并且在imageNet挑战[1]中使用150层网络[3]取得了令人惊讶的性能。

The residual networks simplify the highway networks [13] using identity skip connection, which allows information to flow directly and bypass complex layers. The residual networks consist of many residual units. There are two information flows in a residual unit. The highway signal goes through the identity skip connection and the branched residual signal is realized by Conv-BN-ReLU-Conv-BN. The two flows are combined at the end of a residual unit by element-wise addition and then it goes through a ReLU layer for activation. This simple structure is quite powerful and achieved a surprising performance on the imageNet challenge [1] with 150-layer networks [3].

翻译:

在原始残差网络中,由于ReLU只能产生非负输出,所以在ReLU激活前将两个信息流相加,这意味着支路残差信号只能增强高速信号。然而,直觉上它不是一个自然的解决方案,因为分支残余信号需要被“激活”。

In the original residual networks, the two flows are added up before ReLU activation for a numerical reason that ReLU can only produce non-negative output, which means the branched residual signal can only enhance the highway signal. However, intuitively it is not a natural solution as the branched residual signal needs to be “activated”.

翻译:

He等人[4] 建议通过将这些层重新排列到BN-ReLU-Conv-BN-ReLU-Conv 并且命名为“预激活”结构。在应用“预激活”结构时,应特别注意网络的第一个和最后一个残差单元。

He et al. [4] proposed to handle the incompatibility between ReLU and elementwise addition by re-arranging these layers to BN-ReLU-Conv-BN-ReLU-Conv and named it “pre-activation” structure. When applying the “pre-activation” structure, special attention should be taken on the first and the last residual unit of the networks.

翻译:

为了训练“残差”网络,自然只适合“残差”网络,这意味着当分支残差信号不存在时,高速信号仍应取得有意义的结果。在这种情况下,分支残差信号可以集中于在残差单元中拟合“残差”。Huang等人[6] 提出了一种丢包残差网络,在每个残差单元中随机丢包分支残差信号。因此,当分支残差信号以残差单位表示时,它可以专注于拟合“残差”。由于该模型可以看作是不同深度模型的集合,因此他们将其命名为“随机深度网络”。

To train “residual” networks, it is natural to fit on the “residual” only, which means when the branched residual signal is not presented, the highway signal should still make meaningful results. Under this condition, the branched residual signal can focus on fitting the “residual” in a residual unit. Huang et al. [6] proposed a dropout residual network, which randomly drops the branched residual signal in each residual unit. Therefore, when the branched residual signal is presented in a residual unit, it can focus on fitting the “residual”. As this model can be treated as an ensemble of models with different depths, they named it “stochastic depth networks”.

翻译:

在卷积网络中,深度和宽度对于图像分类的高性能都很重要[7,3]。conv1-conv3-conv1瓶颈结构使用的特征尺寸比conv3-conv3大4倍,达到了更高的性能[4]。Zagoruyko等人[7] 使用特征尺寸10×更大的conv3-conv3,在CIF AR10上达到最高性能(4.10%)。然而,更大的特性维度会消耗更多的GPU内存,导致结构更浅。深度和宽度是平衡的。

 

In the convolutional networks, the depth and width are both important for a high performance in image classification [7,3]. The conv1-conv3-conv1 bottleneck structure which used a feature dimension 4× larger than conv3-conv3 reached a higher performance [4]. Zagoruyko et al. [7] used conv3-conv3 with feature dimension 10× larger and reached the highest performance on CIF AR10 (4.10%). However, a larger feature dimension costs much more GPU memory and leads a shallower structure. There is a balance between depth and width.

翻译:

本文主要研究深度超过100+层的模型。我们的意思是探索如何有效地训练一个非常深的模型,而不是调整一个更精确的模型。

In this paper, we mainly focus on models with depth beyond 100+ layers. We mean to explore how to train a very deep model effectively instead of tuning a more accurate model.

3 加权残差网络

翻译:

首先,我们将简要介绍剩余网络。残差网络通过允许先前的特征表示不受任何修改地畅通无阻地直接流到下面的层来构建信息高速公路。剩余单元执行以下计算:

 

Firstly we will give a brief introduction to the residual networks. The residual networks build the information highway by allowing earlier feature representation to flow unimpededly and directly to the following layers without any modification. A residual unit performs the following computation:

翻译:

这里xi是第i剩余单元的输入公路信号。θi为残差单位的滤波参数,由“msra”初始化,∏li为残差函数,由两个3×3卷积层叠加实现。通常情况下,一个卷积层后应跟随一个批量归一化层以保持具有非零方差的信号,一个ReLU层用于非线性激活。这条公路应该干净、畅通无阻。如[4]所示,高速公路上的障碍物,如不断的缩放和脱落,将使优化变得困难。典型的剩余单元如图1所示。上述原始剩余网络有两个缺陷

 

Here xiis the input highway signal to the i-th residual unit. θiis the filter parameters for the residual unit and it is initialized by “msra”, ∆Liis the residual function, which is realized by a stack of two 3×3 convolutional layers. Typically, one convolutional layer should be followed by one batch normalization layer to keep the signal with non-zero variance and one ReLU layer for non-linearity activation. The highway should be clean and unhindered. As it is shown in [4], obstacles on the highway, such as constant scaling and dropout, will make the optimization difficult. A typical residual unit is depicted in Figure 1. The original residual networks stated above have two defects

图1    剩余单元的示意图。残差函数由两个3×3卷积层组成。每个卷积层(Conv)后面跟着一个批处理规范化层(BN)和一个ReLU层(ReLU)。卷积层的权值由msra初始化。公路信号和剩余信号通过元素加法进行组合。

翻译:

ReLU与元素加成的不相容性。公路信号和残差函数产生的残差信号通过元素加法进行组合。但是,按元素添加操作 在第二个Conv层之后的BN层和ReLU层之间。这主要是由于ReLU激活函数,它产生非负输出。ReLU操作的输出与元素加法不兼容,因为它只能增强公路信号,这限制了残差函数的可表示性,残差函数的意义是取(-~~,+~)中的值。当然,我们可以设计其他的激活函数,它可以在更大的范围内取值,或者在0附近采用对称模式。

 

Incompatibility of ReLU and element-wise addition. The highway signal and the residual signal which is produced by the residual function are combined by the element-wise addition. However, the element-wise addition is operated between the BN layer and the ReLU layer after the second Conv layer. This is mainly due to the ReLU activation function, which produces non-negative output. The output of ReLU operation is not compatible with element-wise addition as it can only enhance the highway signal, which limits the representability of the residual function, which is meant to take values in (−∞,+∞). One can of course resort to designing other activation function which can take values in a larger range or a symmetry mode around zero.

翻译:

很深网络的初始化。深度超过1000层的非常深的网络,即使配备了残余结构、批量标准化和ReLU,在训练阶段仍然不会收敛,如图5所示。文[3]提出了在多个时期用少量的学习率对网络训练进行“预热”,然后将其恢复到正常的学习率,以便于初始收敛。然而,对于更深层的网络,即使是很少的学习率也可能无法很好地工作[2]。

 

Initialization of very deep networks. Very deep networks with depths beyond 1000 layers, even equipped with residual structure, batch normalization and ReLU, still do not converge in the training stage as shown in Figure 5. The paper of [3] proposed to “warm up” the network training with a little learning rate for several epochs and then restore it to the normal learning rate in order to facilitate the initial convergence. However, for deeper networks, even very little learning rate may not work well [2].

翻译:

在非常深的网络中,每个块的残差相加,使得训练难以收敛。一个人可能想把所有的残差都归零来开始训练。然而,残差函数中卷积层的权值应该用“msra”来初始化,它产生所有零权值的概率很小。

 

In very deep networks, the residuals from each block are added together and make the training hard to converge. One may want to zero all the residuals to start the training. However, the weights of the convolutional layers in residual functions should be initialized by “msra” which has little probability to produce all-zero weights.

3.1 加权残差

翻译:

为了解决ReLU和元素加法的不相容性,并为很深的网络获得更好的初始化,我们引入了加权残差网络。在加权残差网络单元中,信号的计算是

其中θ为滤波器参数,由“msra”初始化,λ为残差的权值标量,由零初始化,学习率很小。ReLU激活从公路上移除,并通过两个Conv BN RelUs实现∏li。

图2:加权残差单位示意图。我们将ReLU从公路移动到支路,这使得公路信号能够畅通无阻地通过非常深的网络。剩余信号在训练阶段由一个初始值为零的标量加权。在我们的实验中,当所有的残差逐渐加入到公路信号中时,可以保证整体收敛。权重采用(-1,1)中的值来克服ReLU激活函数的限制。

To address the incompatibility of ReLU and element-wise addition and to get a better initialization for very deep networks, we introduce the weighted residual networks. Formally in a weighted residual networks unit, the computation of the signal is where θiis the filter parameters and it is initialized by “msra” , λiis the weight scalar for the residual and it is initialized by zero with a very small learning rate.The ReLU activation is removed from the highway and ∆Liis realized by two Conv-BN-RelUs.

翻译:

对于任何深块,特征表示席+KIN(I+K)-TH层可以表示为输入层表示Xi和一系列加权残差函数的求和,

在反向传播阶段,当滤波器参数θi+jis任意小时,任何层的梯度都不会消失。注意,在[4]中提出的预激活结构通过将Conv BN RelU的顺序转换为BN-RelU-Conv也具有类似的性质。

 

For any deep blocks, the feature representation xi+kin the (i + k)-th layer can be expressed as a summation of the input layer representation xiand a series of weighted residual functions,In the back-propagation stage, the gradient of any layer does not vanish when filter parameter θi+jis arbitrarily small. Note that the pre-activation structure proposed in [4] also has a similar property by converting the order of Conv-BN-RelU to BN-RelU-Conv.

翻译:

在图3中,我们可视化了1192层模型中学习到的剩余权重的分布。在对称模式下,剩余权重值的范围约为(-0.5,0.5),这意味着分支的剩余信号具有增强/减弱公路信号的相同概率,这意味着学习的剩余权重适当地解决了ReLU与元素加法之间的不相容性。

 

In Figure 3 we visualize the distribution of the learned residual weights in a 1192-layer model. The residual weight values range around (-0.5,0.5) in a symmetry mode, which means the branched residual signal has equal probability to enhance/weaken the highway signal, which means the incompatibility between ReLU and element-wise addition is appropriately addressed by the learned residual weights.

3.2  结构修改

翻译:

在原始残差网络中的一个新块开始时,公路信号被一个step-2卷积层向下采样,而分支残差信号也需要被一个step-2卷积层减半。在执行逐元素加法时,需要使用零填充或转换矩阵在两个信号之间形成匹配的特征维数。在我们的网络中,如图4所示,我们在开始时直接将特征大小减半,并按照前面章节中所述执行以下层。

 

At the beginning of a new block in the original residual networks, the highway signal is down-sampled by a stride-2 convolution layer while the branched residual signal also need to be halved by a stride-2 convolution layer. When performing the element-wise addition, zeros-padding or convert matrix is necessary to make a matched feature dimension between the two signals. In our networks as it is shown in Figure 4, we directly halve the feature size at the beginning and the following layers are performed as stated in the previous sections.

3.3  优化

翻译:

给定训练图像及其对应的地面真值标签{Ii,yi},损失函数是负似然和正则项的总和

其中θ是由“msra”初始化的网络参数,λ是残差的权重向量,由所有零初始化。我们将投影SGD应用于这个典型的约束优化问题。在(t+1)-第次迭代中,更新的λt+1i被投影到凸集S

 

Given training images and its corresponding ground truth labels {Ii, yi}, the loss function is the summation of the negative likelihood and the regularized term. where θ is the network parameters which is initialized by “msra”, λ is the weight vector for the residuals and is initialized by all-zeros. We apply projected SGD to this typical constraint optimization problem. In the (t + 1)-th iteration, the updated λt+1 i is projected to the convex set S

翻译:

其中凸集S=(-1,1)和∏λt是方程4中关于λt i的损失函数的梯度,该梯度可通过深度网络中的反向传播[18]有效计算。

 

where the convex set S = (−1,1) and ∆λt iis the gradient of the loss function in Equation 4 with regard to λt i, which is effectively computed by back-propagation [18] in deep networks.

3.4  应用细节

翻译:

汇聚。首先我们在浅层网络(层数<100)上进行实验。如图5(a)和图5(b)所示,加权残差网络和原始残差网络在浅层网络上的收敛性能和最终精度非常相似。然后我们在非常深的网络上进行实验(层数>100)。在图5(c)中,加权残差网络在训练阶段表现出更好的收敛性能。事实上,深度超过1000层的网络仍然比图5(d)中的112层网络收敛得更快。相反,由于没有采用“热身”策略,原有的残差网络不收敛,1192层网络甚至根本不收敛。然而,即使配备了“预热”,原始的1192层剩余网络也会因过度拟合而结束,并且达到比文献[3]中所述的112层网络更差的性能。准确度。图6报告了CIF AR-10深网络的总体测试精度。蓝色直方图表示原始网络的性能。当层数大于100时,精度降低。然而,对于表示为黄色直方图的加权残差网络,随着深度从10+层增加到1000+层,性能得到了一致的改善。在我们的实验中,当有更多的层时,加权残差网络总是能够更快地收敛并达到更高的性能

 

Convergence. Firstly we experiment on shallow networks (layer number < 100). As it is shown in Figure 5(a) and Figure 5(b), both of the weighted residual networks and the original residual networks have very similar performance of convergence and final accuracy on shallow networks. Then we conduct experiments on very deep networks (layer number > 100). In Figure 5(c), the weighted residual network shows much better performance on convergence in the training stage. In fact, networks with depths beyond 1000 layers still converge faster than the 112-layer networks in Figure 5(d). As contrary, the original residual network does not converge well and the 1192-layer network even does not converge at all as we did not apply the “warm up” strategy. However, even equipped with “warm up”, the original 1192-layer residual network ends with over-fitting and reaches a worse performance than the 112layer network as it is reported in [3]. Accuracy. The overall test accuracy of deep networks on CIF AR-10 is reported in Figure 6. The blue histograms denote the performance of the original networks. The accuracy decreases after the layer number is larger than 100. However, for the weighted residual networks, which are denoted as yellow histograms, the performance enjoys a consistent improvement with the increasing depths from 10+ layers to 1000+ layers. The weighted residual networks can always converge faster and reach a higher performance when there are more layers throughout our experiments.

4  实验

翻译:

在这一部分中,我们给出并分析了在CIF AR-10上的实验结果,以证明加权残差网络的有效性。

 

In this section we present and analyze the experiment results on CIF AR-10 to demonstrate the effectiveness of the weighted residual networks.

4 .1 实验结果

翻译:

汇聚。首先我们在浅层网络(层数<100)上进行实验。如图5(a)和图5(b)所示,加权残差网络和原始残差网络在浅层网络上的收敛性能和最终精度非常相似。然后我们在非常深的网络上进行实验(层数>100)。在图5(c)中,加权残差网络在训练阶段表现出更好的收敛性能。事实上,深度超过1000层的网络仍然比图5(d)中的112层网络收敛得更快。相反,由于没有采用“热身”策略,原有的残差网络不收敛,1192层网络甚至根本不收敛。然而,即使配备了“预热”,原始的1192层剩余网络也会因过度拟合而结束,并且达到比文献[3]中所述的112层网络更差的性能。准确度。图6报告了CIF AR-10深网络的总体测试精度。蓝色直方图表示原始网络的性能。当层数大于100时,精度降低。然而,对于表示为黄色直方图的加权残差网络,随着深度从10+层增加到1000+层,性能得到了一致的改善。在我们的实验中,当有更多的层时,加权残差网络总是能够更快地收敛并达到更高的性能

 

Convergence. Firstly we experiment on shallow networks (layer number < 100). As it is shown in Figure 5(a) and Figure 5(b), both of the weighted residual networks and the original residual networks have very similar performance of convergence and final accuracy on shallow networks. Then we conduct experiments on very deep networks (layer number > 100). In Figure 5(c), the weighted residual network shows much better performance on convergence in the training stage. In fact, networks with depths beyond 1000 layers still converge faster than the 112-layer networks in Figure 5(d). As contrary, the original residual network does not converge well and the 1192-layer network even does not converge at all as we did not apply the “warm up” strategy. However, even equipped with “warm up”, the original 1192-layer residual network ends with over-fitting and reaches a worse performance than the 112layer network as it is reported in [3]. Accuracy. The overall test accuracy of deep networks on CIF AR-10 is reported in Figure 6. The blue histograms denote the performance of the original networks. The accuracy decreases after the layer number is larger than 100. However, for the weighted residual networks, which are denoted as yellow histograms, the performance enjoys a consistent improvement with the increasing depths from 10+ layers to 1000+ layers. The weighted residual networks can always converge faster and reach a higher performance when there are more layers throughout our experiments.

4.2 与世界先进水平的比较

翻译:

在本小节中,我们将加权残差网络(WResNet)与最近提出的其他模型进行比较。主要有两种模型,一种是以扩大特征维数为中心的宽模型,另一种是以深度为中心的深模型。请注意,1001层预激活[4]是深(1000+层)和宽(4×特征尺寸)模型。结果见表2。所有这些模型,除了高速公路[13],与ResNet[3]共享相似的结构,包括三个特征块。

 

In this subsection we compare the weighted residual networks (WResNet) with other recently proposed models. Mainly there are two kinds of models, first of which focus on enlarging the feature dimension and we call them wide models, the second of which focus on depths and we call them deep models. Note that 1001-layer Pre-activation [4] is both deep (1000+ layers) and wide (4× feature dimension) model. The results are presented in Table 2. All these models, except for Highway [13], share similar structures with ResNet [3], including three feature blocks.

翻译:

预激活[4]采用conv1-conv3-conv1瓶颈结构,特征尺寸放大4×。显然,一个4倍宽的型号享受更高的性能,但成本更多的GPU内存。由于GPU内存(一个GTX泰坦X为12GB)资源有限,因此经济地调整模型的宽度和深度对于非常精确的模型非常重要。WideDim[7]和RiR[5]是另外两种提高特征尺寸精度的方法。一个明显的趋势是,更广泛的功能更利于更高的性能。WideDim采用了10倍的特征尺寸,在CIF AR-10上达到了非常高的性能(95.8%)。Dropout[6]在相同的GPU内存开销下,对剩余信号应用Dropout操作,实现了随机深度网络。唯一的缺点是它需要更多的时间(大约2倍)才能收敛到一个好的性能。

 

Pre-activation [4] adapted a conv1-conv3-conv1 bottle-neck structure and enlarged the feature dimension by 4×. Apparently a 4× wider model enjoys a higher performance but costs more GPU memory. As the GPU memory (12GB for one GTX TITAN X) resource is limited, it is important to tune the model width and depth economically for a very accurate model. WideDim [7] and RiR [5] are two other methods to enlarge the feature dimension for higher accuracy. A clear tendency is that a wider feature is better for higher performance. WideDim adapted a 10× feature dimension and reached a very high performance (95.8%) on CIF AR-10. Dropout [6] realized stochastic depth networks by applying the dropout operation on the residual signal at exactly the same GPU memory cost. The only defect resides that it needs much more epochs (about 2×) to converge at a good performance.

翻译:

加权残差网络使得很深的网络训练收敛速度更快,达到了很好的性能,同时带来的计算量和GPU内存负担也很少。由于时间和GPU资源的限制,我们没有调整模型宽度(特征维数)或更多的训练阶段,我们打算探索加权残差在训练非常深的模型中的有效性。然而,在特征维数较短的情况下,加权残差网络的性能仍然比原始残差网络好得多,并且达到了非常有意义的精度,如表2所示。

 

The weighted residual networks make very deep networks training converge faster and reach a good performance while bringing little more computation and GPU memory burden. As time and GPU resource is limited, we have not tuned the model width (feature dim.) or more training epochs and we are meant to explore the effectiveness of the weighted residuals in training very deep models. Yet with shorter feature dim., the weighted residual networks still perform much better than the original residual networks and reach a quite meaningful accuracy as shown in Table 2.

翻译:

我们进一步应用[6]提出的三个区块的辍学率为{0.2,0.4,0.6}的残差。该模型的性能称为WResNet-d,训练周期只有文献[6]的一半左右,加权残差网络具有较高的性能(95.3%)。

 

We further apply dropout on the residuals with dropout ratio = {0.2,0.4,0.6} for three blocks as proposed by [6]. The performance of this model is named as WResNet-d. With only about half training epochs of [6], the weighted residual networks with dropout reach a relative very high performance (95.3%).

4.3  分析

翻译:

通过在本小节中提供更多结果的详细信息,我们对加权残差网络提供了更多的见

We provide more insights into the weighted residual networks by presenting more details information of results in this subsection.

翻译:

对于所有模型,剩余权重的初始学习率设置为0.001,剩余权重初始化为零。图7所示,学习完的1192层模型中各元素加成层的剩余重量值。它由两部分组成,被800层周围可见的锐利边界所分割,后者的残差具有更大的权重。这可能意味着在最终决策中,后一层的残差比前一层的残差更重要。我们将在今后的工作中探讨这一现象。

 

The initial learning rate for the residual weights is set to 0.001 for all models and the residual weights are initialized with zeros.residual weight values in each element-wise addition layer in a 1192-layer model. It comprises two parts divided by a visible sharp boundary around the 800-layer and the latter residuals have larger weights. It may imply the residuals from the later layers are more important than earlier layers on the final decisions. We will explore this phenomenon in the future work.

翻译:

我们还绘制了剩余权重值分布的演变历史,如图8所示。在8k迭代中,分布是相对均匀的。随着越来越多的训练迭代,分布开始集中在两个峰值附近。在64k迭代中,大多数剩余权值在对称模式下约为0.2和-0.2,这表明分支剩余信号具有增强/削弱公路信号的相同概率,这验证了我们的假设。因此,学习的剩余权重可以适当地解决ReLU激活和元素添加之间的不相容性。

 

We have also plotted the evolution history of the distribution of the residual weight values as show in Figure 8. At the 8k iteration, the distribution is relative uniform. As more and more training iterations, the distribution begins to concentrate around two peaks. In the 64k iteration, most of the residual weight values are around 0.2 and −0.2 in a symmetry mode indicating that the branched residual signals have equal probability to enhance/weaken the highway signals, which verifies our hypothesi. Therefore the learned residual weights can solve the incompatibility between ReLU activation and element-wise addition appropriately.

5  结论

翻译:

原始剩余网络存在两个缺陷:1)ReLU与元素加法不相容。2) 使用“msra”初始值设定项很难在深度超过1000层的网络中收敛。本文引入加权残差网络,使得极深残差网络比原残差网络收敛速度更快,性能更高,计算量和GPU内存负担也更小。所有的残差通过学习的缓慢增长的权重逐步加入到公路信号中,以保证收敛。在CIF-AR-10上的实验证明了加权残差网络对非常深的模型的有效性。随着深度从100+层增加到1000+层,它在精度和收敛性方面有了一致的改进。加权残差网络具有简单易实现、实用性强等优点,特别适用于复杂残差网络的研究和实际应用。

 

The original residual networks have two defects, 1) Incompatibility between ReLU and element-wise addition. 2) Difficulty for networks to converge with depths beyond 1000-layer using “msra” initializer. In this paper we introduce the weighted residual networks to make very deep residual networks converge faster and reach a higher performance with little more computation and GPU memory burden than the original residual networks. All the residuals are added to the highway signal gradually by the learned slowly growing-up weights to promise convergence. Experiments on CIF AR-10 have demonstrated the effectiveness of the weighted residual networks for very deep models. It enjoys a consistent improvements over accuracy and convergence with the increasing depths from 100+ layers to 1000+ layers. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章