Weighted Residuals for Very Deep Networks

論文：https://ieeexplore.ieee.org/document/7811085

最近，深度剩餘網絡在許多具有挑戰性的計算機視覺任務中表現出了引人注目的性能。然而，原有的殘差結構仍然存在一些缺陷，使得其難以在深度很深的網絡上收斂。在本文中，我們引入了一個加權殘差網絡來解決ReLU和元素加法與深度網絡初始化之間的不兼容性問題。加權殘差網絡能夠有效地結合不同層次的殘差。隨着深度從100+層增加到1000+層，所提出的模型在準確性和收斂性方面都有持續的改善。加權殘差網絡與原始殘差網絡相比，計算量和GPU內存負擔基本沒有增加。採用投影隨機梯度下降法對網絡進行優化。在CIFAR-10上的實驗表明，我們的算法比原始的殘差網絡具有更快的收斂速度，在1192層模型下達到了95.3%的精度。

Deep residual networks have recently shown appealing performance on many challenging computer vision tasks. However, the original residual structure still has some defects making it difficult to converge on very deep networks. In this paper, we introduce a weighted residual network to address the incompatibility between ReLU and elementwise addition and the deep network initialization problem. The weighted residual network is able to learn to combine residuals from different layers effectively and efficiently. The proposed models enjoy a consistent improvement over accuracy and convergence with increasing depths from 100+ layers to 1000+ layers. Besides, the weighted residual networks have little more computation and GPU memory burden than the original residual networks. The networks are optimized by projected stochastic gradient descent. Experiments on CIFAR-10 have shown that our algorithm has a faster convergence speed than the original residual networks and reaches a high accuracy at 95.3% with a 1192-layer model.

1 說明

目前最先進的圖像分類模型是建立在inception和residual結構上的[1,2,3]。最近出現了很多關於殘差網絡的研究[4,5,6,7]。非常深的卷積網絡[8,9]，特別是帶有殘差單位的卷積網絡，在許多具有挑戰性的計算機視覺任務上顯示出令人信服的準確性和良好的收斂性[3,10,11]。由於批量歸一化[12]和高速公路信號傳播[13]很好地解決了梯度消失問題，所以我們正在開發和訓練100+層的網絡，即使是1000+層結構，在[6]中顯示結合足夠的dropout也能產生有意義的結果。He等人[4]還引入了預激活結構，以允許高速公路信號通過非常深的網絡直接傳播。但他們似乎利用了更大維度(4倍)的特徵，採用了多個1×1卷積層代替3×3卷積層以實現1000+層的收斂。

The state-of-the-art model for image classification is built on inception and residual structure [1,2,3]. Lots of works devoted on residual networks are emerging recently [4,5,6,7]. Very deep convolutional networks [8,9], especially with residual units, have shown compelling accuracy and nice convergence behaviors on many challenging computer vision tasks [3,10,11]. Since vanishing gradients problem is well handled by batch normalization [12] and highway signal propagation [13], networks with 100+ layers are being developed and trained, even 1000+ layers structure still yields meaningful results when combined with adequate dropout as shown in [6] . He et al. [4] also introduced the pre-activation structure to allow the highway signal to be directly propagated through the very deep networks. However they seemed to harness the features with a larger dimension (4×) and adapted multiple 1 × 1 convolutional layers to substitute 3 × 3 convolutional layers for convergence with 1000+ layers.

翻譯：

一個典型的卷積單元由一個卷積層、一個批處理歸一層和一個ReLU層組成，它們依次執行[12]。對於一個殘差單元，關鍵問題是如何將殘差信號與高速公路信號結合起來，在[3]中提出了按元素添加的方法。一個自然的想法是在ReLU激活後執行加法。但這導致了殘差分支的非負輸出，限制了殘差單元的代表能力，即只能增強公路信號。He等人[3]首先提出在批次歸一化和ReLU之間進行添加。在[4]中，他們進一步提出將三層的順序顛倒，在卷積層之前進行批處理歸一化和ReLU。問題在於ReLU激活只能產生正值，而正值與剩餘單元的元素添加不相容。

A typical convolutional unit is composed of one convolutional layer, one batch normalization layer and one ReLU layer, all of which are performed sequently [12]. For a residual unit, a central question is how to combine the residual signal and the highway signal, where element-wise addition was proposed in [3]. A natural idea is to perform addition after ReLU activation. However, this leads to a nonnegative output from residual branch, which limits the representative ability of the residual unit meaning that it can only enhance the highway signal. He et al. firstly proposed to perform addition between batch normalization and ReLU. In [4], they further proposed to inverse the order of the three layers, performing batch normalization and ReLU before convolutional layers. The question is due to that ReLU activation can only generate positive value which is incompatible with element-wise addition in the residual unit.

翻譯：

由於求解深度網絡是一種非凸優化方法，因此適當的初始化對於快速收斂和良好的局部極小值是非常重要的。xavier[14]和msra[15]是深度網絡初始化的常用工具。然而，對於深度超過100層的網絡，“xavier”和“msra”都不是很好。[3]的論文提出對學習率較小的網絡進行“預熱”，然後將學習率恢復到正常值。然而，這種手工策略對深度網絡不是很有用，在深度網絡中，即使非常低的學習率(0.00001)也不足以保證收斂，恢復學習率有機會擺脫初始收斂[2]。

As it is non-convex optimization to solve deep networks, an appropriate initialization is important for both faster convergence and a good local minima. The “xavier” [14] and “msra” [15] are popular used for deep networks initialization. However, for networks with depths beyond 100 layers, neither “xavier” nor “msra” works well. The paper of [3] proposed to “warm up” the network with small learning rate and then restore the learning rate to normal value. However, this hand-craft strategy is not that useful for very deep networks, where even a very low learning rate (0.00001) still is not enough to promise convergence and restoring the learning rate has a chance to get rid of the initial convergence [2].

翻譯：

一般來說，原始殘差網絡的訓練存在兩個缺陷：

——ReLU的不相容性和元素加法。

——使用“msra”初始值設定項很難使網絡收斂到1000層以上的深度。

Generally speaking, there are two defects embedded in the training of the original residual networks

– Incompatibility of ReLU and element-wise addition.

– difficutly for networks to converge with depths beyond 1000-layer using “msra” initializer.

翻譯：

第三點在於，爲了訓練非常深的網絡，需要一種更好的模式來組合來自不同層的殘差。對於非常深的網絡，並不是所有的層都那麼重要，因爲1000層網絡的性能通常不比100層網絡好多少。事實上，許多層充當冗餘信息，而且非常深的網絡往往會過擬合某些任務。

The third point resides that a better mode to combine the residuals from different layers are necessary to train very deep networks. For very deep networks, not all layers are that important as 1000-layer networks often perform not much better than 100-layer networks. In fact, lots of layers serve as redundant information and very deep networks tend to over-fit on some tasks.

翻譯：

本文引入加權殘差網絡，學習如何有效地組合不同層次的殘差。所有的剩餘權值都初始化爲零，並以非常小的學習率（0.001）進行優化，這使得所有的剩餘信號逐漸添加到高速公路信號中。隨着一組殘差權值的逐漸增大，1192層殘差網絡的收斂速度甚至比100層網絡快得多。最後，學習的殘差權重的分佈是在[-0.5,0.5]範圍內的對稱模式，這意味着可以適當處理ReLU和元素加法的不相容性。利用與原殘差網絡具有完全相同訓練時間的投影隨機梯度下降法對網絡進行優化。

In this paper, we introduce the weighted residual networks, which learn to combine residuals from different layers effectively and efficiently. All the residual weights are initialized at zeros and optimized with a very small learning rate (0.001), which allows all the residual signals to gradually add to the highway signal. With a group of gradually growing-up residual weights, the 1192-layer residual networks converge even much faster than the 100-layer networks. Finally, the distribution of the learned residual weights is in a symmetry mode ranging in [−0.5,0.5], which implies the incompatibility of ReLU and elementwise addition can be appropriately handled. The networks are optimized by projected stochastic gradient descent with exactly the same training epochs to original residual networks.

翻譯：

在CIF AR-10[16]上進行了實驗，驗證了加權殘差網絡的實用性。加權殘差網絡訓練比原殘差網絡收斂速度快，性能好，計算量和GPU內存開銷都可以忽略不計。深度超過1000層的加權殘差網絡仍然比較淺的網絡收斂更快，並且隨着深度從100+層增加到1000+層，在不使用任何手工策略（如“預熱”）的情況下，精度得到了一致的提高[3]。在對殘差應用丟包後，我們的加權殘差網絡對非常深的網絡3cif AR-10的加權殘差達到了非常高的精度（95.3%），使用1192層模型，與原始殘差網絡具有相同的訓練週期（約164個週期，64k次迭代）。

We conduct experiments on CIF AR-10 [16] to verify the practicability of the weighted residual networks. Training with the weighted residual networks can converge much faster and reach a higher performance with negligible more computation and GPU memory cost than the original residual networks. The weighted residual networks with depths beyond 1000 layers still converge faster than shallower networks and enjoy a consistent improvement over accuracy with increasing depths from 100+ layers to 1000+ layers without resorting to any hand-craft strategy such as “warm up” [3]. After applying dropout on the residuals, our weighted residual networks reach a very high accuracy (95.3%) on Weighted Residuals for Very Deep Networks 3 CIF AR-10 using a 1192-layer model with the same training epochs to the original residual networks (about 164 epochs, 64k iterations).

翻譯：

本文的工作貢獻有四個方面：

—我們提出了加權殘差網絡，該網絡學習組合每個殘差單元的殘差。加權殘差網絡在訓練階段收斂速度快，在計算量和GPU存儲開銷增加了很少的同時比原始殘差網絡有更高的精度。

–ReLU和按元素添加的不兼容性可以通過加權殘差適當地解決，我們清除了信息高速公路上的所有障礙，使高速公路信號能夠暢通無阻地傳播。

–殘差逐漸添加到公路信號中，使訓練過程更加可靠，即使深度超過1000層的網絡也可以在沒有“預熱”策略的情況下快速收斂。

–我們修改了下采樣步驟，使公路信號和分支殘差信號的空間大小和特徵尺寸一致，而不需要使用零填充或額外的轉換矩陣。加權殘差網絡具有簡單易實現、實用性強等優點，特別適用於複雜殘差網絡的研究和實際應用。

The contributions of our work presented in this paper have four folds:

– We propose the weighted residual networks, which learn to combine the residuals from each residual unit. The weighted residual networks converge much faster in the training stage and reach a higher accuracy than the original residual networks at little more computation and GPU memory cost.

– The incompatibility of ReLU and element-wise addition can be addressed appropriately by weighted residuals and we clear all the obstacles on the information highway to allow the highway signal to enjoy a unhindered propagation.

– The residuals are gradually added to the highway signal to make the training process more reliable, even networks with depths beyond 1000 layers can converge very fast without the “warm up” strategy.

– We modify the down-sampling step to make the spatial size and feature dimension consistent between highway signal and branched residual signal, without resorting to zero-padding or extra converting matrix. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

2 相關工作

翻譯：

殘差網絡吸引了大量的研究者，出現了許多關於殘差網絡的研究成果[4,5,6,7,17]。在下面的段落中，我們將回顧一些相關的工作。

The residual networks have attracted lots of researchers and many works on it have appeared [4,5,6,7,17]. In the following paragraphs we will review some related works.

翻譯：

殘差網絡使用身份跳過連接簡化了高速網絡【13】，這允許信息直接流動並繞過複雜的層。殘差網絡由多個殘差單元組成。在殘差單元中有兩個信息流。高速信號通過同一跳接，支路殘差信號由Conv-BN ReLU-Conv-BN實現。這兩個流在殘差單元的末端通過元素加和進行組合，然後通過ReLU層進行激活。這個簡單的結構非常強大，並且在imageNet挑戰[1]中使用150層網絡[3]取得了令人驚訝的性能。

The residual networks simplify the highway networks [13] using identity skip connection, which allows information to flow directly and bypass complex layers. The residual networks consist of many residual units. There are two information flows in a residual unit. The highway signal goes through the identity skip connection and the branched residual signal is realized by Conv-BN-ReLU-Conv-BN. The two flows are combined at the end of a residual unit by element-wise addition and then it goes through a ReLU layer for activation. This simple structure is quite powerful and achieved a surprising performance on the imageNet challenge [1] with 150-layer networks [3].

翻譯：

在原始殘差網絡中，由於ReLU只能產生非負輸出，所以在ReLU激活前將兩個信息流相加，這意味着支路殘差信號只能增強高速信號。然而，直覺上它不是一個自然的解決方案，因爲分支殘餘信號需要被“激活”。

In the original residual networks, the two flows are added up before ReLU activation for a numerical reason that ReLU can only produce non-negative output, which means the branched residual signal can only enhance the highway signal. However, intuitively it is not a natural solution as the branched residual signal needs to be “activated”.

翻譯：

He等人[4] 建議通過將這些層重新排列到BN-ReLU-Conv-BN-ReLU-Conv 並且命名爲“預激活”結構。在應用“預激活”結構時，應特別注意網絡的第一個和最後一個殘差單元。

He et al. [4] proposed to handle the incompatibility between ReLU and elementwise addition by re-arranging these layers to BN-ReLU-Conv-BN-ReLU-Conv and named it “pre-activation” structure. When applying the “pre-activation” structure, special attention should be taken on the first and the last residual unit of the networks.

翻譯：

爲了訓練“殘差”網絡，自然只適合“殘差”網絡，這意味着當分支殘差信號不存在時，高速信號仍應取得有意義的結果。在這種情況下，分支殘差信號可以集中於在殘差單元中擬合“殘差”。Huang等人[6] 提出了一種丟包殘差網絡，在每個殘差單元中隨機丟包分支殘差信號。因此，當分支殘差信號以殘差單位表示時，它可以專注於擬合“殘差”。由於該模型可以看作是不同深度模型的集合，因此他們將其命名爲“隨機深度網絡”。

To train “residual” networks, it is natural to fit on the “residual” only, which means when the branched residual signal is not presented, the highway signal should still make meaningful results. Under this condition, the branched residual signal can focus on fitting the “residual” in a residual unit. Huang et al. [6] proposed a dropout residual network, which randomly drops the branched residual signal in each residual unit. Therefore, when the branched residual signal is presented in a residual unit, it can focus on fitting the “residual”. As this model can be treated as an ensemble of models with different depths, they named it “stochastic depth networks”.

翻譯：

在卷積網絡中，深度和寬度對於圖像分類的高性能都很重要[7,3]。conv1-conv3-conv1瓶頸結構使用的特徵尺寸比conv3-conv3大4倍，達到了更高的性能[4]。Zagoruyko等人[7] 使用特徵尺寸10×更大的conv3-conv3，在CIF AR10上達到最高性能（4.10%）。然而，更大的特性維度會消耗更多的GPU內存，導致結構更淺。深度和寬度是平衡的。

In the convolutional networks, the depth and width are both important for a high performance in image classification [7,3]. The conv1-conv3-conv1 bottleneck structure which used a feature dimension 4× larger than conv3-conv3 reached a higher performance [4]. Zagoruyko et al. [7] used conv3-conv3 with feature dimension 10× larger and reached the highest performance on CIF AR10 (4.10%). However, a larger feature dimension costs much more GPU memory and leads a shallower structure. There is a balance between depth and width.

翻譯：

本文主要研究深度超過100+層的模型。我們的意思是探索如何有效地訓練一個非常深的模型，而不是調整一個更精確的模型。

In this paper, we mainly focus on models with depth beyond 100+ layers. We mean to explore how to train a very deep model effectively instead of tuning a more accurate model.

3 加權殘差網絡

翻譯：

首先，我們將簡要介紹剩餘網絡。殘差網絡通過允許先前的特徵表示不受任何修改地暢通無阻地直接流到下面的層來構建信息高速公路。剩餘單元執行以下計算：

Firstly we will give a brief introduction to the residual networks. The residual networks build the information highway by allowing earlier feature representation to flow unimpededly and directly to the following layers without any modification. A residual unit performs the following computation:

翻譯：

這裏xi是第i剩餘單元的輸入公路信號。θi爲殘差單位的濾波參數，由“msra”初始化，∏li爲殘差函數，由兩個3×3卷積層疊加實現。通常情況下，一個卷積層後應跟隨一個批量歸一化層以保持具有非零方差的信號，一個ReLU層用於非線性激活。這條公路應該乾淨、暢通無阻。如[4]所示，高速公路上的障礙物，如不斷的縮放和脫落，將使優化變得困難。典型的剩餘單元如圖1所示。上述原始剩餘網絡有兩個缺陷

Here xiis the input highway signal to the i-th residual unit. θiis the filter parameters for the residual unit and it is initialized by “msra”, ∆Liis the residual function, which is realized by a stack of two 3×3 convolutional layers. Typically, one convolutional layer should be followed by one batch normalization layer to keep the signal with non-zero variance and one ReLU layer for non-linearity activation. The highway should be clean and unhindered. As it is shown in [4], obstacles on the highway, such as constant scaling and dropout, will make the optimization difficult. A typical residual unit is depicted in Figure 1. The original residual networks stated above have two defects

圖1 剩餘單元的示意圖。殘差函數由兩個3×3卷積層組成。每個卷積層（Conv）後面跟着一個批處理規範化層（BN）和一個ReLU層（ReLU）。卷積層的權值由msra初始化。公路信號和剩餘信號通過元素加法進行組合。

翻譯：

ReLU與元素加成的不相容性。公路信號和殘差函數產生的殘差信號通過元素加法進行組合。但是，按元素添加操作在第二個Conv層之後的BN層和ReLU層之間。這主要是由於ReLU激活函數，它產生非負輸出。ReLU操作的輸出與元素加法不兼容，因爲它只能增強公路信號，這限制了殘差函數的可表示性，殘差函數的意義是取（－～～，＋～）中的值。當然，我們可以設計其他的激活函數，它可以在更大的範圍內取值，或者在0附近採用對稱模式。

Incompatibility of ReLU and element-wise addition. The highway signal and the residual signal which is produced by the residual function are combined by the element-wise addition. However, the element-wise addition is operated between the BN layer and the ReLU layer after the second Conv layer. This is mainly due to the ReLU activation function, which produces non-negative output. The output of ReLU operation is not compatible with element-wise addition as it can only enhance the highway signal, which limits the representability of the residual function, which is meant to take values in (−∞,+∞). One can of course resort to designing other activation function which can take values in a larger range or a symmetry mode around zero.

翻譯：

很深網絡的初始化。深度超過1000層的非常深的網絡，即使配備了殘餘結構、批量標準化和ReLU，在訓練階段仍然不會收斂，如圖5所示。文[3]提出了在多個時期用少量的學習率對網絡訓練進行“預熱”，然後將其恢復到正常的學習率，以便於初始收斂。然而，對於更深層的網絡，即使是很少的學習率也可能無法很好地工作[2]。

Initialization of very deep networks. Very deep networks with depths beyond 1000 layers, even equipped with residual structure, batch normalization and ReLU, still do not converge in the training stage as shown in Figure 5. The paper of [3] proposed to “warm up” the network training with a little learning rate for several epochs and then restore it to the normal learning rate in order to facilitate the initial convergence. However, for deeper networks, even very little learning rate may not work well [2].

翻譯：

在非常深的網絡中，每個塊的殘差相加，使得訓練難以收斂。一個人可能想把所有的殘差都歸零來開始訓練。然而，殘差函數中卷積層的權值應該用“msra”來初始化，它產生所有零權值的概率很小。

In very deep networks, the residuals from each block are added together and make the training hard to converge. One may want to zero all the residuals to start the training. However, the weights of the convolutional layers in residual functions should be initialized by “msra” which has little probability to produce all-zero weights.

3.1 加權殘差

翻譯：

爲了解決ReLU和元素加法的不相容性，併爲很深的網絡獲得更好的初始化，我們引入了加權殘差網絡。在加權殘差網絡單元中，信號的計算是

其中θ爲濾波器參數，由“msra”初始化，λ爲殘差的權值標量，由零初始化，學習率很小。ReLU激活從公路上移除，並通過兩個Conv BN RelUs實現∏li。

圖2：加權殘差單位示意圖。我們將ReLU從公路移動到支路，這使得公路信號能夠暢通無阻地通過非常深的網絡。剩餘信號在訓練階段由一個初始值爲零的標量加權。在我們的實驗中，當所有的殘差逐漸加入到公路信號中時，可以保證整體收斂。權重採用（-1,1）中的值來克服ReLU激活函數的限制。

To address the incompatibility of ReLU and element-wise addition and to get a better initialization for very deep networks, we introduce the weighted residual networks. Formally in a weighted residual networks unit, the computation of the signal is where θiis the filter parameters and it is initialized by “msra” , λiis the weight scalar for the residual and it is initialized by zero with a very small learning rate.The ReLU activation is removed from the highway and ∆Liis realized by two Conv-BN-RelUs.

翻譯：

對於任何深塊，特徵表示席+KIN（I+K）-TH層可以表示爲輸入層表示Xi和一系列加權殘差函數的求和，

在反向傳播階段，當濾波器參數θi+jis任意小時，任何層的梯度都不會消失。注意，在[4]中提出的預激活結構通過將Conv BN RelU的順序轉換爲BN-RelU-Conv也具有類似的性質。

For any deep blocks, the feature representation xi+kin the (i + k)-th layer can be expressed as a summation of the input layer representation xiand a series of weighted residual functions,In the back-propagation stage, the gradient of any layer does not vanish when filter parameter θi+jis arbitrarily small. Note that the pre-activation structure proposed in [4] also has a similar property by converting the order of Conv-BN-RelU to BN-RelU-Conv.

翻譯：

在圖3中，我們可視化了1192層模型中學習到的剩餘權重的分佈。在對稱模式下，剩餘權重值的範圍約爲（-0.5,0.5），這意味着分支的剩餘信號具有增強/減弱公路信號的相同概率，這意味着學習的剩餘權重適當地解決了ReLU與元素加法之間的不相容性。

In Figure 3 we visualize the distribution of the learned residual weights in a 1192-layer model. The residual weight values range around (-0.5,0.5) in a symmetry mode, which means the branched residual signal has equal probability to enhance/weaken the highway signal, which means the incompatibility between ReLU and element-wise addition is appropriately addressed by the learned residual weights.

3.2 結構修改

翻譯：

在原始殘差網絡中的一個新塊開始時，公路信號被一個step-2卷積層向下採樣，而分支殘差信號也需要被一個step-2卷積層減半。在執行逐元素加法時，需要使用零填充或轉換矩陣在兩個信號之間形成匹配的特徵維數。在我們的網絡中，如圖4所示，我們在開始時直接將特徵大小減半，並按照前面章節中所述執行以下層。

At the beginning of a new block in the original residual networks, the highway signal is down-sampled by a stride-2 convolution layer while the branched residual signal also need to be halved by a stride-2 convolution layer. When performing the element-wise addition, zeros-padding or convert matrix is necessary to make a matched feature dimension between the two signals. In our networks as it is shown in Figure 4, we directly halve the feature size at the beginning and the following layers are performed as stated in the previous sections.

3.3 優化

翻譯：

給定訓練圖像及其對應的地面真值標籤{Ii，yi}，損失函數是負似然和正則項的總和

其中θ是由“msra”初始化的網絡參數，λ是殘差的權重向量，由所有零初始化。我們將投影SGD應用於這個典型的約束優化問題。在（t+1）-第次迭代中，更新的λt+1i被投影到凸集S

Given training images and its corresponding ground truth labels {Ii, yi}, the loss function is the summation of the negative likelihood and the regularized term. where θ is the network parameters which is initialized by “msra”, λ is the weight vector for the residuals and is initialized by all-zeros. We apply projected SGD to this typical constraint optimization problem. In the (t + 1)-th iteration, the updated λt+1 i is projected to the convex set S

翻譯：

其中凸集S=（-1,1）和∏λt是方程4中關於λt i的損失函數的梯度，該梯度可通過深度網絡中的反向傳播[18]有效計算。

where the convex set S = (−1,1) and ∆λt iis the gradient of the loss function in Equation 4 with regard to λt i, which is effectively computed by back-propagation [18] in deep networks.

3.4 應用細節

翻譯：

匯聚。首先我們在淺層網絡（層數<100）上進行實驗。如圖5（a）和圖5（b）所示，加權殘差網絡和原始殘差網絡在淺層網絡上的收斂性能和最終精度非常相似。然後我們在非常深的網絡上進行實驗（層數>100）。在圖5（c）中，加權殘差網絡在訓練階段表現出更好的收斂性能。事實上，深度超過1000層的網絡仍然比圖5（d）中的112層網絡收斂得更快。相反，由於沒有采用“熱身”策略，原有的殘差網絡不收斂，1192層網絡甚至根本不收斂。然而，即使配備了“預熱”，原始的1192層剩餘網絡也會因過度擬合而結束，並且達到比文獻[3]中所述的112層網絡更差的性能。準確度。圖6報告了CIF AR-10深網絡的總體測試精度。藍色直方圖表示原始網絡的性能。當層數大於100時，精度降低。然而，對於表示爲黃色直方圖的加權殘差網絡，隨着深度從10+層增加到1000+層，性能得到了一致的改善。在我們的實驗中，當有更多的層時，加權殘差網絡總是能夠更快地收斂並達到更高的性能

Convergence. Firstly we experiment on shallow networks (layer number < 100). As it is shown in Figure 5(a) and Figure 5(b), both of the weighted residual networks and the original residual networks have very similar performance of convergence and final accuracy on shallow networks. Then we conduct experiments on very deep networks (layer number > 100). In Figure 5(c), the weighted residual network shows much better performance on convergence in the training stage. In fact, networks with depths beyond 1000 layers still converge faster than the 112-layer networks in Figure 5(d). As contrary, the original residual network does not converge well and the 1192-layer network even does not converge at all as we did not apply the “warm up” strategy. However, even equipped with “warm up”, the original 1192-layer residual network ends with over-fitting and reaches a worse performance than the 112layer network as it is reported in [3]. Accuracy. The overall test accuracy of deep networks on CIF AR-10 is reported in Figure 6. The blue histograms denote the performance of the original networks. The accuracy decreases after the layer number is larger than 100. However, for the weighted residual networks, which are denoted as yellow histograms, the performance enjoys a consistent improvement with the increasing depths from 10+ layers to 1000+ layers. The weighted residual networks can always converge faster and reach a higher performance when there are more layers throughout our experiments.

4 實驗

翻譯：

在這一部分中，我們給出並分析了在CIF AR-10上的實驗結果，以證明加權殘差網絡的有效性。

In this section we present and analyze the experiment results on CIF AR-10 to demonstrate the effectiveness of the weighted residual networks.

4 .1 實驗結果

翻譯：

4.2 與世界先進水平的比較

翻譯：

在本小節中，我們將加權殘差網絡（WResNet）與最近提出的其他模型進行比較。主要有兩種模型，一種是以擴大特徵維數爲中心的寬模型，另一種是以深度爲中心的深模型。請注意，1001層預激活[4]是深（1000+層）和寬（4×特徵尺寸）模型。結果見表2。所有這些模型，除了高速公路[13]，與ResNet[3]共享相似的結構，包括三個特徵塊。

In this subsection we compare the weighted residual networks (WResNet) with other recently proposed models. Mainly there are two kinds of models, first of which focus on enlarging the feature dimension and we call them wide models, the second of which focus on depths and we call them deep models. Note that 1001-layer Pre-activation [4] is both deep (1000+ layers) and wide (4× feature dimension) model. The results are presented in Table 2. All these models, except for Highway [13], share similar structures with ResNet [3], including three feature blocks.

翻譯：

預激活[4]採用conv1-conv3-conv1瓶頸結構，特徵尺寸放大4×。顯然，一個4倍寬的型號享受更高的性能，但成本更多的GPU內存。由於GPU內存（一個GTX泰坦X爲12GB）資源有限，因此經濟地調整模型的寬度和深度對於非常精確的模型非常重要。WideDim[7]和RiR[5]是另外兩種提高特徵尺寸精度的方法。一個明顯的趨勢是，更廣泛的功能更利於更高的性能。WideDim採用了10倍的特徵尺寸，在CIF AR-10上達到了非常高的性能（95.8%）。Dropout[6]在相同的GPU內存開銷下，對剩餘信號應用Dropout操作，實現了隨機深度網絡。唯一的缺點是它需要更多的時間（大約2倍）才能收斂到一個好的性能。

Pre-activation [4] adapted a conv1-conv3-conv1 bottle-neck structure and enlarged the feature dimension by 4×. Apparently a 4× wider model enjoys a higher performance but costs more GPU memory. As the GPU memory (12GB for one GTX TITAN X) resource is limited, it is important to tune the model width and depth economically for a very accurate model. WideDim [7] and RiR [5] are two other methods to enlarge the feature dimension for higher accuracy. A clear tendency is that a wider feature is better for higher performance. WideDim adapted a 10× feature dimension and reached a very high performance (95.8%) on CIF AR-10. Dropout [6] realized stochastic depth networks by applying the dropout operation on the residual signal at exactly the same GPU memory cost. The only defect resides that it needs much more epochs (about 2×) to converge at a good performance.

翻譯：

加權殘差網絡使得很深的網絡訓練收斂速度更快，達到了很好的性能，同時帶來的計算量和GPU內存負擔也很少。由於時間和GPU資源的限制，我們沒有調整模型寬度（特徵維數）或更多的訓練階段，我們打算探索加權殘差在訓練非常深的模型中的有效性。然而，在特徵維數較短的情況下，加權殘差網絡的性能仍然比原始殘差網絡好得多，並且達到了非常有意義的精度，如表2所示。

The weighted residual networks make very deep networks training converge faster and reach a good performance while bringing little more computation and GPU memory burden. As time and GPU resource is limited, we have not tuned the model width (feature dim.) or more training epochs and we are meant to explore the effectiveness of the weighted residuals in training very deep models. Yet with shorter feature dim., the weighted residual networks still perform much better than the original residual networks and reach a quite meaningful accuracy as shown in Table 2.

翻譯：

我們進一步應用[6]提出的三個區塊的輟學率爲{0.2,0.4,0.6}的殘差。該模型的性能稱爲WResNet-d，訓練週期只有文獻[6]的一半左右，加權殘差網絡具有較高的性能（95.3%）。

We further apply dropout on the residuals with dropout ratio = {0.2,0.4,0.6} for three blocks as proposed by [6]. The performance of this model is named as WResNet-d. With only about half training epochs of [6], the weighted residual networks with dropout reach a relative very high performance (95.3%).

4.3 分析

翻譯：

通過在本小節中提供更多結果的詳細信息，我們對加權殘差網絡提供了更多的見

We provide more insights into the weighted residual networks by presenting more details information of results in this subsection.

翻譯：

對於所有模型，剩餘權重的初始學習率設置爲0.001，剩餘權重初始化爲零。圖7所示，學習完的1192層模型中各元素加成層的剩餘重量值。它由兩部分組成，被800層周圍可見的銳利邊界所分割，後者的殘差具有更大的權重。這可能意味着在最終決策中，後一層的殘差比前一層的殘差更重要。我們將在今後的工作中探討這一現象。

The initial learning rate for the residual weights is set to 0.001 for all models and the residual weights are initialized with zeros.residual weight values in each element-wise addition layer in a 1192-layer model. It comprises two parts divided by a visible sharp boundary around the 800-layer and the latter residuals have larger weights. It may imply the residuals from the later layers are more important than earlier layers on the final decisions. We will explore this phenomenon in the future work.

翻譯：

我們還繪製了剩餘權重值分佈的演變歷史，如圖8所示。在8k迭代中，分佈是相對均勻的。隨着越來越多的訓練迭代，分佈開始集中在兩個峯值附近。在64k迭代中，大多數剩餘權值在對稱模式下約爲0.2和-0.2，這表明分支剩餘信號具有增強/削弱公路信號的相同概率，這驗證了我們的假設。因此，學習的剩餘權重可以適當地解決ReLU激活和元素添加之間的不相容性。

We have also plotted the evolution history of the distribution of the residual weight values as show in Figure 8. At the 8k iteration, the distribution is relative uniform. As more and more training iterations, the distribution begins to concentrate around two peaks. In the 64k iteration, most of the residual weight values are around 0.2 and −0.2 in a symmetry mode indicating that the branched residual signals have equal probability to enhance/weaken the highway signals, which verifies our hypothesi. Therefore the learned residual weights can solve the incompatibility between ReLU activation and element-wise addition appropriately.

5 結論

翻譯：

原始剩餘網絡存在兩個缺陷：1）ReLU與元素加法不相容。2）使用“msra”初始值設定項很難在深度超過1000層的網絡中收斂。本文引入加權殘差網絡，使得極深殘差網絡比原殘差網絡收斂速度更快，性能更高，計算量和GPU內存負擔也更小。所有的殘差通過學習的緩慢增長的權重逐步加入到公路信號中，以保證收斂。在CIF-AR-10上的實驗證明了加權殘差網絡對非常深的模型的有效性。隨着深度從100+層增加到1000+層，它在精度和收斂性方面有了一致的改進。加權殘差網絡具有簡單易實現、實用性強等優點，特別適用於複雜殘差網絡的研究和實際應用。

The original residual networks have two defects, 1) Incompatibility between ReLU and element-wise addition. 2) Difficulty for networks to converge with depths beyond 1000-layer using “msra” initializer. In this paper we introduce the weighted residual networks to make very deep residual networks converge faster and reach a higher performance with little more computation and GPU memory burden than the original residual networks. All the residuals are added to the highway signal gradually by the learned slowly growing-up weights to promise convergence. Experiments on CIF AR-10 have demonstrated the effectiveness of the weighted residual networks for very deep models. It enjoys a consistent improvements over accuracy and convergence with the increasing depths from 100+ layers to 1000+ layers. The weighted residual networks are simple and easy to implement while having surprising practical effectiveness, which makes it particular useful for complicated residual networks in research community and real applications.

論文翻譯——Weighted Residuals for Very Deep Networks