ReZero is All You Need: Fast Convergence at Large Depth

ReZero is All You Need: Fast Convergence at Large Depth

在這裏插入圖片描述

Abstract

Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets. In general, we find that inefficient signal propagation impedes learning in deep networks. In Transformers, multi-head self-attention is the main cause of this poor signal propagation. To facilitate deep signal propagation, we propose ReZero, a simple change to the architecture that initializes an arbitrary layer as the identity map, using a single additional learned parameter per layer. We apply this technique to language modeling and find that we can easily train ReZero-Transformer networks over a hundred layers. When applied to 12 layer Transformers, ReZero converges 56% faster on enwiki8. ReZero applies beyond Transformers to other residual networks, enabling 1,500% faster convergence for deep fully connected networks and 32% faster convergence for a ResNet-56 trained on CIFAR 10.

深度網絡已實現跨域的顯着性能提升,但它們經常遭受梯度消失/爆炸的困擾。對於Transformer架構尤其如此,如果沒有大型數據集和計算預算,則很難訓練超過12層的深度。總的來說,我們發現無效的信號傳播會阻礙深度網絡中的學習。在transformer中,多頭自我關注是這種不良信號傳播的主要原因。爲了促進深度信號傳播,我們提出了ReZero,這是對體系結構的簡單更改,該體系結構將任意層初始化爲身份映射,並在每層使用單個額外的學習參數。我們將此技術應用於語言建模,發現可以輕鬆地在一百層上訓練ReZero-Transformer網絡。當應用於12層變壓器時,ReZero在enwiki8上的收斂速度提高了56%。 ReZero不僅將Transformers應用於其他殘餘網絡,還爲深度完全連接的網絡提供了1,500%的融合,而在CIFAR 10上接受培訓的ResNet-56的融合則提高了32%。

1 Introduction

Deep learning has enabled significant improvements in state-of-the-art performance across domains [1, 2, 3, 4]. The expressivity of neural networks typically grows exponentially with depth [5], enabling strong generalization performance, but often induces vanishing/exploding gra- dients and poor signal propagation through the model [6]. Researchers have relied on careful initialization [7, 8] and normalization techniques such as BatchNorm [9] and LayerNorm [10] to mitigate this issue, but these techniques can be costly and limited.
In this work, we propose ReZero2, a small architectural addition that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, we introduce a residual connection for the input signal x and one trainable parameter α that modulates the non-trivial transformation of the layer F (x),
xi+1 =xi +αiF(xi), (1)
深度學習已實現跨域的最新性能的顯着改善[1、2、3、4]。 神經網絡的表現力通常隨深度呈指數增長[5],從而實現強大的泛化性能,但通常會導致消失/爆炸的梯度以及通過模型傳播的不良信號[6]。 研究人員依靠仔細的初始化[7,8]和標準化技術(例如BatchNorm [9]和LayerNorm [10])來緩解此問題,但是這些技術可能會昂貴且受限制。
在這項工作中,我們提出了ReZero2,這是一種小的體系結構添加,可動態促進行爲良好的漸變和任意深度的信號傳播。 這個想法很簡單:ReZero初始化每個層以執行標識操作。 對於每一層,我們爲輸入信號x和一個可訓練參數α引入殘差連接,該可調製參數α調製層F(x)的非平凡變換,
xi + 1 = xi +αiF(xi),(1)
在這裏插入圖片描述
where α = 0 at the beginning of training. Initially the gradients for all parameters defining F vanish, but dynamically evolve to suitable values during initial stages of training. We illustrate the architecture in Figure 1.
在訓練開始時α= 0。 最初,所有定義F的參數的梯度都會消失,但會在訓練的初始階段動態演變爲合適的值。 我們在圖1中說明了該體系結構。

在這裏插入圖片描述

Code for ReZero applied to various neural architectures: https://github.com/majumderb/rezero

Table 1: Various forms of normalization and residual connections. F represents the transformation of an arbitrary layer and “Norm” is a normalization (e.g., LayerNorm or BatchNorm).

表1:各種形式的規範化和殘留連接。 F表示任意層的變換,“範數”是歸一化(例如LayerNorm或BatchNorm)。
在這裏插入圖片描述

ReZero provides two main benefits:

Deeper learning — Signals effectively propagate through deep networks, which allows for learning in otherwise untrainable networks. ReZero successfully trains 10,000 layers of fully-connected networks, and we are the first to train Transformers over 100 layers without learning rate warm-up or LayerNorm. In contrast to [11] we find that to get good results at this depth, it is not necessary to add auxiliary losses.
Faster convergence — We observe significantly accelerated convergence in ReZero networks com- pared to regular residual networks with normalization. When ReZero is applied to Transformers, we converge 56% faster than the vanilla Transformer to reach 1.2 BPB on the enwiki8 language modeling benchmark. When applied to ResNets, we obtain 32% speed up to reach 85% accuracy on CIFAR 10.

ReZero具有兩個主要優點:

深度學習-信號通過深度網絡有效傳播,從而可以在原本無法訓練的網絡中學習。 ReZero成功地訓練了10,000層的全連接網絡,並且我們是第一個訓練100層以上的互感器而無需學習速率預熱或LayerNorm的公司。 與[11]相比,我們發現要在此深度獲得良好的結果,沒有必要增加輔助損耗。
更快的收斂—與正規化的殘留網絡相比,ReZero網絡的收斂明顯加快。 當ReZero應用於Transformers時,在enwiki8語言建模基準上,我們的收斂速度比普通Transformer快56%,達到1.2 BPB。 當應用於ResNets時,在CIFAR 10上,我們可以達到32%的速度,達到85%的精度。

2 Background and related work

Networks with a depth of L layers and width w often have an expressive power that scales expo- nentially in depth, but not in width [12, 5]. Large depth often comes with difficulty in training via gradient-based methods. During training of a deep model, a signal in the training data has to propagate forward from the input to the output layer, and subsequently, the cost function gradients have to propagate backwards in order to provide a meaningful weight update. If the magnitude of a perturbation is changed by a factor r in each layer, both signals and gradients vanish or explode at a rate of rL, rendering many deep networks untrainable in practice.
深度爲L層且寬度爲w的網絡通常具有表示能力,其表達能力在深度而不是寬度上呈指數比例縮放[12,5]。 大深度通常難以通過基於梯度的方法進行訓練。 在深度模型訓練期間,訓練數據中的信號必須從輸入向前傳播到輸出層,隨後,成本函數梯度必須向後傳播以便提供有意義的權重更新。 如果在每一層中擾動的幅度都被改變了r,則信號和梯度都將以rL的速率消失或爆炸,從而使許多深層網絡在實踐中無法訓練。

在這裏插入圖片描述

There have been many attempts to improve signal propagation through deep networks, and they often fall into one of three categories — initialization schemes, normalization layers, and residual connections. We show some of the popular ways to combine residual networks with normalization in Table 1.

已經進行了許多嘗試來改善通過深度網絡的信號傳播,並且它們通常屬於三類之一-初始化方案,歸一化層和殘餘連接。 我們在表1中顯示了一些將殘差網絡與規範化相結合的流行方法。

2.1 Careful initialization

In recent years the dynamics of signal propagation in randomly initialized deep and wide neural networks have been formalized via mean field theory [13, 8, 14]. For some deep neural networks, including fully connected and convolutional architectures, the cosine distance of two distinct signals,

近年來,通過均場理論[13,8,14]形式化了隨機初始化的深層和寬層神經網絡中信號傳播的動力學。 對於某些深層神經網絡,包括全連接和卷積架構,兩個不同信號的餘弦距離

在這裏插入圖片描述

approaches a fixed point that either vanishes or approaches unity at large depths. If this fixed point is 1 the behavior of the network is stable and every input is mapped to the same output, leading to vanishing weight updates. If this fixed point is 0 the behavior of the network is chaotic and even similar inputs are mapped to very different outputs, leading to exploding weight updates. To understand whether a network is in a stable or chaotic phase we consider the input-output Jacobian

接近在大深度處消失或接近統一的固定點。 如果此固定點爲1,則網絡的行爲是穩定的,並且每個輸入都映射到相同的輸出,從而導致權重更新消失。 如果此固定點爲0,則網絡的行爲將變得混亂,甚至將相似的輸入映射到非常不同的輸出,從而導致權重更新呈爆炸式增長。 爲了瞭解網絡是處於穩定階段還是混沌階段,我們考慮輸入輸出雅可比矩陣

在這裏插入圖片描述

The mean squared singular values χ of this matrix determine the growth/decay of an average input signal perturbation as it propagates through the network. The network exhibits a boundary between the ordered and the chaotic phase, the edge of chaos at χ = 1. Training proceeds efficiently at the edge of chaos.
This behavior was recognized in [15, 6], which motivated a re-scaling of the weights such that χ ≈ 1 and on average signal strengths are neither enhanced or attenuated.
Pennigton et al. [13, 14] recognized that a unit mean squared average of the input-output Jacobian is insufficient to guarantee trainability. For example, if the singular vectors of Jio corresponding to very large/small singular values align well with the perturbations in the data, training will still be inefficient. They proposed the stronger condition of dynamical isometry [16], which requires that all singular values of Jio are close to one. This means that all perturbations of the input signal propagate through the network equally well. The ReLU activation function maps to zero for some perturbations of the input signal, and it is therefore intuitive that deep networks with ReLU activations cannot possibly satisfy dynamical isometry, as was rigorously established in [13]. For some activation functions and network architectures, elaborate initialization schemes allow the network to satisfy dynamical isometry at initialization, which significantly improves training dynamics [17, 5, 18, 19].

該矩陣的均方平方奇異值χ確定平均輸入信號擾動在網絡中傳播時的增長/衰減。網絡在有序和混沌階段之間表現出邊界,即χ= 1處的混沌邊緣。訓練在混沌邊緣有效地進行。
這種行爲在[15,6]中得到了認可,它促使權重的重新調整,使得χ≈1和平均信號強度均未增強或減弱。
Pennigton等。 [13,14]認識到輸入輸出雅可比矩陣的均方根平均值不足以保證可訓練性。例如,如果對應於非常大/小的奇異值的Jio奇異矢量與數據中的擾動很好地對齊,則訓練仍將是低效的。他們提出了動態等距的更強條件[16],它要求Jio的所有奇異值都接近1。這意味着輸入信號的所有擾動均會很好地通過網絡傳播。對於輸入信號的某些擾動,ReLU激活函數映射爲零,因此很直觀地發現,具有ReLU激活的深層網絡可能無法滿足動態等距,正如在[13]中嚴格建立的那樣。對於某些激活功能和網絡體系結構,精心設計的初始化方案允許網絡在初始化時滿足動態等距圖,從而顯着提高訓練動態性[17、5、18、19]。

2.2 Normalization

An alternative approach to improve the trainability of deep neural networks is to incorporate layers that explicitly provide normalization. Many normalization modules have been proposed, with the two most popular ones being BatchNorm [9] and LayerNorm [10]. In general, normalization aims to ensure that initially, signals have zero mean and unit variance as they propagate through a network, reducing “covariate shift” [9]. For simplicity we will focus primarily on comparisons against LayerNorm because BatchNorm has additional regularizing effects that are orthogonal to our investigation.
Normalization methods have shown success in accelerating the training of deep networks, but they do incur a computational cost to the network and pose additional hyperparameters to tune (e.g., where to place the normalization). In contrast to normalization methods, our proposed method is simple and cheap to implement. ReZero alone is sufficient to train deeper networks, even in the absence of various norms. Although ReZero makes normalization superfluous for convergence, we have found the regularizing effect of BatchNorm to be complementary to our approach.

改善深度神經網絡的可訓練性的另一種方法是合併顯式提供標準化的層。已經提出了許多規範化模塊,其中兩個最受歡迎的模塊是BatchNorm [9]和LayerNorm [10]。一般而言,歸一化旨在確保信號在通過網絡傳播時最初具有零均值和單位方差,從而減少“協變量偏移” [9]。爲簡單起見,我們將主要關注與LayerNorm的比較,因爲BatchNorm具有與我們的研究正交的其他正則化效果。
規範化方法在加速深度網絡的訓練方面已顯示出成功,但是它們確實會給網絡帶來計算成本,並帶來其他超參數進行調整(例如,將規範化放在何處)。與歸一化方法相比,我們提出的方法簡單且實現成本低。即使沒有各種規範,僅ReZero即可訓練更深的網絡。儘管ReZero使歸一化對於收斂是多餘的,但我們發現BatchNorm的正則化效果與我們的方法互補。

2.3 Residual connections

The identity mappings introduced in [2] enabled a deep residual learning framework in the context of convolutional networks for image recognition that significantly increased the trainable depth. The complementary use of BatchNorm and ResNets [2] has enabled the training of convolutional neural networks with over 100 layers. The same has not been the case for LayerNorm and Transformer architectures. Yang et al. [18] studied residual fully connected networks and demonstrated that due to the skip connection, signals decay more slowly (polynomially) as they propagate, allowing for effective training of deeper networks.
Concurrently with our work SkipInit [20], an alternative to the BatchNorm, was proposed for ResNet architectures that is similar to ReZero. The authors find that in deep ResNets without BatchNorm, a scalar multiplier is needed to ensure convergence. We arrive at a similar conclusion for the specific case considered in [20], and study more generally signal propagation in deeper networks across multiple architectures and beyond BatchNorm.

在[2]中引入的身份映射在卷積網絡的背景下爲圖像識別提供了一個深度殘差學習框架,極大地增加了可訓練的深度。 BatchNorm和ResNets [2]的互補使用使訓練具有100多個層的卷積神經網絡成爲可能。對於LayerNorm和Transformer體系結構,情況並非如此。楊等。 [18]研究了殘餘的全連接網絡,並證明由於跳過連接,信號在傳播時衰減較慢(多項式),從而可以對較深的網絡進行有效的訓練。
與我們的工作同時進行的是SkipInit [20],它是與ReZero類似的ResNet體系結構的BatchNorm的替代方案。作者發現,在沒有BatchNorm的深層ResNet中,需要一個標量乘數來確保收斂。對於[20]中考慮的特定案例,我們得出了相似的結論,並且更廣泛地研究了跨多個架構以及BatchNorm的更深網絡中的信號傳播。

3 ReZero

We propose ReZero (residual with zero initialization), a simple change to the architecture of deep residual networks that facilitates dynamical isometry and enables the efficient training of extremely deep networks. Rather than propagating the signal through each of the non-trivial functions F[Wi] at initialization, we add a skip connection and rescale the function by L learnable parameters αi (which we call residual weights) that are initialized to zero. The signal now propagates according to

我們提出了ReZero(零歸零殘差),它是對深殘差網絡體系結構的簡單更改,可促進動態等距並實現對極深網絡的有效訓練。 與其在初始化時通過每個非平凡函數F [Wi]傳播信號,不如添加一個跳過連接並通過初始化爲零的L個可學習參數αi(我們稱爲殘權)來重新縮放該函數。 現在,信號根據

在這裏插入圖片描述

At initialization the network represents the identity function and it trivially satisfies dynamical
isometry. We demonstrate below for a toy model that this architecture can exponentially accelerate training. The architecture modification allows for the training of deep networks even when the individual layers’ Jacobian has vanishing singular values, as is the case for ReLU activation functions or self-attention [21]. The technique also allows us to add arbitrary new layers to existing and trained networks.

在初始化時,網絡代表身份功能,並且可以輕鬆滿足動態需求
等軸測圖。 我們在下面的玩具模型中演示該架構可以成倍地加快訓練速度。 架構修改允許即使在各個層的Jacobian值都消失了的情況下也可以訓練深度網絡,例如ReLU激活功能或自我注意[21]。 該技術還允許我們向現有和經過培訓的網絡添加任意新層。
在這裏插入圖片描述

Figure 2: Contour log plots of a quadratic cost function (left) and gradient norm (right) over the network weight w and the residual weight α during the training of the linear function xL = 5x0 via gradient descent using a training set of x0 = {1, 2, 3}. Gradient descent trajectories initialized at α = 0 are shown in red for five different initial w’s. The trajectory dynamics avoid the poorly conditioned regions around α ≈ 1.
圖2:在線性權重xL = 5x0的訓練中,使用x0 =的訓練集對網絡權重w和剩餘權重α進行二次代價函數(左)和梯度範數(右)的輪廓對數圖 {1,2,3}。 對於五個不同的初始w,以α= 0初始化的梯度下降軌跡以紅色顯示。 軌跡動力學避免了α≈1附近條件差的區域。

3.1 A toy example

To illustrate how the ReZero connection accelerates training let us consider the toy model of a deep neural network described by L single-neuron hidden layers that have no bias and all share the same weight w and αi = α ∀i. The network then simply maps an input x0 to the output

爲了說明ReZero連接如何加快訓練速度,讓我們考慮由L個單神經元隱藏層描述的深度神經網絡的玩具模型,這些隱藏層沒有偏倚,並且都具有相同的權重w和αi=α∀i。 然後,網絡只需將輸入x0映射到輸出
在這裏插入圖片描述

Fixing the parameter α = 1 would represent a toy model for a fully connected residual network, while initializing α = 0 and treating α as a learned parameter corresponds to a ReZero network. The input-output Jacobian is given by Jio = (1 + αw)L, indicating that for initialization with w ≈ 1 and α = 1 the output signal of a deep (i.e., L ≫ 1) network is extremely sensitive to any small perturbations of the input, while with α = 0 the input signal magnitude is preserved. While this example is too simple to exhibit an order/chaos phase transition, it does accurately model the vanishing and exploding gradient problem familiar in deep networks. Assuming a learning rate λ and a cost function C, gradient descent updates the weights w according to
固定參數α= 1將代表完全連接的殘差網絡的玩具模型,而初始化α= 0並將α作爲學習的參數對應於ReZero網絡。 輸入輸出雅可比關係式由Jio =(1 +αw)L給出,表明對於w≈1且α= 1的初始化,深層(即L≫ 1)網絡的輸出信號對任何小擾動都極爲敏感。 當α= 0時,輸入信號的幅度得以保留。 儘管此示例過於簡單以至於無法表現出階躍/混沌相變,但它確實可以對深層網絡中熟悉的消失和爆炸梯度問題進行精確建模。 假設學習率λ和成本函數C,則梯度下降根據
在這裏插入圖片描述

For α = 1, convergence of gradient descent with an initial weight w ≈ 1 requires steps no larger than
1, and hence a learning rate that is exponentially small in depth L

對於α= 1,初始權重w≈1的梯度下降的收斂要求步長不大於
1,因此學習率在深度L上成倍減小

在這裏插入圖片描述

where we only retained the parametric dependence on w and L. For w ≫ 1 the gradients in Equation 6 explode, while for w ≈ −1 the gradients vanish. Initializing α = 0 solves both of these problems: assuming a sufficiently well-conditioned cost function, the first step of gradient descent will update the residual weights α to a value that avoids large outputs and keeps the parameter trajectory within a well-conditioned region while retaining the expressive power of the network. The first non-trivial steps of the residual weight are given by

在這裏,我們只保留了對w和L的參數依賴性。對於w≫ 1,等式6中的梯度會爆炸,而對於w≈-1,梯度會消失。 初始化α= 0解決了這兩個問題:假設條件條件條件足夠充分,則梯度下降的第一步將把剩餘權重α更新爲一個避免大輸出並將參數軌跡保持在條件範圍內的值,而 保留網絡的表現力。 剩餘權重的第一個非平凡步驟爲

在這裏插入圖片描述

and gradient descent will converge with a learning rate that is polynomial in the depth L of the network. In this simple example, the ReZero connection, therefore, allows for convergence with dramatically fewer optimization steps than a vanilla residual network. We illustrate the training dynamics, cost function and gradients in Figure 2.

梯度下降將以網絡深度L中多項式的學習速率收斂。 因此,在這個簡單的示例中,與原始殘差網絡相比,ReZero連接允許以更少的優化步驟進行收斂。 我們在圖2中說明了訓練動態,成本函數和梯度。

4 Training deep fully connected networks faster

在這裏插入圖片描述
在這裏插入圖片描述

Figure 3: Cross entropy loss during training of four variants of 32 layer fully connected networks with width 256 and ReLU activations. The bracketed numbers refer to the architectures in the corresponding rows of Table 1. We average over five runs each and show 1σ error bands. For all models we use the Adagrad [22] optimizer with learning rate 0.01.
圖3:訓練寬度爲256和ReLU激活的32層完全連接網絡的四個變體時的交叉熵損失。 方括號中的數字指的是表1中相應行的體系結構。我們平均進行了5次運行,並顯示了1σ誤差帶。 對於所有模型,我們使用學習率0.01的Adagrad [22]優化器。

As a sample toy task, we train four different net- work architectures on the CIFAR-10 data set for supervised image classification. We are only in- terested in the training dynamics and investigate how many iterations it takes for the model to fit the data.

We show the evolution of the training loss in Figure 3. In our simple experiment, a 32 layer network the ReZero architecture converges to fit the training data between 7 and 15 times faster than the other techniques. Note that without an additional normalization layer the residual connection decreases convergence speed compared to a plain fully connected network. We speculate that this is because at initialization the variance of the signal is not independent of depth, see [18].
With increasing depth, the advantages of the ReZero architecture become more apparent. To verify that this architecture ensures trainability to large depths we successfully trained fully connected ReZero networks with up to 10, 000 layers on a laptop with one GPU3 to overfit the training set.

作爲示例玩具任務,我們在CIFAR-10數據集上訓練了四種不同的網絡體系結構,以進行監督圖像分類。我們僅對訓練動態感興趣,並研究了模型擬合數據所需的迭代次數。

我們在圖3中顯示了訓練損失的演變。在我們的簡單實驗中,ReZero體系結構收斂了一個32層網絡,比其他技術快7到15倍來擬合訓練數據。請注意,與普通的完全連接網絡相比,沒有額外的歸一化層,剩餘連接會降低收斂速度。我們推測這是因爲在初始化時信號的方差與深度無關,請參見[18]。
隨着深度的增加,ReZero體系結構的優勢變得更加明顯。爲了驗證該體系結構確保可深度訓練,我們在配備一臺GPU3的筆記本電腦上成功訓練了多達10、000層的全連接ReZero網絡,以適應訓練集的要求。

5 Training deeper Transformers faster

In this section, we study the signal propagation and application of ReZero to the Transformer architecture [21]. Transformers gained significant popularity and success both in supervised and unsupervised NLP tasks [23, 11]. Transformers are built by stacking modules that first perform self-attention, then a point- wise feed-forward transformation.
The original Transformer [21] implementation can be seen as a residual network with post-normalization (row 5 in Table 1). Inside a Transformer module the output of each sublayer is added via a residual connection and then normalized by LayerNorm,

在本節中,我們研究ReZero的信號傳播及其在Transformer體系結構中的應用[21]。 在有監督和無監督的NLP任務中,變壓器都獲得了極大的歡迎和成功[23,11]。 變壓器由堆疊模塊構建,這些模塊首先執行自我關注,然後進行逐點前饋轉換。
最初的Transformer [21]實現可以看作是具有後歸一化功能的殘差網絡(表1中的第5行)。 在Transformer模塊內部,每個子層的輸出通過殘差連接添加,然後由LayerNorm標準化,

在這裏插入圖片描述

where sublayer ∈ {self-attention, feed-forward}, as
illustrated in the left panel of Figure 4.

在這裏插入圖片描述

5.1 Signal propagation in Transformers

在這裏插入圖片描述

Figure 5: Histograms for log singular values λio of the input-output Jacobian matrix for: (a) Transformer encoder network at initialization of depths 4, 12 and 64 layers; (b) ReZero Transformer encoder network with 64 layers before and during training. Deep Transformers are far from dynamical isometry, λio ≪ 1, while ReZero Transformers remain closer to dynamical isometry with mean singular value λio ≈ 1.
圖5:用於以下情況的輸入-輸出雅可比矩陣的對數奇異值λio的直方圖:(a)在深度4、12和64層初始化時的變壓器編碼器網絡; (b)在訓練之前和訓練期間使用64層的ReZero Transformer編碼器網絡。 Deep Transformers遠離動態等距,λio≪ 1,而ReZero Transformers仍然更接近動態等距,平均奇異值λio≈1。

Two crucial components relevant to the signal propagation in the original Transformer layers include LayerNorm [10] and (multi-head) self attention [21]. We will argue that neither component by itself or in conjunction with a vanilla residual connection can satisfy dynamical isometry for all input signals. This finding motivates the use of a ReZero connection to replace both LayerNorm and the vanilla residual connection.
Layer normalization removes the mean and scales the variance over all neurons of a given layer and introduces learnable parameters γ and β to re-scale the variance and shift the mean according to
與原始Transformer層中的信號傳播相關的兩個關鍵組件包括LayerNorm [10]和(多頭)自我關注[21]。 我們將論證,無論是分量本身還是與香草殘差連接相結合,都不能滿足所有輸入信號的動態等軸測圖。 這一發現促使人們使用ReZero連接來替代LayerNorm和香草殘餘連接。
層歸一化去除均值並縮放給定層的所有神經元的方差,並引入可學習的參數γ和β重新縮放方差並根據

在這裏插入圖片描述

It is clear from this definition that perturbing an input x by a transformation that purely shifts either its mean or variance will leave the output unchanged. These perturbations, therefore, give rise to two vanishing singular values of the input-output Jacobian. In the Transformer architecture [21] the norm is applied to each of the n elements of the input sentence, leading to a total of 2 × n vanishing singular values of the Jacobian for each Transformer layer.

從這個定義可以清楚地看出,通過純粹地移動均值或方差的變換來擾動輸入x會使輸出保持不變。 因此,這些擾動會導致輸入輸出Jacobian的兩個奇異值消失。 在Transformer體系結構[21]中,將規範應用於輸入語句的n個元素中的每個元素,從而導致每個Transformer層的雅可比行列共有2×n消失的奇異值。

在這裏插入圖片描述

In general, the singular value spectrum of the Jacobian of this attention process is complicated. Rather than studying it in full generality, we now merely argue that for some inputs x and weights W Q,K,V the Jacobian has a large number of vanishing singular values (a claim we evaluate empirically below). Consider weights or inputs such that each of the arguments of the softmax function is small compared to 1. The softmax function then simply returns a n × n dimensional matrix filled with entries that all approximate 1/n. This means that the attention function projects all embedding vectors of the input sequence onto a single diagonal direction. This implies that out of the n × d Jacobian singular values only d are non-vanishing and hence much of the input signal is lost. A residual connection can restore some of the lost signals, but even then some perturbations are amplified while others are attenuated. This example demonstrates that self-attention is incompatible with dynamical isometry and unimpeded signal propagation in deep Transformer networks. It is easy to verify that the same conclusion holds for multi-head attention. A careful initialization of the weights might alleviate some of these issues, but we are not aware of any initialization scheme that would render a Transformer layer consistent with dynamical isometry.

通常,該注意力過程的雅可比行列式的奇異值譜很複雜。現在,我們並沒有完全全面地研究它,而只是爭辯說,對於某些輸入x和權重W Q,K,V,雅可比行列式具有大量消失的奇異值(我們在下面的經驗中對此求值進行了評估)。考慮權重或輸入,以使softmax函數的每個自變量都比1小。softmax函數然後簡單地返回一個n×n維矩陣,其中填充了所有近似爲1 / n的條目。這意味着注意功能將輸入序列的所有嵌入矢量投影到單個對角線方向上。這意味着在n×d雅可比奇異值中,只有d消失,因此很多輸入信號丟失。殘餘連接可以恢復一些丟失的信號,但是即使這樣,某些干擾也會放大,而其他干擾則會衰減。此示例說明,在深層變壓器網絡中,自我注意與動態等距和不受阻礙的信號傳播不兼容。容易驗證相同的結論對於多頭注意力是否成立。權重的仔細初始化可能會緩解其中的一些問題,但是我們不知道有任何初始化方案會使Transformer層與動態等距一致。

在這裏插入圖片描述

We gave a theoretical argument that the vanilla Transformer contains elements that inhibit deep signal propagation. Here, we verify these claims in practice by obtaining the input-output Jacobian for the attention process by evaluating its change under an infinitesimal variation of each of the n × d entries of the input sequence x. We show the input-output Jacobian for Transformer encoder layers of various depth with Xavier uniform initialized weights in Figure 5a. While shallow Transformers exhibit a singular value distribution peaked around unity, we clearly observe that the Jacobian of deep architectures has a large number of singular values that vanish to machine precision. While the distribution varies depending on the details of the initialization scheme, the qualitative statement holds more broadly. These results are consistent with the common observation that deep Transformer networks are extremely challenging to train.

We apply ReZero to solve the problem of poor signal propagation in Transformer layers by replacing LayerNorm and re-scaling the self-attention block. Specifically, this modifies equation (9) to

我們給出了一個理論上的論點,即香草變壓器包含的元件會抑制深度信號傳播。在這裏,我們通過在輸入序列x的n×d個條目的無窮小變化下評估其變化來獲得注意力過程的輸入輸出雅可比矩陣,從而在實踐中驗證這些主張。我們在圖5a中顯示了具有Xavier統一初始化權重的各種深度的變壓器編碼器層的輸入輸出Jacobian值。儘管淺層變壓器的奇異值分佈在單位附近達到峯值,但我們清楚地觀察到,深層架構的雅可比矩陣具有大量奇異值,這些奇異值會消失,從而降低機器的精度。雖然分佈取決於初始化方案的詳細信息,但定性說明的含義更爲廣泛。這些結果與普遍的看法一致,即深層變壓器網絡極難訓練。

我們應用ReZero通過替換LayerNorm並重新調整自注意力模塊來解決Transformer層中信號傳播較差的問題。具體而言,這將等式(9)修改爲

在這裏插入圖片描述

where αi is the learned residual weight parameter as in the right panel of Figure 4. We share the same αi parameter for a pair of multi-head self-attention and feed-forward network within a Transformer layer. At initialization, αi = 0, which allows for unimpeded signal propagation: All singular values of the input-output Jacobian are 1 and the model trivially satisfies dynamical isometry. To verify that the model remains close to dynamical isometry throughout training and for larger αi, we show a histogram of the Jacobian singular values during the training of a 64 layer model on a toy task of language modeling on WikiText-2 [24] in Figure 5b. During training the weight of the residual connection gradually increases, allowing the Transformer to model extremely complex functions while maintaining signal propagation properties close to dynamical isometry.

其中,αi是學習的剩餘權重參數,如圖4的右圖所示。我們在變壓器層內的一對多頭自注意和前饋網絡共享相同的αi參數。 在初始化時,αi= 0,這允許無阻礙的信號傳播:輸入-輸出雅可比行列的所有奇異值均爲1,並且模型輕鬆滿足動態等距。 爲了驗證模型在整個訓練過程中仍保持與動態等距近似,並且對於較大的αi,我們在WikiText-2上的語言建模玩具任務[64]上顯示了一個64層模型的訓練過程中,雅可比奇異值的直方圖。 5b。 在訓練過程中,殘餘連接的權重逐漸增加,從而使Transformer可以建模極其複雜的功能,同時保持信號傳播特性接近動態等距。

5.2 Convergence speed

We pick language modeling on enwiki8 [25] as a benchmark because strong language models are a good indicator of downstream NLP task performance [4]. Our aim in these experiments is to measure the convergence speed of each method by measuring the number of iterations it takes for a 12 layer Transformer to reach 1.2 bits per byte (BPB) on enwiki8.
Since the introduction of Transformers [21], there have been several competing placements of the LayerNorm within the Transformer to achieve better convergence [4, 26]. We experiment with 3 Transformer normalization methods and compare against the ReZero Transformer. The Post-Norm (Row 5 in Table 1) method is equivalent to the vanilla Transformer in [21], the Pre-Norm (Row 4 in Table 1) method was recently introduced in [26] and the GPT2-Norm (xi+1 = xi + Norm(F (xi ))) was used in the training of GPT2 [4], which has successfully trained Transformers up to 48 layers. Finally, we experiment with our proposed ReZero method with α initialized to either zero or one. The hyperparameters are in the appendix A.
Our results (Table 2) show that Post-Norm diverges during training while all other models are able to converge. This is not surprising as the original Transformer implementation required a learning rate warm-up and this is also confirmed in [26]. To verify this, we re-ran the Post-Norm setup with 100 steps of learning rate warm-up and find that the model is able to converge to 1.2 BPB in 13,690 iterations. Under this setting, we compared other LayerNorm placements schemes against Post-Norm. We find that the other placements led to initially faster convergence, but ultimately Post-Norm catches up in performance, resulting in relatively slower convergence for Pre-Norm and GPT2-Norm. However, other LayerNorm placements have an advantage over Post-Norm in that they do not require learning rate warm-up, thus have fewer hyperparameters to tune. ReZero with α = 1 does not show an improvement over the vanilla Transformer, indicating the importance of initializing α = 0. With our proposed initialization of α = 0, ReZero converges 56% faster than the vanilla Transformer.

我們選擇enwiki8 [25]上的語言建模作爲基準,因爲強大的語言模型是下游NLP任務性能的良好指標[4]。在這些實驗中,我們的目標是通過測量12層Transformer在enwiki8上達到1.2位/字節(BPB)所需的迭代次數來衡量每種方法的收斂速度。
自從引入Transformer [21]以來,在Transformer中存在LayerNorm的多個競爭佈局,以實現更好的收斂[4,26]。我們嘗試了3種Transformer歸一化方法,並與ReZero Transformer進行了比較。後範式(表1中的第5行)方法等同於[21]中的香草變壓器,前範式(表1中的第4行)方法最近於[26]中引入,GPT2-範式(xi + 1 = xi + Norm(F(xi)))用於GPT2的訓練[4],它已成功訓練了多達48層的變壓器。最後,我們用建議的ReZero方法進行實驗,將α初始化爲零或一。超參數在附錄A中。
我們的結果(表2)表明,訓練後的範數會發散,而所有其他模型都可以收斂。這並不奇怪,因爲最初的Transformer實現要求學習率預熱,這在[26]中也得到了證實。爲了驗證這一點,我們重新進行了100次學習率預熱的標準後設置,發現該模型能夠在13,690次迭代中收斂到1.2 BPB。在此設置下,我們將其他LayerNorm放置方案與Post-Norm進行了比較。我們發現,其他佈局最初會導致更快的收斂,但最終No-Norm趕上了性能,導致Pre-Norm和GPT2-Norm的收斂相對較慢。但是,其他LayerNorm放置優於No-Norm,因爲它們不需要預熱學習率,因此需要調整的超參數更少。 α= 1的ReZero並未顯示出優於香草變壓器的改善,表明初始化α= 0的重要性。通過我們建議的α= 0的初始化,ReZero的收斂速度比香草變壓器快56%。

5.3 Deeper Transformers

Transformer models that achieve state of the art performance in many NLP tasks [23] usually have less than 24 layers. The deepest model as of our work used up to 78 layers [27] and requires 256 GPUs for training. In this section, we will scale beyond hundreds of Transformer layers and still remain trainable on a desktop machine. To examine whether our approach scales to deeper Transformers, we extend our 12 layer ReZero Transformer from Section 5.2 to 64 and 128 layers and compare against the vanilla Transformer (Post-Norm). The hyperparameters are in appendix section B.
Our results (Table 3) indicate that a 12 layer ReZero Transformer attains the same BPB as a regular Transformer after convergence, which shows that we do not lose any representational expressivity in our model by replacing LayerNorm with ReZero. We find that trying to train deep vanilla Transformers lead to either convergence difficulties or slow training times. When scaled to 64 layers, the vanilla Transformer fails to converge even with a warm-up schedule. A ReZero Transformer with initialization of α = 1 diverges, supporting our theoretically motivated initialization at α = 0. The deeper ReZero Transformers are able to attain better performance than the shallower Transformers.
For comparison, we also display results from Character Transformer [11] which had a similar setup for reference. However, Character Transformer uses more parameters and has many additional auxiliary losses to achieve their performance, which is orthogonal to our work. Our 128 layer Transformer achieves similar performance without any intermediate losses, uses half the number of parameters and has larger depth. We did not tune our hyperparameters, and our models can potentially achieve better results with stronger regularization and a learning rate schedule.

在許多NLP任務中達到最先進性能的變壓器模型[23]通常少於24層。我們工作中最深的模型最多使用78層[27],並且需要256個GPU進行訓練。在本節中,我們將擴展到數百個Transformer層,並且仍然可以在臺式機上進行訓練。爲了檢查我們的方法是否可以擴展到更深的變壓器,我們將5.2節的12層ReZero變壓器擴展到了64層和128層,並與普通變壓器(標準後)進行了比較。超參數在附錄B中。
我們的結果(表3)表明,在收斂後,一個12層的ReZero變壓器與常規變壓器具有相同的BPB,這表明通過用ReZero替換LayerNorm,我們在模型中不會失去任何代表性。我們發現嘗試訓練深層香草變形金剛會導致收斂困難或訓練時間緩慢。當縮放到64層時,即使經過預熱計劃,香草變壓器也無法收斂。初始化爲α= 1的ReZero變壓器發散了,從而支持了我們在α= 0時的理論上的初始化工作。較深的ReZero變壓器比較淺的變壓器能夠獲得更好的性能。
爲了進行比較,我們還顯示了字符轉換器[11]的結果,該結果具有類似的設置以供參考。但是,Character Transformer使用更多的參數,並具有許多額外的輔助損耗來實現其性能,這與我們的工作正交。我們的128層變壓器實現了類似的性能,而沒有任何中間損失,使用了一半參數,並且具有更大的深度。我們沒有調整我們的超參數,並且我們的模型可以通過更強的正則化和學習率計劃來潛在地獲得更好的結果。

To probe deeper into our model, we examine
the behavior of residual weights αi during train-
ing for our 12 layer and 64 layer ReZero Trans- former (Figure 6). It is useful to view |αi| as the 20 amount of contribution each layer provides to
the overall signal of the network. We see that an
interesting pattern emerges for both the shallow 40
and the deeper ReZero Transformer. During the 50
early iterations of training, the residual weights
quickly increase to a peak value, then slowly
decays to a small value throughout its training.
Early in training, the higher layers tend to be
dominant (they peak earlier) and towards the
end of training each layer is utilized to a similar
degree. The average |αi| at the end of training
is 0.0898 and 0.0185 for the 12 and 64 layer models respectively, which is approximately 1/L, where L is the number of residual layers.

爲了更深入地研究我們的模型,我們研究了
列車中剩餘權重αi的行爲
我們的12層和64層ReZero變壓器(圖6)。 查看|αi|很有用。 作爲每層提供的20個貢獻量
網絡的整體信號。 我們看到
淺層40都出現了有趣的模式
以及更深的ReZero transformer。 在50年代
訓練的早期迭代,剩餘權重
快速增加到峯值,然後緩慢增加
在整個訓練過程中衰減到很小的值。
在訓練的早期,高層往往是
占主導地位(它們較早達到峯值)並向
訓練結束後,將每一層都用於相似的
學位。 平均|αi| 在訓練結束時
對於12層和64層模型,分別爲0.0898和0.0185,約爲1 / L,其中L是剩餘層數。

在這裏插入圖片描述
Interestingly, this pattern also occurs in the 12 layer ReZero Transformer when we initialized α = 1, except the model spends the first ≈ 50 iterations forcing the α’s to small values, before reaching a similar pattern to that shown in Figure 6. This empirical finding supports our proposal that we should initialize α = 0 even for shallow models.

有趣的是,當我們初始化α= 1時,這種模式也會出現在12層ReZero Transformer中,除了在達到與圖6所示模式相似的模式之前,該模型先進行了約50次迭代,將α強制爲較小的值。 支持我們的建議,即使對於較淺的模型,我們也應初始化α= 0。

6 Training ResNets faster

In the previous sections, we saw how ReZero connections enable training of deep networks that con- tain layers with vanishing Jacobian singular values, such as ReLU activations or self-attention. Some of these architectures are not trainable without ReZero connections or other architectural changes. In this section, we apply ReZero connections to deep residual networks for image recognition [2]. While these networks are trainable without ReZero connections, we observe that the validation error for a ResNet56 model4 trained (up to 200 epochs) on the CIFAR-10 dataset improves significantly —
from (7.37 ± 0.06)% to (6.46 ± 0.05)% — after trading all vanilla residual connections in the model 5
for ReZero connections . The number of epochs to decrease the validation error below 15% also dropped by (32 ± 14)% after implementing ReZero. While these results provide only limited insight by themselves, they point towards broader applicability of ReZero connections and motivate further study.

在前面的部分中,我們看到了ReZero連接如何使深層網絡訓練成爲可能,這些深層網絡包含具有消失的Jacobian奇異值(例如ReLU激活或自我注意)的層。如果沒有ReZero連接或其他架構更改,其中某些架構將無法訓練。在本節中,我們將ReZero連接應用於深層殘差網絡以進行圖像識別[2]。儘管這些網絡無需ReZero連接即可進行訓練,但我們觀察到,在CIFAR-10數據集上訓練的ResNet56 model4(最多200個紀元)的驗證錯誤得到了顯着改善-
從(7.37±0.06)%到(6.46±0.05)%-在交易了模型5中的所有香草殘餘連接之後
用於ReZero連接。在實施ReZero之後,將驗證誤差降低到15%以下的時期數也減少了(32±14)%。儘管這些結果本身僅提供了有限的見識,但它們指出了ReZero連接的更廣泛的適用性,並激發了進一步的研究。

7 Conclusion

We introduced ReZero, a simple architecture modification that facilitates signal propagation in deep networks and helps the network maintain dynamical isometry. Applying ReZero to various residual architectures – fully connected networks, Transformers and ResNets – we observed significantly improved convergence speeds. Furthermore, we were able to efficiently train Transformers with hundreds of layers, which has been difficult with the original architecture. We believe deeper Transformers will open doors for future exploration.
While training models with ReZero, we discovered interesting patterns in the values of residual weights of each layer |αi| over the course of training. These patterns may hint towards some form of curriculum learning and allow for progressive stacking of layers to further accelerate training [28]. Patterns of residual weights can be crucial to understand the training dynamics of such deeper networks and might be important to model performance, which we will explore in future work.

我們推出了ReZero,這是一種簡單的體系結構修改,可促進深度網絡中的信號傳播並幫助網絡保持動態等距。將ReZero應用於各種殘差架構(完全連接的網絡,變壓器和ResNets)後,我們發現收斂速度顯着提高。此外,我們能夠有效地訓練數百個層的變壓器,這對於原始體系結構來說是困難的。我們相信,更深的變形金剛將爲未來的探索敞開大門。
在使用ReZero訓練模型時,我們發現了每個圖層|αi|的剩餘權重值中有趣的模式。在培訓過程中。這些模式可能暗示了某種形式的課程學習,並允許逐步堆疊層次以進一步加速培訓[28]。剩餘權重的模式對於理解這種更深層網絡的訓練動態至關重要,對模型化性能也可能很重要,我們將在以後的工作中進行探討。

Acknowledgements

The work of TB was supported in part by DOE under grants no. DE-SC0009919 and by the Simons Foundation SFARI 560536. The work of BPM and HHM was supported by Amazon via the grant of Alexa Prize Grand Challenge 3.

結核病的工作部分由美國能源部(DOE)提供了部分資助。 DE-SC0009919和Simons基金會SFARI560536。BPM和HHM的工作由亞馬遜通過Alexa大獎大挑戰3的資助而得到了支持。

References

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[3] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pages 971–980, 2017.
[4] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
[5] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360–3368, 2016.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
[7] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
[8] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen- nington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.
4For our experiments we used the implementation by Yerlan Idelbayev (available at github.com/ akamaster/pytorch_resnet_cifar10) that very closely resembles the original architecture [2].
5Our setup differs from the SkipInit proposal in [20], in that we retain the BatchNorm layer. 9

[9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[10] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[11] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3159–3166, 2019.
[12] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
[13] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pages 4785–4795, 2017.
[14] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.
[15] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
[16] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
[17] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa- tion propagation. arXiv preprint arXiv:1611.01232, 2016.
[18] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pages 7103–7114, 2017.
[19] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey Pennington. Dynamical isometry and a mean field theory of lstms and grus. arXiv preprint arXiv:1901.08987, 2019.
[20] Soham De and Samuel L Smith. Batch normalization biases deep residual networks towards shallow paths. arXiv preprint arXiv:2002.10444, 2020.
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[22] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL, pages 4171–4186, 2019.
[24] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017.
[25] Matt Mahoney. Large text compression benchmark, 2009.
[26] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745, 2020.
[27] Microsoft. Turing-nlg: A 17-billion-parameter language model, 2020. 10

[28] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tie-Yan Liu. Efficient training of BERT by progressively stacking. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2337–2346. PMLR, 2019.
[29] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
[30] YangYou,JingLi,JonathanHseu,XiaodanSong,JamesDemmel,andCho-JuiHsieh.Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.

A Convergence speed experimental hyperparameters

For all model variants in Section 5.2, we control the batch size to be 1080, number of layers to 12, feed-forward and attention dropout to 20%, hidden and embedding size to 512 units, context length to 512, the attention heads to 2, and GELU [29] activation in the point-wise feed-forward layer. To accommodate large batch training we use the LAMB optimizer [30] with a fixed learning rate of 0.016. Although learning rate schedules tend to improve performance [23], we omit them to simplify our training process.

對於第5.2節中的所有模型變體,我們將批量大小控制爲1080,層數爲12,前饋和注意力下降爲20%,隱藏和嵌入的大小爲512個單位,上下文長度爲512,注意量爲 2,並在逐點前饋層激活GELU [29]。 爲了適應大批量培訓,我們使用LAMB優化器[30],其固定學習率爲0.016。 儘管學習率表可以提高績效[23],但我們省略了它們以簡化培訓過程。

B Deep Transformers experimental hyperparameters

In Section 5.3, in order to examine whether our approach scales to deeper Transformers, we scale our 12 layer ReZero Transformer from Section 5.2 to 64 layers and 128 layers and compare it against the vanilla Transformer (Post-Norm). Due to memory constraints, we decreased the hidden size from 512 to 256 and reduced batch size to 304 and 144 for the 64 layer and 128 layer model respectively. Following guidelines from [30] we also adjusted the learning rate to according to

在5.3節中,爲了檢查我們的方法是否可以擴展到更深的變壓器,我們將5.2節中的12層ReZero變壓器縮放爲64層和128層,並將其與香草變壓器(標準後)進行比較。 由於內存的限制,對於64層和128層模型,我們將隱藏大小從512減少到256,並將批處理大小分別減少到304和144。 根據[30]的指南,我們還根據
在這裏插入圖片描述

For all models in our experiments we limit training to a maximum of 100 training epochs.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章