LCA: Loss Change Allocation for Neural Network Training (神經網絡在訓練過程中的損失變化分配)

Paper in here.   Code in here.   Uber Blog in here.   Video in here

Motivation

  The empirical tell us that loss will decrease when we training the neural network if we properly designed the network architecture and neural algorithm. In other words, the loss change index the convergence of the algorithm or learn efficient of the neural network or not. However, if we allocation the loss to every parameters of the neural network, the loss change of the parameters will be obtained. We will be find that which parameters will be decrease the totally loss to help neural convergence, and which increase the loss to hurt the SGD convergence to local minima point.

  經驗告訴我們,隨着網絡的訓練,損失是會下降的(假設所有設計都正確),損失函數的變化指示着算法的收斂性和網路的學習過程。然而,當我們把總體的損失函數分配到網絡中的每一個參數上,來具體測量損失在每個參數上的變化程度,結果會怎樣?這篇文章,就是基於這樣的思路,對每層、沒通道甚至沒個神經元的損失變化情況進行了測量,該方法將得到那些有意思的結論呢?

Some Useful Conclusion of LCA

  • We find that barely over 50% of parameters help during any given iteration.
    在任何迭代過程中,只有50%的參數對減少損失是有益的。
  • Some entire layers hurt overall, moving on average against the training gradient, a phenomenon we hypothesize may be due to phase lag in an oscillatory training process.
    有些層整體對損失減少有害,並逆着梯度移動向一個平均點,作者將該現象解釋爲一種在震盪訓練過程中的階段滯後
  • Finally, increments in learning proceed in a synchronized manner across layers, often peaking on identical iterations.
    最後,學習的增量以同步的方式跨層進行,通常在相同的迭代中達到峯值。

What is the LCA?

  We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. This measurement is accomplished by decomposing the components of an approximate path integral along the training trajectory using a Runge-Kutta integrator.
  In one word, it is a simple approach to inspecting training in progress by decomposing changes in the overall network loss into a per-parameter Loss Change Allocation or LCA.

  作者關於網絡訓練提出了一個新的視角,叫做損失變化分配(LCA)。在 LCA 中,網絡損失變化的信譽(Credit)被適當地劃分到其參數上。這種測量是通過使用 Runge-Kutta (RK4) 積分器 (用於非線性常微分方程的解的重要的一類隱式或顯式迭代法) 沿訓練軌跡分解近似路徑積分的分量來完成的。
  總的來說,提出了一種檢測 每個網絡參數上 loss 變化的方法。

Methods

  This rich view shows which parameters are responsible for decreasing or increasing the loss during training, or which parameters “help” or “hurt” the network’s learning, respectively.
  該視角顯示了在網絡訓練過程中,那些參數可以減少 (means help) 或增加 (means hurt) 損失,並給出了具體的量化方法(loss change on per parameters)。
在這裏插入圖片描述
  Negative LCA and is “helping” or “learning”. Positive LCA is “hurting” the learning process, which may result from several causes: a noisy mini-batch with the gradient of that step going the wrong way, momentum, or a step size that is too large for a curvy or rugged loss landscape. If the parameter has a non-zero gradient but does not move, it does not affect the loss. Figure 1 depicts a toy example using two parameters.

  負的 LCA 是有幫助的,正的 LCA 是有害的,他可能是 mini-batch 的gradient走了錯誤的方向和大的崎嶇的曲線而產生的。0 gradient 不產生移動。上圖是一個兩個參數的toy example.

  Consider a parameterized training scenario where a model starts at parameter value θ0θ_0 and ends at parameter value θTθ_T after training. The training process entails traversing some path P along the surface of a loss landscape from θ0θ_0 to θTθ_T . The loss change can derives from a straightforward application of the fundamental theorem of calculus to a path integral along the loss landscape:
  考慮網絡參數在0時刻 θ0θ_0 到T時刻 θTθ_T 沿着路徑 P 變化, 對該路徑進行積分可以得到:
在這裏插入圖片描述
where C is any path from θ0θ_0 to θTθ_T and <.,.> is the dot product. This equation states that the change in loss from θ0 to θT may be calculated by integrating the dot product of the loss gradient and parameter motion along a path from θ0θ_0 to θTθ_T . Because θL(θ)\bigtriangledown_θ L(θ) is the gradient of a function and thus is a conservative field, any path from θ0θ_0 to θTθ_T may be used.
其中,C是θ0θ_0θTθ_T的任意的路徑,<.,.>是點積(dot product)。該等是描述了θ0θ_0θTθ_T參數移動的情況。θL(θ)\bigtriangledown_θ L(θ) 是梯度函數,因爲標量場的梯度是保守場,保守場的第二類曲線積分只與起點和終點有關,而與路徑無關 (就像重力做功一樣),保守場的第二個性質是旋度都是零,即無旋矢量場,這裏只討論第一個性質。

  We may approximate this path integral from θ0θ_0 to θTθ_T by using a series of first order Taylor approximations along the training path. If we index training steps by t[0;1;&ThinSpace;;T]t \in [0; 1; \dots; T ], the first order approximation for the change in loss during one step of training is the following, rewritten as a sum of its individual components:
使用 θ0θ_0θTθ_T 的參數序列來近似路徑積分,那麼,(1)式中的 1 階泰勒展開在各個分量和的形式就可以表示爲:
在這裏插入圖片描述
where θL(θt)\bigtriangledown_θ L(θ_t) represents the gradient of the loss of the whole training set w.r.t. θ evaluated at θtθ_t, v(i)v(i) represents the ii-th element of a vector vv, and the parameter vector θ contains K elements. Note that while we evaluate model learning by tracking progress along the training set loss landscape L(θ)L(\theta).

θL(θt)\bigtriangledown_θ L(θ_t)是整個數據集在 θtθ_t 時的損失 (tt-th epoch?)。上標 i 代表第 i 個分量。

  As shown in Equation 3, the difference in loss produced by one training iteration t may be decomposed into K individual Loss Change Allocation, or LCA, components, denoted At,iA_{t,i}. These K components represent the LCA for a single iteration of training, and over the course of T iterations of training we will collect a large T × K matrix of At,iA_{t,i} values.

  每個參數由 K個分量,網絡使用 SGD 或 Adam 訓練 T 個時間步(epoch),將產生一個 T x K 個 At,iA_{t,i} 值。

Property of LCA

  This is in contrast to approaches that measure quantities like parameter motion or approximate elements of the Fisher information (FI) , which also produce per-parameter measurements but depend heavily on the parameterization chosen. For example, the FI metric is sensitive to scale (e.g. multiply one relu layer weights by 2 and next by 0.5: loss stays the same but FI of each layer changes and total FI changes).

  對比Fisher information (FI) 方法,其主要測量參數移動或者元素逼近,它嚴重依賴參數的選擇,而且對 尺度 敏感。比如,在前層的Relu 乘以2,在後層乘以 0.5,得到的FI是一樣的。

  We can improve on our LCA approximation from Equation 2 by replacing θL(θt)\nabla_θL(θ_t) with 16(θL(θt)+4θL(12θt+12θt+1)+θL(θt+1))\frac{1}{6}(\nabla_θL(θ_t) + 4\nabla_θL(\frac{1}{2}θ_t+\frac{1}{2}θ_{t+1})+\nabla_θL(θ_{t+1})), with the (1; 4; 1) coefficients coming from the fourth-order Runge–Kutta method (RK4) or equivalently from Simpson’s rule.
  使用 Runge–Kutta method (RK4) 來計算梯度的中間點,會產生更好的逼近效果。

  Using a midpoint gradient doubles computation but shrinks accumulated error drastically, from first order to fourth order. If the error is still too large, we can halve the step size with composite Simpson’s rule by calculating gradients at 34θt+14θt+1\frac{3}{4} θ_t + \frac{1}{4}θ_{t+1} and 34θt+14θt+1\frac{3}{4}θ_t + \frac{1}{4}θ_{t+1} as well. We halve the step size until the absolute error of change in loss per iteration is less than 0.001, and we ensure that the cumulative error at the end of training is less than 1%.

  使用四階替換一階中間點梯度法,使得計算量成倍增加,但可以使 累計誤差 迅速下降。可以將step size 減半或者使用 複合 Simpson’s rule. 可以使得參數的絕對值少於0.001時結束迭代,並確保累積誤差小於總誤差的1%。

Experiments

Learning is very noisy

Although it is a commonly held view that the inherent noise in SGD-based neural network training exists and is even considered beneficial.
噪聲是 SGD 方法固有的,甚至被認爲是有利的。
We find it surprising that on average almost half of parameters are hurting in every training iteration. Moreover, each parameter, including ones that help in total, hurt almost half of the time.

一般參數在每次迭代中是有害的,整體上有利的參數,在一般時間上也是有害的。如下表:
在這裏插入圖片描述

在這裏插入圖片描述
Parameters that help (decrease the loss) at a given time are shown as shades of green. Parameters that hurt (increase the loss) are shown as shades of red.

從上圖中,無論是MNIST FC 還是 LeNet,都可以看到在 iteration 1時,幾乎全是綠的,iteration 20時,紅綠各半,在iteration 220時,紅色居多,而且變化較小。(FC 是100x784, LeNet 是40x20, 上圖中只顯示了左上角的部分。)

Barely over 50% of parameters help during training

在這裏插入圖片描述
從上圖(a)中可以看出,FC中存在大塊的 zero motion的 weight,這是因爲MNIST數據集存在大量的 0 pixel,導致first layer不怎麼學習到這些像素。(b)中展示了help和hurt 的權重的分佈(Normalization Distribution)。(c)help 和 hert 的權重的百分比,可以看到維持在50%左右。(d)隨着迭代的進行,網絡中help參數的個數的直方圖。可以看到,在50%左右,help參數的個數最多,隨後,慢慢下落。

Parameters alternate helping

在這裏插入圖片描述
  The averages over the entire network are 741.9 for weight turns and 525.8 for gradients crossing zero. Note that the first and last layers oscillate more than their neighboring layers, which is interesting given that those layers hurt, but this is only a correlation as oscillations do not explain why something would bias towards helping or hurting.

參數和梯度在CIFAR-ResNet上的震盪(改變方向),可以比較一下權重和梯度的數量對比。下表是震盪的頻率(震盪/迭代方向):how often weight switches direction and how often gradient crosses zero.
在這裏插入圖片描述

Noise persists across various hyperparameters.

Changing the learning rate, momentum, or batch size only have a slight effect on the percent of parameters helping. (表格 1)

Learning is heavy-tailed

A reasonable mental model of the distribution of LCA might be a narrow Gaussian around the mean. (圖3(b))

Some layers hurt overall

在這裏插入圖片描述
MNIST-FC 和LeNet的第一層和最後一層總是有害的!

Freezing the first layer stops it from hurting but causes others to help less.

在這裏插入圖片描述
左圖:LCA 對整個訓練過程中的每一層的和,CIFAR–Resnet的SGD 。偏置層和批範數層被合併到它們對應的核層中。藍色表示正常的運行配置,其他顏色顯示第一層上的各種實驗。當第一層使用比其他層(橙色)小10倍的學習率時,每層LCA變化不大。雖然“第一層凍結”運行(綠色)在第一層中不再受影響(因爲層參數從一開始就被凍結),但其他層,尤其是下兩層,沒有那麼大幫助。當我們將第一層的lca argmin(紅色)凍結時,也會看到類似的效果;當我們強制第一層的 LCA 爲負時,其他層的 LCA 稍微爲正,從而取消任何改進。中間:每次運行配置和標準偏差造成的訓練損失。右圖:第一層學習的典型累積軌跡,它在最初的幾百次迭代中起到幫助作用,然後越來越有害。“在最小化凍結第一層”允許該層在凍結之前先提供幫助,但這仍然不能提高性能。

Freezing the last layer results in significant improvement.

Decreasing the learning rate of the last layer by 10x (0.01 as opposed to 0.1 for other layers) results in similar behavior as freezing it. These experiments are consistent with findings in [12] and [8], which demonstrate that you can freeze the last layer in some networks without degrading performance. With LCA, we are now able to provide an explanation for when and why this phenomenon happens. The instability of the last layer at the start of training can also be measured by LCA, as the LCA of the last layer is typically high in the first few iterations.

將最後一層的學習率降低10倍(0.01而不是其他層的0.1),會導致與凍結它類似的行爲。這些實驗與[12]和[8]中的研究結果一致,這表明您可以凍結某些網絡中的最後一層,而不會降低性能。通過生命週期評價,我們現在能夠解釋這種現象發生的時間和原因。最後一層在訓練開始時的不穩定性也可以用 LCA 來衡量,因爲最後一層的 LCA 在前幾次迭代中通常很高。

As the last layer helps more, the other layers hurt more because they are relatively more delayed. LCA of the last layer is fairly linear with respect to the delay.
由於最後一層的幫助更大,其他層的傷害也更大,因爲它們相對延遲的時間更長。最後一層的 LCA 相對於延遲是相當線性的。

Phase shift hypothesis

相移假說

is the last layer phase-lagged?

最後一層時相移滯後嗎?
min-batch 梯度是整個數據集梯度的無偏估計,所以需要從學習率和噪聲之外尋找解釋:我們假設最後一層的是相位滯後的,就是當所有層都震盪時,最後一層有點滯後。

We hypothesize that the last layer may be phase lagged with respect to other layers
during learning. Intuitively, it may be that while all layers are oscillating during learning, the last layer is always a bit behind. As each parameter swings back and forth across its valley, the shape of its valley is affected by the motion of all other parameters.
我們假設最後一層可能相對於其他層是相位滯後的。在學習過程中。直觀地說,當學習過程中所有層都在振盪時,最後一層總是有點落後。當每個參數在山谷中來回擺動時,山谷的形狀受所有其他參數的運動影響。

If one parameter is frozen and all other parameters trained infinitesimally slowly, that parameters valley will tend to flatten out. This means if it had climbed a valley (hurting the loss), it will not be able to fully recover the LCA in the negative direction, as the steep region has been flattened. If the last layer reacts slower than others, its own valley walls may tend to be flattened before it can react.

如果一個參數被凍結,而所有其他參數都被無限緩慢地訓練,那麼參數谷將趨於平緩。這意味着,如果它爬上了一個山谷(傷害了損失),它將無法在負方向上完全 LCA,因爲陡峭的區域已經被夷爲平地。如果最後一層的反應比其他層慢,它自己的谷壁可能會在反應之前被夷爲平地。
在這裏插入圖片描述
As we give the last layer an information freshness advantage, it begins to “steal progress” from other layers, eventually forcing the neighboring layers into positive LCA.
當我們給最後一層一個信息新鮮度優勢時,它開始從其他層“竊取進度”,最終迫使相鄰層進入正的 LCA (有害的)。

These results suggest that it may be profitable to view training as a fundamentally oscillatory process upon which much research in phase-space representations and control system design may come to bear.
這些結果表明,將訓練視爲一個基本振盪過程可能是有益的,在此基礎上,許多相空間表示和控制系統設計的研究可能會產生作用。

CIFAR–AllCNN trained with Adam does not have any hurting layers. We note that layers hurting is not a universal phenomenon that will be observed in all networks, but when it does occur, LCA can identify it. By using LCA we may identify layers as potential candidates to freeze. Further, viewing training through the lens of information delay seems valid, which suggests that per-layer optimization adjustments may be beneficial.
CIFAR–AllCNN 所有與adam一起訓練的cnn沒有任何傷害層。我們注意到,層傷害並不是所有網絡都能觀察到的普遍現象,但當它發生時,LCA 可以識別它。通過使用 LCA,我們可以將層識別爲要凍結的潛在候選層。此外,從信息延遲的角度來看訓練似乎是有效的,這表明逐層優化調整可能是有益的。

Learning is synchronized across layers

We learned that layers tend to have their own distinct, consistent behaviors regarding hurting or helping from per-layer LCA summed across all iterations.
我們瞭解到,在所有迭代中,每一層LCA都會對傷害或幫助產生不同的、一致的行爲。

We further examine the per-layer LCA during training, equivalent to studying individual “loss curves” for each layer, and discover that the exact moments where learning peaks are curiously synchronized across layers. And such synchronization is not driven by only gradients or parameter motion, but both.
我們進一步研究了訓練過程中的每一層 LCA,相當於研究每一層的個體“損失曲線”,並發現學習峯值的精確時刻在各層之間奇怪地同步。而且這種同步不是僅僅由梯度或參數運動驅動的,而是同時由兩者驅動的。

在這裏插入圖片描述
Peak learning iterations by layer by class on MNIST–FC. We define “moments of learning” as temporal spikes in an instantaneous LCA curve, local minima where loss decreased more on that iteration than on the iteration before or after, and show the top 20 such moments (highest magnitude of LCA) for each layer in above figure. We further decompose this metric by class (10 for both MNIST and CIFAR), where the same moments of learning are identified on per-class, per-layer LCAs, shown in above Figure. Whenever learning is synchronized across layers (dots that are vertically aligned) they are marked in red. The large proportion of red aligned stacks suggests that learning is very locally synchronized across layers.
我們將“學習時刻”定義爲瞬時LCA曲線中的時間尖峯、局部極小值,其中該迭代的損失比迭代前後的損失減少更多,並在圖S16中顯示每層的前20個這樣的時刻(LCA的最高值)。我們進一步分解這個按類度量(mnist和cifar均爲10),其中在每類、每層lca上標識相同的學習時刻,如圖6所示。每當學習跨層同步(垂直對齊的點)時,它們都被標記爲紅色。大量紅色對齊的堆棧表明,學習是非常局部地跨層同步的。

We might find different behavior in other architectures such as transformer models or recurrent neural nets, which could be of interest for future work.

Appendix for this blog

Simpson’s Rule

  In Simpson’s Rule, we will use parabolas to approximate each part of the curve. This proves to be very efficient since it’s generally more accurate than the other numerical methods such as straight lines or trapezoid.
在這裏插入圖片描述
We divide the area into nn equal segments of width Δx\Delta{x}. The approximate area is given by the following:
在這裏插入圖片描述
Note: In Simpson’s Rule, nn must be EVEN.
We can re-write Simpson’s Rule by grouping it as follows:
在這裏插入圖片描述
This gives us an easy way to remember Simpson’s Rule:
在這裏插入圖片描述
reference in here

Simpson法則的另一優點在於其自然引出了一種算法,即通過迭代使積分達到所需要的精確度。當積分的上下限相對於展開的中心點對稱時,積分泰勒展開式中含有f(x)的奇數階導數的項都將等於零。利用這一性質,我們可以在相鄰的兩個子區間內對面積作泰勒級數展開。

Runge-Kutta (RK4) Method

The most point that need to illustrate is the RK4 method not appeared or explained in the original paper. So, you can skip this section if you feel it not necessary or useless. 原文中沒又對 RK4 進行介紹,所以可以跳過該部分而不影響理解原文。

  The Runge-Kutta (RK4) methods are used to solve the solution of the non-liner ordinary differential equation. Here, we will simply summary this method.
  Assume the Intial Value Piont (IVP) is satisfied:
y=f(t,y),y(t0)=y0(1)y\prime = f(t,y), \quad y(t_0)=y_0 \quad \quad (1)
  The formulation of RK4 is given by:
y(n+1)=yn+h6(k1+2k2+2k3+k4)(2)y_(n+1) = y_n + \frac{h}{6} (k_1+2k_2+2k_3+k_4) \quad \quad (2)
where, the kik_i represent the slope of middle points of the variable time tt. Will, the Runge-Kutta methods just be generalized by RK4.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章