文章目錄

當我們在優化一個模型的時候，通過網絡得到預測結果，計算loss並反向傳播，算出梯度（也就是下降方向

g_k

），並根據步長（也叫學習率

\alpha_k

），使用優化器來更新參數（GD）：

W_{k+1}=W_k-\alpha_kg_k

。
上面這句話的key points是設置loss，設置優化算法，計算梯度（代碼中直接通過反向傳播計算，調用底層代碼，但是我們要知道GD算的是一階梯度，牛頓法、擬牛頓法計算了二階梯度），而學習率的設置由於優化器設置不同而不同。本文主要關注優化算法的設置。

深度學習優化算法經歷了 SGD -> SGDM -> NAG ->AdaGrad -> AdaDelta ->RMSprop -> Adam -> Nadam -> AMSGrad這樣的發展歷程。

Gradient descent variants

There are three variants of gradient descent,which differ in how much data we use to compute the gradient of the objection function.
BGD（entire training datasets）

SGD（one training example）

Mini_batch（part of the training examples）
—the term SGD usually is employed also when mini-batches are used

共同的缺點（挑戰）：
1.對learning rate的設置較爲敏感，太小則訓練的太慢，太大則容易使目標函數發散掉
2.針對不同的參數，learning rate都是一樣的。這對於稀疏數據來說尤爲不方便，因爲我們更想對那些經常出現的數據採用較小的step size，而對於較爲罕見的數據採用更大的step size。
3.梯度下降法的本質是尋找不動點（目標函數對參數的導數爲0的點），而這種不動點通常包括三類：極大值、極小值、鞍點。高維非凸函數空間中存在大量的鞍點，使得梯度下降法極易陷入鞍點（saddle points）且長時間都出不來。

Momentum

動量梯度下降法（Gradient descent with momentum）-----SGDM
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations .

where $\gamma$ is usually set to 0.9 or a similar value.

垂直方向學習慢一點，減少震盪；水平方向學習快一點，加速收斂(The momentum term increase for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.As a result,we gain faster convergence and reduced oscillation)

Nesterov accelerated gradient(NAG)

We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters $\theta$ but w.r.t. the approximate future position of our parameters.

這幅圖上面藍色的表示Momentum，它首先計算了當前的梯度方向（小藍色矢量），然後沿着更新累積梯度方向
進行大跳躍（大藍色矢量）
下面表示NAG，它首先沿着以前累積梯度方向進行大跳躍（棕色矢量），計算梯度（紅色的）然後作修正（綠色的），這種預期性更新防止我們走的太快，並增加了反應能力。

Adagrad

以前我們使用相同的學習率去更新所有的參數，雖然後面會進行學習率的衰減，但是Adagrad自適應不同的學習率來更新參數，對於罕見的執行較大的更新，對於頻繁的執行較小的更新，而且這對於稀疏數據是非常適合的。
在第t步計算目標函數對參數 $\theta_i$ 計算梯度：
$g_{t,i}$ = $\nabla_{\theta_t}$ J( $\theta_{t,i}$ )
然後對每一個參數 $\theta_i$ 用SGD進行更新：
$\theta_{t+1,i}$ = $\theta_{t,i}- \eta g_{t,i}$
但是Adagrad會基於對參數 $\theta_i$ 的過去梯度修改通用的學習率 $\eta$ :
$\theta_{t+1,i}$ = $\theta_{t,i}$ - $\frac{\eta}{\sqrt{G_{i,ii}+\epsilon}}$ $\cdot$ $g_{t,i}$
( $\theta_{t+1}$ = $\theta_t$ - $\frac{\eta}{\sqrt{G_t+\epsilon}}$ $\odot$ $g_t$ )
這裏的 $G_t$ 是對角矩陣，原始是對參數 $\theta_i$ 梯度的平方和。因此再也不需要手動去調學習率
缺點：因爲G中梯度的平方和每一項都是正的，因此G保持上升，這導致學習率減小直到無限小，我們就不能學習其他的知識。

Adadelta

replace the diagonal matrix $G_t$ with the decaying average over past squared gradients $E[g^2]_t$ :
$\Delta\theta_t$ =- $\frac{\eta}{\sqrt{G_t+\epsilon}}$ $\odot$ $g_t$ 替換爲： $\Delta\theta_t$ =- $\frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}$ $\odot$ $g_t$ ，而

where $\gamma$ is similar value as the momentum term,around 0.9.
我們發現這個更新和SGD、Momentum or Adagrad 不匹配，也就是說更新應該有相同的假設對於參數，因此我們用參數更新的平方代替梯度的平方，如下：

這樣又會遇到一個問題，上一個公式是不可知的，我們只能近似爲如下：
$\Delta\theta_t$ =- $\frac{\sqrt{E[\Delta\theta^2]_{t-1}+\epsilon}}{\sqrt{E[g^2]_t+\epsilon}}$ $\odot$ $g_t$

RMSprop

RMSprop和Adadelta 一樣都是爲了解決Adagrad急劇下降的學習率，事實上RMSprop是Adadelta的第一種更新向量是一樣的：

Adam（Adaptive Moment Estimation）

RMSprop+Momentum+bias-correction
In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$ , similar to momentum:( 這就是將一階動量【the mean of the gradients】和二階動量【the uncentered variance of the gradients】結合起來）

因爲 $m_t,v_t$ 被初始爲0，他們被偏置爲0，因此使用如下去中和他們的偏置：

最後可得：

AdaMax

we use $u_t$ denote the infinity norm-constrained $v_t$ :

用 $u_t$ 代替Adam中的 $\sqrt{\hat{v_t}}+\epsilon$ 然後可得：

where $\eta$ =0.002, $\beta_1$ =0.9, $\beta_2$ =0.999

Nadam

Adam+NAG
將NAG中的

修改爲（use the current momentum vector $m_t$ to look ahead）：
最後將：

變成：

AMSGrad

Visualization of algorithms

我們可以看到不同算法在損失面等高線圖中的學習過程，它們均同同一點出發，但沿着不同路徑達到最小值點。其中 Adagrad、Adadelta、RMSprop 從最開始就找到了正確的方向並快速收斂；SGD 找到了正確方向但收斂速度很慢；SGD-M 和 NAG 最初都偏離了航道，但也能最終糾正到正確方向，SGD-M 偏離的慣性比 NAG 更大

這幅圖展現了不同算法在鞍點處的表現。這裏，SGD、SGD-M、NAG 都受到了鞍點的嚴重影響，儘管後兩者最終還是逃離了鞍點；而 Adagrad、RMSprop、Adadelta 都很快找到了正確的方向。

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are
most suitable and provide the best convergence for these scenarios.

優化問題（A overview of gradient descent optimization algorithms）

文章目錄

Gradient descent variants

Momentum

Nesterov accelerated gradient(NAG)

Adagrad

Adadelta

RMSprop

Adam（Adaptive Moment Estimation）

AdaMax

Nadam

AMSGrad

Visualization of algorithms

閱讀SiamMask代碼記錄

GIOU:Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression

coding 記錄(一）

ATOM 網絡模型（ResNet18）

pytorch 數據讀取之（Dataset,DataLoader）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結