梯度下降、随机梯度下降法、及其改进

题目(155):当训练数据量特别大时,经典的梯度下降法存在什么问题,需要做如何改进?
题目(158):随机梯度下降法失效的原因。
题目(160):为了改进随机梯度下降法,研究者都做了哪些改动?提出了哪些变种方法?它们各有哪些特点?即SGD的一些变种。

考点1:gradient descent, stochastic gradient descent, mini-batch gradient descent

(Batch) gradient descent

Advantages\textcolor{blue}{\text{Advantages}}

  • unbiased estimate of gradients
  • non-increasing trajectory if the loss function is convex

Disadvantages\textcolor{red}{\text{Disadvantages}}

  • may result in local minima for non-convex problems

Stochastic gradient descent

(Pseudo-code of SGD [2])
· Choose an initial vector of parameters w and learning rate α.
· Repeat until an approximate minimum is obtained:
	· Randomly shuffle examples in the training set.
	· For i=1,2,...,n, do:
		· θ := θ - α δL(θ,z_i)/δθ.
  • SGD: randomly select one of the training samples at each iteration to update your coefficients.
  • online gradient descent: use the most recent sample at each iteration. The samples may not be IID.

Advantages\textcolor{blue}{\text{Advantages}}

  • The noisy update process can allow the model to avoid local minima [5]

Challenges\textcolor{red}{\text{Challenges}}

  • 山谷 / narrow and tall level curves (contours) – see figure below
    Reason: 在山谷中,准确的梯度方向是沿山道向下,稍有偏离就会撞向山壁,而粗糙的梯度估计使得它在两山壁间来回反弹震荡,不能沿山道方向迅速下降,导致收敛不稳定和收敛速度慢 [1]。
  • saddle point
    Test\textcolor{green}{\text{\small Test}}: This happens when at least one eigenvalue of the Hessian matrix is negative and the rest of eigenvalues are positive[4].
    Reason: 在梯度近乎为零的区域,SGD无法准确察觉出梯度的微小变化,结果就停滞下来。

Disadvantages\textcolor{red}{\text{Disadvantages}}

  • learning rate decay

Mini-batch gradient descent

Practical issues

  • tuning of batch size: It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason is that, with common batch sizes such as power of 2, some hardware such as GPUs achieve better run time and fit their memory requirements [4,5].

    • Tip: A good default for batch size might be 32 [5].

    … [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value, with values above 10 taking advantage of the speedup of matrix-matrix products over matrix-vector products [6].

    • key considerations: training stability and generalization
  • tuning of learning rate: 通常会采用衰减学习速率的方案:一开始算法采用较大的学习速率,当误差曲线进入平台期后,减小学习速率做更精细的调整 [1]。

  • Tip: Tune batch size and learning rate after tuning all other hyperparameters [5].

  • Practitioners’ experience: batch size = 32; no more than 2-10 epochs

Advantages\textcolor{blue}{\text{Advantages}}
compared with batch GD

  • with batch size smaller than total size, it adds noise to the learning process that helps improve generalization ability [4]

compared with SGD

  • reduce the variance of gradient; however, the return is less than linear compared to the computational burden we incur [4]
  • take the advantage of efficient vectorisation of matrices [3]

Disadvantages\textcolor{red}{\text{Disadvantages}}

  • wander around the minimum but never converge due to randomness in sampling
  • as a result of above, require adding the learning-decay to decrease the learning rate when approaching closer to the minimum
  • additional hyperparameter – batch size

Additional practical issues and tips

  • in small dimensions, local minimum is common; in large dimensions, saddle points are more common

Tip: scale the data if it’s on very different scales as the level curves (contours) may be narrow and tall and thus taking longer time to converge.

考点2:momentum, adaptive learning rate

Momentum方法

Momentum的本质是在参数更新时加入历史信息[9]。

Intuitive understanding of why momentum methods could address the two problems of SGD

  • 山谷问题:向下的力稳定不变,产生的动量不断累积,速度越来越快;左右的弹力不停切换,动量累积的结果是相互抵消,减弱了来回震荡。
  • saddle point问题:利用惯性,保持前进
    θt+1=θt+vt+1vt+1=γvtαf(θt), \begin{aligned} \theta_{t+1} &= \theta_t + v_{t+1} \\ v_{t+1} &= \gamma v_t - \alpha \nabla f(\theta_t), \end{aligned}

where α\alpha denotes the learning rate; the decay rate/momemtum factor γ\gamma would be analogous to the friction coefficient, with 1 being max friction and 0 being no friction [11,7].
To put it simply, momentum acts both as a smoother and an accelerator [10].

AdaGrad方法

Core: different learning rates on different parameters
Motivation: 数据的稀疏性导致相应参数的梯度的稀疏性,从而这些参数被更新的频率很低。在应用中,我们希望更新频率低的参数可以拥有较大的更新步幅,而更新频率高的参数的步幅可以减小 [1]。
θt+1,i=θt,iαk=0tgk,i2+ϵgt,i\theta_{t+1,i} = \theta_{t,i} - \frac{\alpha}{\sqrt{\sum_{k=0}^t g_{k,i}^2+\epsilon}}g_{t,i}
Remark: The learning rate gets smaller and smaller as time passes.

AdaDelta、RMSProp

对AdaGrad的改进。In the denominator of AdaGrad, all historical gradients are summed up, causing the learning rate to decay too fast. AdaDelta和RMSProp采用指数衰退平均的计算方法,用过往梯度的均值代替它们的求和。

Adam方法

Core: combine the merits of momentum and adaptive learning rate
θt+1=θαv^+ϵm^\hspace{6em} \theta_{t+1} = \theta - \frac{\alpha}{\sqrt{\hat{v}+\epsilon}}\hat{m}

first-order gradient: mt=β1mt1+(1β1)gt\text{first-order gradient: } \hspace{1em} m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

second-order gradient: vt=β2vt1+(1β2)gt2\text{second-order gradient: } v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

m^t=mt1β1t,v^t=vt1β2t\hspace{-1.5em} \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \hat{v}_t = \frac{v_t}{1-\beta_2^t}

  • large mtm_t and large vtv_t: 梯度大且稳定,表明遇到一个明显的大坡,前进方向明确
  • small mtm_t and large vtv_t: 梯度不稳定,表明可能遇到一个峡谷,容易引起反弹震荡
  • large mtm_t and small vtv_t: impossible
  • small mtm_t and small vtv_t: 可能到达局部最低点,也可能走到一片坡度极缓
    的平地,此时要避免陷入plateau

Nesterov accelerated gradient [8]

扩展了动量方法,顺着惯性方向,计算未来可能位置处的梯度而非当前位置的梯度,这个“提前量”的设计让算法有了对前方环境预判的能力 [1]。

· choose any initial x^{0}= x^{-1} \in R^n
· repeat for t = 1,2,3...
	y = x^{t-1} + (t-2)/(t+1) (x^{t-1}-x^{t-2})
	x^{t} = prox_{α_t}(y - α_t ∇g(y))

Disadvantages\textcolor{red}{\text{Disadvantages}}

  • Not always decreasing even for convex problems

参考文献

  1. 《百面机器学习》
  2. Wikipedia, Stochastic gradient descent
    https://en.wikipedia.org/wiki/Stochastic_gradient_descent
  3. Implementing Mini Batch Gradient Descent
    http://www.ashukumar27.io/MIni-Batch-Gradient-Descent/
  4. Gradient Descent Algorithm and Its Variants (in-depth discussion on the relationship between Hessian and loss function\textcolor{green}{\text{in-depth discussion on the relationship between Hessian and loss function}})
    https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
  5. A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
    https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
  6. Practical recommendations for gradient-based training of deep architectures, 2012
  7. https://ai.stackexchange.com/questions/6377/why-mlp-momentum-term-must-be-in-the-range-0-1
  8. CMU Optimization 10-725 / 36-725, Accelerated first-order methods
    https://www.cs.cmu.edu/~ggordon/10725-F12/slides/09-acceleration.pdf
  9. How to Configure the Learning Rate When Training Deep Learning Neural Networks
    https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/
  10. Why Momentum Really Works
    https://distill.pub/2017/momentum/
  11. Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013, February). On the importance of initialization and momentum in deep learning. In International conference on machine learning (pp. 1139-1147).
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章