梯度下降、隨機梯度下降法、及其改進

題目（155）：當訓練數據量特別大時，經典的梯度下降法存在什麼問題，需要做如何改進？
題目（158）：隨機梯度下降法失效的原因。
題目（160）：爲了改進隨機梯度下降法，研究者都做了哪些改動？提出了哪些變種方法？它們各有哪些特點？即SGD的一些變種。

考點1：gradient descent, stochastic gradient descent, mini-batch gradient descent

(Batch) gradient descent

$\textcolor{blue}{\text{Advantages}}$

unbiased estimate of gradients
non-increasing trajectory if the loss function is convex

$\textcolor{red}{\text{Disadvantages}}$

may result in local minima for non-convex problems

Stochastic gradient descent

(Pseudo-code of SGD [2])
· Choose an initial vector of parameters w and learning rate α.
· Repeat until an approximate minimum is obtained:
	· Randomly shuffle examples in the training set.
	· For i=1,2,...,n, do:
		· θ := θ - α δL(θ,z_i)/δθ.

SGD: randomly select one of the training samples at each iteration to update your coefficients.
online gradient descent: use the most recent sample at each iteration. The samples may not be IID.

$\textcolor{blue}{\text{Advantages}}$

The noisy update process can allow the model to avoid local minima [5]

$\textcolor{red}{\text{Challenges}}$

山谷 / narrow and tall level curves (contours) – see figure below
Reason: 在山谷中，準確的梯度方向是沿山道向下，稍有偏離就會撞向山壁，而粗糙的梯度估計使得它在兩山壁間來回反彈震盪，不能沿山道方向迅速下降，導致收斂不穩定和收斂速度慢 [1]。
saddle point
$\textcolor{green}{\text{\small Test}}$ : This happens when at least one eigenvalue of the Hessian matrix is negative and the rest of eigenvalues are positive[4].
Reason: 在梯度近乎爲零的區域，SGD無法準確察覺出梯度的微小變化，結果就停滯下來。

$\textcolor{red}{\text{Disadvantages}}$

learning rate decay

Mini-batch gradient descent

Practical issues

tuning of batch size: It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason is that, with common batch sizes such as power of 2, some hardware such as GPUs achieve better run time and fit their memory requirements [4,5].
- Tip: A good default for batch size might be 32 [5].
… [batch size] is typically chosen between 1 and a few hundreds, e.g. [batch size] = 32 is a good default value, with values above 10 taking advantage of the speedup of matrix-matrix products over matrix-vector products [6].
- key considerations: training stability and generalization
tuning of learning rate: 通常會採用衰減學習速率的方案：一開始算法採用較大的學習速率，當誤差曲線進入平臺期後，減小學習速率做更精細的調整 [1]。
Tip: Tune batch size and learning rate after tuning all other hyperparameters [5].
Practitioners’ experience: batch size = 32; no more than 2-10 epochs

$\textcolor{blue}{\text{Advantages}}$
compared with batch GD

with batch size smaller than total size, it adds noise to the learning process that helps improve generalization ability [4]

compared with SGD

reduce the variance of gradient; however, the return is less than linear compared to the computational burden we incur [4]
take the advantage of efficient vectorisation of matrices [3]

$\textcolor{red}{\text{Disadvantages}}$

wander around the minimum but never converge due to randomness in sampling
as a result of above, require adding the learning-decay to decrease the learning rate when approaching closer to the minimum
additional hyperparameter – batch size

Additional practical issues and tips

in small dimensions, local minimum is common; in large dimensions, saddle points are more common

Tip: scale the data if it’s on very different scales as the level curves (contours) may be narrow and tall and thus taking longer time to converge.

考點2：momentum, adaptive learning rate

Momentum方法

Momentum的本質是在參數更新時加入歷史信息[9]。

Intuitive understanding of why momentum methods could address the two problems of SGD

山谷問題：向下的力穩定不變，產生的動量不斷累積，速度越來越快；左右的彈力不停切換，動量累積的結果是相互抵消，減弱了來回震盪。
saddle point問題：利用慣性，保持前進
$\begin{aligned} \theta_{t+1} &= \theta_t + v_{t+1} \\ v_{t+1} &= \gamma v_t - \alpha \nabla f(\theta_t), \end{aligned}$

where $\alpha$ denotes the learning rate; the decay rate/momemtum factor $\gamma$ would be analogous to the friction coefficient, with 1 being max friction and 0 being no friction [11,7].
To put it simply, momentum acts both as a smoother and an accelerator [10].

AdaGrad方法

Core: different learning rates on different parameters
Motivation: 數據的稀疏性導致相應參數的梯度的稀疏性，從而這些參數被更新的頻率很低。在應用中，我們希望更新頻率低的參數可以擁有較大的更新步幅，而更新頻率高的參數的步幅可以減小 [1]。
$\theta_{t+1,i} = \theta_{t,i} - \frac{\alpha}{\sqrt{\sum_{k=0}^t g_{k,i}^2+\epsilon}}g_{t,i}$
Remark: The learning rate gets smaller and smaller as time passes.

AdaDelta、RMSProp

對AdaGrad的改進。In the denominator of AdaGrad, all historical gradients are summed up, causing the learning rate to decay too fast. AdaDelta和RMSProp採用指數衰退平均的計算方法，用過往梯度的均值代替它們的求和。

Adam方法

Core: combine the merits of momentum and adaptive learning rate
$\hspace{6em} \theta_{t+1} = \theta - \frac{\alpha}{\sqrt{\hat{v}+\epsilon}}\hat{m}$

$\text{first-order gradient: } \hspace{1em} m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$

$\text{second-order gradient: } v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$

$\hspace{-1.5em} \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \hat{v}_t = \frac{v_t}{1-\beta_2^t}$

large $m_t$ and large $v_t$ : 梯度大且穩定，表明遇到一個明顯的大坡，前進方向明確
small $m_t$ and large $v_t$ : 梯度不穩定，表明可能遇到一個峽谷，容易引起反彈震盪
large $m_t$ and small $v_t$ : impossible
small $m_t$ and small $v_t$ : 可能到達局部最低點，也可能走到一片坡度極緩
的平地，此時要避免陷入plateau

Nesterov accelerated gradient [8]

擴展了動量方法，順着慣性方向，計算未來可能位置處的梯度而非當前位置的梯度，這個“提前量”的設計讓算法有了對前方環境預判的能力 [1]。

· choose any initial x^{0}= x^{-1} \in R^n
· repeat for t = 1,2,3...
	y = x^{t-1} + (t-2)/(t+1) (x^{t-1}-x^{t-2})
	x^{t} = prox_{α_t}(y - α_t ∇g(y))

$\textcolor{red}{\text{Disadvantages}}$

Not always decreasing even for convex problems

參考文獻

《百面機器學習》
Wikipedia, Stochastic gradient descent
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
Implementing Mini Batch Gradient Descent
http://www.ashukumar27.io/MIni-Batch-Gradient-Descent/
Gradient Descent Algorithm and Its Variants ( $\textcolor{green}{\text{in-depth discussion on the relationship between Hessian and loss function}}$ )
https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
Practical recommendations for gradient-based training of deep architectures, 2012
https://ai.stackexchange.com/questions/6377/why-mlp-momentum-term-must-be-in-the-range-0-1
CMU Optimization 10-725 / 36-725, Accelerated first-order methods
https://www.cs.cmu.edu/~ggordon/10725-F12/slides/09-acceleration.pdf
How to Configure the Learning Rate When Training Deep Learning Neural Networks
https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/
Why Momentum Really Works
https://distill.pub/2017/momentum/
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013, February). On the importance of initialization and momentum in deep learning. In International conference on machine learning (pp. 1139-1147).

梯度下降、隨機梯度下降法、及其改進

考點1：gradient descent, stochastic gradient descent, mini-batch gradient descent

(Batch) gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Additional practical issues and tips

考點2：momentum, adaptive learning rate

Momentum方法

AdaGrad方法

AdaDelta、RMSProp

Adam方法

Nesterov accelerated gradient [8]

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

根據域名查詢服務器的ip地址

梯度下降、隨機梯度下降法、及其改進

機器學習中的凸和非凸優化問題

L1正則項與稀疏性

驗證梯度的正確性

Deep Learning相關概念

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結