李宏毅老師機器學習課程Gradient Descent總結

Review

在解決問題時一般分爲三個步驟：
步驟一：選擇一個function set
步驟二：找到loss function
步驟三：最小化loss function找到function set中最優的function。

步驟三中常用的方法就是梯度下降（Gradient Descent）。

$\theta^*=arg\,\min_{\theta}L(\theta)$
L: loss function
$\theta$ : parameters
目標是最小化loss function，並找到使loss function 最小的 $\theta$ ，其中 $\theta$ 是function set的參數組成的vector。
假設θ有兩個參數{θ1,θ2}，梯度下降的做法如下：

隨機從loss function上任意一點開始，隨機的使 $\theta=\theta_0$ ，其中 $\theta_0=\begin{bmatrix} \theta_1^0 \\ \theta_2^0 \\ \end{bmatrix}$
更新 $\theta$

$\theta_1=\begin{bmatrix} \theta_1^1 \\ \theta_2^1 \\ \end{bmatrix}=\begin{bmatrix} \theta_1^0 \\ \theta_2^0 \\ \end{bmatrix}-\eta \begin{bmatrix} \frac {\partial L(\theta_1^0 )}{\partial \theta_1} \\ \frac {\partial L(\theta_2^0 )}{\partial \theta_2} \\ \end{bmatrix}$
$\theta_2=\begin{bmatrix} \theta_1^2 \\ \theta_2^2 \\ \end{bmatrix}=\begin{bmatrix} \theta_1^1 \\ \theta_2^1 \\ \end{bmatrix}-\eta \begin{bmatrix} \frac {\partial L(\theta_1^1 )}{\partial \theta_1} \\ \frac {\partial L(\theta_2^1 )}{\partial \theta_2} \\ \end{bmatrix}$
將這個動作一直重複下去…
可以簡寫爲： $\theta^{k+1}=\theta^k-\eta\nabla L(\theta^k)$
$\nabla L(\theta)=\begin{bmatrix} \frac {\partial L(\theta_1 )}{\partial \theta_1} \\ \frac {\partial L(\theta_2 )}{\partial \theta_2} \\ \end{bmatrix}$

直觀理解就是對每個參數在當前位置上求偏微分，這個偏微分就是當前點在這個參數方向上的梯度，向這個梯度的反方向走就可以走到一個更低的點，這個更低的點代表着loss function的值更小。如下圖，橫軸是 $\theta_1$ 縱軸是 $\theta_2$ ，初始點 $\theta_0$ 是隨機選擇的。

Learning Rate： $\eta$

learning rate可以理解爲步長，就是在梯度反方向上走多長，learning rate是監督學習中重要的超參，它決定了能否走到一個足夠優的點，也決定了多久能走到一個足夠優的點，就是他決定了目標函數能否收斂到局部最小值也決定了收斂速度，所以learning rate的選擇十分重要。

藍色的線表示一個比較小的learning rate，這就導致目標函數收斂速度非常慢；綠線和黃線表示了一個比較大的learning rate，這就導致目標函數無法收斂甚至會發散。紅色的線是比較合適的learning rate。

Adaptive Learning Rates

假設在一個不規則的曲面上進行梯度下降，在每個方向上的梯度都是不斷變化的，這種情況下，使用統一的learning rate會導致在梯度較小的時候收斂太慢，在梯度較大的時候learning rate又相對過大，會出現無法收斂甚至發散的情況。
比較常用且簡單的方法：每隔幾步就將學習率降低一些。

一開始離目的地很遠，所以我們使用較大的學習率
經過幾次下降後，我們離目的地很近，所以我們降低了學習率
例如： $\frac 1t$ decay： $\eta^t=\frac \eta{\sqrt{t+1}}$
Learning rate cannot be one-size-fits-all，所以對於不同的參數要有不同的learning rates

Adagrad

adagrad也是一種梯度下降方法，在普通的梯度下降的基礎上對於learning rates: $\eta$ 除以之前所有一階微分的RME。

Stochastic Gradient Descent

隨機梯度下降法不同於批量梯度下降，隨機梯度下降是每次迭代使用一個樣本來對參數進行更新。使得訓練速度加快。
Gradient Descent：對於全部的樣本進行求和
$L=\sum_n(\hat y^n-(b+\sum w_ix_i^n))$
$\theta^i=\theta^{i-1}-\eta\nabla L(\theta^{i-1})$
Stochastic Gradient Descent：隨機選擇一個樣本 $x^n$
$L^n=(\hat y^n-(b+\sum w_ix_i^n))^2$
$\theta^i=\theta^{i-1}-\eta\nabla L(\theta^{i-1})$
所以SGD的速度更快。

Feature Scaling

對於不同的特徵，很可能比例不同，例如一個模型 $y=b+w_1x_1+w_2x_2$ ,有兩個特徵 $x_1$ 和 $x_2$ ， $x_1$ 的範圍是[-1,1]， $x_2$ 的範圍[-100,100]，這導致 $x_1$ 和 $x_2$ 對目標函數的不同，同時可能梯度下降的速度會減慢。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

李宏毅老師機器學習課程Gradient Descent總結

Review

Learning Rate： $\eta$

Adaptive Learning Rates

Adagrad

Stochastic Gradient Descent

Feature Scaling

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

【Stanford】Deep Learning-CS224N Lecture 1-2

並查集入門題集

C/C++ 取整函數

廣義線性模型總結（GLM）

Neural networks and deep learning C1總結（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

李宏毅老師機器學習課程Gradient Descent總結

Review

Learning Rate：η\etaη

Adaptive Learning Rates

Adagrad

Stochastic Gradient Descent

Feature Scaling

Learning Rate： $\eta$