數學:優化理論(方向導數、負梯度、SGD、Adagrad、RMSprop、Adam)

Directional Derivative and Gradient

方向導數,描述函數沿指定方向的變化率。若函數f(x,y)f(x, y)在點P0(x0,y0)P_0(x_0,y_0)處可微,則函數在該點沿任一方向ll的方向導數存在,且有
fl(x0,y0)=fx(x0,y0)cosα+fy(x0,y0)cosβ \frac{\partial f}{\partial l} \Big |_{(x_0,y_0)}=f_x(x_0,y_0)\cos\alpha + f_y(x_0,y_0)\cos\beta
其中cosα\cos \alphacosβ\cos \beta是方向ll的方向餘弦。


Differentiable and Partial Derivatives

若函數z=f(x,y)z=f(x,y)在點(x,y)(x,y)的某鄰域內有定義,函數在點(x,y)(x, y)處的全增量Δz=f(x+Δx,y+Δy)f(x,y)\Delta z = f(x+\Delta x, y+\Delta y)-f(x,y),可表示爲Δz=AΔx+BΔy+o(ρ),ρ=(Δx)2+(Δy)2\Delta z = A\Delta x + B\Delta y + o(\rho), \quad \rho=\sqrt{(\Delta x)^2 + (\Delta y)^2}

其中AABB不依賴於Δx\Delta xΔy\Delta y,且僅與xxyy有關,則稱函數z=f(x,y)z=f(x,y)(x,y)(x,y)可微,全微分dz=AΔx+BΔy\mathrm dz=A\Delta x + B\Delta y

必要條件

若函數z=f(x,y)z=f(x,y)在點(x,y)(x,y)處可微分,則函數z=f(x,y)z=f(x,y)在點(x,y)(x,y)的偏導數zx\dfrac{\partial z}{\partial x}zy\dfrac{\partial z}{\partial y}必定存在,此時全微分
dz=zxΔx+zyΔyydz=zxdx+zydy \mathrm dz=\dfrac{\partial z}{\partial x} \Delta x + \dfrac{\partial z}{\partial y} \Delta yy \quad \Longleftrightarrow \quad \mathrm dz = \dfrac{\partial z}{\partial x} \mathrm dx + \dfrac{\partial z}{\partial y} \mathrm dy

充分條件

若函數z=f(x,y)z=f(x,y)的偏導數zx\dfrac{\partial z}{\partial x}zy\dfrac{\partial z}{\partial y}在點(x,y)(x,y)處連續,則函數z=f(x,y)z=f(x,y)在點(x,y)(x,y)處可微。

綜上所述:可微     \implies 偏導存在 \quad 偏導存在且連續     \implies 可微.


Directional Derivative

f(x,y)f(x, y)在點P0(x0,y0)P_0(x_0,y_0)處可微,則
f(x0+Δx,y0+Δy)f(x0,y0)=fx(x0,y0)Δx+fy(x0,y0)Δy+o((Δx)2+(Δy)2) f(x_0+\Delta x,y_0+\Delta y)-f(x_0,y_0)=f_x(x_0,y_0)\Delta x + f_y(x_0,y_0)\Delta y +o(\sqrt{(\Delta x)^2 + (\Delta y)^2})

Δx=tcosα, Δy=tcosβ, Δx)2+(Δy)2=t \Delta x=t\cos\alpha,\ \Delta y=t\cos\beta,\ \sqrt{\Delta x)^2 + (\Delta y)^2}=t
因此方向導數
limt0+fx(x0,y0)tcosα+fy(x0,y0)tcosβt=fx(x0,y0)cosα+fy(x0,y0)cosβ \lim_{t \to 0^+}\frac{f_x(x_0,y_0)t\cos\alpha+f_y(x_0,y_0)t\cos\beta}{t}=f_x(x_0,y_0)\cos\alpha + f_y(x_0,y_0)\cos\beta


例題

f(x,y,z)=xy+yz+zxf(x,y,z)=xy+yz+zx在點(1,1,2)(1,1,2)沿方向ll的方向導數,其中ll的方向角分別爲6060^\circ4545^\circ6060^\circ
解:與ll同向的單位向量el=(12,22,12)\bm e_l=(\dfrac{1}{2},\dfrac{\sqrt 2}{2}, \dfrac{1}{2}),因爲函數可微,故
fx(1,1,2)=3,fy(1,1,2)=3,fz(1,1,2)=2 f_x(1,1,2)=3, \quad f_y(1,1,2)=3, \quad f_z(1,1,2)=2
因此
fl(1,1,2)=312+322+212=12(5+32) \frac{\partial f}{\partial l}\Big|_{(1,1,2)}=3\cdot\frac{1}{2} + 3\cdot\frac{\sqrt 2}{2} + 2\cdot\frac{1}{2} = \frac{1}{2}(5+3\sqrt 2)


Gradient

對於二元函數,設函數f(x,y)f(x,y)在平面區域DD內具有一階連續偏導,則對於每一點P0(x0,y0)DP_0(x_0,y_0)\in D,定義向量
fx(x0,y0)i+fy(x0,y0)jf_x(x_0, y_0)\bm i + f_y(x_0,y_0)\bm j
稱爲函數f(x,y)f(x,y)在點P0(x0,y0)P_0(x_0,y_0)處的梯度,記作
gradf(x0,y0)f(x0,y0) {\bf{grad}} \,f(x_0,y_0)\quad或\quad\nabla f(x_0,y_0)


Relationship of Directional Derivative and Gradient

若函數f(x,y)f(x,y)在點P0(x0,y0)P_0(x_0,y_0)處可微分,與方向ll同方向的單位向量el=(cosα,cosβ)\bm e_l=(\cos \alpha, \cos\beta),則
fl(x0,y0)=fx(x0,y0)cosα+fy(x0,y0)cosβ=gradf(x0,y0)el=gradf(x0,y0)cosθ\begin{aligned} \frac{\partial f}{\partial l}\Big|_{(x_0,y_0)} & =f_x(x_0,y_0)\cos\alpha +f_y(x_0,y_0)\cos\beta \\[1ex] &={\bf{grad}}\,f(x_0,y_0)\cdot \bm e_l \\[1ex] &= |{\bf{grad}}\,f(x_0,y_0)|\cdot \cos \theta \end{aligned}
式中,θ\theta爲梯度與方向ll的夾角. θ\theta不同值,函數變化情況:

  • θ=0\theta=0,方向ll與梯度方向同向,函數f(x,y)f(x,y)增長最快;
  • θ=π\theta=\pi,方向ll與梯度方向相反,函數f(x,y)f(x,y)減少最快;
  • θ=π/2\theta=\pi/2,方向ll與梯度方向正交,函數f(x,y)f(x,y)變化率爲0;

Gradient Descent

梯度下降是利用損失函數的負梯度方向更新參數,使目標函數(均方誤差損失函數)達到極小值
L=12nn(y^(wx+b))2 L = \frac{1}{2n}\sum_n\left(\hat y-\left(\pmb w \pmb x + b\right)\right)^2


Gradient Decent and Taylor Series

泰拉展開式
f(x)=k=0h(k)(x0)k!(xx0)k=h(x0)+h(x0)(xx0)+h(x0)2!(xx0)2+h(x0)+h(x0)(xx0) \begin{aligned} f(x) &=\sum_{k=0}^{\infty}\frac{h^{(k)}(x_0)}{k!}(x-x_0)^k \\ &=h(x_0)+h'(x_0)(x-x_0)+\frac{h''(x_0)}{2!}(x-x_0)^2+\cdots\\ & \approx h(x_0)+h'(x_0)(x-x_0) \end{aligned}
L(θ)L(\pmb \theta)包含兩個參數,將L(θ)L(\pmb \theta)θt=(a,b)\pmb \theta_t=(a, b)一階展開
L(θ)L(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b) L(\pmb \theta)\approx L(a, b)+ \frac{\partial L(a, b)}{\partial \theta_1}(\theta_1 - a)+ \frac{\partial L(a, b)}{\partial \theta_2}(\theta_2 - b)
Δθ1=(θ1a)\Delta\theta_1=(\theta_1-a), Δθ2=(θ2b)\Delta\theta_2=(\theta_2-b),則
minθL(θ)    min[(Δθ1,Δθ2)(L(a,b)θ1,L(a,b)θ2)] \min_{\pmb\theta} L(\pmb \theta) \iff \min\left[ (\Delta\theta_1, \Delta\theta_2)\cdot \left(\frac{\partial L(a, b)}{\partial \theta_1},\frac{\partial L(a, b)}{\partial \theta_2}\right)\right]
兩向量夾角180°時內積最小,最優解滿足
(Δθ1,Δθ2)=η(L(a,b)θ1,L(a,b)θ2)    [θ1θ2]=[ab]η[L(a,b)θ1L(a,b)θ2] (\Delta\theta_1, \Delta\theta_2)=-\eta\left(\dfrac{\partial L(a, b)}{\partial \theta_1},\dfrac{\partial L(a, b)}{\partial \theta_2}\right)\implies \quad \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}= \begin{bmatrix} a \\ b \end{bmatrix} - \eta \begin{bmatrix} \dfrac{\partial L(a, b)}{\partial \theta_1} \\ \dfrac{\partial L(a, b)}{\partial \theta_2} \end{bmatrix}


Negative Gradient

函數f(x)f(x)在點xx處沿方向dd的變化率可用方向導數表示,方向導數等於梯度與方向的內積,即
Df(x;d)=f(x)Td Df(\pmb x;\pmb d)=\nabla f(\pmb x)^T\pmb d
非線性規劃求解最速下降方向,即
minf(x)Tds.t. d1 \min\nabla f(\pmb x)^T\pmb d\quad \quad s.t.\ ||\pmb d||\leq 1
Schwartz不等式,有
f(x)Tdf(x)df(x) ||\nabla f(\pmb x)^T\pmb d|| \leq ||\nabla f(\pmb x)||\,|\pmb d|| \leq ||\nabla f(\pmb x)||

f(x)Tdf(x), d=f(x)f(x) \nabla f(\pmb x)^T\pmb d \geq -||\nabla f(\pmb x)||,\quad 當前僅當\ \pmb d=-\frac{\nabla f(\pmb x)}{||\nabla f(\pmb x)||}時等式成立
即負梯度方向爲最速下降方向。


Batch Gradient Descent

每次迭代使用所有樣本更新參數,即
wt+1=wtηLt,Lt=1nn(y^i(wtxi+b))xi \pmb w_{t+1}=\pmb w_t - \eta\nabla L_t, \quad\nabla L_t=\dfrac{1}{n}\sum_n(\hat y_i - (\pmb w_t \pmb x_i + b))\pmb x_i
優點: 可得到全局最優解,可並行計算,凸優化時可獲得全局最優解;
缺點: 訓練時間長;


Stochastic Gradient Descent, SGD

隨機梯度下降,每次迭代隨機選取一個樣本更新參數,即
wt+1=wtηLt,Lt=12(y^i(wtxi+b))xi \pmb w_{t+1}=\pmb w_t - \eta\nabla L_t,\quad 其中\nabla L_t=\dfrac{1}{2}(\hat y_i - (\pmb w_t \pmb x_i + b))\pmb x_i
優點: 訓練速度快;
缺點: 準確度下降(盲目搜索解空間),可能只能得到局部最優解,且無法並行計算;


Adaptive Gradient Descent, Adagrad

自適應梯度下降,某參數的理想學習率正比於其一次微分、反比於其二次微分,即對於任一參數有
wt+1=wtηtσt+ϵgt,ηt=ηt+1,σt=1t+1i=0tgi2,gt=Lw \pmb w_{t+1} = \pmb w_t - \frac{\eta_t}{\pmb\sigma_t + \epsilon}\pmb g_t, \quad \eta_t = \frac{\eta }{\sqrt{t + 1}},\quad \pmb\sigma_t=\sqrt{\frac{1}{t+1}\sum_{i=0}^t\pmb g_i^2},\quad \pmb g_t=\frac{\partial L}{\partial \pmb w}
其中,σt\sigma_t爲一次微分的平方差(避免分母爲0),在不增加額外計算量時預估二次微分.

優點: 動態調整學習率,不同參數具有不同的學習率,適用於稀疏數據集(自然語言處理和計算機視覺).
缺點: 隨着迭代次數增加時,分母增大,梯度趨近於0,訓練會提前結束.


Momentum

SGD方法參數更新方向依賴於當前batch計算出的梯度,無法跳出局部最優.
動量法借用物理動量思想,模擬物體運動慣性,即更新參數時依賴於當前更新方向和梯度方向.
wt+1=wtvt,vt=γvt1+αgt,v0=0 \pmb w_{t+1} = \pmb w_t - \pmb v_t, \quad \pmb v_t = \gamma\pmb v_{t-1} + \alpha \pmb g_t, \quad \pmb v_0 = \pmb0
物理解釋:γ可視爲空氣阻力,我們把一個球推下山,球在下坡時積聚動量,若動量方向與梯度方向一致,則在途中變得越來越快;若球的方向發生變化,則動量會衰減,速度變慢或方向改變.


Root Mean Square Propagation, RMSprop

均方根傳播,解決Adagrad梯度下降過快和Momentum梯度擺動幅度大的問題
wt+1=wtησt+ϵgt,σt=ασt1+(1α)gt2 \pmb w_{t+1}=\pmb w_t - \frac{\eta}{\sqrt{\pmb\sigma_t + \epsilon}}\pmb g_t, \quad \pmb\sigma_t = {\alpha\pmb\sigma_{t-1} + (1-\alpha)\pmb g_t^2}
α\alpha正比於當前梯度所佔更新權重的比重.


Adaptive Moment Estimation, Adam

自適應矩估計,指數移動均值和平方梯度分別爲
mt+1=αmt+(1α)gt,vt+1=βvt+(1β)gt2 \pmb m_{t+1} = \alpha\pmb m_t + (1-\alpha)\pmb g_t,\quad\pmb v_{t+1}=\beta\pmb v_t + (1-\beta)\pmb g_t^2
偏差修正
m^t=mt1α,v^t=vt1β \pmb{\hat m}_t=\frac{\pmb m_t}{1 - \alpha}, \quad \pmb{\hat v}_t=\frac{\pmb v_t}{1 - \beta}
參數更新公式
wt+1=wtηv^t+ϵm^t \pmb w_{t+1}=\pmb w_t - \frac{\eta}{\sqrt{\pmb{\hat v}_t} + \epsilon}\pmb{\hat m}_t

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章