数学:优化理论(方向导数、负梯度、SGD、Adagrad、RMSprop、Adam)

Directional Derivative and Gradient

方向导数,描述函数沿指定方向的变化率。若函数f(x,y)f(x, y)在点P0(x0,y0)P_0(x_0,y_0)处可微,则函数在该点沿任一方向ll的方向导数存在,且有
fl(x0,y0)=fx(x0,y0)cosα+fy(x0,y0)cosβ \frac{\partial f}{\partial l} \Big |_{(x_0,y_0)}=f_x(x_0,y_0)\cos\alpha + f_y(x_0,y_0)\cos\beta
其中cosα\cos \alphacosβ\cos \beta是方向ll的方向余弦。


Differentiable and Partial Derivatives

若函数z=f(x,y)z=f(x,y)在点(x,y)(x,y)的某邻域内有定义,函数在点(x,y)(x, y)处的全增量Δz=f(x+Δx,y+Δy)f(x,y)\Delta z = f(x+\Delta x, y+\Delta y)-f(x,y),可表示为Δz=AΔx+BΔy+o(ρ),ρ=(Δx)2+(Δy)2\Delta z = A\Delta x + B\Delta y + o(\rho), \quad \rho=\sqrt{(\Delta x)^2 + (\Delta y)^2}

其中AABB不依赖于Δx\Delta xΔy\Delta y,且仅与xxyy有关,则称函数z=f(x,y)z=f(x,y)(x,y)(x,y)可微,全微分dz=AΔx+BΔy\mathrm dz=A\Delta x + B\Delta y

必要条件

若函数z=f(x,y)z=f(x,y)在点(x,y)(x,y)处可微分,则函数z=f(x,y)z=f(x,y)在点(x,y)(x,y)的偏导数zx\dfrac{\partial z}{\partial x}zy\dfrac{\partial z}{\partial y}必定存在,此时全微分
dz=zxΔx+zyΔyydz=zxdx+zydy \mathrm dz=\dfrac{\partial z}{\partial x} \Delta x + \dfrac{\partial z}{\partial y} \Delta yy \quad \Longleftrightarrow \quad \mathrm dz = \dfrac{\partial z}{\partial x} \mathrm dx + \dfrac{\partial z}{\partial y} \mathrm dy

充分条件

若函数z=f(x,y)z=f(x,y)的偏导数zx\dfrac{\partial z}{\partial x}zy\dfrac{\partial z}{\partial y}在点(x,y)(x,y)处连续,则函数z=f(x,y)z=f(x,y)在点(x,y)(x,y)处可微。

综上所述:可微     \implies 偏导存在 \quad 偏导存在且连续     \implies 可微.


Directional Derivative

f(x,y)f(x, y)在点P0(x0,y0)P_0(x_0,y_0)处可微,则
f(x0+Δx,y0+Δy)f(x0,y0)=fx(x0,y0)Δx+fy(x0,y0)Δy+o((Δx)2+(Δy)2) f(x_0+\Delta x,y_0+\Delta y)-f(x_0,y_0)=f_x(x_0,y_0)\Delta x + f_y(x_0,y_0)\Delta y +o(\sqrt{(\Delta x)^2 + (\Delta y)^2})

Δx=tcosα, Δy=tcosβ, Δx)2+(Δy)2=t \Delta x=t\cos\alpha,\ \Delta y=t\cos\beta,\ \sqrt{\Delta x)^2 + (\Delta y)^2}=t
因此方向导数
limt0+fx(x0,y0)tcosα+fy(x0,y0)tcosβt=fx(x0,y0)cosα+fy(x0,y0)cosβ \lim_{t \to 0^+}\frac{f_x(x_0,y_0)t\cos\alpha+f_y(x_0,y_0)t\cos\beta}{t}=f_x(x_0,y_0)\cos\alpha + f_y(x_0,y_0)\cos\beta


例题

f(x,y,z)=xy+yz+zxf(x,y,z)=xy+yz+zx在点(1,1,2)(1,1,2)沿方向ll的方向导数,其中ll的方向角分别为6060^\circ4545^\circ6060^\circ
解:与ll同向的单位向量el=(12,22,12)\bm e_l=(\dfrac{1}{2},\dfrac{\sqrt 2}{2}, \dfrac{1}{2}),因为函数可微,故
fx(1,1,2)=3,fy(1,1,2)=3,fz(1,1,2)=2 f_x(1,1,2)=3, \quad f_y(1,1,2)=3, \quad f_z(1,1,2)=2
因此
fl(1,1,2)=312+322+212=12(5+32) \frac{\partial f}{\partial l}\Big|_{(1,1,2)}=3\cdot\frac{1}{2} + 3\cdot\frac{\sqrt 2}{2} + 2\cdot\frac{1}{2} = \frac{1}{2}(5+3\sqrt 2)


Gradient

对于二元函数,设函数f(x,y)f(x,y)在平面区域DD内具有一阶连续偏导,则对于每一点P0(x0,y0)DP_0(x_0,y_0)\in D,定义向量
fx(x0,y0)i+fy(x0,y0)jf_x(x_0, y_0)\bm i + f_y(x_0,y_0)\bm j
称为函数f(x,y)f(x,y)在点P0(x0,y0)P_0(x_0,y_0)处的梯度,记作
gradf(x0,y0)f(x0,y0) {\bf{grad}} \,f(x_0,y_0)\quad或\quad\nabla f(x_0,y_0)


Relationship of Directional Derivative and Gradient

若函数f(x,y)f(x,y)在点P0(x0,y0)P_0(x_0,y_0)处可微分,与方向ll同方向的单位向量el=(cosα,cosβ)\bm e_l=(\cos \alpha, \cos\beta),则
fl(x0,y0)=fx(x0,y0)cosα+fy(x0,y0)cosβ=gradf(x0,y0)el=gradf(x0,y0)cosθ\begin{aligned} \frac{\partial f}{\partial l}\Big|_{(x_0,y_0)} & =f_x(x_0,y_0)\cos\alpha +f_y(x_0,y_0)\cos\beta \\[1ex] &={\bf{grad}}\,f(x_0,y_0)\cdot \bm e_l \\[1ex] &= |{\bf{grad}}\,f(x_0,y_0)|\cdot \cos \theta \end{aligned}
式中,θ\theta为梯度与方向ll的夹角. θ\theta不同值,函数变化情况:

  • θ=0\theta=0,方向ll与梯度方向同向,函数f(x,y)f(x,y)增长最快;
  • θ=π\theta=\pi,方向ll与梯度方向相反,函数f(x,y)f(x,y)减少最快;
  • θ=π/2\theta=\pi/2,方向ll与梯度方向正交,函数f(x,y)f(x,y)变化率为0;

Gradient Descent

梯度下降是利用损失函数的负梯度方向更新参数,使目标函数(均方误差损失函数)达到极小值
L=12nn(y^(wx+b))2 L = \frac{1}{2n}\sum_n\left(\hat y-\left(\pmb w \pmb x + b\right)\right)^2


Gradient Decent and Taylor Series

泰拉展开式
f(x)=k=0h(k)(x0)k!(xx0)k=h(x0)+h(x0)(xx0)+h(x0)2!(xx0)2+h(x0)+h(x0)(xx0) \begin{aligned} f(x) &=\sum_{k=0}^{\infty}\frac{h^{(k)}(x_0)}{k!}(x-x_0)^k \\ &=h(x_0)+h'(x_0)(x-x_0)+\frac{h''(x_0)}{2!}(x-x_0)^2+\cdots\\ & \approx h(x_0)+h'(x_0)(x-x_0) \end{aligned}
L(θ)L(\pmb \theta)包含两个参数,将L(θ)L(\pmb \theta)θt=(a,b)\pmb \theta_t=(a, b)一阶展开
L(θ)L(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b) L(\pmb \theta)\approx L(a, b)+ \frac{\partial L(a, b)}{\partial \theta_1}(\theta_1 - a)+ \frac{\partial L(a, b)}{\partial \theta_2}(\theta_2 - b)
Δθ1=(θ1a)\Delta\theta_1=(\theta_1-a), Δθ2=(θ2b)\Delta\theta_2=(\theta_2-b),则
minθL(θ)    min[(Δθ1,Δθ2)(L(a,b)θ1,L(a,b)θ2)] \min_{\pmb\theta} L(\pmb \theta) \iff \min\left[ (\Delta\theta_1, \Delta\theta_2)\cdot \left(\frac{\partial L(a, b)}{\partial \theta_1},\frac{\partial L(a, b)}{\partial \theta_2}\right)\right]
两向量夹角180°时内积最小,最优解满足
(Δθ1,Δθ2)=η(L(a,b)θ1,L(a,b)θ2)    [θ1θ2]=[ab]η[L(a,b)θ1L(a,b)θ2] (\Delta\theta_1, \Delta\theta_2)=-\eta\left(\dfrac{\partial L(a, b)}{\partial \theta_1},\dfrac{\partial L(a, b)}{\partial \theta_2}\right)\implies \quad \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}= \begin{bmatrix} a \\ b \end{bmatrix} - \eta \begin{bmatrix} \dfrac{\partial L(a, b)}{\partial \theta_1} \\ \dfrac{\partial L(a, b)}{\partial \theta_2} \end{bmatrix}


Negative Gradient

函数f(x)f(x)在点xx处沿方向dd的变化率可用方向导数表示,方向导数等于梯度与方向的内积,即
Df(x;d)=f(x)Td Df(\pmb x;\pmb d)=\nabla f(\pmb x)^T\pmb d
非线性规划求解最速下降方向,即
minf(x)Tds.t. d1 \min\nabla f(\pmb x)^T\pmb d\quad \quad s.t.\ ||\pmb d||\leq 1
Schwartz不等式,有
f(x)Tdf(x)df(x) ||\nabla f(\pmb x)^T\pmb d|| \leq ||\nabla f(\pmb x)||\,|\pmb d|| \leq ||\nabla f(\pmb x)||

f(x)Tdf(x), d=f(x)f(x) \nabla f(\pmb x)^T\pmb d \geq -||\nabla f(\pmb x)||,\quad 当前仅当\ \pmb d=-\frac{\nabla f(\pmb x)}{||\nabla f(\pmb x)||}时等式成立
即负梯度方向为最速下降方向。


Batch Gradient Descent

每次迭代使用所有样本更新参数,即
wt+1=wtηLt,Lt=1nn(y^i(wtxi+b))xi \pmb w_{t+1}=\pmb w_t - \eta\nabla L_t, \quad\nabla L_t=\dfrac{1}{n}\sum_n(\hat y_i - (\pmb w_t \pmb x_i + b))\pmb x_i
优点: 可得到全局最优解,可并行计算,凸优化时可获得全局最优解;
缺点: 训练时间长;


Stochastic Gradient Descent, SGD

随机梯度下降,每次迭代随机选取一个样本更新参数,即
wt+1=wtηLt,Lt=12(y^i(wtxi+b))xi \pmb w_{t+1}=\pmb w_t - \eta\nabla L_t,\quad 其中\nabla L_t=\dfrac{1}{2}(\hat y_i - (\pmb w_t \pmb x_i + b))\pmb x_i
优点: 训练速度快;
缺点: 准确度下降(盲目搜索解空间),可能只能得到局部最优解,且无法并行计算;


Adaptive Gradient Descent, Adagrad

自适应梯度下降,某参数的理想学习率正比於其一次微分、反比于其二次微分,即对于任一参数有
wt+1=wtηtσt+ϵgt,ηt=ηt+1,σt=1t+1i=0tgi2,gt=Lw \pmb w_{t+1} = \pmb w_t - \frac{\eta_t}{\pmb\sigma_t + \epsilon}\pmb g_t, \quad \eta_t = \frac{\eta }{\sqrt{t + 1}},\quad \pmb\sigma_t=\sqrt{\frac{1}{t+1}\sum_{i=0}^t\pmb g_i^2},\quad \pmb g_t=\frac{\partial L}{\partial \pmb w}
其中,σt\sigma_t为一次微分的平方差(避免分母为0),在不增加额外计算量时预估二次微分.

优点: 动态调整学习率,不同参数具有不同的学习率,适用于稀疏数据集(自然语言处理和计算机视觉).
缺点: 随着迭代次数增加时,分母增大,梯度趋近于0,训练会提前结束.


Momentum

SGD方法参数更新方向依赖于当前batch计算出的梯度,无法跳出局部最优.
动量法借用物理动量思想,模拟物体运动惯性,即更新参数时依赖于当前更新方向和梯度方向.
wt+1=wtvt,vt=γvt1+αgt,v0=0 \pmb w_{t+1} = \pmb w_t - \pmb v_t, \quad \pmb v_t = \gamma\pmb v_{t-1} + \alpha \pmb g_t, \quad \pmb v_0 = \pmb0
物理解释:γ可视为空气阻力,我们把一个球推下山,球在下坡时积聚动量,若动量方向与梯度方向一致,则在途中变得越来越快;若球的方向发生变化,则动量会衰减,速度变慢或方向改变.


Root Mean Square Propagation, RMSprop

均方根传播,解决Adagrad梯度下降过快和Momentum梯度摆动幅度大的问题
wt+1=wtησt+ϵgt,σt=ασt1+(1α)gt2 \pmb w_{t+1}=\pmb w_t - \frac{\eta}{\sqrt{\pmb\sigma_t + \epsilon}}\pmb g_t, \quad \pmb\sigma_t = {\alpha\pmb\sigma_{t-1} + (1-\alpha)\pmb g_t^2}
α\alpha正比于当前梯度所占更新权重的比重.


Adaptive Moment Estimation, Adam

自适应矩估计,指数移动均值和平方梯度分别为
mt+1=αmt+(1α)gt,vt+1=βvt+(1β)gt2 \pmb m_{t+1} = \alpha\pmb m_t + (1-\alpha)\pmb g_t,\quad\pmb v_{t+1}=\beta\pmb v_t + (1-\beta)\pmb g_t^2
偏差修正
m^t=mt1α,v^t=vt1β \pmb{\hat m}_t=\frac{\pmb m_t}{1 - \alpha}, \quad \pmb{\hat v}_t=\frac{\pmb v_t}{1 - \beta}
参数更新公式
wt+1=wtηv^t+ϵm^t \pmb w_{t+1}=\pmb w_t - \frac{\eta}{\sqrt{\pmb{\hat v}_t} + \epsilon}\pmb{\hat m}_t

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章