無約束優化問題的求解

題目(148):無約束優化問題的優化方法有哪些?

複習點:一階、二階算法和Taylor expansion之間的關係

  • 直接求解
  • 迭代求解
    • 一階算法
    • 二階算法

直接求解

  • convex objective function
  • closed-form solution

例:Ridge regression

迭代求解

θt+1=θt+δ\theta^{t+1} = \theta^t +\delta

一階算法

Taylor expansion: L(θ+δ)=L(θ)+δTL(θ)θL(\bm \theta + \bm \delta) = L(\bm \theta)+\bm \delta^T \frac{\partial L(\bm \theta)}{\partial \bm \theta}
To avoid a large value of δ\delta, we impose an L2L_2-norm regulariser:
δ=argminδ L(θ)+δTL(θ)θ+12αδ22=αL(θ)θ,\bm \delta = \underset{\bm \delta}{\arg\min}\ L(\bm \theta)+\bm \delta^T \frac{\partial L(\bm \theta)}{\partial \bm \theta} + \frac{1}{2\alpha}\|\bm \delta\|_2^2 = - \alpha \frac{\partial L(\bm \theta)}{\partial \bm \theta},
which leads to the familiar first-order gradient descent algorithm: θt+1=θtαL(θ)θ\bm \theta^{t+1} = \bm \theta^t - \alpha \frac{\partial L(\bm \theta)}{\partial \bm \theta}.

Two ways of interpreting α\alpha: i) learning rate; ii) penalty on how large δ\bm \delta is allowed to move around θ\bm \theta.

Accelerated gradient descent algorithm (Nesterov)

to be studied\textcolor{gray}{\text{to be studied}}

  • faster convergence

二階算法

Taylor expansion: L(θ+δ)=L(θ)+δTL(θ)θ+12δT2L(θ)θ2δL(\bm \theta + \bm \delta) = L(\bm \theta)+ \bm \delta^T \frac{\partial L(\bm \theta)}{\partial \bm \theta} + \frac{1}{2} \bm \delta^T \frac{\partial^2 L(\bm \theta)}{\partial \bm \theta^2} \bm \delta

δ=argminδL(θ)+δTL(θ)+12δT2L(θ)δ=L(θ)2L(θ)\bm \delta = \underset{\bm \delta}{\arg\min} L(\bm \theta)+ \bm \delta^T \nabla L(\bm \theta) + \frac{1}{2} \bm \delta^T \nabla^2 L(\bm \theta) \bm \delta = -\frac{\nabla L(\bm \theta)}{\nabla^2 L(\bm \theta)}

which leads to the Newton’s method (aka Newton-Raphson method): θt+1=θtL(θ)2L(θ)\bm \theta^{t+1} = \bm \theta^t - \frac{\nabla L(\bm \theta)}{\nabla^2 L(\bm \theta)}

Geometric interpretation

For univariate case, we find a quadratic function H(x)H(x) as an approximation to the original function h(x)h(x) at the current point x0x_0. Then, optimise towards the minimum of H(x)H(x).

For multivariate case, we fit a paraboloid to the surface of f(x)f(x) at x0x_0, which has the same slope and curvature as the surface at x0x_0, and then proceed to the maximum or minimum of that paraboloid (in higher dimensions, this may also be a saddle point) [2].

  • faster convergence than the first-order method\textcolor{blue}{\text{faster convergence than the first-order method}}
  • calculate the inverse of the Hessian matrix, particularly for high-dimensional problems\textcolor{red}{\text{calculate the inverse of the Hessian matrix, particularly for high-dimensional problems}}
  • converge to saddle point when the objective function is non-convex\textcolor{red}{\text{converge to saddle point when the objective function is non-convex}}

BFGS, L-BFGS

to be studied\textcolor{gray}{\text{to be studied}}

  • address the high computational cost of matrix inversion

圖片來源:

  1. Newton’s method: UCL STAT0023, 2018-19, Lec 4. Optimisation, maximum likelihood and nonlinear least squares in R

參考文獻:

  1. 《百問機器學習》
  2. Wikipedia, Newton’s method in optimization
    https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章