Taylor expansion: L(θ+δ)=L(θ)+δT∂θ∂L(θ)
To avoid a large value of δ, we impose an L2-norm regulariser: δ=δargminL(θ)+δT∂θ∂L(θ)+2α1∥δ∥22=−α∂θ∂L(θ),
which leads to the familiar first-order gradient descent algorithm: θt+1=θt−α∂θ∂L(θ).
Two ways of interpreting α: i) learning rate; ii) penalty on how large δ is allowed to move around θ.
Accelerated gradient descent algorithm (Nesterov)
to be studied
faster convergence
二階算法
Taylor expansion: L(θ+δ)=L(θ)+δT∂θ∂L(θ)+21δT∂θ2∂2L(θ)δ
δ=δargminL(θ)+δT∇L(θ)+21δT∇2L(θ)δ=−∇2L(θ)∇L(θ)
which leads to the Newton’s method (aka Newton-Raphson method): θt+1=θt−∇2L(θ)∇L(θ)
Geometric interpretation
For univariate case, we find a quadratic function H(x) as an approximation to the original function h(x) at the current point x0. Then, optimise towards the minimum of H(x).
For multivariate case, we fit a paraboloid to the surface of f(x) at x0, which has the same slope and curvature as the surface at x0, and then proceed to the maximum or minimum of that paraboloid (in higher dimensions, this may also be a saddle point) [2].
faster convergence than the first-order method
calculate the inverse of the Hessian matrix, particularly for high-dimensional problems
converge to saddle point when the objective function is non-convex
BFGS, L-BFGS
to be studied
address the high computational cost of matrix inversion
圖片來源:
Newton’s method: UCL STAT0023, 2018-19, Lec 4. Optimisation, maximum likelihood and nonlinear least squares in R
參考文獻:
《百問機器學習》
Wikipedia, Newton’s method in optimization
https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization