背景: In gradient descent algorithms, step size may be too large or too small, as shown in the figures below.
Backtracking line search
- Initialization: alpha (=1), tau (decay rate)-whilef(x^t + alpha p^t)">"f(x^t)
alpha = tau*alpha
end
- Update x^{t+1}= x^t
α=α∗τ的作用是防止step length過小
f(xt+αpt)“>”f(xt): prevent steps that are too long relative to the decrease in f;通過Armijo condition實現
Wolfe conditions
Armijo condition
f(xt+αtpt)≤f(xt)+αtc1⋅[gt]Tpk,
where gt denotes the first derivative of f; p denotes the direction, gTp<0 (remark: gradient descent −g⇔p).
In practice, c1 is chosen quite small, say c1=10−4.
In the case that p=g, the Armijo condition in the 2nd step of pseudo-code step can be simplified as follows: f(x−α∇(f))>f(x)−c1α∥∇(f)∥22.
*B&V book, c1∈[0.01,0.03],τ∈[0.1,0.8]
直觀理解
require the reduction in f to be at least a fixed fraction β of the reduction promised by the first-order Taylor approximation of f at xt.
aka significant decrease condition: require α to decrease the objective function by a significant amount.
Curvature condition
The curvature condition rules out small steps. ∇f(xt+αtpt)Tpt≥c2∇f(xt)Tpt,
where c2∈(c1,1). The condition requires that the new slope is at least c2 times the original gradient.