Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.
Introduction
- ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
- Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally.
- What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
- ADMM can be seen as a blend of Dual Decomposition and Augmented Lagrangian Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD.
- ADMM can decompose by example or by features. [To be explored in later chapters]
- Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.
Precursors
- What is conjugate function exactly?
- Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and some other assumptions hold. They converge.
- Why augemented lagrangian:
- More robust, less assumption(strict convexity, finiteness of f) : in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.
- For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.
- Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
- The pho in Augmented Lag is actually the stepsize and with a factor of pho/2, dual feasibility can be preserved along iterations from (x^{k+1},y^{k})->(x^{k+1},y^{k+1}). The proof is simple: since x^{k+1} minimizes L(x,y^{k}), gradient of L(x^{k+1},y^{k}) over x is 0. As a result, the convergence only need to be proved over the primal feasibility.
Alternating Direction Method of Multipliers
- Now split the x in AL to x and z. And minimize x,z alternatively.
- Gauss-Seidel pass?
- The scaled form is often more convenient: let u=(1/pho)y and r^k=Ax^k+Bz^k-c
- How to perform the parallel computation?
- A basic version of convergence result
- Assumption 1: f and g are proper, closed and convex
- Assumption 2: unaugemented lagrangian has a saddle point. (the dual pb has opt solution, so it should not be unbounded?)
- Residual convergence:
- Objective convergence: the obj approaches the opt
- Dual varaible convergence: dual opt (y*) is also approached along iterations
- Convergence in practice
- Often converges within a few tens of iterations to an acceptable accuracy.
- Slow convergence to high accuracy compared with Newton's or Interior point
- Suitable for situations where a moderate accuracy is sufficient (this is why it's a good choice for ML and statistical learning)
- Can be combined with other methods at larter iterations to reach a high accuracy
- Optimality conditions
- Optimality is reached when primal and dual feasibility are satisfied
- Primal:
- Dual:
- It can be shown that the third relation is alway satisfied due the way we update y after the minimization of z. So the residual of dual feasibility can be defined as
. The primal one is
- Optimality reached when both s and r converge to 0
- Stopping criteria
- It can be shown that the opt gap :
- The rhs can be approximated by the opt gap of the sub pb on x.
- let
then the rhs
(this is because vectors
)
- So the stopping criteria can be based on
and r:
and
. They can be chosen using absolute and relative tolerance.
- where p and n account for the fact that l2 norms are in R^p and R^n
is typically set as
or
.
depends on the pb
- It can be shown that the opt gap :
- Extensions and Variations
- varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
- Typical choices are
. The mu here pushs primal/dual feasibility to be approaches simutaneously
- Inexact minimization: each minimization step is carried out in approximation measure. under some assumptions the algo still converges
- Updating ordering
- carrying multipler x-z minimization passes before dual update
- or carry out an extra dual update between x-minimization and z-minimization
- Related algorithms
- Proximal methods of multipliers
- Saddle point splitting
- Distributed admm, etc (To be discussed later)
- varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
- Other references
- When applied to statistical learning: constraint sparse regression
- SVM
- Maximal monotone operator
- NOTE
- "There is currently no proof of convergence know for ADMM with nonquadratic penalty terms"