Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.

Introduction

ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally.
What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
ADMM can be seen as a blend of Dual Decomposition and Augmented Lagrangian Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD.
ADMM can decompose by example or by features. [To be explored in later chapters]
Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.

Precursors

What is conjugate function exactly?
Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and some other assumptions hold. They converge.
Why augemented lagrangian:
- More robust, less assumption(strict convexity, finiteness of f) : in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.
- For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.
Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
The pho in Augmented Lag is actually the stepsize and with a factor of pho/2, dual feasibility can be preserved along iterations from (x^{k+1},y^{k})->(x^{k+1},y^{k+1}). The proof is simple: since x^{k+1} minimizes L(x,y^{k}), gradient of L(x^{k+1},y^{k}) over x is 0. As a result, the convergence only need to be proved over the primal feasibility.

Alternating Direction Method of Multipliers

Now split the x in AL to x and z. And minimize x,z alternatively.
Gauss-Seidel pass?
The scaled form is often more convenient: let u=(1/pho)y and r^k=Ax^k+Bz^k-c
- $x^{k+1}=argmin_x({f(x)+\rho/2||Ax+Bz^k-c+u^k||^2_2})\\ z^{k+1}=argmin_z({g(z)+\rho/2||Ax^{k+1}+Bz-c+u^k||^2_2})\\ u^{k+1}=u^k+r^{k+1}$
- How to perform the parallel computation?
A basic version of convergence result
- Assumption 1: f and g are proper, closed and convex
- Assumption 2: unaugemented lagrangian has a saddle point. (the dual pb has opt solution, so it should not be unbounded?)
- Residual convergence: $r^k\rightarrow 0\ as \ k\rightarrow \inf$
- Objective convergence: the obj approaches the opt
- Dual varaible convergence: dual opt (y*) is also approached along iterations
Convergence in practice
- Often converges within a few tens of iterations to an acceptable accuracy.
- Slow convergence to high accuracy compared with Newton's or Interior point
- Suitable for situations where a moderate accuracy is sufficient (this is why it's a good choice for ML and statistical learning)
- Can be combined with other methods at larter iterations to reach a high accuracy
Optimality conditions
- Optimality is reached when primal and dual feasibility are satisfied
- Primal:
- Dual: $0\in \partial f(x^*)+A^Ty^* \\ 0\in \partial g(z^*)+B^Ty^*$
- It can be shown that the third relation is alway satisfied due the way we update y after the minimization of z. So the residual of dual feasibility can be defined as $s^{k+1}=\rho A^TB(z^{k+1}-z^k)$ . The primal one is $r^{k+1}$
- Optimality reached when both s and r converge to 0
Stopping criteria
- It can be shown that the opt gap : $f(x^k)+g(z^k)-p^*\leq -(y^k)^Tr^k+(x^k-x^*)^Ts^k$
- The rhs can be approximated by the opt gap of the sub pb on x.
- let $|x-x^*|\leq d$ then the rhs $\leq ||y^k||_2||r^k||_2+d||s^k||_2$ (this is because vectors $a\dot b = |a||b|\cos{\theta}$ )
- So the stopping criteria can be based on and r: $||r^k||\leq \epsilon^{prim}$ and $||s^k||\leq \epsilon^{dual}$ . They can be chosen using absolute and relative tolerance.
- $\epsilon^{prim}=\sqrt{p}\epsilon^{abs}+\epsilon^{rel}max({||Ax||,||Bz||,||c||})$
- $\epsilon^{dual}=\sqrt{n}\epsilon^{abs}+\epsilon^{rel}||A^Ty||$
- where p and n account for the fact that l2 norms are in R^p and R^n
- $\epsilon^{rel}$ is typically set as $10^{-3}$ or $10^{-4}$ . $\epsilon^{abs}$ depends on the pb
Extensions and Variations
- varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
  - $\rho^{k+1}=\tau^{incr}\rho^{k}, \ if\ ||r^{k}||>\mu||s^k||$
  - $\rho^{k+1}=\rho^{k}/\tau^{decr}, \ if\ ||r^{k}||<\mu||s^k||$
  - $\rho^{k+1}=\rho^{k}$
  - Typical choices are $\mu=10, \tau=2$ . The mu here pushs primal/dual feasibility to be approaches simutaneously
- Inexact minimization: each minimization step is carried out in approximation measure. under some assumptions the algo still converges
- Updating ordering
  - carrying multipler x-z minimization passes before dual update
  - or carry out an extra dual update between x-minimization and z-minimization
- Related algorithms
  - Proximal methods of multipliers
  - Saddle point splitting
  - Distributed admm, etc (To be discussed later)
Other references
- When applied to statistical learning: constraint sparse regression
- SVM
- Maximal monotone operator
NOTE
- "There is currently no proof of convergence know for ADMM with nonquadratic penalty terms"