Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.

# Introduction

- ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
- Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally.
- What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
- ADMM can be seen as a blend of
**Dual Decomposition**and**Augmented Lagrangian**Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD. **ADMM can decompose by example or by features**. [To be explored in later chapters]- Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.

# Precursors

- What is conjugate function exactly?
- Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and
**some other assumptions**hold. They converge. - Why augemented lagrangian:
**More robust, less assumption(strict convexity, finiteness of f)**: in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.- For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.

- Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
- The pho in Augmented Lag is actually the stepsize and with a factor of pho/2,
**dual feasibility**can be preserved along iterations from (x^{k+1},y^{k})->(x^{k+1},y^{k+1}). The proof is simple: since x^{k+1} minimizes L(x,y^{k}), gradient of L(x^{k+1},y^{k}) over x is 0. As a result, the convergence only need to be proved over the primal feasibility.

# Alternating Direction Method of Multipliers

- Now split the x in AL to x and z. And minimize x,z alternatively.
- Gauss-Seidel pass?
- The scaled form is often more convenient: let u=(1/pho)y and r^k=Ax^k+Bz^k-c
- How to perform the parallel computation?

- A basic version of convergence result
- Assumption 1: f and g are proper, closed and convex
- Assumption 2: unaugemented lagrangian has a saddle point. (the dual pb has opt solution, so it should not be unbounded?)
- Residual convergence:
- Objective convergence: the obj approaches the opt
- Dual varaible convergence: dual opt (y*) is also approached along iterations

- Convergence in practice
- Often converges within a few tens of iterations to an acceptable accuracy.
- Slow convergence to high accuracy compared with Newton's or Interior point
- Suitable for situations where a moderate accuracy is sufficient (this is why it's a good choice for ML and statistical learning)
- Can be combined with other methods at larter iterations to reach a high accuracy

- Optimality conditions
- Optimality is reached when primal and dual feasibility are satisfied
- Primal:
- Dual:
- It can be shown that the third relation is alway satisfied due the way we update y after the minimization of z. So the residual of dual feasibility can be defined as . The primal one is
- Optimality reached when both s and r converge to 0

- Stopping criteria
- It can be shown that the opt gap :
- The rhs can be approximated by the opt gap of the sub pb on x.
- let then the rhs (this is because vectors )
- So the stopping criteria can be based on and r: and . They can be chosen using absolute and relative tolerance.
- where p and n account for the fact that l2 norms are in R^p and R^n
- is typically set as or . depends on the pb

- Extensions and Variations
- varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
- Typical choices are . The mu here pushs primal/dual feasibility to be approaches simutaneously

- Inexact minimization: each minimization step is carried out in approximation measure. under some assumptions the algo still converges
- Updating ordering
- carrying multipler x-z minimization passes before dual update
- or carry out an extra dual update between x-minimization and z-minimization

- Related algorithms
- Proximal methods of multipliers
- Saddle point splitting
- Distributed admm, etc (To be discussed later)

- varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
- Other references
- When applied to statistical learning: constraint sparse regression
- SVM
- Maximal monotone operator

- NOTE
- "There is currently no proof of convergence know for ADMM with nonquadratic penalty terms"