# [NOTE in progress] Distributed Optimization and Statistical Learning via ADMM - Boyd

Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.

# Introduction

• ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
• Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally.
• What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
• ADMM can be seen as a blend of Dual Decomposition and Augmented Lagrangian Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD.
• ADMM can decompose by example or by features. [To be explored in later chapters]
• Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.

# Precursors

• What is conjugate function exactly?
• Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and some other assumptions hold. They converge.
• Why augemented lagrangian:
• More robust, less assumption(strict convexity, finiteness of f) : in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.
• For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.
• Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
• The pho in Augmented Lag is actually the stepsize and with a factor of pho/2, dual feasibility can be preserved along iterations from (x^{k+1},y^{k})->(x^{k+1},y^{k+1}). The proof is simple: since x^{k+1} minimizes L(x,y^{k}), gradient of L(x^{k+1},y^{k}) over x is 0. As a result, the convergence only need to be proved over the primal feasibility.

# Alternating Direction Method of Multipliers

• Now split the x in AL to x and z. And minimize x,z alternatively.
• Gauss-Seidel pass?
• The scaled form is often more convenient: let u=(1/pho)y and r^k=Ax^k+Bz^k-c
• How to perform the parallel computation?
• A basic version of convergence result
• Assumption 1: f and g are proper, closed and convex
• Assumption 2: unaugemented lagrangian has a saddle point. (the dual pb has opt solution, so it should not be unbounded?)
• Residual convergence:
• Objective convergence: the obj approaches the opt
• Dual varaible convergence: dual opt (y*) is also approached along iterations
• Convergence in practice
• Often converges within a few tens of iterations to an acceptable accuracy.
• Slow convergence to high accuracy compared with Newton's or Interior point
• Suitable for situations where a moderate accuracy is sufficient (this is why it's a good choice for ML and statistical learning)
• Can be combined with other methods at larter iterations to reach a high accuracy
• Optimality conditions
• Optimality is reached when primal and dual feasibility are satisfied
• Primal:
• Dual:
• It can be shown that the third relation is alway satisfied due the way we update y after the minimization of z. So the residual of dual feasibility can be defined as . The primal one is
• Optimality reached when both s and r converge to 0
• Stopping criteria
• It can be shown that the opt gap :
• The rhs can be approximated by the opt gap of the sub pb on x.
• let  then the rhs  (this is because vectors )
• So the stopping criteria can be based on  and r: and . They can be chosen using absolute and relative tolerance.
• where p and n account for the fact that l2 norms are in R^p and R^n
• is typically set as or  depends on the pb
• Extensions and Variations
• varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
• Typical choices are . The mu here pushs primal/dual feasibility to be approaches simutaneously
• Inexact minimization: each minimization step is carried out in approximation measure. under some assumptions the algo still converges
• Updating ordering
•  carrying multipler x-z minimization passes before dual update
• or carry out an extra dual update between x-minimization and z-minimization
• Related algorithms
• Proximal methods of multipliers