[NOTE in progress] Distributed Optimization and Statistical Learning via ADMM - Boyd

Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.

Introduction

  • ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
  • Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally. 
  • What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
  • ADMM can be seen as a blend of Dual Decomposition and Augmented Lagrangian Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD.
  • ADMM can decompose by example or by features. [To be explored in later chapters]
  • Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.

Precursors

  • What is conjugate function exactly?
  • Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and some other assumptions hold. They converge.
  • Why augemented lagrangian:
    • More robust, less assumption(strict convexity, finiteness of f) : in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.
    • For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.
  • Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
  • The pho in Augmented Lag is actually the stepsize and with a factor of pho/2, dual feasibility can be preserved along iterations from (x^{k+1},y^{k})->(x^{k+1},y^{k+1}). The proof is simple: since x^{k+1} minimizes L(x,y^{k}), gradient of L(x^{k+1},y^{k}) over x is 0. As a result, the convergence only need to be proved over the primal feasibility.

Alternating Direction Method of Multipliers

  • Now split the x in AL to x and z. And minimize x,z alternatively.
  • Gauss-Seidel pass?
  • The scaled form is often more convenient: let u=(1/pho)y and r^k=Ax^k+Bz^k-c
    • x^{k+1}=argmin_x({f(x)+\rho/2||Ax+Bz^k-c+u^k||^2_2})\\ z^{k+1}=argmin_z({g(z)+\rho/2||Ax^{k+1}+Bz-c+u^k||^2_2})\\ u^{k+1}=u^k+r^{k+1}
    • How to perform the parallel computation?
  • A basic version of convergence result
    • Assumption 1: f and g are proper, closed and convex
    • Assumption 2: unaugemented lagrangian has a saddle point. (the dual pb has opt solution, so it should not be unbounded?)
    • Residual convergence: r^k\rightarrow 0\ as \ k\rightarrow \inf
    • Objective convergence: the obj approaches the opt
    • Dual varaible convergence: dual opt (y*) is also approached along iterations
  • Convergence in practice
    • Often converges within a few tens of iterations to an acceptable accuracy.
    • Slow convergence to high accuracy compared with Newton's or Interior point
    • Suitable for situations where a moderate accuracy is sufficient (this is why it's a good choice for ML and statistical learning)
    • Can be combined with other methods at larter iterations to reach a high accuracy
  • Optimality conditions
    • Optimality is reached when primal and dual feasibility are satisfied
    • Primal:Ax^*+Bz^*=c
    • Dual: 0\in \partial f(x^*)+A^Ty^* \\ 0\in \partial g(z^*)+B^Ty^*
    • It can be shown that the third relation is alway satisfied due the way we update y after the minimization of z. So the residual of dual feasibility can be defined as s^{k+1}=\rho A^TB(z^{k+1}-z^k). The primal one is r^{k+1}
    • Optimality reached when both s and r converge to 0
  • Stopping criteria
    • It can be shown that the opt gap : f(x^k)+g(z^k)-p^*\leq -(y^k)^Tr^k+(x^k-x^*)^Ts^k
    • The rhs can be approximated by the opt gap of the sub pb on x. 
    • let |x-x^*|\leq d then the rhs \leq ||y^k||_2||r^k||_2+d||s^k||_2 (this is because vectors a\dot b = |a||b|\cos{\theta})
    • So the stopping criteria can be based on s and r: ||r^k||\leq \epsilon^{prim}and ||s^k||\leq \epsilon^{dual}. They can be chosen using absolute and relative tolerance.
    • \epsilon^{prim}=\sqrt{p}\epsilon^{abs}+\epsilon^{rel}max({||Ax||,||Bz||,||c||})
    • \epsilon^{dual}=\sqrt{n}\epsilon^{abs}+\epsilon^{rel}||A^Ty||
    • where p and n account for the fact that l2 norms are in R^p and R^n
    • \epsilon^{rel}is typically set as 10^{-3}or 10^{-4}\epsilon^{abs} depends on the pb
  • Extensions and Variations
    • varying penalty parameter. To garantee the convergence, fix that parameter at some time point. Otherwise consider:
      • \rho^{k+1}=\tau^{incr}\rho^{k}, \ if\ ||r^{k}||>\mu||s^k||
      • \rho^{k+1}=\rho^{k}/\tau^{decr}, \ if\ ||r^{k}||<\mu||s^k||
      • \rho^{k+1}=\rho^{k}
      • Typical choices are \mu=10, \tau=2. The mu here pushs primal/dual feasibility to be approaches simutaneously
    • Inexact minimization: each minimization step is carried out in approximation measure. under some assumptions the algo still converges
    • Updating ordering
      •  carrying multipler x-z minimization passes before dual update
      • or carry out an extra dual update between x-minimization and z-minimization
    • Related algorithms
      • Proximal methods of multipliers
      • Saddle point splitting
      • Distributed admm, etc (To be discussed later)
  • Other references
    • When applied to statistical learning: constraint sparse regression
    • SVM
    • Maximal monotone operator
  • NOTE
    • "There is currently no proof of convergence know for ADMM with nonquadratic penalty terms"
    •  
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章