L1正则项与稀疏性

题目(164):L1正则化使得模型参数具有稀疏性的原理是什么?

回答角度:

  1. 几何角度,即解空间形状
  2. 微积分角度,对带L1限制的目标函数求导
  3. 贝叶斯先验

解空间形状

Step 1. 正则条件和限制条件的等价性
Step 2. L1范数与L2范数的几何形状
Step 3. 如果原问题目标函数的最优解不在解空间内,那么约束条件下的最优解一定是在解空间的边界上。
[复习KKT, complementary slackness]\textcolor{red}{\text{[复习KKT, complementary slackness]}}

微积分、函数叠加

损失函数加入L1正则后,目标函数变为J(θ)=L(θ)+cθ1J(\bm \theta) = L(\bm \theta) + c \|\bm \theta\|_1。When θ>0\bm \theta>0, the gradient of cθ1c \|\bm \theta\|_1 equals cc; when θ<0\bm \theta<0, the gradient of cθ1c \|\bm \theta\|_1 equals c-c. Therefore, if the gradient of L(θ)L(\bm \theta) lies within (c,c)(-c,c), the gradient of J(θ)J(\bm \theta) is always negative for θ<0\bm \theta<0, indicating that J(θ)J(\bm \theta) is monotonically decreasing on the left of the origin; its gradient is always positive for θ>0\bm \theta>0, indicating monotonic increase on the right of the origin. Therefore, the minimum takes place at θ=0\bm \theta =0.

反观L2正则,原点处导数为零。The gradient of J(θ)J(\bm \theta) at the origin equals zero iff the gradient of L(θ)L(\bm \theta) at the origin equals zero. Therefore, the possibility of having sparse solutions with theL2L_2-norm regularisation is much less likely than with the L1L_1-norm regularisation.

复习soft-thresholding和simplifed LASSO problem [2]
minβ12yβ22+λβ1\min_\beta \frac{1}{2} \|y-\beta\|^2_2 + \lambda \|\beta\|_1

Let v(β)v \in \partial(\beta). By subgradient optimality condition,
{(yiβi)=λsign(βi)if βi0yiβiλif βi=0.\begin{cases}(y_i-\beta_i)=\lambda \text{sign}(\beta_i) & \text{if } \beta_i \neq 0\\ |y_i - \beta_i| \leq \lambda & \text{if } \beta_i=0.\end{cases}

When βi>0\beta_i>0, yiβi=λy_i - \beta_i = \lambda, requiring that yiλ>0y_i - \lambda>0; when βi<0\beta_i<0, yiβi=λy_i - \beta_i = -\lambda, requiring that yi+λ>0y_i + \lambda>0; when βi=0\beta_i=0, yiλ|y_i| \leq \lambda. Combining the three cases leads to the soft-thresholding operator:
Sλ(y)i={yiλif yi>λ0if λyiλyi+λif yiλ.S_\lambda(y)_i = \begin{cases} y_i - \lambda & \text{if }y_i > \lambda \\ 0 & \text{if } -\lambda \leq y_i \leq \lambda \\ y_i + \lambda & \text{if } y_i \leq -\lambda .\end{cases}

贝叶斯先验

高斯分布在极值点处是平滑的,即高斯先验分布认为θ\theta在极值点附近取不同值的可能性是接近的。这就是L2L_2正则化只会让θ\theta更接近0点,但不会等于0的原因。

拉普拉斯分布在极值点处是一个尖峰,所以拉普拉斯先验分布中参数θ\theta取值为0的可能性要更高。

[L2,L1对应Gaussian, Laplace的证明]\textcolor{red}{\text{[L2,L1对应Gaussian, Laplace的证明]}}


参考文献

  1. 《百面机器学习》
  2. CMU Convex Optimization 10-725 / 36-725, Subgradients
    https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/06-subgradients-scribed.pdf
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章