L1正则项与稀疏性

原創

大眼呆萌君

2020-07-04 18:42

题目（164）：L1正则化使得模型参数具有稀疏性的原理是什么？

回答角度：

几何角度，即解空间形状
微积分角度，对带L1限制的目标函数求导
贝叶斯先验

解空间形状

Step 1. 正则条件和限制条件的等价性
Step 2. L1范数与L2范数的几何形状
Step 3. 如果原问题目标函数的最优解不在解空间内，那么约束条件下的最优解一定是在解空间的边界上。
$\textcolor{red}{\text{[复习KKT, complementary slackness]}}$

微积分、函数叠加

损失函数加入L1正则后，目标函数变为 $J(\bm \theta) = L(\bm \theta) + c \|\bm \theta\|_1$ 。When $\bm \theta>0$ , the gradient of $c \|\bm \theta\|_1$ equals $c$ ; when $\bm \theta<0$ , the gradient of $c \|\bm \theta\|_1$ equals $-c$ . Therefore, if the gradient of $L(\bm \theta)$ lies within $(-c,c)$ , the gradient of $J(\bm \theta)$ is always negative for $\bm \theta<0$ , indicating that $J(\bm \theta)$ is monotonically decreasing on the left of the origin; its gradient is always positive for $\bm \theta>0$ , indicating monotonic increase on the right of the origin. Therefore, the minimum takes place at $\bm \theta =0$ .

反观L2正则，原点处导数为零。The gradient of $J(\bm \theta)$ at the origin equals zero iff the gradient of $L(\bm \theta)$ at the origin equals zero. Therefore, the possibility of having sparse solutions with the $L_2$ -norm regularisation is much less likely than with the $L_1$ -norm regularisation.

复习soft-thresholding和simplifed LASSO problem [2]
$\min_\beta \frac{1}{2} \|y-\beta\|^2_2 + \lambda \|\beta\|_1$

Let $v \in \partial(\beta)$ . By subgradient optimality condition,
$\begin{cases}(y_i-\beta_i)=\lambda \text{sign}(\beta_i) & \text{if } \beta_i \neq 0\\ |y_i - \beta_i| \leq \lambda & \text{if } \beta_i=0.\end{cases}$

When $\beta_i>0$ , $y_i - \beta_i = \lambda$ , requiring that $y_i - \lambda>0$ ; when $\beta_i<0$ , $y_i - \beta_i = -\lambda$ , requiring that $y_i + \lambda>0$ ; when $\beta_i=0$ , $|y_i| \leq \lambda$ . Combining the three cases leads to the soft-thresholding operator:
$S_\lambda(y)_i = \begin{cases} y_i - \lambda & \text{if }y_i > \lambda \\ 0 & \text{if } -\lambda \leq y_i \leq \lambda \\ y_i + \lambda & \text{if } y_i \leq -\lambda .\end{cases}$

贝叶斯先验

高斯分布在极值点处是平滑的，即高斯先验分布认为 $\theta$ 在极值点附近取不同值的可能性是接近的。这就是 $L_2$ 正则化只会让 $\theta$ 更接近0点，但不会等于0的原因。

拉普拉斯分布在极值点处是一个尖峰，所以拉普拉斯先验分布中参数 $\theta$ 取值为0的可能性要更高。

$\textcolor{red}{\text{[L2,L1对应Gaussian, Laplace的证明]}}$

参考文献

《百面机器学习》
CMU Convex Optimization 10-725 / 36-725, Subgradients
https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/06-subgradients-scribed.pdf

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

L1正则项与稀疏性

解空间形状

微积分、函数叠加

贝叶斯先验

AI 画图真刺激，手把手教你如何用 ComfyUI 来画出刺激的图

公司刚入职了一名 Java 中级开发，短短 4 行代码居然凑齐了 3 个 bug！我哭了~~

公众号5月C#/.NET热文一览

git 下载大陆镜像地址

梯度下降、隨機梯度下降法、及其改進

機器學習中的凸和非凸優化問題

L1正則項與稀疏性

驗證梯度的正確性

Deep Learning相關概念

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結