L1正則項與稀疏性

題目(164):L1正則化使得模型參數具有稀疏性的原理是什麼?

回答角度:

  1. 幾何角度,即解空間形狀
  2. 微積分角度,對帶L1限制的目標函數求導
  3. 貝葉斯先驗

解空間形狀

Step 1. 正則條件和限制條件的等價性
Step 2. L1範數與L2範數的幾何形狀
Step 3. 如果原問題目標函數的最優解不在解空間內,那麼約束條件下的最優解一定是在解空間的邊界上。
[複習KKT, complementary slackness]\textcolor{red}{\text{[複習KKT, complementary slackness]}}

微積分、函數疊加

損失函數加入L1正則後,目標函數變爲J(θ)=L(θ)+cθ1J(\bm \theta) = L(\bm \theta) + c \|\bm \theta\|_1。When θ>0\bm \theta>0, the gradient of cθ1c \|\bm \theta\|_1 equals cc; when θ<0\bm \theta<0, the gradient of cθ1c \|\bm \theta\|_1 equals c-c. Therefore, if the gradient of L(θ)L(\bm \theta) lies within (c,c)(-c,c), the gradient of J(θ)J(\bm \theta) is always negative for θ<0\bm \theta<0, indicating that J(θ)J(\bm \theta) is monotonically decreasing on the left of the origin; its gradient is always positive for θ>0\bm \theta>0, indicating monotonic increase on the right of the origin. Therefore, the minimum takes place at θ=0\bm \theta =0.

反觀L2正則,原點處導數爲零。The gradient of J(θ)J(\bm \theta) at the origin equals zero iff the gradient of L(θ)L(\bm \theta) at the origin equals zero. Therefore, the possibility of having sparse solutions with theL2L_2-norm regularisation is much less likely than with the L1L_1-norm regularisation.

複習soft-thresholding和simplifed LASSO problem [2]
minβ12yβ22+λβ1\min_\beta \frac{1}{2} \|y-\beta\|^2_2 + \lambda \|\beta\|_1

Let v(β)v \in \partial(\beta). By subgradient optimality condition,
{(yiβi)=λsign(βi)if βi0yiβiλif βi=0.\begin{cases}(y_i-\beta_i)=\lambda \text{sign}(\beta_i) & \text{if } \beta_i \neq 0\\ |y_i - \beta_i| \leq \lambda & \text{if } \beta_i=0.\end{cases}

When βi>0\beta_i>0, yiβi=λy_i - \beta_i = \lambda, requiring that yiλ>0y_i - \lambda>0; when βi<0\beta_i<0, yiβi=λy_i - \beta_i = -\lambda, requiring that yi+λ>0y_i + \lambda>0; when βi=0\beta_i=0, yiλ|y_i| \leq \lambda. Combining the three cases leads to the soft-thresholding operator:
Sλ(y)i={yiλif yi>λ0if λyiλyi+λif yiλ.S_\lambda(y)_i = \begin{cases} y_i - \lambda & \text{if }y_i > \lambda \\ 0 & \text{if } -\lambda \leq y_i \leq \lambda \\ y_i + \lambda & \text{if } y_i \leq -\lambda .\end{cases}

貝葉斯先驗

高斯分佈在極值點處是平滑的,即高斯先驗分佈認爲θ\theta在極值點附近取不同值的可能性是接近的。這就是L2L_2正則化只會讓θ\theta更接近0點,但不會等於0的原因。

拉普拉斯分佈在極值點處是一個尖峯,所以拉普拉斯先驗分佈中參數θ\theta取值爲0的可能性要更高。

[L2,L1對應Gaussian, Laplace的證明]\textcolor{red}{\text{[L2,L1對應Gaussian, Laplace的證明]}}


參考文獻

  1. 《百面機器學習》
  2. CMU Convex Optimization 10-725 / 36-725, Subgradients
    https://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/06-subgradients-scribed.pdf
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章