歡迎轉載,轉載請註明出處 :https://blog.csdn.net/qq_41709378/article/details/106639967
——————————————————————————————————————————————————
我們之前討論的情況都是建立在樣例線性可分的假設上,當樣例線性不可分時,我們可以引入鬆弛變量,我們需要將模型進行調整,以保證在不可分的情況下,也能夠儘可能地找出分隔超平面。
1. 前言
在前一篇支持向量機(一): 硬間隔支持向量機 講了線性可分問題的支持向量機學習方法,對線性不可分訓練數據是不適用的,因爲這時上述方法中的不等式約束y i ( w T x i + b ) ≥ 1 i = 1.... N {y_i}\left( {{w^T}{x_i} + b} \right) \ge 1{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1....N y i ( w T x i + b ) ≥ 1 i = 1 . . . . N 並不能都成立。怎麼才能將它擴展到線性不可分問題呢?這就需要修改硬間隔最大化,使其成爲軟間隔最大化。
本文對支持向量機做一個推廣 ,允許超平面能夠錯分一些點,找出分隔超平面。
2. SVM異常點問題
在線性可分問題情況下,訓練數據集會被分隔超平面分成兩個部分,看下圖。
當有一個離羣點(可能是噪聲)存在時,
離羣點的存在可以造成超平面的移動,間隔縮小,可見以前的模型對噪聲非常敏感,如果再按照線性可分問題分析,會使訓練的模型過擬合。
如果離羣點(可能是噪聲)再嚴重點,跑到對面的分類集中,
此時,模型不能運用線性可分問題的支持向量機學習方法,找不到分隔超平面。
有時候本來數據的確是可分的,也就是說可以用線性可分問題的支持向量機學習方法來求解,但是卻因爲混入了離羣點(可能是噪聲),導致不能線性可分,本來數據是可以按上圖中的實線來做超平面分離的,可以由於一個橙色和一個藍色的離羣點導致我們沒法按照上一篇線性支持向量機中的方法來分類。
3. 線性分類SVM的軟間隔最大化
這時候我們應該允許一些點遊離並在在模型中違背限制條件(函數間隔大於1)。我們設計得到新的模型如下(也稱軟間隔):
min w , b , ξ 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i \mathop {\min }\limits_{w,b,\xi } {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{2}{\left\| w \right\|^2} + C\sum\limits_{i = 1}^N {{\xi _i}} w , b , ξ min 2 1 ∥ w ∥ 2 + C i = 1 ∑ N ξ i
s . t . y i ( w ⋅ x i + b ) ≥ 1 − ξ i , i = 1 , 2 , . . , N s.t.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {y_i}\left( {w \cdot {x_i} + b} \right) \ge 1 - {\xi _i}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,i = 1,2,..,N s . t . y i ( w ⋅ x i + b ) ≥ 1 − ξ i , i = 1 , 2 , . . , N
ξ i ≥ 0 , i = 1 , 2 , . . , N {\xi _i}{\kern 1pt} \ge 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,..,N ξ i ≥ 0 , i = 1 , 2 , . . , N
這裏,C > 0 C>0 C > 0 爲懲罰參數,可以理解爲我們一般迴歸和分類問題正則化時候的參數。C越大,對誤分類的懲罰越大,C越小,對誤分類的懲罰越小。
其次,引入非負參數ξ i {\xi_i} ξ i 後(稱爲鬆弛變量),就允許某些樣本點的函數間隔小於1,即在最大間隔區間裏面,容忍了函數間隔是負數,即樣本點在對方的區域中。而放鬆限制條件後,我們需要重新調整目標函數,以對離羣點進行處罰,目標函數後面加上的C ∑ i = 1 N ξ i C\sum\limits_{i = 1}^N {{\xi _i}} C i = 1 ∑ N ξ i 就表示離羣點越多,目標函數值越大,而我們要求的是儘可能小的目標函數值。C作爲懲罰參數,是離羣點的權重,C越大表明離羣點對目標函數影響越大,也就是越不希望看到離羣點。
我們看到,目標函數控制了離羣點的數目和程度,使大部分樣本點仍然遵守限制條件。
4. 拉格朗日對偶化
模型修改後,拉格朗日公式也要修改如下:
L ( w , b , ξ , α , μ ) = 1 2 w T w + C ∑ i = 1 N ξ i − ∑ i = 1 N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − ∑ i = 1 N μ i ξ i L\left( {w,b,\xi ,\alpha ,\mu} \right) = \frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^N {{\xi _i}} - \sum\limits_{i = 1}^N {{\alpha _i}} \left[ {{y_i}\left( {w \cdot {x_i} + b} \right) - 1 + {\xi _i}} \right] - \sum\limits_{i = 1}^N {{\mu_i}{\xi _i}} L ( w , b , ξ , α , μ ) = 2 1 w T w + C i = 1 ∑ N ξ i − i = 1 ∑ N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − i = 1 ∑ N μ i ξ i
這裏的α i {\alpha _i} α i 和μ i \mu_i μ i 都是拉格朗日乘子,回想我們在拉格朗日對偶中提到的求法,先寫出拉格朗日公式(如上),然後將其看作是變量w和b的函數,分別對其求偏導,求其最小值,得到w和b的表達式。然後代入公式中,求帶入後公式的極大值。
(1) 先求L ( w , b , ξ , α , μ ) L\left( {w,b,\xi ,\alpha ,\mu} \right) L ( w , b , ξ , α , μ ) 對w,b的極小
對w,b,ξ求偏導
∇ w L ( w , b , ξ , α , μ ) = w − ∑ i = 1 N α i y i x i = 0 {\nabla _w}L\left( {w,b,\xi ,\alpha ,\mu } \right) = w - \sum\limits_{i = 1}^N {{\alpha _i}} {y_i}{x_i} = 0 ∇ w L ( w , b , ξ , α , μ ) = w − i = 1 ∑ N α i y i x i = 0
∇ b L ( w , b , ξ , α , μ ) = − ∑ i = 1 N α i y i = 0 {\nabla _b}L\left( {w,b,\xi ,\alpha ,\mu } \right) = - \sum\limits_{i = 1}^N {{\alpha _i}} {y_i} = 0 ∇ b L ( w , b , ξ , α , μ ) = − i = 1 ∑ N α i y i = 0
∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0 {\nabla _{{\xi _i}}}L\left( {w,b,\xi ,\alpha ,\mu } \right) = C - {\alpha _i} - {\mu _i} = 0 ∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0
將上式w = ∑ i = 1 N α i y i x i {w} = \sum\limits_{i = 1}^N {\alpha _i} {y_i}{x_i} w = i = 1 ∑ N α i y i x i 帶回到拉格朗日函數L ( w , b , ξ , α , μ ) = 1 2 w T w + C ∑ i = 1 N ξ i − ∑ i = 1 N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − ∑ i = 1 N μ i ξ i L\left( {w,b,\xi ,\alpha ,\mu} \right) = \frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^N {{\xi _i}} - \sum\limits_{i = 1}^N {{\alpha _i}} \left[ {{y_i}\left( {w \cdot {x_i} + b} \right) - 1 + {\xi _i}} \right] - \sum\limits_{i = 1}^N {{\mu_i}{\xi _i}} L ( w , b , ξ , α , μ ) = 2 1 w T w + C i = 1 ∑ N ξ i − i = 1 ∑ N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − i = 1 ∑ N μ i ξ i 中得到,此時得到的是該函數的最小值(目標函數是凸函數)。化簡過程如下:
L ( w , b , ξ , α , μ ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i − ∑ i = 1 N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − ∑ i = 1 N μ i ξ i L\left( {w,b,\xi ,\alpha ,\mu } \right) = \frac{1}{2}{\left\| w \right\|^2} + C\sum\limits_{i = 1}^N {{\xi _i}} - \sum\limits_{i = 1}^N {{\alpha _i}} \left[ {{y_i}\left( {w \cdot {x_i} + b} \right) - 1 + {\xi _i}} \right] - \sum\limits_{i = 1}^N {{\mu _i}{\xi _i}} L ( w , b , ξ , α , μ ) = 2 1 ∥ w ∥ 2 + C i = 1 ∑ N ξ i − i = 1 ∑ N α i [ y i ( w ⋅ x i + b ) − 1 + ξ i ] − i = 1 ∑ N μ i ξ i
= 1 2 w T w − ∑ i = 1 N α i y i w T x i − ∑ i = 1 N α i y i b + ∑ i = 1 N α i + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ξ i − ∑ i = 1 N μ i ξ i = \frac{1}{2}{w^T}w - \sum\limits_{i = 1}^N {{\alpha _i}{y_i}} {w^T}{x_i} - \sum\limits_{i = 1}^N {{\alpha _i}{y_i}} b + \sum\limits_{i = 1}^N {{\alpha _i}} + C\sum\limits_{i = 1}^N {{\xi _i}} - \sum\limits_{i = 1}^N {{\alpha _i}{\xi _i}} - \sum\limits_{i = 1}^N {{\mu _i}{\xi _i}} = 2 1 w T w − i = 1 ∑ N α i y i w T x i − i = 1 ∑ N α i y i b + i = 1 ∑ N α i + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ξ i − i = 1 ∑ N μ i ξ i
= 1 2 w T ∑ i = 1 N α i y i x i − w T ∑ i = 1 N α i y i x i − b ∑ i = 1 N α i y i + ∑ i = 1 N α i + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ξ i − ∑ i = 1 N μ i ξ i = \frac{1}{2}{w^T}\sum\limits_{i = 1}^N {{\alpha _i}} {y_i}{x_i} - {w^T}\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} {x_i} - b\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} + \sum\limits_{i = 1}^N {{\alpha _i}} + C\sum\limits_{i = 1}^N {{\xi _i}} - \sum\limits_{i = 1}^N {{\alpha _i}{\xi _i}} - \sum\limits_{i = 1}^N {{\mu _i}{\xi _i}} = 2 1 w T i = 1 ∑ N α i y i x i − w T i = 1 ∑ N α i y i x i − b i = 1 ∑ N α i y i + i = 1 ∑ N α i + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ξ i − i = 1 ∑ N μ i ξ i
= − 1 2 w T ∑ i = 1 N α i y i x i − b ∑ i = 1 N α i y i + ∑ i = 1 N α i + ∑ i = 1 N ( C − α i − μ i ) ξ i = - \frac{1}{2}{w^T}\sum\limits_{i = 1}^N {{\alpha _i}} {y_i}{x_i} - b\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} + \sum\limits_{i = 1}^N {{\alpha _i}} + \sum\limits_{i = 1}^N {\left( {C - {\alpha _i} - {\mu _i}} \right)} {\xi _i} = − 2 1 w T i = 1 ∑ N α i y i x i − b i = 1 ∑ N α i y i + i = 1 ∑ N α i + i = 1 ∑ N ( C − α i − μ i ) ξ i
= − 1 2 ( ∑ i = 1 N α i y i x i ) T ∑ i = 1 N α i y i x i − b ∑ i = 1 N α i y i + ∑ i = 1 N α i + ∑ i = 1 N ( C − α i − μ i ) ξ i = - \frac{1}{2}{\left( {\sum\limits_{i = 1}^N {{\alpha _i}} {y_i}{x_i}} \right)^T}\sum\limits_{i = 1}^N {{\alpha _i}} {y_i}{x_i} - b\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} + \sum\limits_{i = 1}^N {{\alpha _i}} + \sum\limits_{i = 1}^N {\left( {C - {\alpha _i} - {\mu _i}} \right)} {\xi _i} = − 2 1 ( i = 1 ∑ N α i y i x i ) T i = 1 ∑ N α i y i x i − b i = 1 ∑ N α i y i + i = 1 ∑ N α i + i = 1 ∑ N ( C − α i − μ i ) ξ i
= − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i ( x j ) T − b ∑ i = 1 N α i y i + ∑ i = 1 N α i + ∑ i = 1 N ( C − α i − μ i ) ξ i = - \frac{1}{2}\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {{\alpha _i}{\alpha _j}{y_i}} } {y_j}{x_i}{\left( {{x_j}} \right)^T} - b\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} + \sum\limits_{i = 1}^N {{\alpha _i}} + \sum\limits_{i = 1}^N {\left( {C - {\alpha _i} - {\mu _i}} \right)} {\xi _i} = − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i ( x j ) T − b i = 1 ∑ N α i y i + i = 1 ∑ N α i + i = 1 ∑ N ( C − α i − μ i ) ξ i
最後得到
= − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i ( x j ) T − b ∑ i = 1 N α i y i + ∑ i = 1 N α i + ∑ i = 1 N ( C − α i − μ i ) ξ i = - \frac{1}{2}\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {{\alpha _i}{\alpha _j}{y_i}} } {y_j}{x_i}{\left( {{x_j}} \right)^T} - b\sum\limits_{i = 1}^N {{\alpha _i}{y_i}} + \sum\limits_{i = 1}^N {{\alpha _i}} + \sum\limits_{i = 1}^N {\left( {C - {\alpha _i} - {\mu _i}} \right)} {\xi _i} = − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i ( x j ) T − b i = 1 ∑ N α i y i + i = 1 ∑ N α i + i = 1 ∑ N ( C − α i − μ i ) ξ i
由於∑ i = 1 N α i y i = 0 \sum\limits_{i = 1}^N {\alpha _i} {y_i}=0 i = 1 ∑ N α i y i = 0 ,C − α i − μ i = 0 C - {\alpha _i} - {\mu _i} = 0 C − α i − μ i = 0 ,並且將向量內積x i ( x j ) T {x_i}{\left( {{x_j}} \right)^T} x i ( x j ) T 表示爲( x i ⋅ x j ) \left( {{x_i} \cdot {x_j}} \right) ( x i ⋅ x j ) 。
因此簡化爲
min w , b , ξ L ( w , b , ξ , α , μ ) = − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i ⋅ x j ) + ∑ i = 1 N α i \mathop {\min }\limits_{w,b,\xi } L\left( {w,b,\xi ,\alpha ,\mu } \right) = - \frac{1}{2}\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {{\alpha _i}{\alpha _j}{y_i}} } {y_j}\left( {{x_i} \cdot {x_j}} \right) + \sum\limits_{i = 1}^N {{\alpha _i}} w , b , ξ min L ( w , b , ξ , α , μ ) = − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i ⋅ x j ) + i = 1 ∑ N α i
(2) 求min w , b , ξ L ( w , b , ξ , α , μ ) \mathop {\min }\limits_{w,b,\xi } L\left( {w,b,\xi ,\alpha ,\mu } \right) w , b , ξ min L ( w , b , ξ , α , μ ) 對α \alpha α 的極大值,即是對偶問題
max α − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i ⋅ x j ) + ∑ i = 1 N α i \mathop {\max }\limits_\alpha {\kern 1pt} - \frac{1}{2}\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {{\alpha _i}{\alpha _j}{y_i}} } {y_j}\left( {{x_i} \cdot {x_j}} \right) + \sum\limits_{i = 1}^N {{\alpha _i}} α max − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i ⋅ x j ) + i = 1 ∑ N α i
s . t . ∑ i = 1 N α i y i = 0 s.t.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{i = 1}^N {{\alpha _i}} {y_i} = 0 s . t . i = 1 ∑ N α i y i = 0
0 ≤ α i ≤ C 0 \le {\alpha _i} \le C 0 ≤ α i ≤ C
此時,我們發現沒有了參數ξ i {\xi _i} ξ i ,與之前模型唯一不同在於又多了的限制條件α i ≤ C {\alpha _i} \le C α i ≤ C 。需要提醒的是,b的求值公式也發生了改變。先看看KKT條件的變化:
第一個式子表明在兩條間隔線外的樣本點前面的係數爲0;第二個式子表示離羣樣本點前面的係數爲C,第三個式子表示支持向量(也就是在超平面兩邊的最大間隔線上)的樣本點前面係數在(0,C)上。通過KKT條件可知,某些在最大間隔線上的樣本點也不是支持向量,相反也可能是離羣點。 我們可以通過SMO算法來求上式極小化時對應的α向量就可以求出w和b了。
5. Hinge損失函數
我們從另一個角度來解讀軟間隔的損失函數,表達式如下:
∑ i = 1 N [ 1 − y i ( w ⋅ x i + b ) ] + + λ ∥ w ∥ 2 {\sum\limits_{i = 1}^N {\left[ {1 - {y_i}\left( {w \cdot {x_i} + b} \right)} \right]} _ + } + \lambda {\left\| w \right\|^2} i = 1 ∑ N [ 1 − y i ( w ⋅ x i + b ) ] + + λ ∥ w ∥ 2
目標函數的第1項是經驗損失或經驗風險,函數
L ( y ( w ⋅ x + b ) ) = [ 1 − y ( w ⋅ x + b ) ] + L\left( {y\left( {w \cdot x + b} \right)} \right) = {\left[ {1 - y\left( {w \cdot x + b} \right)} \right]_ + } L ( y ( w ⋅ x + b ) ) = [ 1 − y ( w ⋅ x + b ) ] +
稱爲合頁損失函數( hinge loss function)。下標“十”表示以下取正值的函數。
[ z ] + = { z , z > 0 0 , z ≤ 0 {\left[ z \right]_ + } = \left\{ \begin{array}{l}
z{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} z > 0\\
0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} z \le 0
\end{array} \right. [ z ] + = { z , z > 0 0 , z ≤ 0
也就是說,如果點被正確分類,且函數間隔大於1,損失是0,否則損失是1 − y ( w ⋅ x + b ) {1 - y\left( {w \cdot x + b} \right)} 1 − y ( w ⋅ x + b ) 。
我們在下圖還可以看出其他各種模型損失和函數間隔的關係:對於0-1損失函數,如果正確分類,損失是0,誤分類損失1, 如下圖黑線,【可見0-1損失函數是不可導的】。對於感知機模型,感知機的損失函數是[ 1 − y ( w ⋅ x + b ) ] + {\left[ {1 - y\left( {w \cdot x + b} \right)} \right]_ + } [ 1 − y ( w ⋅ x + b ) ] + ,這樣當樣本被正確分類時,損失是0,誤分類時,損失是− y ( w ⋅ x + b ) { - y\left( {w \cdot x + b} \right)} − y ( w ⋅ x + b ) ,如下圖紫線。對於邏輯迴歸之類和最大熵模型對應的對數損失,損失函數是l o g [ 1 + e x p ( − y ( w ∙ x + b ) ) ] log[1+exp(−y(w∙x+b))] l o g [ 1 + e x p ( − y ( w ∙ x + b ) ) ] , 如下圖紅線所示。
6.總結
線性可分SVM通過軟間隔最大化,可以解決線性數據集帶有異常點時的分類處理。
進行了軟間隔的處理,在KTT條件下,鬆弛變量ξ i {\xi_i} ξ i 沒有出現在最後的目標函數中,最後的優化求解問題,也被拉格朗日對偶和SMO算法化解,使SVM趨向於完美。
參考資料
1 :https://www.cnblogs.com/jerrylead/archive/2011/03/18/1988415.html2
2: https://www.cnblogs.com/huangyc/p/9938306.html#_labelTop
3: https://zhuanlan.zhihu.com/p/76946313
4: 李航《統計學習方法》