第7章 支持向量機(SVM)
背景
支持向量機由簡至繁可分爲:線性可分支持向量機(linear support vector machine in linearly separable case)、線性支持向量機(linear support vector machine)、非線性支持向量機(non-linear support vector machine)。當訓練數據線性可分時,通過硬件個最大化學習一個線性分類器,即線性可分支持向量機,又稱應間隔支持向量機;當訓練數據近似線性可分時,通過軟間隔最大化,學習一個線性分類器,即線性支持向量機,又稱軟間隔支持向量機;當訓練數據線性不可分時,通過核技巧(kernel trick)及軟間隔最大化,學習非線性支持向量機。
1、線性可分支持向量機
1.1 基本定義
線性可分支持向量機
給定線性可分數據集,通過間隔最大化或等價地求解相應的凸二次規劃問題學習得到分離超平面: w ∗ x + b = 0 w^*x+b=0 w ∗ x + b = 0 以及相應的分隔函數: f ( x ) = s i g n ( w ∗ x + b ) f(x) = sign(w^*x+b) f ( x ) = s i g n ( w ∗ x + b )
函數間隔
對於給定的數據集T T T 和超平面( w , b ) (w, b) ( w , b ) , 定義函數間隔 爲:
γ ^ = y i ( w × x i + b )
\hat\gamma = y_i(w\times x_i+b)
γ ^ = y i ( w × x i + b )
最小函數間隔函數:γ ^ = min i = 1 , . . n γ ^ i \hat\gamma = \min_{i = 1, .. n}{\hat \gamma_i} γ ^ = min i = 1 , . . n γ ^ i
幾何間隔
對於給定的數據集T T T 和超平面( w , b ) (w, b) ( w , b ) , 定義幾何間隔 爲:
γ = y i ( w ∣ ∣ w ∣ ∣ . x i + b ∣ ∣ w ∣ ∣ )
\gamma = y_i(\dfrac{w}{||w||} . x_i + \dfrac{b}{||w||})
γ = y i ( ∣ ∣ w ∣ ∣ w . x i + ∣ ∣ w ∣ ∣ b )
最小几何間隔:γ ^ = m i n i = 1 , . . N γ ^ i \hat\gamma = min_{i = 1, .. N}{\hat \gamma_i} γ ^ = m i n i = 1 , . . N γ ^ i
注 :函數間隔和幾何間隔的數學關係:
γ = γ ^ ∣ ∣ w ∣ ∣
\gamma = \dfrac{\hat \gamma}{||w||}
γ = ∣ ∣ w ∣ ∣ γ ^
1.2 最大間隔分離超平面
max w , b γ s t . y i ( w ∣ ∣ w ∣ ∣ . x i + b ∣ ∣ w ∣ ∣ ) ≥ γ
\max_{w, b} \ {\gamma} \ \ \
st. \ \ y_i(\dfrac{w}{||w||} . x_i + \dfrac{b}{||w||}) \geq \gamma
w , b max γ s t . y i ( ∣ ∣ w ∣ ∣ w . x i + ∣ ∣ w ∣ ∣ b ) ≥ γ
約束條件表明:超平面( w , b ) (w, b) ( w , b ) 關於每個訓練樣本的幾何間隔至少是γ \gamma γ
根據函數間隔和幾何間隔的關係,最大間隔分離超平面也可以表示爲:
max w , b γ ^ ∣ ∣ w ∣ ∣ s t . y i ( w . x i + b ) ≥ γ ^
\max_{w, b} \ {\dfrac{\hat \gamma}{||w||}} \ \ \
st. \ \ y_i(w . x_i +b) \geq \hat \gamma
w , b max ∣ ∣ w ∣ ∣ γ ^ s t . y i ( w . x i + b ) ≥ γ ^
由於γ ^ \hat \gamma γ ^ 對最優化的影響在於同等放縮( w , b ) (w, b) ( w , b ) ,所以最優化時問題轉化爲求解最大值1 ∣ ∣ w ∣ ∣ \dfrac{1}{||w||} ∣ ∣ w ∣ ∣ 1 ,等價地,最優化問題轉化爲求解最小值1 2 ∣ ∣ w ∣ ∣ 2 \dfrac{1}{2}||w||^2 2 1 ∣ ∣ w ∣ ∣ 2 ,即有:
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s t . y i ( w . x i + b ) − 1 ≥ 0 (*)
\min_{w, b}{\dfrac{1}{2}||w||^2} \ \ \ st. y_i(w.x_i+b) - 1 \geq 0 \tag{*}
w , b min 2 1 ∣ ∣ w ∣ ∣ 2 s t . y i ( w . x i + b ) − 1 ≥ 0 ( * )
(*)爲凸二次規劃問題,對應的超平面( w , b ) (w, b) ( w , b ) 存在且唯一 :
min w f ( w ) s t . g i ( w ) ≤ 0 i = 1 , 2 , . . . , k h i ( w ) = 0 i = 1 , 2 , . . . , l
\min_{w}{f(w)} \\
st.g_i(w) \leq 0 \ \ i = 1, 2, ..., k \\
h_i(w) = 0 \ \ \ i = 1, 2, ..., l
w min f ( w ) s t . g i ( w ) ≤ 0 i = 1 , 2 , . . . , k h i ( w ) = 0 i = 1 , 2 , . . . , l
1.3 對偶算法與KKT條件
(*)構造 拉格朗日函數,對每一個不等式約束引入拉格朗日乘子, α i ≥ 0 , i = 1 , 2 , . . . , N \alpha_i \geq 0,i = 1, 2, ..., N α i ≥ 0 , i = 1 , 2 , . . . , N 定義拉格朗日函數:
L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 N α i y i ( w . x i + b ) + ∑ i = 1 N α i (1)
L(w, b, \alpha) = \dfrac{1}{2}||w||^2 - \sum_{i=1}^N\alpha_iy_i(w.x_i+b)+\sum_{i=1}^N\alpha_i \tag1
L ( w , b , α ) = 2 1 ∣ ∣ w ∣ ∣ 2 − i = 1 ∑ N α i y i ( w . x i + b ) + i = 1 ∑ N α i ( 1 )
其中,α = ( α 1 , α 2 , . . . , α N ) T \alpha = (\alpha_1, \alpha_2, ..., \alpha_N)^T α = ( α 1 , α 2 , . . . , α N ) T ,根據拉格朗日對偶性,原始問題的對偶問題是極大極小問題:
max α min w , b L ( w , b , α )
\max_\alpha\min_{w,b}L(w,b,\alpha)
α max w , b min L ( w , b , α )
所以,爲了得到對偶問題的解,需要先求L ( w , b , a ) L(w,b,a) L ( w , b , a ) 對w , b w, b w , b 的極小值,再求對α \alpha α 的極大值。
(1)求min w , b L ( w , b , a ) \min_{w,b}L(w,b,a) min w , b L ( w , b , a )
∇ w L ( w , b , α ) = w − ∑ i = 1 N α i y i x i = 0 ∇ b L ( w , b , α ) = − ∑ i = 1 N α i y i = 0 (2)
\nabla_wL(w,b,\alpha) = w - \sum_{i=1}^N\alpha_iy_ix_i = 0 \\
\nabla_bL(w,b,\alpha) = -\sum_{i=1}^N\alpha_iy_i=0 \tag2
∇ w L ( w , b , α ) = w − i = 1 ∑ N α i y i x i = 0 ∇ b L ( w , b , α ) = − i = 1 ∑ N α i y i = 0 ( 2 )
將(2)帶入(1)得:
L ( w , b , α ) = − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i . x j ) + ∑ i = 1 N α i
L(w,b,\alpha) = - \dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_j(x_i.x_j)+\sum_{i=1}^N\alpha_i
L ( w , b , α ) = − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i . x j ) + i = 1 ∑ N α i
(2)求min w , b L ( w , b , α ) \min_{w,b}L(w,b,\alpha) min w , b L ( w , b , α ) 對α \alpha α 的極大,即是對偶問題
max α − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i . x j ) + ∑ i = 1 N α i s . t . ∑ i = 1 N α i y i = 0 α i ≥ 0 , i = 1 , 2 , . . . , N (3)
\max_{\alpha}-\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_j(x_i.x_j)+\sum_{i=1}^N\alpha_i
\\ s.t. \ \ \ \sum_{i=1}^N\alpha_iy_i=0
\\ \alpha_i \geq 0, \ \ \ i = 1, 2, ..., N
\tag3
α max − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i . x j ) + i = 1 ∑ N α i s . t . i = 1 ∑ N α i y i = 0 α i ≥ 0 , i = 1 , 2 , . . . , N ( 3 )
將(3)式目標函數由求極大轉換轉換爲求極小,則等價對偶最優化問題:
min α 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i . x j ) − ∑ i = 1 N α i s . t . ∑ i = 1 N α i y i = 0 α i ≥ 0 , i = 1 , 2 , . . . , N (4)
\min_{\alpha}\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_j(x_i.x_j)-\sum_{i=1}^N\alpha_i \\ s.t. \ \ \ \sum_{i=1}^N\alpha_iy_i=0\\ \alpha_i \geq 0, \ \ \ i = 1, 2, ..., N\tag4
α min 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i . x j ) − i = 1 ∑ N α i s . t . i = 1 ∑ N α i y i = 0 α i ≥ 0 , i = 1 , 2 , . . . , N ( 4 )
因此,原始w , b w,b w , b 的最優化轉換爲先求解α \alpha α 的最優化,設(4)條件下的最優化解爲α ∗ \alpha^* α ∗ ,其中α ∗ = ( α 1 ∗ , α 2 ∗ , . . . , α N ∗ ) T \alpha^*=(\alpha_1^*, \alpha_2^*, ..., \alpha_N^*)^T α ∗ = ( α 1 ∗ , α 2 ∗ , . . . , α N ∗ ) T ,則( w , b ) (w, b) ( w , b ) 的最優解爲( w ∗ , b ∗ ) (w^*,b^*) ( w ∗ , b ∗ ) :
w ∗ = ∑ i = 1 N α i ∗ y i x i b ∗ = y j − w ∗ x i = y j − ∑ i = 1 N α ∗ y i ( x i . x j )
w^* = \sum_{i=1}^N\alpha_i^*y_ix_i
\\ b^* = y_j-w^*x_i = y_j - \sum_{i=1}^N\alpha^*y_i(x_i.x_j)
w ∗ = i = 1 ∑ N α i ∗ y i x i b ∗ = y j − w ∗ x i = y j − i = 1 ∑ N α ∗ y i ( x i . x j )
1.4 KKT條件:
∇ w L ( w ∗ , b ∗ , α ∗ ) = w ∗ − ∑ i = 1 N α i ∗ y i x i = 0 ∇ b L ( w ∗ , b ∗ , α ∗ ) = − ∑ i = 1 N α i ∗ y i = 0 α i ∗ ( y i ( w ∗ . x i + b ∗ ) − 1 ) = 0 y i ( w ∗ . x i + b ∗ ) − 1 ≥ 0 i = 1 , 2 , . . . , N α i ∗ ≥ 0 i = 1 , 2 , . . . , N (5)
\nabla_wL(w^*,b^*,\alpha^*) = w^* - \sum_{i=1}^N\alpha_i^*y_ix_i = 0 \\
\nabla_bL(w^*,b^*,\alpha^*) = -\sum_{i=1}^N\alpha_i^*y_i=0 \\
\alpha_i^*(y_i(w^*.x_i+b^*) - 1) = 0 \\
y_i(w^*.x_i+b^*) - 1 \geq 0 \ \ \ i = 1, 2, ..., N \\
\alpha_i^* \geq 0 \ \ \ i = 1, 2, ..., N \\
\tag5
∇ w L ( w ∗ , b ∗ , α ∗ ) = w ∗ − i = 1 ∑ N α i ∗ y i x i = 0 ∇ b L ( w ∗ , b ∗ , α ∗ ) = − i = 1 ∑ N α i ∗ y i = 0 α i ∗ ( y i ( w ∗ . x i + b ∗ ) − 1 ) = 0 y i ( w ∗ . x i + b ∗ ) − 1 ≥ 0 i = 1 , 2 , . . . , N α i ∗ ≥ 0 i = 1 , 2 , . . . , N ( 5 )
2、線性支持向量機
線性可分的學習方法對線性不可分的訓練是不適用的,因爲此時的不等式越是並不成立,需要將硬間隔最大化修改爲軟間隔最大化。
此時對應的凸二次規劃問題(原始問題):
min w , b , ξ 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 N ξ i ξ i ≥ 0 i = 1 , 2 , . . . , N s . t . y i ( w i . x + b ) ≥ 1 − ξ i i = 1 , 2 , . . . , N (6)
\min_{w, b,\xi}\dfrac{1}{2}||w||^2+C\sum_{i=1}^N\xi_i \\
\xi_i \geq 0 \ \ \ i = 1, 2, ..., N \\
s.t. \ \ \ y_i(w_i.x+b) \geq 1 - \xi_i i = 1, 2, ...,N \\
\tag6
w , b , ξ min 2 1 ∣ ∣ w ∣ ∣ 2 + C i = 1 ∑ N ξ i ξ i ≥ 0 i = 1 , 2 , . . . , N s . t . y i ( w i . x + b ) ≥ 1 − ξ i i = 1 , 2 , . . . , N ( 6 )
2.1 定義
對於給定的線性不可分的訓練數據集,通過求解凸二次規劃問題,即軟間隔最大化,得到分離超平面:w ∗ . x + b ∗ = 0 w^* .x +b^* = 0 w ∗ . x + b ∗ = 0 ,對應的分類決策函數:f ( x ) = s i g n ( w ∗ . x + b ∗ ) f(x) = sign(w^*.x+b^*) f ( x ) = s i g n ( w ∗ . x + b ∗ ) ,稱爲線性支持向量機。
原始問題的對偶問題:
min α 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 N α i s . t . ∑ i = 1 N α i y i = 0 i = 1 , 2 , . . . , N 0 ≤ α i ≤ C i = 1 , 2 , . . . , N (7)
\min_\alpha\dfrac{1}{2}||w||^2-\sum_{i=1}^N\alpha_i \\
s.t. \ \ \ \sum_{i=1}^N\alpha_iy_i = 0 \ \ i = 1, 2, ..., N\\
0 \leq \alpha_i \leq C \ \ i = 1, 2, ..., N\\
\tag7
α min 2 1 ∣ ∣ w ∣ ∣ 2 − i = 1 ∑ N α i s . t . i = 1 ∑ N α i y i = 0 i = 1 , 2 , . . . , N 0 ≤ α i ≤ C i = 1 , 2 , . . . , N ( 7 )
原始問題的最優化拉格朗日目標函數:
L ( w , b , ξ , α , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ( y i ( w . x i + b ) + ξ i − 1 ) ) − ∑ i = 1 N μ i ξ i
L(w, b, \xi, \alpha, \mu) = \dfrac{1}{2}||w||^2+C\sum_{i=1}^N\xi_i-\sum_{i=1}^N\alpha_i(y_i(w.x_i+b) + \xi_i - 1))-\sum_{i=1}^N\mu_i\xi_i
L ( w , b , ξ , α , μ ) = 2 1 ∣ ∣ w ∣ ∣ 2 + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ( y i ( w . x i + b ) + ξ i − 1 ) ) − i = 1 ∑ N μ i ξ i
其中α i ≥ 0 , μ i ≥ 0 \alpha_i \geq 0, \mu_i \geq 0 α i ≥ 0 , μ i ≥ 0 。
對偶問題是拉格朗日目標函數的極大極小問題即:
max α min w , b , ξ L ( w , b , ξ , α , μ ) = max α min w , b , ξ ( 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ( y i ( w . x i + b ) + ξ i − 1 ) ) − ∑ i = 1 N μ i ξ i ) (8)
\max_{\alpha}\min_{w, b, \xi} L(w, b, \xi, \alpha, \mu) = \max_{\alpha}\min_{w, b, \xi} (\dfrac{1}{2}||w||^2+C\sum_{i=1}^N\xi_i-\sum_{i=1}^N\alpha_i(y_i(w.x_i+b) + \xi_i - 1))-\sum_{i=1}^N\mu_i\xi_i) \tag8
α max w , b , ξ min L ( w , b , ξ , α , μ ) = α max w , b , ξ min ( 2 1 ∣ ∣ w ∣ ∣ 2 + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ( y i ( w . x i + b ) + ξ i − 1 ) ) − i = 1 ∑ N μ i ξ i ) ( 8 )
對極小化拉格朗日目標函數min w , b , ξ L ( w , b , ξ , α , μ ) \min_{w, b, \xi}L(w,b, \xi, \alpha, \mu) min w , b , ξ L ( w , b , ξ , α , μ ) 求對應偏導數:
∇ w L ( w , b , ξ , α , μ ) = w − ∑ i = 1 N α i y i x i = 0 ∇ b L ( w , b , ξ , α , μ ) = − ∑ i = 1 N α i y i = 0 ∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0 (9)
\nabla_wL(w,b,\xi,\alpha, \mu) = w - \sum_{i=1}^N\alpha_iy_ix_i = 0 \\
\nabla_bL(w,b,\xi, \alpha, \mu) = -\sum_{i=1}^N\alpha_iy_i=0 \\
\nabla_{\xi_i}L(w,b,\xi, \alpha, \mu) = C - \alpha_i-\mu_i = 0
\tag9
∇ w L ( w , b , ξ , α , μ ) = w − i = 1 ∑ N α i y i x i = 0 ∇ b L ( w , b , ξ , α , μ ) = − i = 1 ∑ N α i y i = 0 ∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0 ( 9 )
將(9)迴帶入(8)式, 再對min w , b , ξ L ( w , b , ξ , α , μ ) \min_{w, b, \xi}L(w,b, \xi, \alpha, \mu) min w , b , ξ L ( w , b , ξ , α , μ ) 中的α \alpha α 求極大,得對偶問題:
max α 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i . x j ) + ∑ i = 1 N α i s . t . ∑ i = 1 N α i y i = 0 C − α i − μ i = 0 α i ≥ 0 μ i ≥ 0 (10)
\max_\alpha\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_j(x_i.x_j)+\sum_{i=1}^N\alpha_i \\
s.t. \ \ \sum_{i=1}^N\alpha_iy_i = 0 \\
C - \alpha_i - \mu_i = 0 \\
\alpha_i \geq 0 \\
\mu_i \geq 0 \\ \tag{10}
α max 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j ( x i . x j ) + i = 1 ∑ N α i s . t . i = 1 ∑ N α i y i = 0 C − α i − μ i = 0 α i ≥ 0 μ i ≥ 0 ( 1 0 )
將(9)式C − α i − μ i = 0 C - \alpha_i - \mu_i = 0 C − α i − μ i = 0 帶入(10)式,求解只含α i \alpha_i α i 的z最優化問題,滿足:0 ≤ α i ≤ C 0 \leq \alpha_i \leq C 0 ≤ α i ≤ C 。
設α ∗ = ( α 1 ∗ , α 2 ∗ , . . . , α N ∗ ) T \alpha^*= (\alpha_1^*, \alpha_2^*, ..., \alpha_N^*)^T α ∗ = ( α 1 ∗ , α 2 ∗ , . . . , α N ∗ ) T 是原始問題的對偶問題(7)的一個解,則原始問題的解( w ∗ , b ∗ ) (w^*, b^*) ( w ∗ , b ∗ ) 爲:
w ∗ = ∑ i = 1 N α i ∗ y i x i b ∗ = y j − w ∗ x j = y j − ∑ i = 1 N α i ∗ y i ( x i . x j )
w^* = \sum_{i=1}^N\alpha_i^*y_ix_i \\
b^* = y_j - w^*x_j = y_j - \sum_{i=1}^N\alpha_i^*y_i(x_i.x_j)
w ∗ = i = 1 ∑ N α i ∗ y i x i b ∗ = y j − w ∗ x j = y j − i = 1 ∑ N α i ∗ y i ( x i . x j )
2.2 KKT條件:
∇ w L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = w ∗ − ∑ i = 1 N α i ∗ y i x i = 0 ∇ b L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = − ∑ i = 1 N α i ∗ y i = 0 ∇ ξ L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = C − α ∗ − μ ∗ = 0 α i ∗ ( y i ∗ ( w ∗ . x i + b ∗ ) + ξ i ∗ − 1 ) = 0 μ i ∗ ξ i ∗ = 0 y i ∗ ( w ∗ . x i + b ∗ ) + ξ i ∗ − 1 ≥ 0 ξ i ∗ ≥ 0 α i ∗ ≥ 0 μ i ≥ 0 (11)
\nabla_wL(w^*,b^*,\xi^*,\alpha^*, \mu^*) = w^* - \sum_{i=1}^N\alpha_i^*y_ix_i = 0 \\
\nabla_bL(w^*,b^*,\xi^*,\alpha^*, \mu^*) = -\sum_{i=1}^N\alpha_i^*y_i=0 \\
\nabla_{\xi}L(w^*,b^*,\xi^*,\alpha^*, \mu^*) = C - \alpha^*-\mu^* = 0 \\
\alpha_i^*(y_i^*(w^*.x_i+b^*) + \xi_i^* - 1) = 0 \\
\mu_i^* \xi_i^* = 0 \\
y_i^*(w^*.x_i+b^*) + \xi_i^* - 1 \geq 0 \\
\xi_i^* \geq 0 \\
\alpha_i^* \geq 0 \\
\mu_i \geq 0 \\
\tag{11}
∇ w L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = w ∗ − i = 1 ∑ N α i ∗ y i x i = 0 ∇ b L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = − i = 1 ∑ N α i ∗ y i = 0 ∇ ξ L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = C − α ∗ − μ ∗ = 0 α i ∗ ( y i ∗ ( w ∗ . x i + b ∗ ) + ξ i ∗ − 1 ) = 0 μ i ∗ ξ i ∗ = 0 y i ∗ ( w ∗ . x i + b ∗ ) + ξ i ∗ − 1 ≥ 0 ξ i ∗ ≥ 0 α i ∗ ≥ 0 μ i ≥ 0 ( 1 1 )
由此所得的分離超平面:w ∗ x + b = ∑ i = 1 N α i ∗ y i ( x . x i ) + b ∗ = 0 w^*x+b = \sum_{i=1}^N\alpha_i^*y_i(x.x_i) +b^* = 0 w ∗ x + b = ∑ i = 1 N α i ∗ y i ( x . x i ) + b ∗ = 0
分類決策函數:f ( x ) = s i g n ( w ∗ x + b ) = s i g n ( ∑ i = 1 N α i ∗ y i ( x . x i ) + b ∗ ) f(x) = sign(w^*x+b) = sign(\sum_{i=1}^N\alpha_i^*y_i(x.x_i) +b^*) f ( x ) = s i g n ( w ∗ x + b ) = s i g n ( ∑ i = 1 N α i ∗ y i ( x . x i ) + b ∗ )
【注】:對於線性支持向量機學習策略,w ∗ w^* w ∗ 是唯一的,b ∗ b^* b ∗ 可以有多個,實際中取N N N 個b ∗ b^* b ∗ 的均值。
2.3 支持向量:
圖中,正例用"o o o “表示,負例用”× \times × "表示,實例到邊間的距離爲:ξ i ∣ ∣ w ∣ ∣ \dfrac{\xi_i}{||w||} ∣ ∣ w ∣ ∣ ξ i
軟間隔支持向量x i x_i x i 或者在間隔邊界上,或者在間隔邊界和分離超平面之間,或者在分離超平面誤分類一側,存在如下關係:
若0 < α i < C 0 < \alpha_i <C 0 < α i < C ,u i ≠ 0 u_i \neq 0 u i = 0 ,ξ i = 0 \xi_i=0 ξ i = 0 ,則支持向量落在間隔邊界上;
若α i = C \alpha_i = C α i = C ,μ i = 0 \mu_i = 0 μ i = 0 , 若0 < ξ i < 1 0 < \xi_i <1 0 < ξ i < 1 ,則分類正確,支持向量落在間隔邊界和分離超平面之間;
若α i = C \alpha_i = C α i = C ,μ i = 0 \mu_i = 0 μ i = 0 , 若ξ i = 1 \xi_i =1 ξ i = 1 ,則支持向量恰好落在分離超平面上;
若α i = C \alpha_i = C α i = C ,μ i = 0 \mu_i = 0 μ i = 0 , 若ξ i > 1 \xi_i > 1 ξ i > 1 ,則支持向量位於分離超平面誤分類一側;
2.4 另一種解釋:
線性支持向量機學習相當於最小化下列函數:
∑ i = 1 N [ 1 − y i ( w . x i + b ) ] + + λ ∣ ∣ w ∣ ∣ 2 (12)
\sum_{i=1}^N[1-y_i(w.x_i+b)]_+ + \lambda ||w||^2 \tag{12}
i = 1 ∑ N [ 1 − y i ( w . x i + b ) ] + + λ ∣ ∣ w ∣ ∣ 2 ( 1 2 )
目標函數的第一項是經驗損失或經驗風險,記L ( y ( w . x + b ) ) = [ 1 − y ( w . x + b ) ] + L(y(w.x+b)) = [1-y(w.x+b)]_+ L ( y ( w . x + b ) ) = [ 1 − y ( w . x + b ) ] + 爲合頁損失,其中z + = max ( 0 , z ) z_+ =\max(0, z) z + = max ( 0 , z ) 。
因此有,當樣本點被正確分類且函數間隔y ( w . x + b ) y(w.x+b) y ( w . x + b ) 大於1時,損失爲0,注意,位於分離超平面誤分類一側的點能夠被正確分類,但是損失不爲0;目標函數的第二項是係數爲λ \lambda λ 的w w w 的L 2 L_2 L 2 範數,爲正則化項。
令ξ i = 1 − y i ( w . x i + b ) \xi_i = 1 - y_i(w.x_i+b) ξ i = 1 − y i ( w . x i + b ) ,則線性支持向量機的原始最優化問題(6)等價於最優化問題(12),其中λ = 1 2 C \lambda = \dfrac{1}{2C} λ = 2 C 1 。
合頁損失和0-1損失函數的圖示如下圖,其中虛線爲感知機損失函數[ y i ( w . x i + b ) ] + [y_i(w.x_i+b)]_+ [ y i ( w . x i + b ) ] + 。
由於0-1損失函數不是連續可導的,直接優化由其構成的損失函數比較困難,可認爲線性支持向量機的合頁損失函數是0-1損失的上界,此時的上界損失又稱代理損失函數(surrogate loss function)。
3、非線性支持向量機與核函數
3.1 核技巧
對給定的訓練數據集T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\} T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } ,其中x i x_i x i 屬於輸入空間,y i ∈ [ + 1 , − 1 ] y_i \in [+1, -1] y i ∈ [ + 1 , − 1 ] ,如果能用R N R^N R N 的一個超曲面將正負樣例正確地分開,則稱這個問題爲非線性可分問題 。非線性可分問題的求解思路是對輸入進行非線性變換,將非線性問題轉換爲線性問題。
用線性分類方法求解非線性分類問題分兩步:首先使用一個變換將原空間映射到新空間;然後再新空間使用線性分類方法,從訓練數據中學習分類模型。核技巧 就屬於這種方法。
核技巧應用到支持向量機,其基本思想是通過一個非線性變換將輸入空間(歐式空間R \mathcal{R} R )對應於一個特種空間(希爾伯特空間H \mathcal{H} H )。
3.2 核函數
設X \mathcal{X} X 是輸入空間(歐式空間R \mathcal {R} R ,設H \mathcal {H} H 爲特徵空間(希爾伯特空間),如果存在一個X \mathcal {X} X 到H \mathcal {H} H 的映射 ϕ ( x ) : X → H \phi(x):\mathcal {X} \rightarrow \mathcal {H} ϕ ( x ) : X → H ,使得對所有x , z ∈ X x,z \in \mathcal {X} x , z ∈ X 函數K ( x , z ) K(x, z) K ( x , z ) 滿足條件:K ( x , z ) = ϕ ( x ) . ϕ ( z ) K(x,z) = \phi(x).\phi(z) K ( x , z ) = ϕ ( x ) . ϕ ( z ) ,則稱K ( x , z ) K(x,z) K ( x , z ) 爲核函數,ϕ ( x ) \phi(x) ϕ ( x ) 爲映射函數。
【注】:ϕ \phi ϕ 是輸入空間X \mathcal {X} X 到特徵空間H \mathcal {H} H 的映射,特徵空間H \mathcal {H} H 一般是高維的,甚至是無窮的,對於給定的核函數K ( x , z ) K(x, z) K ( x , z ) ,特徵映射ϕ \phi ϕ 並不唯一。
例 :設輸入空間爲R 2 \mathcal R^2 R 2 ,核函數爲K ( x , z ) = ( x . z ) 2 K(x,z) = (x.z)^2 K ( x , z ) = ( x . z ) 2 ,試找出相關的特徵空間H \mathcal H H 和特徵映射ϕ ( x ) \phi(x) ϕ ( x ) :R → H R\rightarrow \mathcal H R → H
取特徵空間H = R 3 \mathcal H = \mathcal R^3 H = R 3 ,記x = ( x ( 1 ) , x ( 2 ) ) T x=(x^{(1)}, x^{(2)})^T x = ( x ( 1 ) , x ( 2 ) ) T ,z = ( z ( 1 ) , z ( 2 ) ) T z = (z^{(1)}, z^{(2)})^T z = ( z ( 1 ) , z ( 2 ) ) T ,則:
K ( x , z ) = ( x . z ) 2 = ( x ( 1 ) z ( 1 ) + x ( 2 ) z ( 2 ) ) 2 = ( x ( 1 ) z ( 1 ) ) 2 + 2. x ( 1 ) z ( 1 ) x ( 2 ) z ( 2 ) + ( x ( 2 ) z ( 2 ) ) 2 K(x,z) = (x.z)^2 = (x^{(1)}z^{(1)} + x^{(2)}z^{(2)})^2 = (x^{(1)}z^{(1)})^2+2.x^{(1)}z^{(1)}x^{(2)}z^{(2)}+(x^{(2)}z^{(2)})^2 K ( x , z ) = ( x . z ) 2 = ( x ( 1 ) z ( 1 ) + x ( 2 ) z ( 2 ) ) 2 = ( x ( 1 ) z ( 1 ) ) 2 + 2 . x ( 1 ) z ( 1 ) x ( 2 ) z ( 2 ) + ( x ( 2 ) z ( 2 ) ) 2
取映射:ϕ ( x ) = ( ( x ( 1 ) ) 2 , 2 . x ( 1 ) x ( 2 ) , ( x ( 2 ) ) 2 ) T \phi(x)= ((x^{(1)})^2, \sqrt2.x^{(1)}x^{(2)}, (x^{(2)})^2)^T ϕ ( x ) = ( ( x ( 1 ) ) 2 , 2 . x ( 1 ) x ( 2 ) , ( x ( 2 ) ) 2 ) T
滿足:K ( x , z ) = ϕ ( x ) . ϕ ( z ) K(x,z) = \phi(x).\phi(z) K ( x , z ) = ϕ ( x ) . ϕ ( z )
同理,取映射:ϕ ( x ) = 1 2 ( ( x ( 1 ) ) 2 − ( x ( 2 ) ) 2 , 2. x ( 1 ) x ( 2 ) , ( x ( 1 ) ) 2 + ( x ( 2 ) ) 2 ) T \phi(x)= \dfrac{1}{\sqrt2}((x^{(1)})^2 - (x^{(2)})^2, 2.x^{(1)}x^{(2)}, (x^{(1)})^2+(x^{(2)})^2)^T ϕ ( x ) = 2 1 ( ( x ( 1 ) ) 2 − ( x ( 2 ) ) 2 , 2 . x ( 1 ) x ( 2 ) , ( x ( 1 ) ) 2 + ( x ( 2 ) ) 2 ) T
滿足:K ( x , z ) = ϕ ( x ) . ϕ ( z ) K(x,z) = \phi(x).\phi(z) K ( x , z ) = ϕ ( x ) . ϕ ( z )
取特徵空間H = R 4 \mathcal H = \mathcal R^4 H = R 4 ,ϕ ( x ) = ( ( x ( 1 ) ) 2 , x ( 1 ) x ( 2 ) , x ( 1 ) x ( 2 ) , ( x ( 2 ) ) 2 ) T \phi(x)= ((x^{(1)})^2, x^{(1)}x^{(2)}, x^{(1)}x^{(2)},(x^{(2)})^2)^T ϕ ( x ) = ( ( x ( 1 ) ) 2 , x ( 1 ) x ( 2 ) , x ( 1 ) x ( 2 ) , ( x ( 2 ) ) 2 ) T
滿足:K ( x , z ) = ϕ ( x ) . ϕ ( z ) K(x,z) = \phi(x).\phi(z) K ( x , z ) = ϕ ( x ) . ϕ ( z )
3.3 核方法在支持向量機中的應用
將對偶問題(10)的內積x i . x j x_i.x_j x i . x j 用核函數K ( x i , x j ) = ϕ ( x i ) . ϕ ( x j ) K(x_i,x_j) = \phi(x_i).\phi(x_j) K ( x i , x j ) = ϕ ( x i ) . ϕ ( x j ) 代替,得此時的目標函數:
W ( α ) = 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) − ∑ i = 1 N α i (13)
W(\alpha) = \dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)-\sum_{i=1}^N\alpha_i \tag{13}
W ( α ) = 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j K ( x i , x j ) − i = 1 ∑ N α i ( 1 3 )
即將原內積映射到高維特徵空間,當映射函數爲非線性映射時,學習到的含核函數的支持向量機即爲非線性支持向量機。
3.4 常用核函數
多項式核函數對應一個p p p 次多項式分類器:
K ( x , z ) = ( x . z + 1 ) p
K(x, z) = (x.z+1)^p
K ( x , z ) = ( x . z + 1 ) p
高斯核函數對應一個高斯徑向基函數分類器:
K ( x , z ) = exp { − ∣ ∣ x − z ∣ ∣ 2 2 δ 2 }
K(x,z) = \exp\{-\dfrac{||x-z||^2}{2\delta^2}\}
K ( x , z ) = exp { − 2 δ 2 ∣ ∣ x − z ∣ ∣ 2 }
字符串核函數k n ( s , t ) k_n(s,t) k n ( s , t ) 給出了字符串s s s 和t t t 中長度等於n n n 的所有子串組成的特徵向量的餘弦相似度。
k n ( s , t ) = ∑ u ∈ ∑ n [ ϕ n ( s ) ] u . [ ϕ n ( t ) ] u = ∑ u ∈ ∑ n ( i . j ) ∑ s ( i ) = t ( i ) = u λ l ( i ) λ l ( j )
k_n(s ,t) = \sum_{u \in \sum^n}[\phi_n(s)]_u.[\phi_n(t)]_u=\sum_{u \in \sum^n(i.j)}\sum_{s(i)= t(i)=u}\lambda^{l{(i)}}\lambda^{l{(j)}}
k n ( s , t ) = u ∈ ∑ n ∑ [ ϕ n ( s ) ] u . [ ϕ n ( t ) ] u = u ∈ ∑ n ( i . j ) ∑ s ( i ) = t ( i ) = u ∑ λ l ( i ) λ l ( j )
4、序列最小化算法
序列最小化算法(sequential minimal optimization, SMO)要求解的凸二次優化的對偶問題:
min α 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) − ∑ i = 1 N α i s . t . ∑ i = 1 N α i y i = 0 0 ≤ α i ≤ C i = 1 , 2 , . . . , N (14)
\min_\alpha\dfrac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i\alpha_jy_iy_jK(x_i,x_j)-\sum_{i=1}^N\alpha_i \\
s.t. \ \ \ \sum_{i=1}^N\alpha_iy_i = 0 \\
0 \leq \alpha_i \leq C \ \ \ i = 1 ,2, ..., N \tag{14}
α min 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j K ( x i , x j ) − i = 1 ∑ N α i s . t . i = 1 ∑ N α i y i = 0 0 ≤ α i ≤ C i = 1 , 2 , . . . , N ( 1 4 )
整個SMO算法包括兩個部分:求解兩個變量的二次規劃的解析方法和選擇變量的啓發式方法。