核邏輯迴歸(Kernel Logistic Regression)
SVM 和 Regularization 之間的聯繫
軟間隔支持向量機的原最優化問題爲:
min b , w , ξ 1 2 w T w + C ⋅ ∑ n = 1 N ξ n s.t. y n ( w T z n + b ) ≥ 1 − ξ n and ξ n ≥ 0 for all n
\begin{aligned} \min _ { b , \mathbf { w } , \xi } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \cdot \sum _ { n = 1 } ^ { N } \xi _ { n } \\ \text { s.t. } & y _ { n } \left( \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) \geq 1 - \xi _ { n } \text { and } \xi _ { n } \geq 0 \text { for all } n \end{aligned}
b , w , ξ min s.t. 2 1 w T w + C ⋅ n = 1 ∑ N ξ n y n ( w T z n + b ) ≥ 1 − ξ n and ξ n ≥ 0 for all n
轉換爲無約束問題如下:
min b , w 1 2 w T w + C ∑ n = 1 N max ( 1 − y n ( w T z n + b ) , 0 ) ⏟ e r r ^
\min _ { b , \mathbf { w } } \quad \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \underbrace{\sum _ { n = 1 } ^ { N } \max \left( 1 - y _ { n } \left( \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) , 0 \right)}_{\widehat { \mathrm { err } }}
b , w min 2 1 w T w + C e r r n = 1 ∑ N max ( 1 − y n ( w T z n + b ) , 0 )
可簡化爲:
min 1 2 w ⊤ w + C ∑ e r r ^
\min \quad \frac { 1 } { 2 } \mathbf { w } ^ { \top } \mathbf { w } + C \sum \widehat { \mathrm { err } }
min 2 1 w ⊤ w + C ∑ e r r
與 L2 範數正則化相比:
min λ N w T w + 1 N ∑ e r r
\min \quad \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum \mathrm { err }
min N λ w T w + N 1 ∑ e r r
可見兩者十分相似。但是爲什麼不用該無約束問題求解呢:這是因爲:
這不是一個 QP 問題,不能使用核技巧
取最大函數不是可微的,很難求解。
下面列出 SVM 與其他模型的簡單對比:
minimize constraint regularization by constraint E in w T w ≤ C hard-margin SVM w T w E in = 0 [and more] L2 regularization λ N w T w + E in soft-margin SVM 1 2 w T w + C N E in ^
\begin{array} { c | c | c } & \text { minimize } & \text { constraint } \\ \hline \text { regularization by constraint } & E _ { \text {in } } & \mathbf { w } ^ { T } \mathbf { w } \leq C \\ \hline \text { hard-margin SVM } & \mathbf { w } ^ { T } \mathbf { w } & E _ { \text {in } } = 0 \text { [and more] } \\ \hline \hline \text { L2 regularization } & \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + E _ { \text {in } } & \\ \hline \text { soft-margin SVM } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C N \widehat {E _ { \text {in } }} & \\ \end{array}
regularization by constraint hard-margin SVM L2 regularization soft-margin SVM minimize E in w T w N λ w T w + E in 2 1 w T w + C N E in constraint w T w ≤ C E in = 0 [and more]
可以觀察出以下特性:
large margin ⟺ fewer hyperplanes ⟺ L 2 regularization of short w soft margin ⟺ special err ^ larger C ⟺ smaller λ ⟺ less regularization
\begin{array} { c } \text { large margin } \Longleftrightarrow \text { fewer hyperplanes } \Longleftrightarrow L 2 \text { regularization of short w } \\ \text { soft margin } \Longleftrightarrow \text { special } \widehat { \text { err } } \\ \text { larger } C \Longleftrightarrow \text { smaller } \lambda \Longleftrightarrow \text { less regularization } \end{array}
large margin ⟺ fewer hyperplanes ⟺ L 2 regularization of short w soft margin ⟺ special err larger C ⟺ smaller λ ⟺ less regularization
即間隔越大意味着更少的超平面,類似於L2正則化中係數的衰減。C C C 越大意味着更小的 λ \lambda λ ,更弱的正則化。
將SVM看作一種正則化方法的話,可以跟簡單理解如何擴展和連接到其他學習模型。
SVM 和 Logistic Regression 之間的聯繫
現在令 linear score s = w T z n + b \text { linear score } s = \mathbf { w } ^ { T } \mathbf { z } _ { n } + b linear score s = w T z n + b ,那麼不同的誤差測量(Error Measure)表達式可以寫爲:
err 0 / 1 ( s , y ) = [ y s ≤ 0 ] err svm ( s , y ) = max ( 1 − y s , 0 ) err sce ( s , y ) = log 2 ( 1 + exp ( − y s ) ) :
\begin{array} { l } \operatorname { err } _ { 0 / 1 } ( s , y ) = [ y s \leq 0 ] \\ \operatorname { err } _ { \text {svm } } ( s , y ) = \max ( 1 - y s , 0 ) \\\operatorname { err } _ { \text {sce } }( s , y ) = \log _ { 2 } ( 1 + \exp ( - y s ) ) \text { : } \end{array}
e r r 0 / 1 ( s , y ) = [ y s ≤ 0 ] e r r svm ( s , y ) = max ( 1 − y s , 0 ) e r r sce ( s , y ) = log 2 ( 1 + exp ( − y s ) ) :
其中 err svm \operatorname { err } _ { \text {svm} } e r r svm 和 err sce \operatorname { err } _ { \text {sce} } e r r sce 均是 err 0 / 1 \operatorname{ err } _ { 0 / 1 } e r r 0 / 1 的凸上限,err sce \operatorname { err } _ { \text {sce} } e r r sce 用在 Logistic Regression 中做誤差測量。
具體如下圖所示:
− ∞ ⟵ y s ⟶ + ∞ ≈ − y s e r r ^ s v m ( S , y ) = 0 ≈ − y s ( ln 2 ) ⋅ err sce ( s , y ) ≈ 0
\begin{array} { ccccc } - \infty & \longleftarrow & y s & \longrightarrow & & + \infty \\ \approx - y s & & \widehat { \mathrm { err } } _ { \mathrm { svm } } ( S , y ) & && = 0 \\ \approx - y s & & ( \ln 2 ) \cdot \operatorname { err }_{\text{sce}} ( s , y ) & & & \approx 0 \end{array}
− ∞ ≈ − y s ≈ − y s ⟵ y s e r r s v m ( S , y ) ( ln 2 ) ⋅ e r r sce ( s , y ) ⟶ + ∞ = 0 ≈ 0
所以可以看出 regularized LogReg 與 SVM 十分相似。
兩階段學習模型(Two-Level-Learning)
如何實現呢?概括來說爲使用邏輯迴歸在經過SVM映射的空間上學習 。所以叫兩階段學習模型(Two-Level-Learning)。具體步驟爲:
使用 SVM 找出分割超平面
在超平面周圍,根據距離,使用 Logistic Regression 學習出真實分數。具體操作爲通過縮放(A 和 θ \theta θ )和 偏移(B)建立距離和分數之間的聯繫。
數學表達如下:
g ( x ) = θ ( A ⋅ ( w s v m T Φ ( x ) + b S V M ) + B )
g ( \mathbf { x } ) = \theta \left( A \cdot \left( \mathbf { w} _ { \mathrm { svm } } ^ { T } \mathbf { \Phi } ( \mathbf { x } ) + b _ { \mathrm { SVM } } \right) + B \right)
g ( x ) = θ ( A ⋅ ( w s v m T Φ ( x ) + b S V M ) + B )
通常 A > 0 , B ≈ 0 A > 0, B \approx 0 A > 0 , B ≈ 0 更合理一些。A > 0 A > 0 A > 0 代表了SVM的分類結果大體是對的,B ≈ 0 B \approx 0 B ≈ 0 代表了分割超平面與實際值偏差很小。
可以寫出其顯示數學表達:
min A , B 1 N ∑ n = 1 N log ( 1 + exp ( − y n ( A ⋅ ( w S V M T Φ ( x n ) + b S V M ⏟ Φ S V M ( x n ) ) + B ) ) )
\min _ { A , B } \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \log \left( 1 + \exp \left( - y _ { n } ( A \cdot ( \underbrace { \mathbf { w } _ { \mathrm { SVM } } ^ { T } \mathbf { \Phi } \left( \mathbf { x } _ { n } \right) + b _ { \mathrm { SVM } } } _ { \Phi _ { \mathrm { SVM } } \left( \mathbf { x } _ { n } \right) } ) + B ) \right) \right)
A , B min N 1 n = 1 ∑ N log ⎝ ⎜ ⎛ 1 + exp ⎝ ⎜ ⎛ − y n ( A ⋅ ( Φ S V M ( x n ) w S V M T Φ ( x n ) + b S V M ) + B ) ⎠ ⎟ ⎞ ⎠ ⎟ ⎞
那麼該兩階段學習模型的具體步驟爲:
1. run SVM on D to get ( b s v m , w s v m ) [ o r the equivalent α ] , and transform D to z n ′ = w S V M T Φ ( X n ) + b S V M -actulal model performs this step in a more complicated manner 2. run LogReg on { ( z n ′ , y n ) } n = 1 N to get ( A , B ) -actual model adds some special regularization here 3. return g ( x ) = θ ( A ⋅ ( w s v m T Φ ( x ) + b s v m ) + B )
\begin{array} { l } 1. \text { run SVM on } \mathcal { D } \text { to get } \left( b _ { \mathrm { svm} } , \mathbf { w } _ { \mathrm { svm} } \right) [ \mathrm { or } \text { the equivalent } \alpha ] , \text { and } \text { transform } \mathcal { D } \text { to } \mathbf { z } _ { n } ^ { \prime } = \mathbf { w } _ { \mathrm { SVM } } ^ { T } \mathbf { \Phi } \left( \mathbf { X } _ { n } \right) + b _ { \mathrm { SVM } } \\ \text { -actulal model performs this step in a more complicated manner } \\ 2. \text { run LogReg on } \left\{ \left( \mathbf { z } _ { n } ^ { \prime } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \text { to get } ( A , B ) \\ \text { -actual model adds some special regularization here } \\ 3. \text { return } g ( \mathbf { x } ) = \theta \left( A \cdot \left( \mathbf { w } _ { \mathrm { svm} } ^ { T } \mathbf { \Phi } ( \mathbf { x } ) + b _ { \mathrm { svm} } \right) + B \right) \end{array}
1 . run SVM on D to get ( b s v m , w s v m ) [ o r the equivalent α ] , and transform D to z n ′ = w S V M T Φ ( X n ) + b S V M -actulal model performs this step in a more complicated manner 2 . run LogReg on { ( z n ′ , y n ) } n = 1 N to get ( A , B ) -actual model adds some special regularization here 3 . return g ( x ) = θ ( A ⋅ ( w s v m T Φ ( x ) + b s v m ) + B )
核邏輯迴歸(Kernel Logistic Regression)
前面提到的二階段學習模型是一種核邏輯迴歸的近似解法,那麼如何實現真正的核邏輯迴歸呢?
關鍵是最優解 w ∗ \mathbf { w } _ { * } w ∗ 滿足一下條件:
w ∗ = ∑ n = 1 N β n z n
\mathbf { w } _ { * } = \sum _ { n = 1 } ^ { N } \beta _ { n } \mathbf { z } _ { n }
w ∗ = n = 1 ∑ N β n z n
因爲 w ∗ T z = ∑ n = 1 N β n z n T z = ∑ n = 1 N β n K ( x n , x ) \mathbf { w } _ { * } ^ { T } \mathbf { z } = \sum _ { n = 1 } ^ { N } \beta _ { n } \mathbf { z } _ { n } ^ { T } \mathbf { z } = \sum _ { n = 1 } ^ { N } \beta _ { n } K \left( \mathbf { x } _ { n } , \mathbf { x } \right) w ∗ T z = ∑ n = 1 N β n z n T z = ∑ n = 1 N β n K ( x n , x ) ,這樣的話便可以使用核技巧了。
那麼對於任何一個L2正則化線性模型(L2-regularized linear model)即:
min w λ N w T w + 1 N ∑ n = 1 N err ( y n , w T z n )
\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \operatorname { err } \left( y _ { n } , \mathbf { w } ^ { T } \mathbf { z } _ { n } \right)
w min N λ w T w + N 1 n = 1 ∑ N e r r ( y n , w T z n )
現在假設其最優解由兩個部分組成,w ∥ ∈ span ( z n ) \mathbf { w } _ { \| } \in \operatorname { span } \left( \mathbf { z } _ { n } \right) w ∥ ∈ s p a n ( z n ) 及 w ⊥ ⊥ span ( z n ) \mathbf { w } _ { \perp } \perp \operatorname { span } \left( \mathbf { z } _ { n } \right) w ⊥ ⊥ s p a n ( z n ) 即:
w ∗ = w ∥ + w ⊥
\mathbf { w } _ { * } = \mathbf { w } _ { \| } + \mathbf { w } _ { \perp }
w ∗ = w ∥ + w ⊥
那麼有:
w ∗ T w ∗ = w ∥ T w ∥ + 2 w ∥ T w ⊥ + w ⊥ T w ⊥ > w ∥ T w ∥
\mathbf { w } _ { * } ^ { T } \mathbf { w } _ { * } = \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \| } + 2 \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \perp } + \mathbf { w } _ { \perp } ^ { T } \mathbf { w } _ { \perp } \quad > \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \| }
w ∗ T w ∗ = w ∥ T w ∥ + 2 w ∥ T w ⊥ + w ⊥ T w ⊥ > w ∥ T w ∥
也就是說 w ∥ \mathbf { w } _ { \| } w ∥ 優於 w ∗ \mathbf { w } _ { * } w ∗ ,與 w ∗ \mathbf { w } _ { * } w ∗ 是最優解的假設相悖,所以 w ∗ = w ∥ \mathbf { w } _ { * } = \mathbf { w } _ { \| } w ∗ = w ∥ ,不存在 w ⊥ \mathbf { w } _ { \perp } w ⊥ 。也就是說 w ∗ \mathbf { w } _ { * } w ∗ 可以由 z n \mathbf{z}_n z n 組成,w ∗ \mathbf { w } _ { * } w ∗ 位於 Z \mathcal{Z} Z 空間。
所以說任何一個L2正則化線性模型都可以被kernel
,所以可以改寫爲:
min β λ N ∑ n = 1 N ∑ m = 1 N β n β m K ( x n , x m ) + 1 N ∑ n = 1 N log ( 1 + exp ( − y n ∑ m = 1 N β m K ( x m , x n ) ) )
\min _ { \beta } \frac { \lambda } { N } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \beta _ { n } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right) + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \log \left( 1 + \exp \left( - y _ { n } \sum _ { m = 1 } ^ { N } \beta _ { m } K \left( \mathbf { x } _ { m } , \mathbf { x } _ { n } \right) \right) \right)
β min N λ n = 1 ∑ N m = 1 ∑ N β n β m K ( x n , x m ) + N 1 n = 1 ∑ N log ( 1 + exp ( − y n m = 1 ∑ N β m K ( x m , x n ) ) )
至此便可以使用 GD/SGD 等優化算法進行尋優了。值得注意的是雖然與SVM相似,但是與SVM不同的是其係數 β n \beta_n β n 常常是非零的。
核嶺迴歸(Kernel Ridge Regression)
嶺迴歸的原問題模型如下:
min w λ N w T w + 1 N ∑ n = 1 N ( y n − w T z n ) 2
\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left( y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } \right) ^ { 2 }
w min N λ w T w + N 1 n = 1 ∑ N ( y n − w T z n ) 2
前文已經證得任何L2正則化線性模型都可以被 kernel
,所以這裏寫出核嶺迴歸的數學表達爲:
min β λ N ∑ n = 1 N ∑ m = 1 N β n β m K ( x n , x m ) ⏟ regularization of β K-based features + 1 N ∑ n = 1 N ( y n − ∑ m = 1 N β m K ( x n , x m ) ) 2 ⏟ linear regression of β K-based features β = λ N β T K β + 1 N ( β T K T K β − 2 β T K T y + y T y )
\begin{aligned} \min _ { \boldsymbol { \beta } } & \underbrace{\frac { \lambda } { N } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \beta _ { n } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right)}_{ \text {regularization of } \beta \text{ K-based features } } + \frac { 1 } { N } \underbrace { \sum _ { n = 1 } ^ { N } \left( y _ { n } - \sum _ { m = 1 } ^ { N } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right) \right) ^ { 2 } } _ { \text {linear regression of } \beta \text{ K-based features } \boldsymbol { \beta } }\\ & = \frac { \lambda } { N } \boldsymbol { \beta } ^ { T } \mathbf { K } \boldsymbol { \beta } + \frac { 1 } { N } \left( \boldsymbol { \beta } ^ { T } \mathbf { K } ^ { T } \mathbf { K } \boldsymbol { \beta } - 2 \boldsymbol { \beta } ^ { T } \mathbf { K } ^ { T } \mathbf { y } + \mathbf { y } ^ { T } \mathbf { y } \right) \end{aligned}
β min regularization of β K-based features N λ n = 1 ∑ N m = 1 ∑ N β n β m K ( x n , x m ) + N 1 linear regression of β K-based features β n = 1 ∑ N ( y n − m = 1 ∑ N β m K ( x n , x m ) ) 2 = N λ β T K β + N 1 ( β T K T K β − 2 β T K T y + y T y )
也就是說之前得所有核技巧都可以用到這裏。
那麼可以根據目標函數導數爲零,求出係數 β \beta β 的解析解,其目標函數的導數求得如下:
∇ E a u g ( β ) = 2 N ( λ K T I β + K T K β − K T y ) = 2 N K T ( ( λ I + K ) β − y )
\nabla E _ { \mathrm { aug } } ( \beta ) = \frac { 2 } { N } \left( \lambda \mathrm { K } ^ { T } \mathrm { I } \beta + \mathrm { K } ^ { T } \mathrm { K } \beta - \mathrm { K } ^ { T } \mathbf { y } \right) = \frac { 2 } { N } \mathrm { K } ^ { T } ( ( \lambda \mathrm { I } + \mathrm { K } ) \beta - \mathrm { y } )
∇ E a u g ( β ) = N 2 ( λ K T I β + K T K β − K T y ) = N 2 K T ( ( λ I + K ) β − y )
令其爲零有:
β = ( λ I + K ) − 1 y
\beta = ( \lambda I + K ) ^ { - 1 } y
β = ( λ I + K ) − 1 y
由於 K K K 必然是半正定的(根據 Mercer’s condition),那麼當 λ > 0 \lambda > 0 λ > 0 時,( λ I + K ) − 1 ( \lambda I + K ) ^ { - 1 } ( λ I + K ) − 1 必然存在。這裏的稠密矩陣求逆操作的時間複雜度爲:O ( N 3 ) O(N^3) O ( N 3 ) 。值得注意的是雖然與SVM相似,但是與SVM不同的是其係數 β n \beta_n β n 常常是非零的。
支持向量迴歸(Support Vector Regression)
管迴歸(Tube Regression)
管迴歸說的是在迴歸線周圍一定範圍內,不算錯誤,即實際值和預測值的差在一定範圍內認爲其無錯:
∣ s − y ∣ ≤ ϵ : 0 ∣ s − y ∣ > ϵ : ∣ s − y ∣ − ϵ
\begin{array} { l } | s - y | \leq \epsilon : 0 \\ | s - y | > \epsilon : | s - y | - \epsilon \end{array}
∣ s − y ∣ ≤ ϵ : 0 ∣ s − y ∣ > ϵ : ∣ s − y ∣ − ϵ
即
err ( y , s ) = max ( 0 , ∣ s − y ∣ − ϵ )
\operatorname { err } ( y , s ) = \max ( 0 , | s - y | - \epsilon ) \\
e r r ( y , s ) = max ( 0 , ∣ s − y ∣ − ϵ )
該誤差叫做 ϵ \epsilon ϵ -insensitive error(不敏感誤差)。
與平方(squared)誤差 err ( y , s ) = ( s − y ) 2 \operatorname { err } ( y , s ) = ( s - y ) ^ { 2 } e r r ( y , s ) = ( s − y ) 2 相比,當 ∣ s − y ∣ | s - y | ∣ s − y ∣ 較小時,兩者相似。但是當 ∣ s − y ∣ | s - y | ∣ s − y ∣ 較大時,增長較緩,也就是雜訊(nosie)的影響相較小。
那麼 L2 管迴歸 的優化目標爲:
min w λ N w T w + 1 N ∑ n = 1 N max ( 0 , ∣ w T z n − y ∣ − ϵ )
\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \max \left( 0 , \left| \mathbf { w } ^ { T } \mathbf { z } _ { n } - y \right| - \epsilon \right)
w min N λ w T w + N 1 n = 1 ∑ N max ( 0 , ∣ ∣ w T z n − y ∣ ∣ − ϵ )
標準支持向量迴歸(Standard Support Vector Regression)
爲了使用 SVM 的優點(稀疏係數矩陣),現在將 L2 管迴歸的係數改變一下:
min b , w 1 2 w T w + C ∑ n = 1 N max ( 0 , ∣ w T z n + b − y n ∣ − ϵ )
\min _ { b , \mathbf { w } } \quad \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \max \left( 0 , \left| \mathbf { w } ^ { T } \mathbf { z } _ { n } + b - y _ { n } \right| - \epsilon \right)
b , w min 2 1 w T w + C n = 1 ∑ N max ( 0 , ∣ ∣ w T z n + b − y n ∣ ∣ − ϵ )
同時由於這裏的max
操作不可微,所以需要轉換一下:
min b , w , ξ ∨ , ξ ∧ 1 2 w T w + C ∑ n = 1 N ( ξ n ∨ + ξ n ∧ ) s.t. − ϵ − ξ n ∨ ≤ y n − w T z n − b ≤ ϵ + ξ n ∧ ξ n ∨ ≥ 0 , ξ n ∧ ≥ 0
\begin{aligned} \min _ { b , \mathbf { w } , \xi ^ { \vee } , \xi ^ { \wedge } } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \left( \xi _ { n } ^ { \vee } + \xi _ { n } ^ { \wedge } \right) \\ \text { s.t. } & - \epsilon - \xi _ { n } ^ { \vee } \leq y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \leq \epsilon + \xi _ { n } ^ { \wedge } \\ & \xi _ { n } ^ { \vee } \geq 0 , \xi _ { n } ^ { \wedge } \geq 0 \end{aligned}
b , w , ξ ∨ , ξ ∧ min s.t. 2 1 w T w + C n = 1 ∑ N ( ξ n ∨ + ξ n ∧ ) − ϵ − ξ n ∨ ≤ y n − w T z n − b ≤ ϵ + ξ n ∧ ξ n ∨ ≥ 0 , ξ n ∧ ≥ 0
由於這裏有上下兩個邊界,所以相比SVM多了一個輔助係數,需要 ξ n ∨ , ξ n ∧ \xi _ { n } ^ { \vee } , \xi _ { n } ^ { \wedge } ξ n ∨ , ξ n ∧ 兩個輔助係數。至此便可以使用二次規劃求解最優值了。
對偶支持向量迴歸(Dual Support Vector Regression)
爲了使用核技巧,仍需求解對偶問題。這裏引入兩個拉格朗日乘數 α n ∨ , α n ∧ \alpha_ { n } ^ { \vee } , \alpha_ { n } ^ { \wedge } α n ∨ , α n ∧
objective function 1 2 w T w + C ∑ n = 1 N ( ξ n ∨ + ξ n ∧ ) Lagrange multiplier α n ∧ for y n − w T z n − b ≤ ϵ + ξ n ∧ Lagrange multiplier α n ∨ for − ϵ − ξ n ∨ ≤ y n − w T z n − b
\begin{array} { l l l } \text { objective function } & &\frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \left( \xi _ { n } ^ { \vee } + \xi _ { n } ^ { \wedge } \right) \\ \text { Lagrange multiplier } \alpha _ { n } ^ { \wedge } & \text { for } & y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \leq \epsilon + \xi _ { n } ^ { \wedge } \\ \text { Lagrange multiplier } \alpha _ { n } ^ { \vee } & \text { for } & - \epsilon - \xi _ { n } ^ { \vee } \leq y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \end{array}
objective function Lagrange multiplier α n ∧ Lagrange multiplier α n ∨ for for 2 1 w T w + C ∑ n = 1 N ( ξ n ∨ + ξ n ∧ ) y n − w T z n − b ≤ ϵ + ξ n ∧ − ϵ − ξ n ∨ ≤ y n − w T z n − b
某些 KKT 條件如下:
∂ L ∂ w i = 0 : w = ∑ n = 1 N ( α n ∧ − α n ∨ ) ⏟ β n z n ∂ L ∂ b = 0 : ∑ n = 1 N ( α n ∧ − α n ∨ ) = 0 α n ∧ ( ϵ + ξ n ∧ − y n + w T z n + b ) = 0 α n ∨ ( ϵ + ξ n ∨ + y n − w T z n − b ) = 0
\begin{array} { l }
\frac { \partial \mathcal { L } } { \partial w _ { i } } = 0 : \mathbf { w } = \sum _ { n = 1 } ^ { N } \underbrace { \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) } _ { \beta_n } \mathbf { z } _ { n } \\ \frac { \partial \mathcal { L } } { \partial b } = 0 : \sum _ { n = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) = 0 \\
\alpha _ { n } ^ { \wedge } \left( \epsilon + \xi _ { n } ^ { \wedge } - y _ { n } + \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) = 0 \\ \alpha _ { n } ^ { \vee } \left( \epsilon + \xi _ { n } ^ { \vee } + y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \right) = 0 \end{array}
∂ w i ∂ L = 0 : w = ∑ n = 1 N β n ( α n ∧ − α n ∨ ) z n ∂ b ∂ L = 0 : ∑ n = 1 N ( α n ∧ − α n ∨ ) = 0 α n ∧ ( ϵ + ξ n ∧ − y n + w T z n + b ) = 0 α n ∨ ( ϵ + ξ n ∨ + y n − w T z n − b ) = 0
與SVM求解過程類似,可以寫出其對偶問題如下:
min 1 2 ∑ n = 1 N ∑ m = 1 N ( α n ∧ − α n ∨ ) ( α m ∧ − α m ∨ ) k n , m + ∑ n = 1 N ( ( ϵ − y n ) ⋅ α n ∧ + ( ϵ + y n ) ⋅ α n ∨ ) s.t. ∑ n = 1 N ( α n ∧ − α n ∨ ) = 0 0 ≤ α n ∧ ≤ C , 0 ≤ α n ∨ ≤ C
\begin{aligned} \min & \frac { 1 } { 2 } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) \left( \alpha _ { m } ^ { \wedge } - \alpha _ { m } ^ { \vee } \right) k _ { n , m } \\ & + \sum _ { n = 1 } ^ { N } \left( \left( \epsilon - y _ { n } \right) \cdot \alpha _ { n } ^ { \wedge } + \left( \epsilon + y _ { n } \right) \cdot \alpha _ { n } ^ { \vee } \right) \\ \text { s.t. } & \sum _ { n = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) = 0 \\ & 0 \leq \alpha _ { n } ^ { \wedge } \leq C , 0 \leq \alpha _ { n } ^ { \vee } \leq C \end{aligned}
min s.t. 2 1 n = 1 ∑ N m = 1 ∑ N ( α n ∧ − α n ∨ ) ( α m ∧ − α m ∨ ) k n , m + n = 1 ∑ N ( ( ϵ − y n ) ⋅ α n ∧ + ( ϵ + y n ) ⋅ α n ∨ ) n = 1 ∑ N ( α n ∧ − α n ∨ ) = 0 0 ≤ α n ∧ ≤ C , 0 ≤ α n ∨ ≤ C
係數的稀疏性分析:
當 ∣ w T z n + b − y n ∣ < ϵ \left| \mathbf { w } ^ { T } \mathbf { z } _ { n } + b - y _ { n } \right| < \epsilon ∣ ∣ w T z n + b − y n ∣ ∣ < ϵ ,也就是說樣本嚴格位於管內,那麼會有:
⟹ ξ n ∧ = 0 and ξ n ∨ = 0 ⟹ ( ϵ + ξ n ∧ − y n + w T z n + b ) ≠ 0 and ( ϵ + ξ n ∨ + y n − w T z n − b ) ≠ 0 ⟹ α n ∧ = 0 and α n ∨ = 0 ⟹ β n = 0
\begin{array} { l } \Longrightarrow \xi _ { n } ^ { \wedge } = 0 \text { and } \xi _ { n } ^ { \vee } = 0 \\ \Longrightarrow \left( \epsilon + \xi _ { n } ^ { \wedge } - y _ { n } + \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) \neq 0 \text { and } \left( \epsilon + \xi _ { n } ^ { \vee } + y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \right) \neq 0 \\ \Longrightarrow \alpha _ { n } ^ { \wedge } = 0 \text { and } \alpha _ { n } ^ { \vee } = 0 \\ \Longrightarrow \beta _ { n } = 0 \end{array}
⟹ ξ n ∧ = 0 and ξ n ∨ = 0 ⟹ ( ϵ + ξ n ∧ − y n + w T z n + b ) = 0 and ( ϵ + ξ n ∨ + y n − w T z n − b ) = 0 ⟹ α n ∧ = 0 and α n ∨ = 0 ⟹ β n = 0
所以說 β \beta β 是稀疏的,同時在 SVR 中,哪些在管上或外的樣本(β n ≠ 0 \beta_n \neq 0 β n = 0 )叫做支持向量。
線性或核模型總結
第一行由於效果不好,所以不太常用。
第二行比較常用的工具箱是LIBLINEAR
第三行由於其稠密的 β \beta β 所以也不太常用
第四行比較常用的工具箱是LIBSVM