機器學習複習part1
線性代數複習
表示方法:
(,,,) - 行向量
(;;;) - 列向量
乘法
矩陣轉置
A I = A = I A ( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T
AI = A = IA
\\
(A^T)^T = A\\
(AB)^T = B^TA^T\\
(A+B)^T =A^T+B^T
A I = A = I A ( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T
矩陣求導
f : R m × n → R f:R^{m \times n} \rightarrow R f : R m × n → R
( ▽ A f ( A ) ) i j = ∂ f ( A ) ∂ A i j ▽ A f ( A ) ∈ R m n
(\bigtriangledown_Af(A))_{ij} = \frac{\partial f(A)}{\partial A_{ij}}\\
\bigtriangledown_Af(A) \in R_{mn}
( ▽ A f ( A ) ) i j = ∂ A i j ∂ f ( A ) ▽ A f ( A ) ∈ R m n
e . g . : e.g.: e . g . : f : R M → R f:R_M \rightarrow R f : R M → R and f ( z ) = z T z f(z) = z^Tz f ( z ) = z T z
此時▽ z f ( z ) = 2 z \bigtriangledown_zf(z)=2z ▽ z f ( z ) = 2 z 把z z z 當做變量
但是另外一種情況是▽ f ( A x ) \bigtriangledown f(Ax) ▽ f ( A x ) 如果看做一個整體時,同上即2 A x 2Ax 2 A x ,而如果單獨對於x x x 並不滿足上面的矩陣對應關係,此時應該是▽ f ( A x ) = ▽ x g ( x ) ∈ R n \bigtriangledown f(Ax) =\bigtriangledown_xg(x) \in R^n ▽ f ( A x ) = ▽ x g ( x ) ∈ R n 實際造成不同的原因是變量不同 。
二次項平方項
x ∈ R n x \in R^{n} x ∈ R n and f ( x ) = b T x f(x) = b^Tx f ( x ) = b T x and b ∈ R n b \in R^{n} b ∈ R n and f ( x ) = ∑ i = 1 n b i x i f(x) = \sum_{i=1}^nb_ix_i f ( x ) = ∑ i = 1 n b i x i
此時 ▽ x f ( x ) = b \bigtriangledown_xf(x) = b ▽ x f ( x ) = b
f ( x ) = x T A x f(x) = x^TAx f ( x ) = x T A x for A ∈ S n A \in S^n A ∈ S n
此時▽ x f ( x ) = 2 A x \bigtriangledown_xf(x) =2 Ax ▽ x f ( x ) = 2 A x
常用
▽ x b T x = b ▽ x x T A x = 2 A x ▽ x 2 x T A x = A ( i f A i s s y m m e t r i c ) A T = A
\bigtriangledown_xb^Tx = b\\
\bigtriangledown_xx^TAx = 2Ax\\
\bigtriangledown_x^2x^TAx = A(if A issymmetric )\\
A^T = A
▽ x b T x = b ▽ x x T A x = 2 A x ▽ x 2 x T A x = A ( i f A i s s y m m e t r i c ) A T = A
最小平方項
∣ ∣ A x − b ∣ ∣ 2 2 = ( A x − b ) T ( A x − b ) = x T A T A x − 2 b T A x + b T b ▽ x = 2 A T A x − 2 A T b
||Ax-b||^2_2 = (Ax-b)^T(Ax-b)=\\
x^TA^TAx-2b^TAx+b^Tb
\\
\bigtriangledown_x = 2A^TAx-2A^Tb
∣ ∣ A x − b ∣ ∣ 2 2 = ( A x − b ) T ( A x − b ) = x T A T A x − 2 b T A x + b T b ▽ x = 2 A T A x − 2 A T b
矩陣的跡:
t r A = ∑ i = 1 n A i i trA=\sum_{i=1}^nA_{ii} t r A = ∑ i = 1 n A i i
當A ∈ R n × n A \in R^{n \times n} A ∈ R n × n 時,滿足一些線性性質
矩陣的秩
for A ∈ R m × n , r a n k ( A ) < = m i n ( m , n ) A \in R^{m \times n},rank(A) <= min{(m,n)} A ∈ R m × n , r a n k ( A ) < = m i n ( m , n ) ,if r a n k ( A ) = m i n ( m , n ) rank(A)=min{(m,n)} r a n k ( A ) = m i n ( m , n ) ,then it is full rank
求逆
A − 1 A = I = A A − 1 A^{-1}A=I=AA^{-1} A − 1 A = I = A A − 1
沒有逆矩陣的情況:
不是方陣
不是滿秩(full rank)
特徵值求解:
補充求導:
線性模型複習
基本模型
f ( x ) = w T x + b
f(\mathbf{x}) =\mathbf{w}^T \mathbf{x} +b
f ( x ) = w T x + b
其中x x x 表示( x 1 ; x 2 ; x 3 ; . . . . . . x d ) (x_1;x_2;x_3;......x_d) ( x 1 ; x 2 ; x 3 ; . . . . . . x d ) , w w w 表示( w 1 , w 2 , w 3 . . . . . . w d ) (w_1,w_2,w_3......w_d) ( w 1 , w 2 , w 3 . . . . . . w d ) , b b b 是一個常數
線性迴歸
假設模型如上w , b \mathbf{w},b w , b 爲待定參數
構建目標函數:均方誤差
求解:最小二乘法
( w ∗ , b ∗ ) = a r g m i n w , b ∑ i = 1 n ( f ( x i ) − y i ) 2
(w^*,b^*) = argmin_{w,b}\sum_{i=1}^n(f(x_i)-y_i)^2
( w ∗ , b ∗ ) = a r g m i n w , b i = 1 ∑ n ( f ( x i ) − y i ) 2
求導(前提是目標函數爲凸函數,所以當導數等於零的時候取的最優解)
∂ E ( w , b ) ∂ w = 2 ( w ∑ x i 2 − ∑ ( y i − b ) x i ) ∂ E ( w , b ) ∂ b = 2 ( m b − ∑ ( y i − w x i ) w = ∑ y i ( x i − x ˉ ) ∑ x i 2 − 1 m ( ∑ x i 2 ) 2 b = 1 m ∑ ( y i − w x i )
\frac{\partial E(w,b)}{\partial w} = 2(w\sum x_i^2-\sum(y_i-b)x_i)
\\
\frac{\partial E(w,b)}{\partial b} = 2(mb-\sum(y_i-wx_i)
\\
w = \frac{\sum y_i(x_i-\bar{x})}{\sum x_i^2-\frac{1}{m}(\sum x_i^2)^2}
\\
b= \frac{1}{m}\sum (y_i-wx_i)
∂ w ∂ E ( w , b ) = 2 ( w ∑ x i 2 − ∑ ( y i − b ) x i ) ∂ b ∂ E ( w , b ) = 2 ( m b − ∑ ( y i − w x i ) w = ∑ x i 2 − m 1 ( ∑ x i 2 ) 2 ∑ y i ( x i − x ˉ ) b = m 1 ∑ ( y i − w x i )
數據說明:
w ^ = ( w ; b ) \hat{\mathbf{w}}=(\mathbf{w};b) w ^ = ( w ; b ) 把b引入,更好計算,此時爲一個(d+1)維的列向量
數據集:X \mathbf{X} X
x 11 x 12 . . . x 1 d 1 x 21 x 22 . . . x 2 d 1 . . . . . . . . . 1 x m 1 x m 2 . . . x m d 1 = x 1 T 1 x 2 T 1 . . . 1 x m T 1
\begin{matrix}
x_{11}&x_{12}&...x_{1d}&1\\
x_{21}&x_{22}&...x_{2d}&1\\
...&...&...&1\\
x_{m1}&x_{m2}&...x_{md}&1\\
\end{matrix}
=
\begin{matrix}
\mathbf{x_1^T}&1\\
\mathbf{x_2^T}&1\\
...&1\\
\mathbf{x_m^T}&1\\
\end{matrix}
x 1 1 x 2 1 . . . x m 1 x 1 2 x 2 2 . . . x m 2 . . . x 1 d . . . x 2 d . . . . . . x m d 1 1 1 1 = x 1 T x 2 T . . . x m T 1 1 1 1
此時
y = X w ^
\mathbf{y}=\mathbf{X}\hat{\mathbf{w}}
y = X w ^
而對於單獨的一個y = ( y 1 ; y 2 , . . . y m ) \mathbf{y} =(y_1;y_2,...y_m) y = ( y 1 ; y 2 , . . . y m )
y i = f ( x ) = w T x + b
y_i =f(\mathbf{x}) =\mathbf{w}^T \mathbf{x} +b
y i = f ( x ) = w T x + b
E ( w ^ ) = ( y − X w ^ ) T ( y − X w ^ )
E(\hat{\mathbf{w}})= (\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})
E ( w ^ ) = ( y − X w ^ ) T ( y − X w ^ )
求導
∂ E ∂ w = 2 X T ( y − X w ^ ) w ^ ∗ = ( X T X ) − 1 X T y
\frac{\partial E}{\partial \mathbf{w}} =2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})
\\
\mathbf{\hat{w}}^* = (X^TX)^{-1}X^Ty
∂ w ∂ E = 2 X T ( y − X w ^ ) w ^ ∗ = ( X T X ) − 1 X T y
tips:若X T X X^TX X T X 並非列滿秩矩陣:引入正則項:
E = ( y − X w ^ ) T ( y − X w ^ ) + λ ∣ ∣ w ∣ ∣ 2 ( λ > 0 ) w ^ ∗ = ( X T X + λ I ) − 1 X T y
E=(y-X\hat{w})^T(y-X\hat{w})+\lambda ||w||^2\\
(\lambda>0)\\
\mathbf{\hat{w}}^* = (X^TX+\lambda I)^{-1}X^Ty
E = ( y − X w ^ ) T ( y − X w ^ ) + λ ∣ ∣ w ∣ ∣ 2 ( λ > 0 ) w ^ ∗ = ( X T X + λ I ) − 1 X T y
得到後的應用:
迴歸和分類的區別可以理解爲,分類爲離散,迴歸爲連續,但是同時迴歸可以作爲概率轉換爲分類。
由上,在二分類問題如果labels(0,1)可以理解爲0.5爲分界線。劃分兩類問題。如果標籤值不是直接分類的標識,轉爲線性分類問題。
廣義線性模型
y = g ( w T x + b ) g − 1 ( y ) = w T x + b
y=g(w^Tx+b)
\\
g^{-1}(y)=w^Tx+b
y = g ( w T x + b ) g − 1 ( y ) = w T x + b
線性分類
對數機率迴歸
把線性擬合轉化爲0-1分類(y表示看做正類的概率)
y = 1 1 + e − z
y=\frac{1}{1+e^{-z}}
y = 1 + e − z 1
轉爲爲線性模型(即)
l n ( y 1 − y ) = w T x + b
ln(\frac{y}{1-y})=w^Tx+b
l n ( 1 − y y ) = w T x + b
構建對數似然函數:
概率表示:
p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p(y=1|x) =\frac{e^{w^Tx+b}}{1+e^{w^Tx+b}} p ( y = 1 ∣ x ) = 1 + e w T x + b e w T x + b
p ( y = 0 ∣ x ) = 1 1 + e w T x + b p(y=0|x) =\frac{1}{1+e^{w^Tx+b}} p ( y = 0 ∣ x ) = 1 + e w T x + b 1
對數似然函數:
l ( w , b ) = ∑ i = 1 n l n p ( y = j ∣ x i , w , b ) ; l(w,b)=\sum_{i=1}^n lnp(y=j|x_i,w,b); l ( w , b ) = ∑ i = 1 n l n p ( y = j ∣ x i , w , b ) ;
p ( y = j ∣ x i , w , b ) = y i p ( y i = 1 ∣ x i , w , b ) + ( 1 − y i ) y ( y i = 0 ∣ x i , w , b ) p(y=j|x_i,w,b)=y_ip(y_i=1|x_i,w,b)+(1-y_i)y(y_i=0|x_i,w,b) p ( y = j ∣ x i , w , b ) = y i p ( y i = 1 ∣ x i , w , b ) + ( 1 − y i ) y ( y i = 0 ∣ x i , w , b )
l ( w , b ) = ∑ i = 1 n [ y i ( w T x i + b ) − l n ( 1 + e w T x i + b ) ] l(w,b)=\sum_{i=1}^n[y_i(w^Tx_i+b)-ln(1+e^{w^Tx_i+b})] l ( w , b ) = ∑ i = 1 n [ y i ( w T x i + b ) − l n ( 1 + e w T x i + b ) ]
凸優化,梯度下降法:
w t + 1 = w t − λ △ w = w t − λ ∂ l ∂ w ∣ w = w t , b = b t b t + 1 = b t − λ △ b = b t − λ ∂ l ∂ b ∣ w = w t , b = b t ∂ l ∂ w = − ∑ [ x i y i − x i p ( y i = 1 ∣ x i , w , b ) ] ∂ l ∂ b = − ∑ [ y i − p ( y i = 1 ∣ x i , w , b ) ]
w^{t+1}=w^t-\lambda\triangle w=w^t-\lambda \frac{\partial l}{\partial w} |_{w=w^t,b=b^t}
\\
b^{t+1}=b^t-\lambda\triangle b=b^t-\lambda \frac{\partial l}{\partial b} |_{w=w^t,b=b^t}\\
\frac{\partial l}{\partial w}=-\sum[x_iy_i-x_ip(y_i=1|x_i,w,b)]\\
\frac{\partial l}{\partial b}=-\sum[y_i-p(y_i=1|x_i,w,b)]
w t + 1 = w t − λ △ w = w t − λ ∂ w ∂ l ∣ w = w t , b = b t b t + 1 = b t − λ △ b = b t − λ ∂ b ∂ l ∣ w = w t , b = b t ∂ w ∂ l = − ∑ [ x i y i − x i p ( y i = 1 ∣ x i , w , b ) ] ∂ b ∂ l = − ∑ [ y i − p ( y i = 1 ∣ x i , w , b ) ]
while step < max_step:
dw = np. zeros( sample_dim, float )
db = 0.0
step += 1
for i in range ( sample_num) :
xi, yi = train_sample[ i] , train_label[ i]
pi = 1 - 1 / ( 1 + np. exp( np. dot( w, xi) + b) )
dw += ( xi* yi - xi* pi)
db += ( yi - pi)
dw = - dw
db = - db
w -= learning_rate* dw
b -= learning_rate* db
self. w = w
self. b = b
分類:判斷哪個概率大即是哪個類
線性判別分析(LDA)
核心思想 :同類儘可能近,異類儘可能遠(監督降維算法-投影到直線)
數據集( x i , y ) i = 1 n {(\mathbf{x_i},y)}_{i=1}^n ( x i , y ) i = 1 n 二分類問題
投影前的每一類的均值&協方差矩陣:
u 0 = 1 n 0 ∑ y i = 0 x i Σ 0 = 1 n 0 − 1 ∑ y i = 0 x i x i T
\mathbf{u_0}=\frac{1}{n_0}\sum_{y_i=0}x_i\\
\Sigma_0 = \frac{1}{n_0-1}\sum_{y_i=0}x_ix_i^T\\
u 0 = n 0 1 y i = 0 ∑ x i Σ 0 = n 0 − 1 1 y i = 0 ∑ x i x i T
投影后:(投影到直線均爲實數)
u 0 ^ = w T u 0 Σ 0 ^ = w T Σ 1 w
\hat{u_0}=w^Tu_0\\
\hat{\Sigma_0}=w^T\Sigma_1w
u 0 ^ = w T u 0 Σ 0 ^ = w T Σ 1 w
最大化目標函數:
J = w T S b w w T S w w S w = Σ 0 + Σ 1 S b = ( u 0 − u 1 ) ( u 0 − u 1 ) T
J=\frac{w^TS_bw}{w^TS_ww}\\
S_w=\Sigma_0+\Sigma_1
\\
S_b=(u_0-u_1)(u_0-u_1)^T
\\
J = w T S w w w T S b w S w = Σ 0 + Σ 1 S b = ( u 0 − u 1 ) ( u 0 − u 1 ) T
等價表示:
m i n − w T S b w s . t . w T S w w = 1 L = − w T S b w + λ ( w T S w w − 1 ) ∂ L ∂ w w ∗ = S w − 1 ( u 0 − u 1 )
min -w^TS_bw\\
s.t. w^TS_ww=1\\
L=-w^TS_bw+\lambda(w^TS_ww-1)\\
\frac{\partial L}{\partial w}
\\
w^*=S_{w}^{-1}(u_0-u_1)
m i n − w T S b w s . t . w T S w w = 1 L = − w T S b w + λ ( w T S w w − 1 ) ∂ w ∂ L w ∗ = S w − 1 ( u 0 − u 1 )
支持向量機
線性可分問題
劃分超平面:
w T x + b = 0
w^Tx+b=0
w T x + b = 0
最大化間隔即
m a x w , b 2 ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) > = 1
max_{w,b}\frac{2}{||w||}\\
s.t.y_i(w^Tx_i+b)>=1
m a x w , b ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
等效於
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
min_{w,b}\frac{1}{2}{||w||}^2\\
s.t.y_i(w^Tx_i+b)>=1
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
是一個凸優化問題
對偶問題(求解):
L = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 n α ( y i ( w T x i + b ) − 1 )
L=\frac{1}{2}{||w||}^2-\sum_{i=1}^n\alpha(y_i(w^Tx_i+b)-1)
L = 2 1 ∣ ∣ w ∣ ∣ 2 − i = 1 ∑ n α ( y i ( w T x i + b ) − 1 )
w = ∑ α i y i x i , ∑ α i y i = 0
w=\sum\alpha_iy_ix_i,\sum\alpha_iy_i=0
w = ∑ α i y i x i , ∑ α i y i = 0
m i n α 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j x i T x j − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
min_{\alpha}\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jx_i^Tx_j-\sum\alpha_i\\
s.t.\sum\alpha_iy_i=0,\alpha_i>=0
m i n α 2 1 i = 1 ∑ n j = 1 ∑ n α i α j y i y j x i T x j − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
對偶問題求解(SMO):
選取一對需要更新的α i , α j \alpha_i,\alpha_j α i , α j
固定除了上面以爲的參數求解
α i y i + α j y j = − ∑ k ≠ i , j α k y k
\alpha_iy_i+\alpha_jy_j =-\sum_{k\neq i,j}\alpha_ky_k
α i y i + α j y j = − k ̸ = i , j ∑ α k y k
此時兩個等式,一個單變量的二次規劃,具有閉式解。(捨棄負數)
求b:支持向量方程y i f ( x i ) = 1 y_if(x_i)=1 y i f ( x i ) = 1
最終判定:y = s i g n [ f ( x i ) ] y=sign[f(x_i)] y = s i g n [ f ( x i ) ]
非線性可分問題
引入鬆弛變量
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ ξ i s . t . y i ( w T x i + b ) > = 1 − ξ i ξ i ≥ 0
min_{w,b}\frac{1}{2}{||w||}^2+C\sum\xi_i\\
s.t.y_i(w^Tx_i+b)>=1-\xi_i\\
\xi_i\geq0
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 + C ∑ ξ i s . t . y i ( w T x i + b ) > = 1 − ξ i ξ i ≥ 0
類似上面
特徵映射
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T Φ ( x i ) + b ) > = 1
min_{w,b}\frac{1}{2}{||w||}^2\\
s.t.y_i(w^T\Phi (x_i)+b)>=1
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T Φ ( x i ) + b ) > = 1
m i n α 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j Φ ( x i ) T Φ ( x j ) − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
min_{\alpha}\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\Phi (x_i)^T\Phi (x_j)-\sum\alpha_i\\
s.t.\sum\alpha_iy_i=0,\alpha_i>=0
m i n α 2 1 i = 1 ∑ n j = 1 ∑ n α i α j y i y j Φ ( x i ) T Φ ( x j ) − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
核函數
因爲w = ∑ α i y i x i , w=\sum\alpha_iy_ix_i, w = ∑ α i y i x i , 代入即最終
f ( x ) = ∑ α i y i Φ ( x i ) T Φ ( x ) + b
f(x)=\sum\alpha_iy_i\Phi(x_i)^T\Phi(x)+b
f ( x ) = ∑ α i y i Φ ( x i ) T Φ ( x ) + b
定義核函數k : R d × R d → R , k ( x , y ) = Φ ( x ) T Φ ( y ) k:R^d\times R^d \rightarrow R,k(x,y)=\Phi(x)^T\Phi(y) k : R d × R d → R , k ( x , y ) = Φ ( x ) T Φ ( y )
核矩陣:核函數的採樣矩陣。
貝葉斯
基本數學公式(貝葉斯公式)
P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ A ˉ ) P ( A ˉ )
P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)}=\\
\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\bar{A})P(\bar{A})}
P ( A ∣ B ) = P ( B ) P ( A , B ) = P ( B ) P ( B ∣ A ) P ( A ) = P ( B ∣ A ) P ( A ) + P ( B ∣ A ˉ ) P ( A ˉ ) P ( B ∣ A ) P ( A )
貝葉斯決策論
最優決策:使風險最小化
R = P ( c 1 ∣ B ) λ 21 + P ( c 2 ∣ B ) λ 12 R =P(c_1|B)\lambda_{21}+P(c_2|B)\lambda_{12} R = P ( c 1 ∣ B ) λ 2 1 + P ( c 2 ∣ B ) λ 1 2
其中P ( c 1 ∣ B ) P(c_1|B) P ( c 1 ∣ B ) 爲實際爲c 1 c_1 c 1 的概率 (以下對於0,1損失分類)
樸素貝葉斯分類器:(假設特徵之間互不相關)
連續:一般高斯分佈,極大似然法求參數
離散:直接數並計算