机器学习复习part1
线性代数复习
表示方法:
(,,,) - 行向量
(;;;) - 列向量
乘法
矩阵转置
A I = A = I A ( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T
AI = A = IA
\\
(A^T)^T = A\\
(AB)^T = B^TA^T\\
(A+B)^T =A^T+B^T
A I = A = I A ( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T
矩阵求导
f : R m × n → R f:R^{m \times n} \rightarrow R f : R m × n → R
( ▽ A f ( A ) ) i j = ∂ f ( A ) ∂ A i j ▽ A f ( A ) ∈ R m n
(\bigtriangledown_Af(A))_{ij} = \frac{\partial f(A)}{\partial A_{ij}}\\
\bigtriangledown_Af(A) \in R_{mn}
( ▽ A f ( A ) ) i j = ∂ A i j ∂ f ( A ) ▽ A f ( A ) ∈ R m n
e . g . : e.g.: e . g . : f : R M → R f:R_M \rightarrow R f : R M → R and f ( z ) = z T z f(z) = z^Tz f ( z ) = z T z
此时▽ z f ( z ) = 2 z \bigtriangledown_zf(z)=2z ▽ z f ( z ) = 2 z 把z z z 当做变量
但是另外一种情况是▽ f ( A x ) \bigtriangledown f(Ax) ▽ f ( A x ) 如果看做一个整体时,同上即2 A x 2Ax 2 A x ,而如果单独对于x x x 并不满足上面的矩阵对应关系,此时应该是▽ f ( A x ) = ▽ x g ( x ) ∈ R n \bigtriangledown f(Ax) =\bigtriangledown_xg(x) \in R^n ▽ f ( A x ) = ▽ x g ( x ) ∈ R n 实际造成不同的原因是变量不同 。
二次项平方项
x ∈ R n x \in R^{n} x ∈ R n and f ( x ) = b T x f(x) = b^Tx f ( x ) = b T x and b ∈ R n b \in R^{n} b ∈ R n and f ( x ) = ∑ i = 1 n b i x i f(x) = \sum_{i=1}^nb_ix_i f ( x ) = ∑ i = 1 n b i x i
此时 ▽ x f ( x ) = b \bigtriangledown_xf(x) = b ▽ x f ( x ) = b
f ( x ) = x T A x f(x) = x^TAx f ( x ) = x T A x for A ∈ S n A \in S^n A ∈ S n
此时▽ x f ( x ) = 2 A x \bigtriangledown_xf(x) =2 Ax ▽ x f ( x ) = 2 A x
常用
▽ x b T x = b ▽ x x T A x = 2 A x ▽ x 2 x T A x = A ( i f A i s s y m m e t r i c ) A T = A
\bigtriangledown_xb^Tx = b\\
\bigtriangledown_xx^TAx = 2Ax\\
\bigtriangledown_x^2x^TAx = A(if A issymmetric )\\
A^T = A
▽ x b T x = b ▽ x x T A x = 2 A x ▽ x 2 x T A x = A ( i f A i s s y m m e t r i c ) A T = A
最小平方项
∣ ∣ A x − b ∣ ∣ 2 2 = ( A x − b ) T ( A x − b ) = x T A T A x − 2 b T A x + b T b ▽ x = 2 A T A x − 2 A T b
||Ax-b||^2_2 = (Ax-b)^T(Ax-b)=\\
x^TA^TAx-2b^TAx+b^Tb
\\
\bigtriangledown_x = 2A^TAx-2A^Tb
∣ ∣ A x − b ∣ ∣ 2 2 = ( A x − b ) T ( A x − b ) = x T A T A x − 2 b T A x + b T b ▽ x = 2 A T A x − 2 A T b
矩阵的迹:
t r A = ∑ i = 1 n A i i trA=\sum_{i=1}^nA_{ii} t r A = ∑ i = 1 n A i i
当A ∈ R n × n A \in R^{n \times n} A ∈ R n × n 时,满足一些线性性质
矩阵的秩
for A ∈ R m × n , r a n k ( A ) < = m i n ( m , n ) A \in R^{m \times n},rank(A) <= min{(m,n)} A ∈ R m × n , r a n k ( A ) < = m i n ( m , n ) ,if r a n k ( A ) = m i n ( m , n ) rank(A)=min{(m,n)} r a n k ( A ) = m i n ( m , n ) ,then it is full rank
求逆
A − 1 A = I = A A − 1 A^{-1}A=I=AA^{-1} A − 1 A = I = A A − 1
没有逆矩阵的情况:
不是方阵
不是满秩(full rank)
特征值求解:
补充求导:
线性模型复习
基本模型
f ( x ) = w T x + b
f(\mathbf{x}) =\mathbf{w}^T \mathbf{x} +b
f ( x ) = w T x + b
其中x x x 表示( x 1 ; x 2 ; x 3 ; . . . . . . x d ) (x_1;x_2;x_3;......x_d) ( x 1 ; x 2 ; x 3 ; . . . . . . x d ) , w w w 表示( w 1 , w 2 , w 3 . . . . . . w d ) (w_1,w_2,w_3......w_d) ( w 1 , w 2 , w 3 . . . . . . w d ) , b b b 是一个常数
线性回归
假设模型如上w , b \mathbf{w},b w , b 为待定参数
构建目标函数:均方误差
求解:最小二乘法
( w ∗ , b ∗ ) = a r g m i n w , b ∑ i = 1 n ( f ( x i ) − y i ) 2
(w^*,b^*) = argmin_{w,b}\sum_{i=1}^n(f(x_i)-y_i)^2
( w ∗ , b ∗ ) = a r g m i n w , b i = 1 ∑ n ( f ( x i ) − y i ) 2
求导(前提是目标函数为凸函数,所以当导数等于零的时候取的最优解)
∂ E ( w , b ) ∂ w = 2 ( w ∑ x i 2 − ∑ ( y i − b ) x i ) ∂ E ( w , b ) ∂ b = 2 ( m b − ∑ ( y i − w x i ) w = ∑ y i ( x i − x ˉ ) ∑ x i 2 − 1 m ( ∑ x i 2 ) 2 b = 1 m ∑ ( y i − w x i )
\frac{\partial E(w,b)}{\partial w} = 2(w\sum x_i^2-\sum(y_i-b)x_i)
\\
\frac{\partial E(w,b)}{\partial b} = 2(mb-\sum(y_i-wx_i)
\\
w = \frac{\sum y_i(x_i-\bar{x})}{\sum x_i^2-\frac{1}{m}(\sum x_i^2)^2}
\\
b= \frac{1}{m}\sum (y_i-wx_i)
∂ w ∂ E ( w , b ) = 2 ( w ∑ x i 2 − ∑ ( y i − b ) x i ) ∂ b ∂ E ( w , b ) = 2 ( m b − ∑ ( y i − w x i ) w = ∑ x i 2 − m 1 ( ∑ x i 2 ) 2 ∑ y i ( x i − x ˉ ) b = m 1 ∑ ( y i − w x i )
数据说明:
w ^ = ( w ; b ) \hat{\mathbf{w}}=(\mathbf{w};b) w ^ = ( w ; b ) 把b引入,更好计算,此时为一个(d+1)维的列向量
数据集:X \mathbf{X} X
x 11 x 12 . . . x 1 d 1 x 21 x 22 . . . x 2 d 1 . . . . . . . . . 1 x m 1 x m 2 . . . x m d 1 = x 1 T 1 x 2 T 1 . . . 1 x m T 1
\begin{matrix}
x_{11}&x_{12}&...x_{1d}&1\\
x_{21}&x_{22}&...x_{2d}&1\\
...&...&...&1\\
x_{m1}&x_{m2}&...x_{md}&1\\
\end{matrix}
=
\begin{matrix}
\mathbf{x_1^T}&1\\
\mathbf{x_2^T}&1\\
...&1\\
\mathbf{x_m^T}&1\\
\end{matrix}
x 1 1 x 2 1 . . . x m 1 x 1 2 x 2 2 . . . x m 2 . . . x 1 d . . . x 2 d . . . . . . x m d 1 1 1 1 = x 1 T x 2 T . . . x m T 1 1 1 1
此时
y = X w ^
\mathbf{y}=\mathbf{X}\hat{\mathbf{w}}
y = X w ^
而对於单独的一个y = ( y 1 ; y 2 , . . . y m ) \mathbf{y} =(y_1;y_2,...y_m) y = ( y 1 ; y 2 , . . . y m )
y i = f ( x ) = w T x + b
y_i =f(\mathbf{x}) =\mathbf{w}^T \mathbf{x} +b
y i = f ( x ) = w T x + b
E ( w ^ ) = ( y − X w ^ ) T ( y − X w ^ )
E(\hat{\mathbf{w}})= (\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})
E ( w ^ ) = ( y − X w ^ ) T ( y − X w ^ )
求导
∂ E ∂ w = 2 X T ( y − X w ^ ) w ^ ∗ = ( X T X ) − 1 X T y
\frac{\partial E}{\partial \mathbf{w}} =2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})
\\
\mathbf{\hat{w}}^* = (X^TX)^{-1}X^Ty
∂ w ∂ E = 2 X T ( y − X w ^ ) w ^ ∗ = ( X T X ) − 1 X T y
tips:若X T X X^TX X T X 并非列满秩矩阵:引入正则项:
E = ( y − X w ^ ) T ( y − X w ^ ) + λ ∣ ∣ w ∣ ∣ 2 ( λ > 0 ) w ^ ∗ = ( X T X + λ I ) − 1 X T y
E=(y-X\hat{w})^T(y-X\hat{w})+\lambda ||w||^2\\
(\lambda>0)\\
\mathbf{\hat{w}}^* = (X^TX+\lambda I)^{-1}X^Ty
E = ( y − X w ^ ) T ( y − X w ^ ) + λ ∣ ∣ w ∣ ∣ 2 ( λ > 0 ) w ^ ∗ = ( X T X + λ I ) − 1 X T y
得到后的应用:
回归和分类的区别可以理解为,分类为离散,回归为连续,但是同时回归可以作为概率转换为分类。
由上,在二分类问题如果labels(0,1)可以理解为0.5为分界线。划分两类问题。如果标签值不是直接分类的标识,转为线性分类问题。
广义线性模型
y = g ( w T x + b ) g − 1 ( y ) = w T x + b
y=g(w^Tx+b)
\\
g^{-1}(y)=w^Tx+b
y = g ( w T x + b ) g − 1 ( y ) = w T x + b
线性分类
对数机率回归
把线性拟合转化为0-1分类(y表示看做正类的概率)
y = 1 1 + e − z
y=\frac{1}{1+e^{-z}}
y = 1 + e − z 1
转为为线性模型(即)
l n ( y 1 − y ) = w T x + b
ln(\frac{y}{1-y})=w^Tx+b
l n ( 1 − y y ) = w T x + b
构建对数似然函数:
概率表示:
p ( y = 1 ∣ x ) = e w T x + b 1 + e w T x + b p(y=1|x) =\frac{e^{w^Tx+b}}{1+e^{w^Tx+b}} p ( y = 1 ∣ x ) = 1 + e w T x + b e w T x + b
p ( y = 0 ∣ x ) = 1 1 + e w T x + b p(y=0|x) =\frac{1}{1+e^{w^Tx+b}} p ( y = 0 ∣ x ) = 1 + e w T x + b 1
对数似然函数:
l ( w , b ) = ∑ i = 1 n l n p ( y = j ∣ x i , w , b ) ; l(w,b)=\sum_{i=1}^n lnp(y=j|x_i,w,b); l ( w , b ) = ∑ i = 1 n l n p ( y = j ∣ x i , w , b ) ;
p ( y = j ∣ x i , w , b ) = y i p ( y i = 1 ∣ x i , w , b ) + ( 1 − y i ) y ( y i = 0 ∣ x i , w , b ) p(y=j|x_i,w,b)=y_ip(y_i=1|x_i,w,b)+(1-y_i)y(y_i=0|x_i,w,b) p ( y = j ∣ x i , w , b ) = y i p ( y i = 1 ∣ x i , w , b ) + ( 1 − y i ) y ( y i = 0 ∣ x i , w , b )
l ( w , b ) = ∑ i = 1 n [ y i ( w T x i + b ) − l n ( 1 + e w T x i + b ) ] l(w,b)=\sum_{i=1}^n[y_i(w^Tx_i+b)-ln(1+e^{w^Tx_i+b})] l ( w , b ) = ∑ i = 1 n [ y i ( w T x i + b ) − l n ( 1 + e w T x i + b ) ]
凸优化,梯度下降法:
w t + 1 = w t − λ △ w = w t − λ ∂ l ∂ w ∣ w = w t , b = b t b t + 1 = b t − λ △ b = b t − λ ∂ l ∂ b ∣ w = w t , b = b t ∂ l ∂ w = − ∑ [ x i y i − x i p ( y i = 1 ∣ x i , w , b ) ] ∂ l ∂ b = − ∑ [ y i − p ( y i = 1 ∣ x i , w , b ) ]
w^{t+1}=w^t-\lambda\triangle w=w^t-\lambda \frac{\partial l}{\partial w} |_{w=w^t,b=b^t}
\\
b^{t+1}=b^t-\lambda\triangle b=b^t-\lambda \frac{\partial l}{\partial b} |_{w=w^t,b=b^t}\\
\frac{\partial l}{\partial w}=-\sum[x_iy_i-x_ip(y_i=1|x_i,w,b)]\\
\frac{\partial l}{\partial b}=-\sum[y_i-p(y_i=1|x_i,w,b)]
w t + 1 = w t − λ △ w = w t − λ ∂ w ∂ l ∣ w = w t , b = b t b t + 1 = b t − λ △ b = b t − λ ∂ b ∂ l ∣ w = w t , b = b t ∂ w ∂ l = − ∑ [ x i y i − x i p ( y i = 1 ∣ x i , w , b ) ] ∂ b ∂ l = − ∑ [ y i − p ( y i = 1 ∣ x i , w , b ) ]
while step < max_step:
dw = np. zeros( sample_dim, float )
db = 0.0
step += 1
for i in range ( sample_num) :
xi, yi = train_sample[ i] , train_label[ i]
pi = 1 - 1 / ( 1 + np. exp( np. dot( w, xi) + b) )
dw += ( xi* yi - xi* pi)
db += ( yi - pi)
dw = - dw
db = - db
w -= learning_rate* dw
b -= learning_rate* db
self. w = w
self. b = b
分类:判断哪个概率大即是哪个类
线性判别分析(LDA)
核心思想 :同类尽可能近,异类尽可能远(监督降维算法-投影到直线)
数据集( x i , y ) i = 1 n {(\mathbf{x_i},y)}_{i=1}^n ( x i , y ) i = 1 n 二分类问题
投影前的每一类的均值&协方差矩阵:
u 0 = 1 n 0 ∑ y i = 0 x i Σ 0 = 1 n 0 − 1 ∑ y i = 0 x i x i T
\mathbf{u_0}=\frac{1}{n_0}\sum_{y_i=0}x_i\\
\Sigma_0 = \frac{1}{n_0-1}\sum_{y_i=0}x_ix_i^T\\
u 0 = n 0 1 y i = 0 ∑ x i Σ 0 = n 0 − 1 1 y i = 0 ∑ x i x i T
投影后:(投影到直线均为实数)
u 0 ^ = w T u 0 Σ 0 ^ = w T Σ 1 w
\hat{u_0}=w^Tu_0\\
\hat{\Sigma_0}=w^T\Sigma_1w
u 0 ^ = w T u 0 Σ 0 ^ = w T Σ 1 w
最大化目标函数:
J = w T S b w w T S w w S w = Σ 0 + Σ 1 S b = ( u 0 − u 1 ) ( u 0 − u 1 ) T
J=\frac{w^TS_bw}{w^TS_ww}\\
S_w=\Sigma_0+\Sigma_1
\\
S_b=(u_0-u_1)(u_0-u_1)^T
\\
J = w T S w w w T S b w S w = Σ 0 + Σ 1 S b = ( u 0 − u 1 ) ( u 0 − u 1 ) T
等价表示:
m i n − w T S b w s . t . w T S w w = 1 L = − w T S b w + λ ( w T S w w − 1 ) ∂ L ∂ w w ∗ = S w − 1 ( u 0 − u 1 )
min -w^TS_bw\\
s.t. w^TS_ww=1\\
L=-w^TS_bw+\lambda(w^TS_ww-1)\\
\frac{\partial L}{\partial w}
\\
w^*=S_{w}^{-1}(u_0-u_1)
m i n − w T S b w s . t . w T S w w = 1 L = − w T S b w + λ ( w T S w w − 1 ) ∂ w ∂ L w ∗ = S w − 1 ( u 0 − u 1 )
支持向量机
线性可分问题
划分超平面:
w T x + b = 0
w^Tx+b=0
w T x + b = 0
最大化间隔即
m a x w , b 2 ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) > = 1
max_{w,b}\frac{2}{||w||}\\
s.t.y_i(w^Tx_i+b)>=1
m a x w , b ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
等效于
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
min_{w,b}\frac{1}{2}{||w||}^2\\
s.t.y_i(w^Tx_i+b)>=1
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T x i + b ) > = 1
是一个凸优化问题
对偶问题(求解):
L = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 n α ( y i ( w T x i + b ) − 1 )
L=\frac{1}{2}{||w||}^2-\sum_{i=1}^n\alpha(y_i(w^Tx_i+b)-1)
L = 2 1 ∣ ∣ w ∣ ∣ 2 − i = 1 ∑ n α ( y i ( w T x i + b ) − 1 )
w = ∑ α i y i x i , ∑ α i y i = 0
w=\sum\alpha_iy_ix_i,\sum\alpha_iy_i=0
w = ∑ α i y i x i , ∑ α i y i = 0
m i n α 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j x i T x j − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
min_{\alpha}\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jx_i^Tx_j-\sum\alpha_i\\
s.t.\sum\alpha_iy_i=0,\alpha_i>=0
m i n α 2 1 i = 1 ∑ n j = 1 ∑ n α i α j y i y j x i T x j − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
对偶问题求解(SMO):
选取一对需要更新的α i , α j \alpha_i,\alpha_j α i , α j
固定除了上面以为的参数求解
α i y i + α j y j = − ∑ k ≠ i , j α k y k
\alpha_iy_i+\alpha_jy_j =-\sum_{k\neq i,j}\alpha_ky_k
α i y i + α j y j = − k ̸ = i , j ∑ α k y k
此时两个等式,一个单变量的二次规划,具有闭式解。(舍弃负数)
求b:支持向量方程y i f ( x i ) = 1 y_if(x_i)=1 y i f ( x i ) = 1
最终判定:y = s i g n [ f ( x i ) ] y=sign[f(x_i)] y = s i g n [ f ( x i ) ]
非线性可分问题
引入松弛变量
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ ξ i s . t . y i ( w T x i + b ) > = 1 − ξ i ξ i ≥ 0
min_{w,b}\frac{1}{2}{||w||}^2+C\sum\xi_i\\
s.t.y_i(w^Tx_i+b)>=1-\xi_i\\
\xi_i\geq0
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 + C ∑ ξ i s . t . y i ( w T x i + b ) > = 1 − ξ i ξ i ≥ 0
类似上面
特征映射
m i n w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T Φ ( x i ) + b ) > = 1
min_{w,b}\frac{1}{2}{||w||}^2\\
s.t.y_i(w^T\Phi (x_i)+b)>=1
m i n w , b 2 1 ∣ ∣ w ∣ ∣ 2 s . t . y i ( w T Φ ( x i ) + b ) > = 1
m i n α 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j Φ ( x i ) T Φ ( x j ) − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
min_{\alpha}\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\Phi (x_i)^T\Phi (x_j)-\sum\alpha_i\\
s.t.\sum\alpha_iy_i=0,\alpha_i>=0
m i n α 2 1 i = 1 ∑ n j = 1 ∑ n α i α j y i y j Φ ( x i ) T Φ ( x j ) − ∑ α i s . t . ∑ α i y i = 0 , α i > = 0
核函数
因为w = ∑ α i y i x i , w=\sum\alpha_iy_ix_i, w = ∑ α i y i x i , 代入即最终
f ( x ) = ∑ α i y i Φ ( x i ) T Φ ( x ) + b
f(x)=\sum\alpha_iy_i\Phi(x_i)^T\Phi(x)+b
f ( x ) = ∑ α i y i Φ ( x i ) T Φ ( x ) + b
定义核函数k : R d × R d → R , k ( x , y ) = Φ ( x ) T Φ ( y ) k:R^d\times R^d \rightarrow R,k(x,y)=\Phi(x)^T\Phi(y) k : R d × R d → R , k ( x , y ) = Φ ( x ) T Φ ( y )
核矩阵:核函数的采样矩阵。
贝叶斯
基本数学公式(贝叶斯公式)
P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( B ∣ A ˉ ) P ( A ˉ )
P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)}=\\
\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\bar{A})P(\bar{A})}
P ( A ∣ B ) = P ( B ) P ( A , B ) = P ( B ) P ( B ∣ A ) P ( A ) = P ( B ∣ A ) P ( A ) + P ( B ∣ A ˉ ) P ( A ˉ ) P ( B ∣ A ) P ( A )
贝叶斯决策论
最优决策:使风险最小化
R = P ( c 1 ∣ B ) λ 21 + P ( c 2 ∣ B ) λ 12 R =P(c_1|B)\lambda_{21}+P(c_2|B)\lambda_{12} R = P ( c 1 ∣ B ) λ 2 1 + P ( c 2 ∣ B ) λ 1 2
其中P ( c 1 ∣ B ) P(c_1|B) P ( c 1 ∣ B ) 为实际为c 1 c_1 c 1 的概率 (以下对于0,1损失分类)
朴素贝叶斯分类器:(假设特征之间互不相关)
连续:一般高斯分布,极大似然法求参数
离散:直接数并计算