1. 線性迴歸
1.1 多元線性迴歸模型
給定訓練數據集
D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x i , y i ) , … , ( x N , y N ) } \begin{aligned} \\& D = \left\{ \left( \mathbf{x}_{1}, y_{1} \right), \left( \mathbf{x}_{2}, y_{2} \right), \cdots, \left(\mathbf{x}_i,y_i\right),\dots, \left( \mathbf{x}_{N}, y_{N} \right) \right\} \end{aligned} D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x i , y i ) , … , ( x N , y N ) }
其中,x i ∈ X ⊆ R n , y i ∈ Y ⊆ R \mathbf{x}_{i} \in \mathcal{X}\subseteq\mathbb{R}^{n}, y_{i} \in \mathcal{Y}\subseteq\mathbb{R} x i ∈ X ⊆ R n , y i ∈ Y ⊆ R 。
多元線性迴歸模型:
f ( x ) = w ⋅ x + b = ∑ i = 1 n w ( i ) ⋅ x ( i ) + b f\left(\mathbf{x}\right)=\mathbf{w}\cdot\mathbf{x}+b=\sum_{i=1}^n w^{\left(i\right)}\cdot x^{\left(i\right)}+b f ( x ) = w ⋅ x + b = i = 1 ∑ n w ( i ) ⋅ x ( i ) + b
其中,x ∈ X ⊆ R n \mathbf{x} \in \mathcal{X}\subseteq \mathbb{R}^{n} x ∈ X ⊆ R n 是輸入記錄,w = ( w ( 1 ) , w ( 2 ) , … , w ( n ) ) ⊤ ∈ R n \mathbf{w}=\left(w^{\left(1\right)},w^{\left(2\right)},\dots,w^{\left(n\right)}\right)^\top \in \mathbb{R}^{n} w = ( w ( 1 ) , w ( 2 ) , … , w ( n ) ) ⊤ ∈ R n 和b ∈ R b \in \mathbb{R} b ∈ R 是模型參數,w \mathbf{w} w 稱爲權值向量,b b b 稱爲偏置,w ⋅ x \mathbf{w} \cdot \mathbf{x} w ⋅ x 爲w \mathbf{w} w 和x \mathbf{x} x 的內積。
當n = 1 n=1 n = 1 時,模型爲一元線性迴歸模型:
f ( x ) = w ⋅ x + b f\left(x\right)=w\cdot x+b f ( x ) = w ⋅ x + b
其中,w ∈ R w\in\mathbb{R} w ∈ R 和b ∈ R b\in\mathbb{R} b ∈ R 爲模型參數。
令
w ^ = ( w , b ) ⊤ x ^ = ( x , 1 ) ⊤ \hat{\mathbf{w}}=\left(\mathbf{w},b\right)^\top \\
\hat{\mathbf{x}}=\left(\mathbf{x},1\right)^\top w ^ = ( w , b ) ⊤ x ^ = ( x , 1 ) ⊤
則多元線性迴歸模型可簡化爲
f ( x ^ ) = w ^ ⋅ x ^ f\left(\hat{\mathbf{x}}\right)=\hat{\mathbf{w}}\cdot\hat{\mathbf{x}} f ( x ^ ) = w ^ ⋅ x ^
其中,x ^ \hat{\mathbf{x}} x ^ 爲增廣特徵向量,w ^ \hat{\mathbf{w}} w ^ 爲增廣權重。
1.2 多元線性迴歸參數學習——經驗風險最小化與結構風險最小化
損失函數:平方損失損失函數
L ( y , f ( x ) ) = ( y − f ( x ) ) 2 L\left(y,f\left(\mathbf{x}\right)\right)=\left(y-f\left(\mathbf{x}\right)\right)^2 L ( y , f ( x ) ) = ( y − f ( x ) ) 2
經驗風險
R e m p ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) \begin{aligned} R_{emp} \left( f \right) = \dfrac{1}{N} \sum_{i=1}^{N} L \left(y_{i}, f \left( \mathbf{x}_{i} \right) \right) \end{aligned} R e m p ( f ) = N 1 i = 1 ∑ N L ( y i , f ( x i ) )
模型參數最優解:
w ^ ∗ = arg min w ^ ∑ i = 1 N ( y i − f ( x ^ i ) ) 2 = arg min w ^ ∑ i = 1 N ( y i − w ^ ⋅ x ^ i ) 2 \begin{aligned}
\hat{\mathbf{w}}^*&=\mathop{\arg\min}_{\hat{\mathbf{w}}}\sum_{i=1}^N \left(y_i-f\left(\hat{\mathbf{x}}_i\right)\right)^2 \\
&=\mathop{\arg\min}_{\hat{\mathbf{w}}}\sum_{i=1}^N \left(y_i-\hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_i\right)^2
\end{aligned} w ^ ∗ = arg min w ^ i = 1 ∑ N ( y i − f ( x ^ i ) ) 2 = arg min w ^ i = 1 ∑ N ( y i − w ^ ⋅ x ^ i ) 2
基於均方誤差最小化來進行模型求解的方法稱爲“最小二乘法”(least square method)。
等價的,模型參數最優解:
w ^ ∗ = arg min w ^ ( y − X w ^ ) ⊤ ( y − X w ^ ) \hat{\mathbf{w}}^*=\mathop{\arg\min}_{\hat{\mathbf{w}}} \left(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}}\right)^\top\left(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}}\right) w ^ ∗ = arg min w ^ ( y − X w ^ ) ⊤ ( y − X w ^ )
其中,
X = ( x 1 ⊤ 1 x 2 ⊤ 1 ⋮ ⋮ x N ⊤ 1 ) = ( x ^ 1 ⊤ x ^ 2 ⊤ ⋮ x ^ N ⊤ ) y = ( y 1 , y 2 , … , y N ) ⊤ \mathbf{X}=\begin{pmatrix} \mathbf{x}_1^\top & 1 \\ \mathbf{x}_2^\top & 1 \\ \vdots & \vdots \\ \mathbf{x}_N^\top & 1\end{pmatrix} =\begin{pmatrix} \hat{\mathbf{x}}_1^\top \\ \hat{\mathbf{x}}_2^\top \\ \vdots \\ \hat{\mathbf{x}}_N^\top \end{pmatrix}\\
\mathbf{y}=\left(y_1,y_2,\dots,y_N\right)^\top X = ⎝ ⎜ ⎜ ⎜ ⎛ x 1 ⊤ x 2 ⊤ ⋮ x N ⊤ 1 1 ⋮ 1 ⎠ ⎟ ⎟ ⎟ ⎞ = ⎝ ⎜ ⎜ ⎜ ⎛ x ^ 1 ⊤ x ^ 2 ⊤ ⋮ x ^ N ⊤ ⎠ ⎟ ⎟ ⎟ ⎞ y = ( y 1 , y 2 , … , y N ) ⊤
令E w ^ = ( y − X w ^ ) ⊤ ( y − X w ^ ) E_{\hat{\mathbf{w}}}=\left(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}}\right)^\top\left(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}}\right) E w ^ = ( y − X w ^ ) ⊤ ( y − X w ^ ) ,對w ^ \hat{\mathbf{w}} w ^ 求偏導,得
∂ E w ^ ∂ w ^ = 2 X ⊤ ( X w ^ − y ) \frac{\partial E_{\hat{\mathbf{w}}}}{\partial \hat{\mathbf{w}}}=2\mathbf{X}^\top\left(\mathbf{X}\hat{\mathbf{w}}-\mathbf{y}\right) ∂ w ^ ∂ E w ^ = 2 X ⊤ ( X w ^ − y )
當X ⊤ X \mathbf{X}^\top\mathbf{X} X ⊤ X 爲滿秩矩陣或正定矩陣時,令上式爲零可得最優閉式解
w ^ ∗ = ( X ⊤ X ) − 1 X ⊤ y \hat{\mathbf{w}}^*=\left(\mathbf{X}^\top\mathbf{X}\right)^{-1}\mathbf{X}^\top\mathbf{y} w ^ ∗ = ( X ⊤ X ) − 1 X ⊤ y
當上述條件不滿足時,可使用主成分分析(PCA)等方法消除特徵間的線性相關性,再使用最小二乘法求解。
或者通過梯度下降法,初始化w ^ 0 = 0 \hat{\mathbf{w}}_0=\mathbf{0} w ^ 0 = 0 ,進行迭代
w ^ ← w ^ − α X ⊤ ( X w ^ − y ) \hat{\mathbf{w}}\gets\hat{\mathbf{w}}-\alpha\mathbf{X}^\top\left(\mathbf{X}\hat{\mathbf{w}}-\mathbf{y}\right) w ^ ← w ^ − α X ⊤ ( X w ^ − y )
其中,α \alpha α 是學習率。
結構風險 :
R s t r = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) + λ J ( f ) \begin{aligned} R_{str}= \dfrac{1}{N} \sum_{i=1}^{N} L \left(y_{i}, f \left( \mathbf{x}_{i} \right) \right) + \lambda J \left(f\right) \end{aligned} R s t r = N 1 i = 1 ∑ N L ( y i , f ( x i ) ) + λ J ( f )
嶺迴歸(Ridge Regression) :
R s t r = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) + α ∥ w ∥ 2 , α ≥ 0 \begin{aligned} R_{str}= \dfrac{1}{N} \sum_{i=1}^{N} L \left(y_{i}, f \left( \mathbf{x}_{i} \right) \right) + \alpha\|\mathbf{w}\|^2,\alpha\geq0 \end{aligned} R s t r = N 1 i = 1 ∑ N L ( y i , f ( x i ) ) + α ∥ w ∥ 2 , α ≥ 0
套索迴歸(Lasso Regression) :
R s t r = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) + α ∥ w ∥ 1 , α ≥ 0 \begin{aligned} R_{str}= \dfrac{1}{N} \sum_{i=1}^{N} L \left(y_{i}, f \left( \mathbf{x}_{i} \right) \right) +
\alpha\|\mathbf{w}\|_1,\alpha\geq0 \end{aligned} R s t r = N 1 i = 1 ∑ N L ( y i , f ( x i ) ) + α ∥ w ∥ 1 , α ≥ 0
彈性網絡迴歸(Elastic Net) :
R s t r = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) + α ρ ∥ w ∥ 1 + α ( 1 − ρ ) 2 ∥ w ∥ 2 , α ≥ 0 , 1 ≥ ρ ≥ 0 \begin{aligned} R_{str}= \dfrac{1}{N} \sum_{i=1}^{N} L \left(y_{i}, f \left( \mathbf{x}_{i} \right) \right) +
\alpha\rho\|\mathbf{w}\|_1+\frac{\alpha\left(1-\rho\right)}{2}\|\mathbf{w}\|^2,\alpha\geq0,1\geq \rho\geq0\end{aligned} R s t r = N 1 i = 1 ∑ N L ( y i , f ( x i ) ) + α ρ ∥ w ∥ 1 + 2 α ( 1 − ρ ) ∥ w ∥ 2 , α ≥ 0 , 1 ≥ ρ ≥ 0
1.3 多元線性迴歸模型應用
% matplotlib inline
import matplotlib. pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
def load_data ( ) :
diabetes = datasets. load_diabetes( )
return model_selection. train_test_split( diabetes. data, diabetes. target, test_size= 0.25 , random_state= 0 )
def test_LinearRegression ( * data) :
X_train, X_test, y_train, y_test= data
regr = linear_model. LinearRegression( )
regr. fit( X_train, y_train)
print ( 'Coefficients:%s, intercept %.2f' % ( regr. coef_, regr. intercept_) )
print ( "Residual sum of squares: %.2f" % np. mean( ( regr. predict( X_test) - y_test) ** 2 ) )
print ( 'Score: %.2f' % regr. score( X_test, y_test) )
if __name__== '__main__' :
X_train, X_test, y_train, y_test= load_data( )
test_LinearRegression( X_train, X_test, y_train, y_test)
import matplotlib. pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
def load_data ( ) :
diabetes = datasets. load_diabetes( )
return model_selection. train_test_split( diabetes. data, diabetes. target,
test_size= 0.25 , random_state= 0 )
def test_Lasso ( * data) :
X_train, X_test, y_train, y_test= data
regr = linear_model. Lasso( )
regr. fit( X_train, y_train)
print ( 'Coefficients:%s, intercept %.2f' % ( regr. coef_, regr. intercept_) )
print ( "Residual sum of squares: %.2f" % np. mean( ( regr. predict( X_test) - y_test) ** 2 ) )
print ( 'Score: %.2f' % regr. score( X_test, y_test) )
def test_Lasso_alpha ( * data) :
X_train, X_test, y_train, y_test= data
alphas= [ 0.01 , 0.02 , 0.05 , 0.1 , 0.2 , 0.5 , 1 , 2 , 5 , 10 , 20 , 50 , 100 , 200 , 500 , 1000 ]
scores= [ ]
for i, alpha in enumerate ( alphas) :
regr = linear_model. Lasso( alpha= alpha)
regr. fit( X_train, y_train)
scores. append( regr. score( X_test, y_test) )
fig= plt. figure( )
ax= fig. add_subplot( 1 , 1 , 1 )
ax. plot( alphas, scores)
ax. set_xlabel( r"$\alpha$" )
ax. set_ylabel( r"score" )
ax. set_xscale( 'log' )
ax. set_title( "Lasso" )
plt. show( )
if __name__== '__main__' :
X_train, X_test, y_train, y_test= load_data( )
test_Lasso( X_train, X_test, y_train, y_test)
test_Lasso_alpha( X_train, X_test, y_train, y_test)
import matplotlib. pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
def load_data ( ) :
diabetes = datasets. load_diabetes( )
return model_selection. train_test_split( diabetes. data, diabetes. target,
test_size= 0.25 , random_state= 0 )
def test_Ridge ( * data) :
X_train, X_test, y_train, y_test= data
regr = linear_model. Ridge( )
regr. fit( X_train, y_train)
print ( 'Coefficients:%s, intercept %.2f' % ( regr. coef_, regr. intercept_) )
print ( "Residual sum of squares: %.2f" % np. mean( ( regr. predict( X_test) - y_test) ** 2 ) )
print ( 'Score: %.2f' % regr. score( X_test, y_test) )
def test_Ridge_alpha ( * data) :
X_train, X_test, y_train, y_test= data
alphas= [ 0.01 , 0.02 , 0.05 , 0.1 , 0.2 , 0.5 , 1 , 2 , 5 , 10 , 20 , 50 , 100 , 200 , 500 , 1000 ]
scores= [ ]
for i, alpha in enumerate ( alphas) :
regr = linear_model. Ridge( alpha= alpha)
regr. fit( X_train, y_train)
scores. append( regr. score( X_test, y_test) )
fig= plt. figure( )
ax= fig. add_subplot( 1 , 1 , 1 )
ax. plot( alphas, scores)
ax. set_xlabel( r"$\alpha$" )
ax. set_ylabel( r"score" )
ax. set_xscale( 'log' )
ax. set_title( "Ridge" )
plt. show( )
if __name__== '__main__' :
X_train, X_test, y_train, y_test= load_data( )
test_Ridge( X_train, X_test, y_train, y_test)
test_Ridge_alpha( X_train, X_test, y_train, y_test)
2. 邏輯斯諦迴歸
2.1 sigmoid函數與二分類邏輯斯諦迴歸模型
sigmoid函數:
s i g m o i d ( z ) = σ ( z ) = 1 1 + e − z sigmoid\left(z\right)=\sigma\left(z\right)=\frac{1}{1+e^{-z}} s i g m o i d ( z ) = σ ( z ) = 1 + e − z 1
其中,z ∈ R z\in\mathbb{R} z ∈ R ,s i g o i d ( z ) ∈ ( 0 , 1 ) sigoid\left(z\right)\in\left(0,1\right) s i g o i d ( z ) ∈ ( 0 , 1 ) 。
sigmoid函數的導數:
σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) ) \sigma'\left(z\right)=\sigma\left(z\right)\left(1-\sigma\left(z\right)\right) σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) )
二分類邏輯斯諦迴歸模型是如下的條件概率分佈:
P ( y = 1 ∣ x ) = σ ( w ⋅ x + b ) = 1 1 + exp ( − ( w ⋅ x + b ) ) = exp ( w ⋅ x + b ) 1 + exp ( w ⋅ x + b ) P ( y = 0 ∣ x ) = 1 − σ ( w ⋅ x + b ) = 1 1 + exp ( w ⋅ x + b ) \begin{aligned} P \left( y = 1 | \mathbf{x} \right) &=\sigma\left(\mathbf{w}\cdot\mathbf{x}+b\right) \\
&= \dfrac{1}{1+\exp{\left(-\left(\mathbf{w} \cdot \mathbf{x} + b \right)\right)}}
\\ &= \dfrac{\exp{\left(\mathbf{w} \cdot \mathbf{x} + b \right)}}{1+\exp{\left( \mathbf{w} \cdot \mathbf{x} + b \right)}}\\
P \left( y = 0 | \mathbf{x} \right) &= 1- \sigma\left(\mathbf{w}\cdot\mathbf{x}+b\right)
\\ &=\dfrac{1}{1+\exp{\left( \mathbf{w} \cdot \mathbf{x} + b \right)}}\end{aligned} P ( y = 1 ∣ x ) P ( y = 0 ∣ x ) = σ ( w ⋅ x + b ) = 1 + exp ( − ( w ⋅ x + b ) ) 1 = 1 + exp ( w ⋅ x + b ) exp ( w ⋅ x + b ) = 1 − σ ( w ⋅ x + b ) = 1 + exp ( w ⋅ x + b ) 1
其中,x ∈ R n \mathbf{x} \in \mathbb{R}^{n} x ∈ R n ,y ∈ { 0 , 1 } y \in \left\{ 0, 1 \right\} y ∈ { 0 , 1 } ,w ∈ R n \mathbf{w} \in \mathbb{R}^{n} w ∈ R n 是權值向量,b ∈ R b \in \mathbb{R} b ∈ R 是偏置,w ⋅ x \mathbf{w} \cdot \mathbf{x} w ⋅ x 爲向量內積。
可將權值權值向量和特徵向量加以擴充,即增廣權值向量w ^ = ( w ( 1 ) , w ( 2 ) , ⋯ , w ( n ) , b ) ⊤ \hat{\mathbf{w}} = \left( w^{\left(1\right)},w^{\left(2\right)},\cdots,w^{\left(n\right)},b \right)^\top w ^ = ( w ( 1 ) , w ( 2 ) , ⋯ , w ( n ) , b ) ⊤ ,增廣特徵向量x ^ = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) , 1 ) ⊤ \hat{\mathbf{x}} = \left( x^{\left(1\right)},x^{\left(2\right)},\cdots,x^{\left(n\right)},1 \right)^\top x ^ = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) , 1 ) ⊤ ,則邏輯斯諦迴歸模型:
P ( y = 1 ∣ x ^ ) = exp ( w ^ ⋅ x ^ ) 1 + exp ( w ^ ⋅ x ^ ) P ( y = 0 ∣ x ^ ) = 1 1 + exp ( w ^ ⋅ x ^ ) \begin{aligned} \\& P \left( y = 1 | \hat{\mathbf{x}} \right) = \dfrac{\exp{\left(\hat{\mathbf{w}} \cdot \hat{\mathbf{x}} \right)}}{1+\exp{\left( \hat{\mathbf{w}} \cdot \hat{\mathbf{x}} \right)}}\\& P \left( y = 0 | \hat{\mathbf{x}} \right) =\dfrac{1}{1+\exp{\left( \hat{\mathbf{w}} \cdot \hat{\mathbf{x}} \right)}}\end{aligned} P ( y = 1 ∣ x ^ ) = 1 + exp ( w ^ ⋅ x ^ ) exp ( w ^ ⋅ x ^ ) P ( y = 0 ∣ x ^ ) = 1 + exp ( w ^ ⋅ x ^ ) 1
2.2 二分類邏輯斯諦迴歸參數學習——最大似然估計
給定訓練數據集
D = { ( x ^ 1 , y 1 ) , ( x ^ 2 , y 2 ) , ⋯ , ( x ^ N , y N ) } \begin{aligned} \\& D = \left\{ \left( \hat{\mathbf{x}}_{1}, y_{1} \right), \left( \hat{\mathbf{x}}_{2}, y_{2} \right), \cdots, \left( \hat{\mathbf{x}}_{N}, y_{N} \right) \right\} \end{aligned} D = { ( x ^ 1 , y 1 ) , ( x ^ 2 , y 2 ) , ⋯ , ( x ^ N , y N ) }
其中,x ^ i ∈ R n + 1 , y i ∈ { 0 , 1 } , i = 1 , 2 , ⋯ , N \hat{\mathbf{x}}_{i} \in \mathbb{R}^{n+1}, y_{i} \in \left\{ 0, 1 \right\}, i = 1, 2, \cdots, N x ^ i ∈ R n + 1 , y i ∈ { 0 , 1 } , i = 1 , 2 , ⋯ , N 。
設
P ( y = 1 ∣ x ^ ) = σ ( w ^ ⋅ x ^ ) , P ( y = 0 ∣ x ^ ) = 1 − σ ( w ^ ⋅ x ^ ) \begin{aligned} \\& P \left( y =1 | \hat{\mathbf{x}} \right) =\sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}} \right) ,\quad P \left( y =0 | \hat{\mathbf{x}} \right) = 1 - \sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}} \right) \end{aligned} P ( y = 1 ∣ x ^ ) = σ ( w ^ ⋅ x ^ ) , P ( y = 0 ∣ x ^ ) = 1 − σ ( w ^ ⋅ x ^ )
似然函數
L ( w ^ ) = ∏ i = 1 N P ( y i ∣ x ^ i ) = ∏ i = 1 N [ σ ( w ^ ⋅ x ^ i ) ] y i [ 1 − σ ( w ^ ⋅ x ^ i ) ] 1 − y i
\begin{aligned} L \left( \hat{\mathbf{w}} \right) &= \prod_{i=1}^N P\left(y_i|\hat{\mathbf{x}}_i\right) \\
&= \prod_{i=1}^{N} \left[ \sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_{i} \right) \right]^{y_{i}}\left[ 1 - \sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_{i} \right) \right]^{1 - y_{i}}\end{aligned} L ( w ^ ) = i = 1 ∏ N P ( y i ∣ x ^ i ) = i = 1 ∏ N [ σ ( w ^ ⋅ x ^ i ) ] y i [ 1 − σ ( w ^ ⋅ x ^ i ) ] 1 − y i
因爲似然函數累乘會可能出現下溢的情況,可以轉換爲對數似然函數(累加)
l ( w ^ ) = log L ( w ^ ) = ∑ i = 1 N [ y i log σ ( w ^ ⋅ x ^ i ) + ( 1 − y i ) log ( 1 − σ ( w ^ ⋅ x ^ i ) ) ] \begin{aligned} \\ l \left( \hat{\mathbf{w}} \right) &= \log L \left( \hat{\mathbf{w}} \right)
\\ & = \sum_{i=1}^{N} \left[ y_{i} \log \sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_{i} \right) + \left( 1 - y_{i} \right) \log \left( 1 - \sigma \left( \hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_{i} \right) \right) \right]\end{aligned} l ( w ^ ) = log L ( w ^ ) = i = 1 ∑ N [ y i log σ ( w ^ ⋅ x ^ i ) + ( 1 − y i ) log ( 1 − σ ( w ^ ⋅ x ^ i ) ) ]
最大似然估計
w ^ ∗ = arg max w ^ l ( w ^ ) \hat{\mathbf{w}}^*=\mathop{\arg\max}_{\hat{\mathbf{w}}} l\left(\hat{\mathbf{w}}\right) w ^ ∗ = arg max w ^ l ( w ^ )
最小負對數損失
w ^ ∗ = arg min w ^ − l ( w ^ ) \hat{\mathbf{w}}^*=\mathop{\arg\min}_{\hat{\mathbf{w}}}-l\left(\hat{\mathbf{w}}\right) w ^ ∗ = arg min w ^ − l ( w ^ )
令y ^ i = σ ( w ^ ⋅ x ^ i ) \hat{y}_i=\sigma\left(\hat{\mathbf{w}}\cdot\hat{\mathbf{x}}_i\right) y ^ i = σ ( w ^ ⋅ x ^ i ) ,則對數似然函數l ( w ^ ) l\left(\hat{\mathbf{w}}\right) l ( w ^ ) 關於w ^ \hat{\mathbf{w}} w ^ 的偏導數
∂ l ( w ^ ) ∂ w ^ = − ∑ i = 1 N ( y i y ^ i ( 1 − y ^ i ) y ^ i x ^ i − ( 1 − y i ) y ^ i ( 1 − y ^ i ) 1 − y ^ i x ^ i ) = − ∑ i = 1 N ( y i ( 1 − y ^ i ) x ^ i − ( 1 − y i ) y ^ i x ^ i ) = − ∑ i = 1 N x ^ i ( y i − y ^ i ) \begin{aligned}\frac{\partial l\left(\hat{\mathbf{w}}\right)}{\partial \hat{\mathbf{w}}} &=-\sum_{i=1}^N\left(y_i\frac{\hat{y}_i\left(1-\hat{y}_i\right)}{\hat{y}_i}\hat{\mathbf{x}}_i-\left(1-y_i\right)\frac{\hat{y}_i\left(1-\hat{y}_i\right)}{1-\hat{y}_i}\hat{\mathbf{x}}_i\right)\\
&=-\sum_{i=1}^N\left(y_i\left(1-\hat{y}_i\right)\hat{\mathbf{x}}_i-\left(1-y_i\right)\hat{y}_i\hat{\mathbf{x}}_i\right) \\
&=-\sum_{i=1}^N\hat{\mathbf{x}}_i\left(y_i-\hat{y}_i\right)\end{aligned} ∂ w ^ ∂ l ( w ^ ) = − i = 1 ∑ N ( y i y ^ i y ^ i ( 1 − y ^ i ) x ^ i − ( 1 − y i ) 1 − y ^ i y ^ i ( 1 − y ^ i ) x ^ i ) = − i = 1 ∑ N ( y i ( 1 − y ^ i ) x ^ i − ( 1 − y i ) y ^ i x ^ i ) = − i = 1 ∑ N x ^ i ( y i − y ^ i )
採用梯度下降法,初始化w ^ = 0 \hat{\mathbf{w}}=\mathbf{0} w ^ = 0 ,進行迭代
w ^ t + 1 ← w ^ t + α ∑ i = 1 N x ^ i ( y i − y ^ i w ^ t ) \hat{\mathbf{w}}_{t+1}\gets\hat{\mathbf{w}}_t+\alpha\sum_{i=1}^N\hat{\mathbf{x}}_i\left(y_i-\hat{y}_i^{\hat{\mathbf{w}}_t}\right) w ^ t + 1 ← w ^ t + α i = 1 ∑ N x ^ i ( y i − y ^ i w ^ t )
其中,α \alpha α 是學習率,y ^ i w ^ t \hat{y}_i^{\hat{\mathbf{w}}_t} y ^ i w ^ t 是當參數w ^ t \hat{\mathbf{w}}_t w ^ t 時模型的預測輸出。
2.3 邏輯斯諦迴歸模型應用
from math import exp
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
% matplotlib inline
from sklearn. datasets import load_iris
from sklearn. model_selection import train_test_split
def create_data ( ) :
iris = load_iris( )
df = pd. DataFrame( iris. data, columns= iris. feature_names)
df[ 'label' ] = iris. target
df. columns = [ 'sepal length' , 'sepal width' , 'petal length' , 'petal width' , 'label' ]
data = np. array( df. iloc[ : 100 , [ 0 , 1 , - 1 ] ] )
return data[ : , : 2 ] , data[ : , - 1 ]
X, y = create_data( )
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size= 0.3 )
class LogisticRegressionClassifier :
def __init__ ( self, max_iter= 200 , learning_rate= 0.01 ) :
self. max_iter = max_iter
self. learning_rate = learning_rate
def sigmoid ( self, x) :
return 1 / ( 1 + exp( - x) )
def data_matrix ( self, X) :
data_mat = [ ]
for d in X:
data_mat. append( [ 1.0 , * d] )
return data_mat
def fit ( self, X, y) :
data_mat = self. data_matrix( X)
self. weights = np. zeros( ( len ( data_mat[ 0 ] ) , 1 ) , dtype= np. float32)
for iter_ in range ( self. max_iter) :
for i in range ( len ( X) ) :
result = self. sigmoid( np. dot( data_mat[ i] , self. weights) )
error = y[ i] - result
self. weights += self. learning_rate * error * np. transpose( [ data_mat[ i] ] )
print ( 'LogisticRegression Model(learning_rate={}, max_iter={})' .
format ( self. learning_rate, self. max_iter) )
def score ( self, X_test, y_test) :
right = 0
X_test = self. data_matrix( X_test)
for x, y in zip ( X_test, y_test) :
result = np. dot( x, self. weights)
if ( result > 0 and y == 1 ) or ( result < 0 and y == 0 ) :
right += 1
return right / len ( X_test)
lr_clf = LogisticRegressionClassifier( )
lr_clf. fit( X_train, y_train)
lr_clf. score( X_test, y_test)
x_points = np. arange( 4 , 8 )
y_ = - ( lr_clf. weights[ 1 ] * x_points + lr_clf. weights[ 0 ] ) / lr_clf. weights[ 2 ]
plt. plot( x_points, y_)
plt. scatter( X[ : 50 , 0 ] , X[ : 50 , 1 ] , label= '0' )
plt. scatter( X[ 50 : , 0 ] , X[ 50 : , 1 ] , label= '1' )
plt. legend( )
import matplotlib. pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn import model_selection
def load_data ( ) :
iris= datasets. load_iris( )
X_train= iris. data
y_train= iris. target
return model_selection. train_test_split( X_train, y_train, test_size= 0.25 , random_state= 0 , stratify= y_train)
def test_LogisticRegression ( * data) :
X_train, X_test, y_train, y_test= data
regr = linear_model. LogisticRegression( )
regr. fit( X_train, y_train)
print ( 'Coefficients:%s, intercept %s' % ( regr. coef_, regr. intercept_) )
print ( 'Score: %.2f' % regr. score( X_test, y_test) )
def test_LogisticRegression_multinomial ( * data) :
X_train, X_test, y_train, y_test= data
regr = linear_model. LogisticRegression( multi_class= 'multinomial' , solver= 'lbfgs' )
regr. fit( X_train, y_train)
print ( 'Coefficients:%s, intercept %s' % ( regr. coef_, regr. intercept_) )
print ( 'Score: %.2f' % regr. score( X_test, y_test) )
def test_LogisticRegression_C ( * data) :
X_train, X_test, y_train, y_test= data
Cs= np. logspace( - 2 , 4 , num= 100 )
scores= [ ]
for C in Cs:
regr = linear_model. LogisticRegression( C= C)
regr. fit( X_train, y_train)
scores. append( regr. score( X_test, y_test) )
fig= plt. figure( )
ax= fig. add_subplot( 1 , 1 , 1 )
ax. plot( Cs, scores)
ax. set_xlabel( r"C" )
ax. set_ylabel( r"score" )
ax. set_xscale( 'log' )
ax. set_title( "LogisticRegression" )
plt. show( )
if __name__== '__main__' :
X_train, X_test, y_train, y_test= load_data( )
test_LogisticRegression( X_train, X_test, y_train, y_test)
test_LogisticRegression_multinomial( X_train, X_test, y_train, y_test)
test_LogisticRegression_C( X_train, X_test, y_train, y_test)