機器學習實踐系列(二)嶺迴歸(Ridge Regression)

按照第一個教程裏的普通線性迴歸我們在load_boston數據集上構建一個普通線性迴歸模型,我們看看會出現什麼問題?
我們先來看看load_boston數據集

from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)

You will see something similar to:

(506, 13)

表示這個數據集有13個特徵,一共506個樣本

from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))

#分割數據集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.LinearRegression()

#訓練模型
reg.fit(X_train,y_train)

#預測模型
y_pred = reg.predict(X_test)

#輸出模型參數
print("係數",reg.coef_)
print(reg.coef_.shape)
print("常數項",reg.intercept_)

#計算均方誤差
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))

#計算r2值
print("r2值",r2_score(y_test,y_pred))

You will see something similar to:

係數 [ 1.29693856  0.01469497  0.04050457  0.79060732 -9.12933243  9.24839787
 -0.0451214  -0.91395374  0.14079658 -0.01477291 -0.63369567  0.01577172
 -0.09514128]
(13,)
常數項 -13.6721465522
在測試集均方誤差 545.445002115
r2值 -7.2211853282

這個線性迴歸模型有13個參數,一個常數項
Ridge regression addresses some of the problems of Ordinary Least Squares(i.e overfiting 過擬合) by imposing a penalty on the size of coefficients.(正則化)
當樣本特徵較多時,而樣本數量相對較少時,普通線性迴歸很容易陷入過擬合,爲了解決這個問題嶺迴歸通過引入L2正則化來降低過擬合風險.

from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))

#分割數據集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.Ridge(alpha = .5)

#訓練模型
reg.fit(X_train,y_train)

#預測模型
y_pred = reg.predict(X_test)

#輸出模型參數
print("係數",reg.coef_)
print(reg.coef_.shape)
print("常數項",reg.intercept_)

#計算均方誤差
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))

#計算r2值
print("r2值",r2_score(y_test,y_pred))

You will see something similar to:

係數 [ 1.06913232  0.01534766  0.03083921  0.81470562 -5.44619698  9.22075685
 -0.04681829 -0.86607139  0.13700694 -0.01498462 -0.60960326  0.01610884
 -0.10287555]
(13,)
常數項 -15.7171489971
在測試集均方誤差 398.766928585
r2值 -5.01038933338

可以看到,嶺迴歸確實能減小過擬合的風險.同樣的數據集上,嶺迴歸的誤差小於普通線性迴歸.
使用交叉驗證來選擇參數

from sklearn import linear_model
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
boston = load_boston()
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
reg.fit(X_train,y_train)
print("最優alpha:",reg.alpha_)
y_pred = reg.predict(X_test)
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))

You will see something similar to:

最優alpha: 0.1
在測試集均方誤差 499.704726051

Reference
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

發佈了51 篇原創文章 · 獲贊 9 · 訪問量 2萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章