按照第一個教程裏的普通線性迴歸我們在load_boston數據集上構建一個普通線性迴歸模型,我們看看會出現什麼問題?
我們先來看看load_boston數據集
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
You will see something similar to:
(506, 13)
表示這個數據集有13個特徵,一共506個樣本
from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
#分割數據集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.LinearRegression()
#訓練模型
reg.fit(X_train,y_train)
#預測模型
y_pred = reg.predict(X_test)
#輸出模型參數
print("係數",reg.coef_)
print(reg.coef_.shape)
print("常數項",reg.intercept_)
#計算均方誤差
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))
#計算r2值
print("r2值",r2_score(y_test,y_pred))
You will see something similar to:
係數 [ 1.29693856 0.01469497 0.04050457 0.79060732 -9.12933243 9.24839787
-0.0451214 -0.91395374 0.14079658 -0.01477291 -0.63369567 0.01577172
-0.09514128]
(13,)
常數項 -13.6721465522
在測試集均方誤差 545.445002115
r2值 -7.2211853282
這個線性迴歸模型有13個參數,一個常數項
Ridge regression addresses some of the problems of Ordinary Least Squares(i.e overfiting 過擬合) by imposing a penalty on the size of coefficients.(正則化)
當樣本特徵較多時,而樣本數量相對較少時,普通線性迴歸很容易陷入過擬合,爲了解決這個問題嶺迴歸通過引入L2正則化來降低過擬合風險.
from sklearn.datasets import load_boston
boston = load_boston()
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
#分割數據集
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.Ridge(alpha = .5)
#訓練模型
reg.fit(X_train,y_train)
#預測模型
y_pred = reg.predict(X_test)
#輸出模型參數
print("係數",reg.coef_)
print(reg.coef_.shape)
print("常數項",reg.intercept_)
#計算均方誤差
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))
#計算r2值
print("r2值",r2_score(y_test,y_pred))
You will see something similar to:
係數 [ 1.06913232 0.01534766 0.03083921 0.81470562 -5.44619698 9.22075685
-0.04681829 -0.86607139 0.13700694 -0.01498462 -0.60960326 0.01610884
-0.10287555]
(13,)
常數項 -15.7171489971
在測試集均方誤差 398.766928585
r2值 -5.01038933338
可以看到,嶺迴歸確實能減小過擬合的風險.同樣的數據集上,嶺迴歸的誤差小於普通線性迴歸.
使用交叉驗證來選擇參數
from sklearn import linear_model
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
boston = load_boston()
X = boston.data
y = boston.target
num_training = int(0.7*len(X))
X_train = X[:num_training]
y_train = y[:num_training]
X_test = X[num_training:]
y_test = y[num_training:]
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
reg.fit(X_train,y_train)
print("最優alpha:",reg.alpha_)
y_pred = reg.predict(X_test)
print("在測試集均方誤差",mean_squared_error(y_test,y_pred))
You will see something similar to:
最優alpha: 0.1
在測試集均方誤差 499.704726051
Reference
http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets