Sklearn之线性回归实战
一,前言
一元线性回归的理论片请看我这个链接
二,热身例子
预测直线
导入LinearRegression 从Sklearn.liear_model 包里
from sklearn.linear_model import LinearRegression
拟合数据也可以说是训练
reg = LinearRegression().fit(X, y)
检验正确率
print(reg.score(X, y))
训练的系数,也就是X前面的那个系数,这里打印出 [1. 2.]
print(reg.coef_)
直线的b的系数(其实就是偏置系数), 打印出3
print(reg.intercept_)
完整代码
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
print(X)
print(y)
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
print(reg.predict(np.array([[3, 5]])))
[[1 1]
[1 2]
[2 2]
[2 3]]
[ 6 8 9 11]
1.0
[1. 2.]
3.0000000000000018
[16.]
三,贸易公司的简单例子
可见随着广告费用的增加,公司的销售额也在增加,但是它们并非绝对的 线性关系,而是趋向于平均
我们用线性拟合一下这个数据吧。
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
data = np.array([[10, 19],[13,60],[22,71],[37,74],[45,69],[48,89],[59,146]
,[65,130],[66,153],[68,144],[68,128],[71,123],[84,127]
,[88,125],[89,154],[89,150]])
X = data[:,np.newaxis, 0]
y = data[:,1]
print(X)
print(y)
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))
print(reg.coef_)
print(reg.intercept_)
diabetes_y_pred = reg.predict(X)
print('Mean squared error: %.2f' % mean_squared_error(y, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y, diabetes_y_pred))
plt.scatter(data[:,0], data[:,1], color='black')
print('y='+str(reg.coef_[0]) +'*x + ' + str(reg.intercept_) )
plt.plot(data[:,0], reg.coef_*data[:,0] + reg.intercept_, color='blue', linewidth=3)
plt.show()
[[10]
[13]
[22]
[37]
[45]
[48]
[59]
[65]
[66]
[68]
[68]
[71]
[84]
[88]
[89]
[89]]
[ 19 60 71 74 69 89 146 130 153 144 128 123 127 125 154 150]
0.7861129941287246
[1.37939644]
30.637280329657003
Mean squared error: 333.71
Coefficient of determination: 0.79
y=1.37939643679554*x + 30.637280329657003
拟合直线的表达是y=1.37939643679554*x + 30.637280329657003,,其中x表示广告费用,y表示销 售额,通过线性回归的公式就可以预测企业的销售额了。结果还可以 R2 = 0.79
四,Sklearn 官网里的一个例子
diabetes这个数据集一共有442个样本,每个样本有10 个特征。
我们选两个特征用线性来拟合它。
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
print('diabetes_X', diabetes_X.shape)
print('diabetes_y', diabetes_y.shape)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
print('diabetes_X', diabetes_X.shape)
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'% r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
diabetes_X (442, 10)
diabetes_y (442,)
diabetes_X (442, 1)
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
R2 验证结果是0.47,结果不是很好,不过无所谓,我们就当学习而已,毕竟我们只是选择了10个里面的两个特征。
参考资料
[1]https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py
[2] https://www.cnblogs.com/wuliytTaotao/p/10837533.html