Python機器學習/數據挖掘項目實戰 波士頓房價預測 迴歸分析
- 此數據源於美國某經濟學雜誌上,分析研究波士頓房價( Boston HousePrice)的數據集。
- 在這個項目中,你將利用馬薩諸塞州波士頓郊區的房屋信息數據訓練和測試一個模型,並對模型的性能和預測能力進行測試。通過該數據訓練後的好的模型可以被用來對房屋做特定預測—尤其是對房屋的價值。對於房地產經紀等人的日常工作來說,這樣的預測模型被證明非常有價值。
數據集說明
- 數據集中的每一行數據都是對波士頓周邊或城鎮房價的情況描述,對數據集變量說明如下。
CRIM: 城鎮人均犯罪率
ZN: 住宅用地所佔比例
INDUS: 城鎮中非住宅用地所佔比例
CHAS: 虛擬變量,用於迴歸分析
NOX: 環保指數
RM: 每棟住宅的房間數
AGE: 1940 年以前建成的自住單位的比例
DIS: 距離 5 個波士頓的就業中心的加權距離
RAD: 距離高速公路的便利指數
TAX: 每一萬美元的不動產稅率
PTRATIO: 城鎮中的教師學生比例
B: 城鎮中的黑人比例
LSTAT: 地區中有多少房東屬於低收入人羣
MEDV: 自住房屋房價中位數(也就是均價)
- 原文:
print (boston_data['DESCR'])
Boston House Prices dataset
===========================
Notes
------ Data Set Characteristics:
:Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's
導入庫
from sklearn.datasets import load_boston
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
載入數據集
boston_data=load_boston()
x_data = boston_data.data
y_data = boston_data.target
names=boston_data.feature_names
FeaturesNums = 13
DataNums = len(x_data)
可視化分析
- 將數據集各個特徵可視化
- 分析相關性後再進行數據處理
- 處理後繼續可視化
- 可視化再反饋給數據處理
- 若數據滿意,則嘗試建模
以下的數據圖是經過篩選後的特徵數據所得
特徵與標籤關係
- 觀察特徵與標籤關係
- 分析特徵對於標籤的貢獻程度
# 每個Feature和target二維關係圖
plt.subplots(figsize=(20,12))
for i in range(FeaturesNums):
plt.subplot(231+i)
plt.scatter(x_train[:,i],y_train,s=20,color='blueviolet')
plt.title(names[i])
plt.show()
特徵數據分佈
- 數據分佈能夠估計數據價值
- 也能發現異常數據
plt.subplots(figsize=(20,10))
for i in range(FeaturesNums):
plt.subplot(231+i)
plt.hist(x_data[:,i],color='lightseagreen',width=2)
plt.xlabel(names[i])
plt.title(names[i])
plt.show()
數據處理
- 導入sklearn中的預處理庫
- 多種處理方式
from sklearn import preprocessing
清除異常值
DelList0=[]
for i in range(DataNums):
if (y_data[i] >= 49 or y_data[i] <= 1):
DelList0.append(i)
DataNums -= len(DelList0)
x_data = np.delete(x_data,DelList0,axis=0)
y_data = np.delete(y_data,DelList0,axis=0)
去除無用特徵
DelList1=[]
for i in range(FeaturesNums):
if (names[i] == 'ZN' or
names[i] == 'INDUS' or
names[i] == 'RAD' or
names[i] == 'TAX' or
names[i] == 'CHAS' or
names[i] == 'NOX' or
names[i] == 'B' or
names[i] == 'PTRATIO'):
DelList1.append(i)
x_data = np.delete(x_data, DelList1, axis=1)
names = np.delete(names, DelList1)
FeaturesNums -= len(DelList1)
歸一化
from sklearn.preprocessing import MinMaxScaler, scale
nms = MinMaxScaler()
x_train = nms.fit_transform(x_train)
x_test = nms.fit_transform(x_test)
y_train = nms.fit_transform(y_train.reshape(-1,1))
y_test = nms.fit_transform(y_test.reshape(-1,1))
數據分割
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3)
訓練模型
- 嘗試多種模型
- 擇優選取
線性迴歸LinearRegression
- 用線性迴歸模型訓練
- 查看MSE和R2得分
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print ("MSE =", mean_squared_error(y_test, y_pred),end='\n\n')
print ("R2 =", r2_score(y_test, y_pred),end='\n\n')
MSE = 0.013304697805737791
R2 = 0.44625845284900767
- 可視化結果
# 畫圖
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, c="blue", edgecolors="aqua",s=13)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k', lw=2, color='navy')
ax.set_xlabel('Reality')
ax.set_ylabel('Prediction')
plt.show()
SVR模型linear核
- 用SVR模型linear核模型訓練
- 查看得分
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
linear_svr = SVR(kernel='linear')
# linear_svr.fit(x_train, y_train)
# linear_pred = linear_svr.predict(x_test)
linear_svr_pred = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()
print ("Linear_SVR_Score =",linear_svr_meanscore,end='\n')
Linear_SVR_Score = 0.6497361775614359
SVR模型poly核
- 用SVR模型poly核模型訓練
- 查看得分
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
poly_svr = SVR(kernel='poly')
poly_svr.fit(x_train, y_train)
poly_pred = poly_svr.predict(x_test)
poly_svr_pred = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()
print ('\n',"Poly_SVR_Score =",poly_svr_meanscore,end='\n')
Poly_SVR_Score = 0.5383303049258509
總結
- 通過數據處理和可視化分析,能夠相互反饋篩選有效特徵
- 採用多種模型進行嘗試,選擇得分最優的模型進行最終的訓練
- 最後,SVR模型linear核效果最佳