Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析

原創

頔潇

2020-06-13 12:43

Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析

此數據源於美國某經濟學雜誌上，分析研究波士頓房價( Boston HousePrice)的數據集。
在這個項目中，你將利用馬薩諸塞州波士頓郊區的房屋信息數據訓練和測試一個模型，並對模型的性能和預測能力進行測試。通過該數據訓練後的好的模型可以被用來對房屋做特定預測—尤其是對房屋的價值。對於房地產經紀等人的日常工作來說，這樣的預測模型被證明非常有價值。

數據集說明

數據集中的每一行數據都是對波士頓周邊或城鎮房價的情況描述，對數據集變量說明如下。

CRIM: 城鎮人均犯罪率
ZN: 住宅用地所佔比例
INDUS: 城鎮中非住宅用地所佔比例
CHAS: 虛擬變量,用於迴歸分析
NOX: 環保指數
RM: 每棟住宅的房間數
AGE: 1940 年以前建成的自住單位的比例
DIS: 距離 5 個波士頓的就業中心的加權距離
RAD: 距離高速公路的便利指數
TAX: 每一萬美元的不動產稅率
PTRATIO: 城鎮中的教師學生比例
B: 城鎮中的黑人比例
LSTAT: 地區中有多少房東屬於低收入人羣
MEDV: 自住房屋房價中位數（也就是均價）

原文：print (boston_data['DESCR'])

Boston House Prices dataset

===========================

Notes

------ Data Set Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

導入庫

from sklearn.datasets import load_boston
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt

載入數據集

boston_data=load_boston()
x_data = boston_data.data
y_data = boston_data.target

names=boston_data.feature_names
FeaturesNums = 13
DataNums = len(x_data)

可視化分析

將數據集各個特徵可視化
分析相關性後再進行數據處理
處理後繼續可視化
可視化再反饋給數據處理
若數據滿意，則嘗試建模

以下的數據圖是經過篩選後的特徵數據所得

特徵與標籤關係

觀察特徵與標籤關係
分析特徵對於標籤的貢獻程度

# 每個Feature和target二維關係圖
plt.subplots(figsize=(20,12))
for i in range(FeaturesNums):
    plt.subplot(231+i)
    plt.scatter(x_train[:,i],y_train,s=20,color='blueviolet')
    plt.title(names[i])
plt.show()

特徵數據分佈

數據分佈能夠估計數據價值
也能發現異常數據

plt.subplots(figsize=(20,10))
for i in range(FeaturesNums):
    plt.subplot(231+i)
    plt.hist(x_data[:,i],color='lightseagreen',width=2)
    plt.xlabel(names[i])
    plt.title(names[i])
plt.show()

數據處理

導入sklearn中的預處理庫
多種處理方式

from sklearn import preprocessing

清除異常值

DelList0=[]
for i in range(DataNums):
    if (y_data[i] >= 49 or y_data[i] <= 1):
        DelList0.append(i)
DataNums -= len(DelList0)
x_data = np.delete(x_data,DelList0,axis=0)
y_data = np.delete(y_data,DelList0,axis=0)

去除無用特徵

DelList1=[]
for i in range(FeaturesNums):
    if (names[i] == 'ZN' or
        names[i] == 'INDUS' or
        names[i] == 'RAD' or
        names[i] == 'TAX' or
        names[i] == 'CHAS' or
        names[i] == 'NOX' or
        names[i] == 'B' or
        names[i] == 'PTRATIO'):
      DelList1.append(i)
x_data = np.delete(x_data, DelList1, axis=1)
names = np.delete(names, DelList1)
FeaturesNums -= len(DelList1)

歸一化

from sklearn.preprocessing import MinMaxScaler, scale
nms = MinMaxScaler()
x_train = nms.fit_transform(x_train)
x_test  = nms.fit_transform(x_test)
y_train = nms.fit_transform(y_train.reshape(-1,1))
y_test  = nms.fit_transform(y_test.reshape(-1,1))

數據分割

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3)

訓練模型

嘗試多種模型
擇優選取

線性迴歸LinearRegression

用線性迴歸模型訓練
查看MSE和R2得分

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print ("MSE =", mean_squared_error(y_test, y_pred),end='\n\n')
print ("R2  =", r2_score(y_test, y_pred),end='\n\n')

MSE = 0.013304697805737791
R2 = 0.44625845284900767

可視化結果

# 畫圖
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, c="blue", edgecolors="aqua",s=13)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k', lw=2, color='navy')
ax.set_xlabel('Reality')
ax.set_ylabel('Prediction')
plt.show()

SVR模型linear核

用SVR模型linear核模型訓練
查看得分

from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
linear_svr = SVR(kernel='linear')
# linear_svr.fit(x_train, y_train)
# linear_pred = linear_svr.predict(x_test)
linear_svr_pred = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()
print ("Linear_SVR_Score =",linear_svr_meanscore,end='\n')

Linear_SVR_Score = 0.6497361775614359

SVR模型poly核

用SVR模型poly核模型訓練
查看得分

from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
poly_svr = SVR(kernel='poly')
poly_svr.fit(x_train, y_train)
poly_pred = poly_svr.predict(x_test)
poly_svr_pred = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()
print ('\n',"Poly_SVR_Score =",poly_svr_meanscore,end='\n')

Poly_SVR_Score = 0.5383303049258509

總結

通過數據處理和可視化分析，能夠相互反饋篩選有效特徵
採用多種模型進行嘗試，選擇得分最優的模型進行最終的訓練
最後，SVR模型linear核效果最佳

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析

Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析

數據集說明

Boston House Prices dataset

Notes

導入庫

載入數據集

可視化分析

特徵與標籤關係

特徵數據分佈

數據處理

清除異常值

去除無用特徵

歸一化

數據分割

訓練模型

線性迴歸LinearRegression

SVR模型linear核

SVR模型poly核

總結

形式語言與自動機總結筆記

Python機器學習項目：基於數據挖掘的抖音商用廣告視頻識別

2020美賽建模C題思路和理解

Locally Private k-Means Clustering（本地私有k均值聚類）論文閱讀報告

二叉查找樹 / 二叉搜索樹數據結構原理、示例和算法實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python機器學習/數據挖掘項目實戰 波士頓房價預測 迴歸分析

Python機器學習/數據挖掘項目實戰 波士頓房價預測 迴歸分析

數據集說明

Boston House Prices dataset

Notes

導入庫

載入數據集

可視化分析

特徵與標籤關係

特徵數據分佈

數據處理

清除異常值

去除無用特徵

歸一化

數據分割

訓練模型

線性迴歸LinearRegression

SVR模型linear核

SVR模型poly核

總結

Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析

Python機器學習/數據挖掘項目實戰波士頓房價預測迴歸分析