支持向量迴歸 & 波士頓房價預測

支持向量迴歸 & 波士頓房價預測

# 導入數據集
from sklearn.datasets import load_boston
boston = load_boston()
# 查看數據組成
print(boston.keys()) # 分別代表,數據,目標,特徵名稱,描述信息
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
# 將描述信息打印輸出
print(boston['DESCR']) # 樣本條數, 特徵個數 Median Value (attribute 14) is usually the target.目標
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
# 使用SVR進行建模
from sklearn.model_selection import train_test_split
X,y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=8)
# 查看數據的形態
print(X_train.shape)
print(X_test.shape)
(379, 13)
(127, 13)
# 導入模型包
from sklearn.svm import SVR
# 使用線性核和rbf核
for kernel in ['linear', 'rbf']:
    svr = SVR(kernel = kernel)
    svr.fit(X_train, y_train)
    print(kernel,"訓練集得分:{}".format(svr.score(X_train, y_train)))
    print(kernel,"測試集得分:{}".format(svr.score(X_test, y_test)))
linear 訓練集得分:0.7088454040060503
linear 測試集得分:0.6964154693072283
rbf 訓練集得分:0.19202953621697594
rbf 測試集得分:0.22231887131613304

SVM算法對於數據預處理的要求比較高,比如:數據特徵量級相差較大,則就要對數據進行預處理。

# 查看數據集中各個特徵的數量級分佈情況,使用散點圖進行可視化
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(X.min(axis = 0), 'v', label='min')
plt.plot(X.max(axis = 0), '^', label='max')
# 設定縱座標爲對數形式
plt.yscale('log')
# 設置圖注位置爲最佳
plt.legend(loc='best')
# 設置最標軸標題
plt.xlabel('features')
plt.ylabel('features distribute')
# 顯示圖像
plt.show()

在這裏插入圖片描述

可見,數據分佈的數量級差異還是很大的。以下進行預處理

# 導入數據處理工具
from sklearn.preprocessing import StandardScaler
# 對訓練集和測試集進行數據預處理
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 處理之後的數據進行可視化
plt.plot(X_train_scaled.min(axis=0),'v',label = 'train set min')
plt.plot(X_train_scaled.max(axis=0),'^',label = 'train set max')
plt.plot(X_test_scaled.min(axis=0),'v',label = 'test set min')
plt.plot(X_test_scaled.max(axis=0),'^',label = 'test set max')
plt.yscale('log')
# 設置圖標註位置,爲最佳
plt.legend(loc = 'best')

# 設置座標軸標題
plt.xlabel('scaled features')
plt.ylabel('scaled features distribute')
plt.show()

在這裏插入圖片描述

這樣一來,數據大小的差異就不那麼大了

# 使用線性核和rbf核
for kernel in ['linear', 'rbf']:
    svr = SVR(kernel = kernel)
    svr.fit(X_train_scaled, y_train)
    print(kernel,"訓練集得分:{}".format(svr.score(X_train_scaled, y_train)))
    print(kernel,"測試集得分:{}".format(svr.score(X_test_scaled, y_test)))
linear 訓練集得分:0.7056333150364087
linear 測試集得分:0.6983657869087585
rbf 訓練集得分:0.6649619040718826
rbf 測試集得分:0.6945967225393969

變化很明顯了,似乎還是不怎麼好

# 設置模型的參數
svr = SVR(C=100, gamma=0.1)
svr.fit(X_train_scaled, y_train)
print("訓練集得分:{}".format(svr.score(X_train_scaled, y_train)))
print("測試集得分:{}".format(svr.score(X_test_scaled, y_test)))
訓練集得分:0.9662897941105739
測試集得分:0.8940385882400999
#kernel : It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. default='rbf'
# 設置模型的參數
svr = SVR(C=100, gamma=0.2)
svr.fit(X_train_scaled, y_train)
print("訓練集得分:{}".format(svr.score(X_train_scaled, y_train)))
print("測試集得分:{}".format(svr.score(X_test_scaled, y_test)))
訓練集得分:0.9834549534287683
測試集得分:0.9013368508736123
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章