2.1.2.3K近鄰（迴歸）

1、模型介紹

在迴歸任務中，K近鄰（迴歸）模型同樣只是藉助周圍K個最近訓練樣本的目標數值，對待測樣本的迴歸值進行決策。自然，也衍生出衡量待測樣本回歸值的不同方式，即到底是對K個近鄰目標數值使用普通的算術平均算法，還是同時考慮距離的差異進行加權平均。因此，也初始化不同配置的K近鄰（迴歸）模型來比較迴歸性能的差異。

2、數據描述

（1）美國波士頓地區房價數據描述

# 代碼34：美國波士頓地區房價數據描述
# 從sklearn.datasets導入波士頓房價數據讀取器。
from sklearn.datasets import load_boston

# 從讀取房價數據存儲在變量boston中。
boston = load_boston()
# 輸出數據描述。
print(boston.DESCR)

本地輸出：

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

結論：總體而言，該數據共有506條美國波士頓地區房價的數據，每條數據包括對指定房屋的13項數值型特徵描述和目標房價。另外，數據中沒有缺失的屬性/特徵值，更加方便了後續的分析。

（2）美國波士頓地區房價數據分割

# 代碼35：美國波士頓地區房價數據分割
# 從sklearn.model_selection導入數據分割器。
from sklearn.model_selection import train_test_split
# 導入numpy並重命名爲np。
import numpy as np

X = boston.data
y = boston.target

# 隨機採樣25%的數據構建測試樣本，其餘作爲訓練樣本。
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.25)

# 分析迴歸目標值的差異。
print("The max target value is", np.max(boston.target))
print("The min target value is", np.min(boston.target))
print("The average target value is", np.mean(boston.target))

備註：原來的導入模型from sklearn.cross_validation import train_test_split的時候，提示錯誤：

from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'

需要替換cross_validation：

from sklearn.model_selection import train_test_split

本地輸出：

The max target value is 50.0
The min target value is 5.0
The average target value is 22.532806324110677

結論：預測目標房價之間的差異較大，因此需要對特徵以及目標值進行標準化處理。

備註：讀者無需質疑將真實房價做標準化處理的做法。事實上，儘管在標準化之後，數據有了很大的變化。但是依然可以使用標準化器中的inverse_transform函數還原真實的結果；並且，對於預測的迴歸值也可以採用相同的做法進行還原。

（3）美國波士頓地區房價數據標準化處理

# 代碼36：訓練與測試數據標準化處理
# 從sklearn.preprocessing導入數據標準化模塊。
from sklearn.preprocessing import StandardScaler

# 分別初始化對特徵和目標值的標準化器。
ss_X = StandardScaler()
ss_y = StandardScaler()

# 分別對訓練和測試數據的特徵以及目標值進行標準化處理。
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1, 1))

備註：原來的會報錯，是因爲工具包版本更新造成的；故採用以下方法。

根據錯誤的提示相應的找出原來出錯的兩行代碼：

y_train = ss_y.fit_transform(y_train)
y_test = ss_y.transform(y_test)

問題出現在上面的兩行代碼中，例如數據格式爲[1, 2, 3, 4]就會出錯，如果把這行數據轉換成[[1], [2], [3], [4]]就不會出錯了。所以要對上面導致出錯的兩行代碼做出修改：

y_train = ss_y.fit_transform(y_train.reshape(-1, 1))
y_test = ss_y.transform(y_test.reshape(-1,1))

3、編程實踐

使用兩種不同配置的K近鄰迴歸模型對美國波士頓房價數據進行迴歸預測。

# 代碼41：使用兩種不同配置的K近鄰迴歸模型對美國波士頓房價數據進行迴歸預測
# 從sklearn.neighbors導入KNeighborRegressor（K近鄰迴歸器）。
from sklearn.neighbors import KNeighborsRegressor

# 初始化K近鄰迴歸器，並且調整配置，使得預測的方式爲平均迴歸：weights='uniform'。
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict = uni_knr.predict(X_test)

# 初始化K近鄰迴歸器，並且調整配置，使得預測的方式爲根據距離加權迴歸：weights='distance'。
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)

4、性能測評

接下來，就不同迴歸預測配置下的K近鄰模型進行性能評估。

# 代碼42：對兩種不同配置的K近鄰迴歸模型在美國波士頓房價數據上進行預測性能的評估
# 使用R-squared、MSE以及MAE三種指標對平均迴歸配置的K近鄰模型在測試集上進行性能評估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

print('R-squared value of uniform-weighted KNeighborRegression:', uni_knr.score(X_test, y_test))
print('The mean squared error of uniform-weighted KNeighborRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))
print('The mean absolute error of uniform-weighted KNeighborRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))

# 使用R-squared、MSE以及MAE三種指標對根據距離加權迴歸配置的K近鄰模型在測試集上進行性能評估。
print('R-squared value of distance-weighted KNeighborRegression:', dis_knr.score(X_test, y_test))
print('The mean squared error of distance-weighted KNeighborRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))
print('The mean absolute error of distance-weighted KNeighborRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))

備註：原來沒有導入工具包，有誤；故導入工具包。

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

本地輸出：

R-squared value of uniform-weighted KNeighborRegression: 0.6907212176346006
The mean squared error of uniform-weighted KNeighborRegression: 23.981877165354337
The mean absolute error of uniform-weighted KNeighborRegression: 2.9650393700787396
R-squared value of distance-weighted KNeighborRegression: 0.7201094821421603
The mean squared error of distance-weighted KNeighborRegression: 21.703073090490353
The mean absolute error of distance-weighted KNeighborRegression: 2.801125502210876

結論：相比之下，採用加權平均的方式迴歸房價具有更好的預測性能。

5、特點分析

K近鄰（迴歸）與K近鄰（分類）一樣，均屬於無參數模型，同樣沒有參數訓練過程。但是由於其模型的計算方法非常直觀，因此深受廣大初學者的喜愛。本節討論了兩種根據數據樣本的相似程度預測迴歸值的方法，並且驗證採用K近鄰加權平均的迴歸策略可以獲得較高的模型性能，供讀者參考。

《Python機器學習及實踐：從零開始通往Kaggle競賽之路》第2章基礎篇學習筆記（九）2.1.2.3K近鄰（迴歸）總結

2.1.2.3K近鄰（迴歸）

1、模型介紹

2、數據描述

（1）美國波士頓地區房價數據描述

（2）美國波士頓地區房價數據分割

（3）美國波士頓地區房價數據標準化處理

3、編程實踐

4、性能測評

5、特點分析

《Python機器學習及實踐：從零開始通往Kaggle競賽之路》第2章基礎篇學習筆記（十二）2.2.1.1K均值算法總結

【牛客網】哈爾濱工業大學字符串去特定字符解題報告

【數據結構（青島大學王卓）】第1章緒論學習筆記（二）1.2 基本概念和術語1

【數據結構（青島大學王卓）】第1章緒論學習筆記（一）0 前言、1.1 數據結構的研究內容

《Python機器學習及實踐：從零開始通往Kaggle競賽之路》第2章基礎篇學習筆記（十一）2.1.2.5集成模型（迴歸）總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

《Python機器學習及實踐：從零開始通往Kaggle競賽之路》第2章 基礎篇 學習筆記（九）2.1.2.3K近鄰（迴歸）總結

2.1.2.3K近鄰（迴歸）

1、模型介紹

2、數據描述

（1）美國波士頓地區房價數據描述

（2）美國波士頓地區房價數據分割

（3）美國波士頓地區房價數據標準化處理

3、編程實踐

4、性能測評

5、特點分析

《Python機器學習及實踐：從零開始通往Kaggle競賽之路》第2章基礎篇學習筆記（九）2.1.2.3K近鄰（迴歸）總結