監督機器學習問題主要有兩種,分別叫作分類(classification)與迴歸(regression)。
監督學習和非監督學習,以及監督學習中的分類與迴歸在這裏不在贅述,請參考《機器學習》
1、K近鄰分類
應用sklearn裏的KNeighborsClassifier方法,使用不同的鄰居個數,對測試數據集進行評估
這裏的數據集使用sklearn裏的乳腺癌數據集
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
cancer = datasets.load_breast_cancer()
# 分離訓練集、測試集
X_train, X_test, y_train, y_test = train_test_split(cancer['data'], cancer['target'], random_state=1)
test_accuracy = []
# n_neighbors取值從1到10
neighbors_settings = range(1, 11)
# 使用不同的鄰居值,獲得不同的測試精度
for n_neighbors in neighbors_settings:
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_train, y_train)
# 記錄精度
test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, test_accuracy)
plt.xlabel("n_neighbor")
plt.ylabel("test_accuracy")
plt.show()
測試精度顯示
可以大略看出,鄰居數量5和6的時候,測試精度達到最高,但最低時仍是可以接受
2、K近鄰迴歸
KNeighborsRegressor方法,當鄰居大於1時,結果爲這些鄰居的平均值
此處採用自己構造的一維數據集,即距離海岸線的距離與房價的關係,數據集如下
house_price_dataSet = {
'data': np.array([0.5, 0.8, 1.0, 1.4, 1.6, 1.8, 2.0, 2.1, 2.3, 2.5, 2.9, 3.2, 3.5, 3.9, 4.6, 5.0]).reshape(-1, 1),
'target': np.array([3., 2.5, 2.8, 2.6, 2.7, 2.6, 2., 2., 1.6, 1.7, 1.6, 1.4, 1.2, 1.4, 1.3, 1.2]),
'target_names': np.array(['price']),
'feature_names': np.array(['distance'])
}
目標是完成對距離海岸線0-5km內的房價預測
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
# ['data', 'target', 'target_names', 'feature_names']
house_price_dataSet = {
'data': np.array([0.5, 0.8, 1.0, 1.4, 1.6, 1.8, 2.0, 2.1, 2.3, 2.5, 2.9, 3.2, 3.5, 3.9, 4.6, 5.0]).reshape(-1, 1),
'target': np.array([3., 2.5, 2.8, 2.6, 2.7, 2.6, 2., 2., 1.6, 1.7, 1.6, 1.4, 1.2, 1.4, 1.3, 1.2]),
'target_names': np.array(['price']),
'feature_names': np.array(['distance'])
}
# plt.scatter(house_price_dataSet['data'], house_price_dataSet['target'])
# plt.xlabel("Distance from the sea")
# plt.ylabel("housing price")
# plt.show()
# train_test_split函數
X_train, X_test, y_train, y_test = train_test_split(house_price_dataSet['data'], house_price_dataSet['target'],
random_state=0)
# # K臨近算法——迴歸
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X_train, y_train)
x_new = np.array([[1.5]])
prediction = knn.predict(x_new)
print("prediction price:", prediction)
對1.5km的房價進行預測,輸出結果爲
prediction price: [2.65]
在圖中顯示爲