[scikit-learn 機器學習] 3. K-近鄰算法分類和迴歸


本文爲 scikit-learn機器學習(第2版)學習筆記

K 近鄰法(K-Nearest Neighbor, K-NN) 常用於 搜索和推薦系統。

1. KNN模型

  • 確定距離度量方法(如歐氏距離)
  • 根據 K 個最近的距離的鄰居樣本,選擇策略做出預測
  • 模型假設:距離相近的樣本,有接近的響應值

2. KNN分類

根據身高、體重對性別進行分類

import numpy as np
import matplotlib.pyplot as plt

X_train = np.array([
    [158, 64],
    [170, 86],
    [183, 84],
    [191, 80],
    [155, 49],
    [163, 59],
    [180, 67],
    [158, 54],
    [170, 67]
])
y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female']

plt.figure()
plt.title('Human Heights and Weights by Sex')
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')

for i, x in enumerate(X_train):
    if y_train[i] == 'male':
        c1 = plt.scatter(x[0], x[1], c='k', marker='x')
    else:
        c2 = plt.scatter(x[0], x[1], c='r', marker='o')
plt.grid(True)
plt.legend((c1,c2),('male','female'),loc='lower right')
# plt.show()

在這裏插入圖片描述

  • 對身高 155cm,體重 70 kg的人進行性別預測
  • 設置 KNN 模型 k = 3
計算距離
x = np.array([[155,70]])
dis = np.sqrt(np.sum((X_train-x)**2 ,axis = 1))
dis
選取最近k個
nearset_k_neighbor = dis.argsort()[0:3]
k_genders = [y_train[i] for i in nearset_k_neighbor]
k_genders  # ['male', 'female', 'female']
計算最近的k個的標籤
from collections import Counter
# b = Counter(np.take(y_train, dis.argsort()[0:3]))
b = Counter(k_genders)
b # Counter({'male': 1, 'female': 2})
性別爲女性佔多數
# help(Counter.most_common)
# most_common(self, n=None)
#     List the n most common elements and their counts from the most
#     common to the least.  If n is None, then list all element counts.
b.most_common(2) # [('female', 2), ('male', 1)]
b.most_common(1)[0][0] # 'female'

3. 使用sklearn KNN分類

標籤(male,female)數字化(0,1)

from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier

lb = LabelBinarizer()
y_train_lb = lb.fit_transform(y_train)
y_train_lb
######
array([[1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0]])

預測前面的例子的性別

K=3
clf = KNeighborsClassifier(n_neighbors=K)
clf.fit(X_train,y_train_lb.ravel())
pred_gender = clf.predict(x)
pred_gender # array([0])
pred_label_gender = lb.inverse_transform(pred_gender)
pred_label_gender # array(['female'], dtype='<U6')

在test集上驗證

X_test = np.array([
    [168, 65],
    [180, 96],
    [160, 52],
    [169, 67]
])
y_test = ['male', 'male', 'female', 'female']
y_test_lb = lb.transform(y_test)

pred_lb = clf.predict(X_test)
print('Predicted labels: %s' % lb.inverse_transform(pred_lb))
# Predicted labels: ['female' 'male' 'female' 'female']

計算評價指標

準確率:預測對了的比例3/4
from sklearn.metrics import accuracy_score
accuracy_score(y_test_lb, pred_lb) # 0.75
精準率:正類爲男,男預測爲男/(男預測男+女預測男)
from sklearn.metrics import precision_score
precision_score(y_test_lb, pred_lb) # 1.0
召回率: 男預測男/(男預測男+男預測女)
from sklearn.metrics import recall_score
recall_score(y_test_lb, pred_lb) # 0.5

F1 值

F1 得分是:精準率和召回率的均衡
from sklearn.metrics import f1_score
f1_score(y_test_lb, pred_lb) # 0.6667
評價報告
from sklearn.metrics import classification_report
# help(classification_report)
# classification_report(y_true, y_pred, labels=None, target_names=None, s
#        ample_weight=None, digits=2, output_dict=False, zero_division='warn')
print(classification_report(y_test_lb, pred_lb, target_names=['male','female'], labels=[1,0]))

在這裏插入圖片描述

4. KNN迴歸

根據身高、性別,預測其體重

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score

X_train = np.array([
    [158,  1],
    [170,  1],
    [183,  1],
    [191,  1],
    [155,  0],
    [163,  0],
    [180,  0],
    [158,  0],
    [170,  0]
])
y_train = [64,86,84,80,49,59,67,54,67]

X_test = np.array([
    [168,  1],
    [180,  1],
    [160,  0],
    [169,  0]
])
y_test = [65,96,52,67]

K = 3
clf = KNeighborsRegressor(n_neighbors=K)
clf.fit(X_train, y_train)
predictions = clf.predict(np.array(X_test))
predictions # array([70.66666667, 79.        , 59.        , 70.66666667])

# help(r2_score)
# R^2 (coefficient of determination)
r2_score(y_test, predictions) # 0.6290565226735438

平均絕對值誤差
mean_absolute_error(y_test, predictions) # 8.333333333333336

平均平方誤差
mean_squared_error(y_test, predictions)  # 95.8888888888889
  • 數據沒有標準化的影響
from scipy.spatial.distance import euclidean
# help(euclidean) # 歐氏距離
X_train = np.array([
    [1700,1],
    [1600,0]
])
X_test = np.array([1640,1]).reshape(1,-1)
print(euclidean(X_train[0,:], X_test))
print(euclidean(X_train[1,:], X_test))
# 60.0
# 40.01249804748511

X_train = np.array([
    [1.7,1],
    [1.6,0]
])
X_test = np.array([1.64,1]).reshape(1,-1)
print(euclidean(X_train[0,:], X_test))
print(euclidean(X_train[1,:], X_test))
# 0.06000000000000005
# 1.0007996802557444

可以看出不同單位下的歐式距離差異很大

  • 進行數據標準化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

print(X_train)
print(X_train_scaled)
[[158   1]
 [170   1]
 [183   1]
 [191   1]
 [155   0]
 [163   0]
 [180   0]
 [158   0]
 [170   0]]
[[-0.9908706   1.11803399]
 [ 0.01869567  1.11803399]
 [ 1.11239246  1.11803399]
 [ 1.78543664  1.11803399]
 [-1.24326216 -0.89442719]
 [-0.57021798 -0.89442719]
 [ 0.86000089 -0.89442719]
 [-0.9908706  -0.89442719]
 [ 0.01869567 -0.89442719]]
  • 標準化特徵後 模型誤差更低
pred = clf.predict(X_test_scaled)
pred # array([78.        , 83.33333333, 54.        , 64.33333333])

# R^2 (coefficient of determination)
r2_score(y_test, pred) # 0.6706425961745109

# 平均絕對值誤差
mean_absolute_error(y_test, pred) # 7.583333333333336

# 平均平方誤差
mean_squared_error(y_test, pred)  # 85.13888888888893
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章