數學知識:
李航《統計學習方法》,敘述了K鄰近算法,K鄰近模型和它的三要素(距離度量、K值、分類決策規則),然後講解了算法實現的數據結構——kd樹,和基於這個樹的搜索kd樹算法。
一些數學細節的補充:
https://www.cnblogs.com/eyeszjwang/articles/2429382.html
講解了Kd樹的原理、例子和僞代碼。
在python上的實現:
https://zhuanlan.zhihu.com/p/23191325
介紹了sk庫實現的KNeighborsClassifier類,它的參數,主要函數等
一個python例子:
在jupyter notebook中操作的,且所用數據集爲《機器學習實戰》KNN算法部分的。
KNN算法
將圖像(黑白)轉爲一維數組
import numpy as np
def re_shape(filename):
return_matrix = np.zeros((1,1024))
with open(filename) as inf:
for i in range(32):
row = inf.readline()
for n in range(32):
return_matrix[0,32*i+n] = int(row[n])
return return_matrix[0]
獲得類別(文件名)
from sklearn.neighbors import KNeighborsClassifier
import os
labels = []
file_forder = "E:\\DataMining\\Project\\MLBook\\機器學習實戰源代碼\\machinelearninginaction\\Ch02\\digits\\trainingDigits"
trainingFileList = os.listdir(file_forder)
#print(trainingFileList)
for name in trainingFileList:
labels.append(name.split("_")[0])
獲得訓練數據
X_train = []
for name in trainingFileList:
fileneme = os.path.join(file_forder,name)
row = re_shape(fileneme)
X_train.append([n for n in row])
#print(X_train[:5])
獲得測試數據類別
testLabels = []
file_forder = "E:\\DataMining\\Project\\MLBook\\機器學習實戰源代碼\\machinelearninginaction\\Ch02\\digits\\testDigits"
testFileList = os.listdir(file_forder)
#print(trainingFileList)
for name in testFileList:
testLabels.append(name.split("_")[0])
獲得測試數據
X_test = []
for name in testFileList:
fileneme = os.path.join(file_forder,name)
row = re_shape(fileneme)
X_test.append([n for n in row])
clf = KNeighborsClassifier(n_neighbors=1, weights='uniform', algorithm='auto', p=2, metric='minkowski', metric_params=None)
clf.fit(X_train,labels)
Y_prected = clf.predict(X_test)
進行評估
from sklearn.metrics import accuracy_score
score = accuracy_score(Y_prected,testLabels)
print("When k is 5,the accuracy score is {}".format(score))
When k is 5,the accuracy score is 0.9809725158562368
score = accuracy_score(Y_prected,testLabels)
print("When k is 1,the accuracy score is {}".format(score))
When k is 1,the accuracy score is 0.9862579281183932