[MLReview] k-NearestNeighbor k鄰近算法代碼實現

一、分類模型（classification）

近鄰算法大約是最簡單的算法之一，但是在許多場景中卻出奇地好用。

knn的核心思想是樣本在特徵空間中的k個最近鄰的樣本，然後讓這k的樣本“投票表決”待分類樣本的類別。

二、缺點：

1、原始樣本數據不均衡，某一類樣本數量很大

2、可理解性差，對比決策樹

三、算法及數學推導（截圖出來看，引用自李航老師的《統計學習方法》）

顯而易見，在knn中並沒有看見所謂的“學習算法”。

周志華老師在《機器學習》中稱這種爲“lazy learning”。

這裏引用ZH ZHOU的一篇論文ML-KNN:A lazy learning approach to multi-label learning

這裏周老師對比貝葉斯分類器，證明knn的泛化錯誤率不超過bayes的兩倍。

證明的論文地址Nearest Neighbor Pattern Classification

四、重要參數

1、k值選取

k過大：此時能有更大區域決定待預測的樣本，bias會增大（近似誤差），variance減小（估計誤差）

k過小：在較小的鄰域內預測。容易過擬合

選取合適k值：交叉驗證（cross validation）選取

2、亂入補充一個：交叉驗證crossvalidation

10折交叉驗證比較常見，就是把數據集分成n份，每次訓練選取其中n-1份爲訓練集，留一份爲測試集，計算平均誤差（estimated generalization error）。

詳細的論文見A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

**不過最近看見有LOOCV(Least-One-Out CV)

就是每次拿出來一個樣本做測試，不過這樣過擬合確定不會爆嗎。。好吧其實knn在過擬合問題上感覺考慮的不多

這樣做最大的缺點是會計算量巨多

再po一個cross validation的pptcmu_cross_validation

比較跑題，這裏提到了就順便寫完

好習慣，給個cv（區別下視覺的cv）的code

sklearn會有一些很好用的package：看看官方的tutorial真的很嗨皮scikit-learn——cross_validation

例如：①train_test_split 用的很多按百分比隨機劃分數據集

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 
    train_test_split(data, target, test_size=0.4, random_state=0)

② cross_val_score、cross_val_predicted 交叉驗證並評估驗證

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, data, target, cv=5)

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, data, target, cv=10)

③重頭戲：kfold

import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

#[2 3] [0 1]
#[0 1] [2 3]

④LOO 其實跟kfold差不多就是kfold的一個特殊形式

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))

#[1 2 3] [0]
#[0 2 3] [1]
#[0 1 3] [2]
#[0 1 2] [3]

⑤當然還有時間序列的分割 TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1,2],[3,4],[1,2],[3,4],[1,2],[3,4]])

tscv = TimeSeriesSplit(n_splits=3)

for train, test in tscv.split(X):
    print train, test
   
#[0 1 2] [3]
#[0 1 2 3] [4]
#[0 1 2 3 4] [5]

官方源碼真的很不錯，當然這些package也可以自己寫咯~

ShuffleSplit：有放回的抽樣類似kfold

總而言之這裏CV的作用是tell我們哪一個k值model是最好噠~！

3、距離的問題（可以用在很多個場景）

在KNN算法中，常用的距離有三種，分別爲曼哈頓距離、歐式距離和閔可夫斯基距離。
設特徵空間X是n維實數向量空間Rn, xi,xj∈X,xi=(x(1)i,x(2)i,...,x(n)i)T, xj=(x(1)j,x(2)j,...,x(n)j)T, xi,xj的Lp距離定義爲：

五、knn算法codes

先從底層的開始：knn.py

#dataset數據集 inX分類向量 k鄰近數目
def classify(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]

    #歐式距離計算
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    
    #選擇距離最小的k個點
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        #排序
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

so easy. 代碼解釋：

這裏先把訓練集的每一個樣本的features與對應的inX做減法，然後計算曼哈頓距離，對這些點進行升序排序。

接着聲明一個字典型classCount，keys爲最近k個點的labels，每當遇到相同的keys，values自增1。

對字典的values降序排列，返回第一個字典第一個值的key即爲“投票選舉”的結果

ps: argsort(x) 返回將x降序排列的index列表然後根據index在labels中搜索對於的y值。

注意下py3將iteritems()替換爲items()，以及cmp函數也被取消，sorted中不再支持定義函數比較排序，可以自己重寫一個方法。

可以利用先前的split測試，我們load data

from sklearn.model_selection import train_test_split
from sklearn import datasets  

def createDataSet(): 
    iris = datasets.load_iris()  
    iris_X = iris.data  
    iris_y = iris.target   
    X_train, X_test, y_train, y_test = 
        train_test_split(iris_X , iris_y, test_size=0.4, random_state=0)
    return X_train, X_test, y_train, y_test

我們同時把訓練集合測試集輸入model

def predict():
    X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.4, random_state=0)

    k = 0
    p = np.zeros(y_test.shape)
    for i in range(X_test.shape[0]):
	p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
	if(p[i]==y_test[i]):k+=1

    accuracy = k/p.shape[0]
    print(accuracy)

這樣就完成了水仙花數據的knn測試，結果爲93.333%

現在我們對這個結果不是很滿意，希望用CV測試出準確度最高的k值

def kfold_predict():
    iris = datasets.load_iris()  
    iris_X = iris.data  
    iris_y = iris.target   

    kf = KFold(n_splits = 10)

    accuracy = 0
    for train_index, test_index in kf.split(iris_X) :
	#print("TRAIN:", train_index, "TEST:",test_index)
	X_train, X_test = iris_X[train_index],iris_X[test_index]
	y_train, y_test = iris_y[train_index], iris_y[test_index]
	#print(y_train,y_test)
	k = 0
	p = np.zeros(y_test.shape)
	for i in range(X_test.shape[0]):
		p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
		if(p[i]==y_test[i]):k+=1
	accuracy += k/p.shape[0]

    print("%s" % (accuracy/10))

好吧再寫一個for loop去試試看k值咯這裏就不再繼續糾結了

六、knn算法實例

這裏引用幾個machinelearninginaction書中的codes進行說明。

1、在約會網站上使用knn，改進配對結果

準備工作：

①解析文件數據工具 file.py

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    #解析文件到列表
    for line in fr.readlines():
        line = line.strip() #截取回車字符
        listFromLine = line.split('\t') #將整行數據分割爲元素列表
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

②畫散點圖 matplotlib 一個神奇的包這個之後再碼時間有點晚了

③歸一化normalization 弱智方法直接放codes

def autoNorm(dataSet):  
    minVals = dataSet.min(0)  #每列的最小值  參數0可以從列中選取最小值而不是選取當前行的最小值  
    maxVals = dataSet.max(0)    
    ranges = maxVals - minVals  #爲了歸一化特徵值，必須使用當前值減去最小值，然後除以取值範圍  
    normDataSet = zeros(shape(dataSet))  #使用numpy中tile函數將變量內容複製成輸入矩陣同樣大小的矩陣  
    m = dataSet.shape[0]  
    normDataSet = dataSet - tile(minVals, (m,1))  
    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide  
    return normDataSet, ranges, minVals

④預測

def datingClassTest():  
    hoRatio = 0;10  
  
    datingDataMat,datingLables = file2matrix('datingTestSet.txt')  
    normMat, ranges, minVals = autoNorm(datingDataMat)  
    m = normMat.shape[0]  
    numTestVecs = int(m*hoRatio)  
    errorCount = 0.0  
    for i in range(numTestVecs):  
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)  
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])  
        if (classifierResult != datingLabels[i]): errorCount += 1.0  
    print "the total error rate is: %f" % (errorCount/float(numTestVecs))  
    print errorCount

def classifyPerson():  
    resultList = ['not at all','in small doses','in large doses']  
    percentTats = float(raw_input(\  
                  "percentage of time spent playing video games?"))  
    ffMiles = float(raw_input("frequent flier miles earned per year?"))  
    iceCream = float(raw_input("liters of ice cream consumed per year?"))  
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')  
    normMat,ranges,minVals = autoNorm(datingDataMat)  
    inArr = array([ffMiles,percentTats,iceCream])  
    classifierResult = int(classify0((inArr-\  
                                  minVals)/ranges,normMat,datingLabels,3))  
    print "You will probably like this person:",\  
          resultList[classifierResult - 1]

2、比較經典的手寫體識別可以用一萬種方法做

①處理圖像數據 img.py

def img2vector(filename):  
    returnVect = zeros((1,1024))  
    fr = open(filename)  
    for i in range(32):  
        lineStr = fr.readline()  
        for j in range(32):  
            returnVect[0,32*i+j] = int(lineStr[j])  
    return returnVect

②手寫體main.py 導入knn module

def handwritingClassTest():  
    hwLabels = []  
    trainingFileList = listdir('trainingDigits')           #load the training set  
    m = len(trainingFileList)  
    trainingMat = zeros((m,1024))  
    for i in range(m):  
        fileNameStr = trainingFileList[i]  
        fileStr = fileNameStr.split('.')[0]     #take off .txt  
        classNumStr = int(fileStr.split('_')[0])  
        hwLabels.append(classNumStr)  
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)  
    testFileList = listdir('testDigits')        #iterate through the test set  
    errorCount = 0.0  
    mTest = len(testFileList)  
    for i in range(mTest):  
        fileNameStr = testFileList[i]  
        fileStr = fileNameStr.split('.')[0]     #take off .txt  
        classNumStr = int(fileStr.split('_')[0])  
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)  
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)  
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)  
        if (classifierResult != classNumStr): errorCount += 1.0  
    print "\nthe total number of errors is: %d" % errorCount  
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

七、sikit-learn

直接有knn包，sry久等了，不過謝謝源碼也是極好的哇。

#KNN調用  
import numpy as np  
from sklearn import datasets  
iris = datasets.load_iris()  
iris_X = iris.data  
iris_y = iris.target  
np.unique(iris_y)  
# Split iris data in train and test data  
# A random permutation, to split the data randomly  
np.random.seed(0)  
# permutation隨機生成一個範圍內的序列  
indices = np.random.permutation(len(iris_X))  
# 通過隨機序列將數據隨機進行測試集和訓練集的劃分  
    
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X , iris_y, test_size=0.4, random_state=0) 
# Create and fit a nearest-neighbor classifier  
from sklearn.neighbors import KNeighborsClassifier  
knn = KNeighborsClassifier()  
knn.fit(iris_X_train, iris_y_train)   
  
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',  
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,  
           weights='uniform')

KNeighborsClassifier方法中含有8個參數（以下前兩個常用）：

n_neighbors : int, optional (default = 5)：K的取值，默認的鄰居數量是5；

weights：確定近鄰的權重，“uniform”權重一樣，“distance”指權重爲距離的倒數，默認情況下是權重相等。也可以自己定義函數確定權重的方式；

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'},optional：計算最近鄰的方法，可根據需要自己選擇；

再放一波官方tutorial sklearn.neighbors.KNeighborsClassifier

4月23日碼有點肝不動最近還和小夥伴做做時間序列分析感覺會很有趣~

有時間再回來補充

順便補充一個剛剛發現openCV出問題了估計很多人都會有

open_cv_輪子下載好後cd到文件夾 pip or conda都ok

[MLReview] k-NearestNeighbor k鄰近算法代碼實現

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

Domain Adaptation論文合集

[ICLR19] ORDERED NEURONS: INTEGRATING TREE STRUCTURES INTO RECURRENT NEURAL NETWORKS

[ICLR19] THE LOTTERY TICKET HYPOTHESIS： FINDING SPARSE, TRAINABLE NEURAL NETWORKS

[TF進階] Tensorflow編程基礎

Single Headed Attention RNN: Stop Thinking With Your Head

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結