一、分類模型(classification)
近鄰算法大約是最簡單的算法之一,但是在許多場景中卻出奇地好用。
knn的核心思想是樣本在特徵空間中的k個最近鄰的樣本,然後讓這k的樣本“投票表決”待分類樣本的類別。
二、缺點:
1、原始樣本數據不均衡,某一類樣本數量很大
2、可理解性差,對比決策樹
三、算法及數學推導(截圖出來看,引用自李航老師的《統計學習方法》)
顯而易見,在knn中並沒有看見所謂的“學習算法”。
周志華老師在《機器學習》中稱這種爲“lazy learning”。
這裏引用ZH ZHOU的一篇論文ML-KNN:A lazy learning approach to multi-label learning
這裏周老師對比貝葉斯分類器,證明knn的泛化錯誤率不超過bayes的兩倍。
證明的論文地址Nearest Neighbor Pattern Classification
四、重要參數
1、k值選取
k過大:此時能有更大區域決定待預測的樣本,bias會增大(近似誤差),variance減小(估計誤差)
k過小:在較小的鄰域內預測。容易過擬合
選取合適k值:交叉驗證(cross validation)選取
2、亂入補充一個:交叉驗證crossvalidation
10折交叉驗證比較常見,就是把數據集分成n份,每次訓練選取其中n-1份爲訓練集,留一份爲測試集,計算平均誤差(estimated generalization error)。
詳細的論文見A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
**不過最近看見有LOOCV(Least-One-Out CV)
就是每次拿出來一個樣本做測試,不過這樣過擬合確定不會爆嗎。。好吧其實knn在過擬合問題上感覺考慮的不多
這樣做最大的缺點是會計算量巨多
再po一個cross validation的pptcmu_cross_validation
比較跑題,這裏提到了就順便寫完
好習慣,給個cv(區別下視覺的cv)的code
sklearn會有一些很好用的package:看看官方的tutorial真的很嗨皮scikit-learn——cross_validation
例如:①train_test_split 用的很多 按百分比隨機劃分數據集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(data, target, test_size=0.4, random_state=0)
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, data, target, cv=5)
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, data, target, cv=10)
③重頭戲:kfold
import numpy as np
from sklearn.model_selection import KFold
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
print("%s %s" % (train, test))
#[2 3] [0 1]
#[0 1] [2 3]
④LOO 其實跟kfold差不多 就是kfold的一個特殊形式
from sklearn.model_selection import LeaveOneOut
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))
#[1 2 3] [0]
#[0 2 3] [1]
#[0 1 3] [2]
#[0 1 2] [3]
⑤當然還有時間序列的分割 TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1,2],[3,4],[1,2],[3,4],[1,2],[3,4]])
tscv = TimeSeriesSplit(n_splits=3)
for train, test in tscv.split(X):
print train, test
#[0 1 2] [3]
#[0 1 2 3] [4]
#[0 1 2 3 4] [5]
官方源碼真的很不錯,當然這些package也可以自己寫咯~
ShuffleSplit:有放回的抽樣 類似kfold
總而言之 這裏CV的作用是tell我們哪一個k值model是最好噠~!
3、距離的問題(可以用在很多個場景)
在KNN算法中,常用的距離有三種,分別爲曼哈頓距離、歐式距離和閔可夫斯基距離。
設特徵空間XX是n維實數向量空間RnRn, xi,xj∈X,xi=(x(1)i,x(2)i,...,x(n)i)Txi,xj∈X,xi=(xi(1),xi(2),...,xi(n))T, xj=(x(1)j,x(2)j,...,x(n)j)Txj=(xj(1),xj(2),...,xj(n))T, xi,xjxi,xj的LpLp距離定義爲:
這裏 p≥1p≥1
當p=1p=1時,稱爲曼哈頓距離(Manhattan distance), 公式爲:
當p=2p=2時,稱爲歐式距離(Euclidean distance),即
當p=∞p=∞時,它是各個座標距離的最大值,計算公式爲:
五、knn算法codes
先從底層的開始:knn.py
#dataset數據集 inX分類向量 k鄰近數目
def classify(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
#歐式距離計算
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
#選擇距離最小的k個點
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#排序
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
so easy. 代碼解釋:
這裏先把訓練集的每一個樣本的features與對應的inX做減法,然後計算曼哈頓距離,對這些點進行升序排序。
接着聲明一個字典型classCount,keys爲最近k個點的labels,每當遇到相同的keys,values自增1。
對字典的values降序排列,返回第一個字典第一個值的key即爲“投票選舉”的結果
ps: argsort(x) 返回將x降序排列的index列表 然後根據index在labels中搜索對於的y值。
注意下py3將iteritems()替換爲items(),以及cmp函數也被取消,sorted中不再支持定義函數比較排序,可以自己重寫一個方法。
可以利用先前的split測試,我們load data
from sklearn.model_selection import train_test_split
from sklearn import datasets
def createDataSet():
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
X_train, X_test, y_train, y_test =
train_test_split(iris_X , iris_y, test_size=0.4, random_state=0)
return X_train, X_test, y_train, y_test
我們同時把訓練集合測試集輸入model
def predict():
X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.4, random_state=0)
k = 0
p = np.zeros(y_test.shape)
for i in range(X_test.shape[0]):
p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
if(p[i]==y_test[i]):k+=1
accuracy = k/p.shape[0]
print(accuracy)
這樣就完成了水仙花數據的knn測試,結果爲93.333%
現在我們對這個結果不是很滿意,希望用CV測試出準確度最高的k值
def kfold_predict():
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
kf = KFold(n_splits = 10)
accuracy = 0
for train_index, test_index in kf.split(iris_X) :
#print("TRAIN:", train_index, "TEST:",test_index)
X_train, X_test = iris_X[train_index],iris_X[test_index]
y_train, y_test = iris_y[train_index], iris_y[test_index]
#print(y_train,y_test)
k = 0
p = np.zeros(y_test.shape)
for i in range(X_test.shape[0]):
p[i] = (ks.classify0(X_test[i:i+1], X_train, y_train, 3))
if(p[i]==y_test[i]):k+=1
accuracy += k/p.shape[0]
print("%s" % (accuracy/10))
好吧 再寫一個for loop去試試看k值咯 這裏就不再繼續糾結了
六、knn算法實例
這裏引用幾個machinelearninginaction書中的codes進行說明。
1、在約會網站上使用knn,改進配對結果
準備工作:
①解析文件數據工具 file.pydef file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
#解析文件到列表
for line in fr.readlines():
line = line.strip() #截取回車字符
listFromLine = line.split('\t') #將整行數據分割爲元素列表
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
②畫散點圖 matplotlib 一個神奇的包 這個之後再碼 時間有點晚了def autoNorm(dataSet):
minVals = dataSet.min(0) #每列的最小值 參數0可以從列中選取最小值而不是選取當前行的最小值
maxVals = dataSet.max(0)
ranges = maxVals - minVals #爲了歸一化特徵值,必須使用當前值減去最小值,然後除以取值範圍
normDataSet = zeros(shape(dataSet)) #使用numpy中tile函數將變量內容複製成輸入矩陣同樣大小的矩陣
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
④預測def datingClassTest():
hoRatio = 0;10
datingDataMat,datingLables = file2matrix('datingTestSet.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
print errorCount
def classifyPerson():
resultList = ['not at all','in small doses','in large doses']
percentTats = float(raw_input(\
"percentage of time spent playing video games?"))
ffMiles = float(raw_input("frequent flier miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
inArr = array([ffMiles,percentTats,iceCream])
classifierResult = int(classify0((inArr-\
minVals)/ranges,normMat,datingLabels,3))
print "You will probably like this person:",\
resultList[classifierResult - 1]
2、比較經典的手寫體識別 可以用一萬種方法做
①處理圖像數據 img.py
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
②手寫體main.py 導入knn moduledef handwritingClassTest():
hwLabels = []
trainingFileList = listdir('trainingDigits') #load the training set
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
testFileList = listdir('testDigits') #iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))
七、sikit-learn
直接有knn包,sry久等了,不過謝謝源碼也是極好的哇。
#KNN調用
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
# permutation隨機生成一個範圍內的序列
indices = np.random.permutation(len(iris_X))
# 通過隨機序列將數據隨機進行測試集和訓練集的劃分
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X , iris_y, test_size=0.4, random_state=0)
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
再放一波官方tutorial sklearn.neighbors.KNeighborsClassifier
4月23日碼 有點肝不動 最近還和小夥伴做做時間序列分析 感覺會很有趣~
有時間再回來補充
順便補充一個 剛剛發現openCV出問題了 估計很多人都會有
open_cv_輪子 下載好後cd到文件夾 pip or conda都ok