1、KNN分類算法
KNN分類算法(K-Nearest-Neighbors Classification),又叫K近鄰算法,是一個概念極其簡單,而分類效果又很優秀的分類算法。
他的核心思想就是,要確定測試樣本屬於哪一類,就尋找所有訓練樣本中與該測試樣本“距離”最近的前K個樣本,然後看這K個樣本大部分屬於哪一類,那麼就認爲這個測試樣本也屬於哪一類。簡單的說就是讓最相似的K個樣本來投票決定。
machine-learning-databases/iris 點擊打開鏈接
數據集信息:
這也許是最著名的數據庫模式識別文獻中被發現。 費舍爾的論文是一個典型的,經常被引用。 (見杜達&哈特,例如)。 50個實例的數據集包含3類,其中
每個類是指一種虹膜。 一個類是線性可分的從其他2;後者不是線性可分的。
預測屬性:類的虹膜。
UCI中的Iris(鳶尾屬植物)數據集。Iris數據包含150條樣本記錄,分剮取自三種不同的鳶尾屬植物setosa、versic010r和virginica的花朵樣本,每一
類各50條記錄,其中每條記錄有4個屬性:萼片長度(sepal length)、萼片寬度sepalwidth)、花瓣長度(petal length)和花瓣寬度(petal width)。
這是一個極其簡單的域。
#-*- coding: UTF-8 -*- ''''' Created on 2016/7/17 @author: chen ''' import csv #用於處理csv文件 import random #用於隨機數 import math import operator # from sklearn import neighbors #加載數據集 def loadDataset(filename,split,trainingSet=[],testSet = []): with open(filename,"rb") as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)-1): for y in range(4): dataset[x][y] = float(dataset[x][y]) if random.random()<split: trainingSet.append(dataset[x]) else: testSet.append(dataset[y]) #計算距離 def euclideanDistance(instance1,instance2,length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]),2) return math.sqrt(distance) #返回K個最近鄰 def getNeighbors(trainingSet,testInstance,k): distances = [] length = len(testInstance) -1 #計算每一個測試實例到訓練集實例的距離 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x],dist)) #對所有的距離進行排序 distances.sort(key=operator.itemgetter(1)) neighbors = [] #返回k個最近鄰 for x in range(k): neighbors.append(distances[x][0]) return neighbors #對k個近鄰進行合併,返回value最大的key def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response]+=1 else: classVotes[response] = 1 #排序 sortedVotes = sorted(classVotes.iteritems(),key = operator.itemgetter(1),reverse =True) return sortedVotes[0][0] #計算準確率 def getAccuracy(testSet,predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct+=1 return (correct/float(len(testSet))) * 100.0 def main(): trainingSet = [] #訓練數據集 testSet = [] #測試數據集 split = 0.67 #分割的比例 loadDataset(r"../data/iris.txt", split, trainingSet, testSet) print "Train set :" + repr(len(trainingSet)) print "Test set :" + repr(len(testSet)) predictions = [] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print ">predicted = " + repr(result) + ",actual = " + repr(testSet[x][-1]) accuracy = getAccuracy(testSet, predictions) print "Accuracy:" + repr(accuracy) + "%" if __name__ =="__main__": main()
爲了檢驗上述程序是否正確,編寫一下代碼,測試只需上面的代碼。
#coding:utf-8
'''''
Created on 2016年7月17日
@author: chen
'''
from sklearn.datasets import load_iris
from sklearn import neighbors
import sklearn
#查看iris數據集
iris = load_iris()
print iris
knn = neighbors.KNeighborsClassifier()
#訓練數據集
knn.fit(iris.data, iris.target)
#預測
predict = knn.predict([[0.1,0.2,0.3,0.4]])
print predict
print iris.target_names[predict]
Train set :92
Test set :39
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
>predicted = 'Iris-setosa',actual = 'Iris-setosa'
Accuracy:100.0%
[Finished in 1.4s]