不做調包俠，手撕KNN算法

原創

2020-04-27 20:52

不做調包俠，手撕KNN算法

啥是KNN算法

KNN是最簡單的分類算法之一，在給定樣本數據和標籤的情況下，判定新來的一個數據屬於哪一個標籤。如何判斷呢？關鍵值在於K，所謂K就是從距離新來數據最近的樣本數據中選取K個數據，我們來數一數這K的樣本數據對應的標籤，哪個標籤佔比高，那就把該新來的數據歸爲哪個標籤。
基本算法步驟：

設新來數據爲x，計算x與所有數據集S中樣本的距離。距離是歐式距離，哈密頓距離，根據具體業務而不同。
找出距離最近的K個樣本，並查看對應的標籤
看哪個標籤在K中佔比高，則把x記爲哪個標籤。

手撕KNN算法代碼

初始化樣本

構建一組樣本數據，樣本格式爲[point,label,distance],分別存放點座標，標籤和距離。初始化數據。samples0和samples1分別對應標籤爲0和1的數據集。samples=samples0+samples1爲總體數據。

import random
import matplotlib.pyplot as plt
import math
samplesNum=50
samples=[]
samples0=[]
samples1=[]
for i in range(samplesNum):
    point = [random.randint(0, 100), random.randint(0, 100)]
    samples0.append([point, 0, 0])
for i in range(samplesNum):
    point = [random.randint(60, 160), random.randint(60, 160)]
    samples1.append([point, 1, 0])
samples=samples0+samples1

樣本數據大概長這樣

KNN算法

我們用KNN算法返回inputTest這個點位的標籤。

def distEclud(A,B):
    return math.sqrt(math.pow(A[0]-B[0],2)+math.pow(A[1]-B[1],2))

def classfiyKNN(inputTest,dataSet,K):
    labelsList=[]
    for i in range(len(dataSet)):
        dataSet[i][2]=distEclud(inputTest,dataSet[i][0])
    dataSet.sort(key=lambda samp: samp[2],reverse=False)
    labelsList=[0,1]
    labelsNum=[0,0]
    for i in range(K):
        for j in labelsList:
            if dataSet[i][1]==labelsList[j]:
                labelsNum[j]=labelsNum[j]+1
                break
    labelIndex=labelsNum.index(max(labelsNum))
    return labelsList[labelIndex]

實驗幾個數據

我們生成10個測試數據，看下情況。其中圓形的數據爲測試數據。對應的顏色即爲其分類。

K=10
testSample=[]
for i in range(10):
    point = [random.randint(60, 100), random.randint(60, 100)]
    testSample.append(point)
for t in testSample:
    resultlabel = classfiyKNN(t, samples, K)
    plt.scatter(t[0], t[1], marker=markers[2], c=color[resultlabel], alpha=0.5)
plt.show()

K的選擇

KNN算法最關鍵的是確定K值。我們通過對原始樣本處理，找到最合適的K值。通過測試我們看到在K取1,2，3左右時，錯誤率較低。

TestNum=int(samplesNum*0.4)

TestSamples0=random.sample(samples0,TestNum)
TestSamples1=random.sample(samples1,TestNum)
for ts in TestSamples0:
    samples0.remove(ts)

for ts in TestSamples1:
    samples1.remove(ts)

TestSample=TestSamples0+TestSamples1
TrainSample=samples1+samples1
print("TestSample：",len(TestSample))
print("TrainSample：",len(TrainSample))
error_rate=[]
Kx=[]
for K in range(20):
    errorNum=0
    Kx.append(K)
    for TS in TestSample:
        resultlabel = classfiyKNN(TS[0], samples, K)
        if resultlabel!=TS[1]:
            errorNum+=1
    error_rate.append(round(errorNum/len(TestSample),2))

plt.plot(Kx,error_rate,'g')

print(error_rate)
plt.show()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

不做調包俠，手撕KNN算法

不做調包俠，手撕KNN算法

啥是KNN算法

手撕KNN算法代碼

初始化樣本

KNN算法

實驗幾個數據

K的選擇

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

雨露均沾系列-開啓前端玩票之旅

python+pygame 貪喫蛇遊戲-多彩版

蟻羣算法原理與實現(python)

揹包問題及其優化-python實現

matplotlib畫圖初體驗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結