【李航-統計機器學習】【原理與代碼】【第三章】K近鄰法 python C++

原創

Hi_AI

2019-04-09 07:08

一、原理

什麼是K近鄰？就是KNN，當N=1的時候就是最近鄰了。

k近鄰算法簡單、直觀：給定一個訓練數據集，對新的輸入實例，在訓練數據集中找到與該實例最鄰近的k個實例，這k個實例多數屬於某個類，就把該輸入實例分爲這個類。

上面這個公式，就是找出投票最多的那一類！！！

二、幾大要點：

1、k近鄰法沒有顯式的學習過程。

2、個基本要素：

距離度量：

k值的選擇：

選擇較小的k值，就相當於用較小的鄰域中的訓練實例進行預測，“學習”的近似誤差（approximation error）會減小，只有與輸入實例較近的（相似的）訓練實例纔會對預測結果起作用。但缺點是“學習”的估計誤差（estimation error）會增大，預測結果會對近鄰的實例點非常敏感[2]。如果鄰近的實例點恰巧是噪聲，預測就會出錯。換句話說，k值的減小就意味着整體模型變得複雜，容易發生過擬合。
如果選擇較大的k值，就相當於用較大鄰域中的訓練實例進行預測。其優點是可以減少學習的估計誤差。但缺點是學習的近似誤差會增大。這時與輸入實例較遠的（不相似的）訓練實例也會對預測起作用，使預測發生錯誤。k值的增大就意味着整體的型變得簡單。
如果k＝N，那麼無論輸入實例是什麼，都將簡單地預測它屬於在訓練實例中最多的類。這時，模型過於簡單，完全忽略訓練實例中的大量有用信息，是不可取的。在應用中，k值一般取一個比較小的數值。通常採用交叉驗證法來選取最優的k值

分類決策規則：

繞來繞去，沒看明白~~~~

三、代碼：

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter

#對距離度量進行定義
#定義距離度量
def distance_(x1,x2,p=2):
    x1=np.array(x1)
    x2=np.array(x2)
    assert x1.shape==x2.shape   
    sum_=0
    for i in range(x1.shape[0]):
        sum_+=math.pow((x1[0]-x2[0]),p)
    return math.pow(sum_,1./p)

#製作訓練數據
iris = load_iris() #中文名是安德森鳶尾花卉數據集
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
df
#     花萼長度      花萼寬度    花瓣長度    花瓣寬度   類別

#數據進行可視化
plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], c='r',label='0')
plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'],c='y' ,label='1')
#plt.scatter(df[100:150]['sepal length'], df[100:150]['sepal width'],c='g' ,label='2')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()

#取數據，並且分成訓練和測試集合
data = np.array(df.iloc[:100, [0, 1, -1]])
X, y = data[:,:-1], data[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#定義模型

class KNN:
    def __init__(self,num_k,p,X_train,Y_train):
        self.p=p
        self.k=num_k
        self.X_train=X_train
        self.Y_train=Y_train
    def get_preds(self,input_data):
        dist_=np.zeros((self.X_train.shape[0],2))
        for i in range(self.X_train.shape[0]):
            dist_[i]=distance_(input_data,self.X_train[i],p=2),self.Y_train[i]
            #print(dist_[i])
        #dist_=dist_[np.lexsort(dist_[:,::-1].T)]
        dist_ = dist_[dist_[:,0].argsort()]
        res=dist_[:self.k,:]
        #print(res)
        sortbin=res[:,1].astype(np.int32)
        #print(np.argmax(np.bincount(sortbin)))
        return np.argmax(np.bincount(sortbin))

#運行模型並eval
model=KNN(num_k=10,p=2,X_train=X_train,Y_train=y_train)
count=0
for i in range(X_test.shape[0]):
    pred=model.get_preds(X_test[i])
    if pred == y_test[i] :
        count+=1
print("acc:",float(float(count)/y_test.shape[0]))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【李航-統計機器學習】【原理與代碼】【第三章】K近鄰法 python C++

【文章閱讀】The Devil is in the Decoder【計算機視覺中的上採樣方式-6種】

【姿態估計文章閱讀】Structured Feature Learning for Pose Estimation

【姿態估計文章閱讀】Human Pose Estimation with Iterative Error Feedback【這篇文章看的雖然有點迷糊，但是最後有一點很重要，自適應scale】

【姿態估計文章閱讀】PifPaf: Composite Fields for Human Pose Estimation

【文章解讀】FreeAnchor: Learning to Match Anchors for Visual Object Detection

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結