非參方法-K NearestNeighbor（KNN）

原創

2020-06-29 17:16

非參方法-K NearestNeighbor（KNN）

KNN方法作爲一種無參方法，使用起來十分簡單，方便。更爲重要的是它往往能夠得到很好的效果。它既可以應用到分類中，也可以應用到迴歸中，是一種十分重要的方法。

問題：

給定一組訓練數據(X1,y1),(X2,y2),...(XN,yN) , 同時又給定了預測樣本Xt , 求取相對應的yt 。

問題分析：

如果我們定義一種判斷距離遠近的函數，那麼能夠找到給定訓練數據中的Xi(i=1,2,...N) 距離Xt 最近的一些點，也就是說找到Xt 的“鄰居”, 那麼這些“鄰居”所對應的y值應該與yt 相差不大。

參考解決方案：

(1) 定義一種距離函數，求出所有訓練數據輸入值，即Xi(i=1,2,...N) 到預測樣本Xt 的距離值。
(2) 找出這些距離值中最小的K個值，對應於X1,X2,...,XK).
(3) 若所求問題爲迴歸問題（即訓練數據的y值爲連續的），則

y t = 1 K \sum i = 1 K y i

若所求問題爲分類問題（即y值是離散的, 且爲M分類問題)，則

yt 爲K個“鄰居”中含有數量最多的那個類所對應的值。
matlab 代碼：

%************************************************************
%               KNN for regression or classification
%*************************************************************
% the specified parameters are as follows:
%          X: the input of train datas, it should be n*m Matrix(n is the
%             nums of the data, while m is the dimensionalities)
%          y: the output of train datas, t should be n*1 Matrix(n is the
%             nums of the data)
%          k: top k nearest neighbors
%  predict_x: test sample, t should be m*1 Matrix(m is the
%             dimensionalities)
% regression: 1 denotes regression, 0 denotes classification
%
% Author: Bai Junyang
%  Email: [email protected]
%************************************************************
function result = KNN(X, y, k, predict_x, regression)
[n, m] = size(X);

%compute the vector of the distanc
predict_X = repmat(predict_x', n, 1);
%size(X)
%size(predict_X)
distance = sum((X - predict_X).^2, 2);

%find the top-K index:topIndex
topIndex = zeros(k, 1);
sort_distance = sort(distance);
for i = 1:k
        topIndex(i) = find(distance == sort_distance(i));
end;

%compute the result
result = mean(y(topIndex)); 
if regression == 0
    if result > 0.5
        result = 1;
    else
        result = 0;
    end;
end;


%plot the point
index = 1:n;
index(topIndex) = [];

if regression == 1
    %plot the predict point
    plot(predict_x, result, 'ro', 'MarkerSize', 10);
    hold on;

    %plot the training data except the top-k data
    for i = index
        plot(X(i,:), y(i), 'ko', 'MarkerSize', 5);
    end;

    %plot the top-k data
    for i = 1:k
        plot(X(topIndex(i),:), y(topIndex(i),:), 'bo', 'MarkerSize', 10);
    end;
else
    %plot the predict point
    if result == 0
        plot(predict_x(1), predict_x(2), 'ro', 'MarkerSize', 10);
        hold on;
    else
        plot(predict_x(1), predict_x(2), 'r+', 'MarkerSize', 10);
        hold on;

    end;

    %plot the training data except the top-k data
    for i = index
        if y(i) == 0
            plot(X(i, 1), X(i, 2), 'yo', 'MarkerSize', 5);
        else
            plot(X(i, 1), X(i, 2), 'k+', 'MarkerSize', 5);
        end;
    end;

    %plot the top-k data
    for i = 1:k
        if y(i) == 0
            plot(X(topIndex(i),1), X(topIndex(i),2), 'bo', 'MarkerSize', 10);
        else
            plot(X(topIndex(i),1), X(topIndex(i),2), 'b+', 'MarkerSize', 10);
        end;

    end;
    hold off;
end

繪出圖形：

紅色就代表預測點的值，藍色代表K個鄰居，這裏K = 5。

與線性迴歸的對比

測試所用的數據共97組，其中25組用於測試
代碼如下：

function [knnError, lrError] = test(k)
data = load('D://ex1data1.txt');
X_train = data(1:72, 1);
y_train = data(1:72, 2);
X_test = data(73:97, 1);
y_test = data(73:97, 2);

%compute the Linear Regression Error
lrX_train = [ones(72, 1), X_train];
w = pinv(lrX_train)*y_train;
lrX_test = [ones(25, 1), X_test];
lrError = sum((lrX_test * w - y_test).^2);

%compute the KNN Error
knnError = 0;
for i = 1:25
    knnError = knnError + (y_test(i) - KNN(X_train, y_train, k, X_test(i), 1))^2;
end;

若用平方根誤差衡量兩種方法，則可以得到下表：

KNN中K的值	KNN的誤差值	Linear Regression的誤差值
1	30.626	14.219
5	16.134	14.219
10	14.079	14.219

從表格可以看出，若不考慮計算量的大小，KNN可以得到與Linear Regression一樣好的效果

KNN的評價

優點：
1.KNN算法思路十分簡單，容易理解。
2.KNN算法沒有訓練的過程，不必求解相關參數。
3.在一般情況下，KNN均能取得不錯的預測效果
缺點：
1.雖然不用求解參數，但每次預測均需要較大的計算量，若對於樣本數量及其龐大，且對預測時間有較高要求的實際問題中，往往不能適用。
2.同時，K的選擇也是其中一個問題，K的值過大，很容易導致計算量成倍地增加，但對於誤差的減小貢獻有限。例如測試例子中，若將k取20，誤差也有13.6699，僅比k = 5時降低了2.5左右,但計算量的增加確很大。
3.KNN預測結果十分依賴於樣本數據，若樣本數據數據與待預測數據相距較遠。例如樣本數據的X值大部分位於1附近，但預測點的值在100附近，這樣的預測結果準確率會大打折扣。
4.在分類問題中，KNN採用“硬劃分”的方法，即對於一個2分類問題，其預測結果不是0便是1。不像邏輯迴歸（Logistic Regression）可以得到預測結果是1或是0的概率，甚至可以設置不同的概率閾值來得到相關的結果。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

非參方法-K NearestNeighbor（KNN）

非參方法-K NearestNeighbor（KNN）

問題：

問題分析：

參考解決方案：

與線性迴歸的對比

KNN的評價

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

移位操作搞定兩數之商

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

非參方法-K NearestNeighbor（KNN）

神經網絡激活函數的介紹

用鏈表實現雙向隊列

預處理流程

推薦系統：Collaborative recommendation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結