統計學習筆記（三）k近鄰算法

算法描述

k近鄰算法（k-nearest neighbour）的輸入是實例的特徵向量，對應於特徵空間的點；輸出是實例的類別。k近鄰法假定在給定的訓練數據集裏，其中的實例的類別是確定的。對於新的實例，根據其k個最近的實例的類別，通過表決的方法進行預測。

3.1 k近鄰算法

算法3.1

輸入：訓練數據集 $T$ 和實例的特徵向量 $\hat{x}$ ；
其中訓練數據集 $T=\left\{\left(x_1,y_1\right),\left(x_2,y_2\right),...,\left(x_N,y_N\right)\right\}$
其中， $x_i\in{X}\subseteq{R^n}$ 爲實例的特徵向量， $y_i\in{Y}\subseteq\left\{c_1,c_2,...,c_K\right\}$ 爲實例的類別， $i=1,2,...,N$ ， $\hat{x}=(x^{(1)}, x^{(2)},...,x^{(M)})$ ， $x^{(i)}$ 是特徵向量的第i個參數，M是參數的個數；
輸出：實例 $x$ 所屬的類 $y$
（1）根據給定的距離度量，在訓練集 $T$ 裏找出與 $x$ 最鄰近的k個點，涵蓋這k個點的 $x$ 的鄰域記做 $N_k(x)$ 。
（2）在 $N_k(x)$ 中根據分類決策規則，決定 $x$ 的分類 $y$ 。
$KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ y=\mathop{…$

3.2 k近鄰模型

3.2.1 模型

3.2.2 距離

特徵空間中的距離是2個實例的相似程度的反映。k近鄰模型的特徵空間一般是n維實數向量空間 $R^n$ 。距離一般使用歐氏距離，或者使用 $L_p$ 距離或明可夫斯基距離。
設特徵空間X是n維實數向量空間 $R^n$ ， $x_i,x_j\in X$ ， $\hat{x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})$ ， $\hat{x_j}=(x_j^{(1)},x_j^{(2)},...,x_j^{(n)})$ ， $\hat{x_i}$ 和 $\hat{x_j}$ 的距離定義爲
$L_p(x_i,x_j)=\left(\sum_{l=1}^n |x_i^{(l)}-x_j^{(l)}| ^ p \right) ^ {\frac 1p}$
當p=2，稱爲歐氏距離
$L_2(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | ^ 2 \right) ^ {\frac 12}$
當p=1，稱爲曼哈頓距離
$L_1(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | \right)$
當p $\rightarrow \infty$ ，她是各個座標差的最大值
$L_{\infty}(x_i, x_j)= \max_l | x_i^{(l)}-x_j^{(l)} |,\ \ l=1,2,...,n$

3.2.3 k值得選擇

通常使用交叉驗證法選擇一個最優的k值。

3.2.4 分類決策規則

表述很數學，我……

3.3 k近鄰法的實現：kd樹

3.3.1 構造kd樹

算法3.2 構造平衡kd樹

輸入：k維空間數據集 $T=\{x_1,x_2,...,x_N\}$ ，其中 ${x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})$ 。
輸出：一個kd樹
（1）開始構造根節點，根節點對應於包含 $T$ 的k維空間的超矩形區域。
選擇 $x^{(l)}$ ，以 $T$ 中所有實例的 $x^{(l)}$ 座標的中位數爲切分點，將這個超矩形區域切分成兩個子區域。
由根節點生成深度爲1的左右兩個子節點，左子節點對應區域內所有點的 $x^{(l)}$ 座標小於切分點的 $x^{(l)}$ 座標，右子節點對應區域內所有點的 $x^{(l)}$ 座標大於/等於切分點的 $x^{(l)}$ 座標。
將落在切分超平面上的實例點保存在根節點。
（2）重複（1）直到兩個子區域中沒有實例存在時停止。

3.3.2 搜索kd樹

算法3.3 用kd樹的最近鄰搜索

輸入：已構造的kd樹；目標點x；
輸出：x的最近鄰。
（1）在kd樹中找出包含目標點的葉節點（區域）：從根節點出發遞歸地訪問他的子節點。若目標點x座標小於切分點的座標，則移動到左子節點，否則移動到右子節點。直到葉子節點。
（2）以此節點作爲當前最近點。
（3）遞歸地向上回退，對每個節點進行：
（a）如果該節點保存的實例點比當前最近點距離目標點更近，則以該實例點爲當前最近點。
（b）當前最近點一定存在於該節點的一個子節點對應的區域。檢查該子節點的父節點的另一個子節點對應的區域是否有更近的點。具體的，檢查另一個子節點對應的區域是否與以目標節點爲球心、以目標點與“當前最近點”間的距離爲半徑的超球體相交。
如果相交，可能在另一個子節點對應的區域內存在距目標點更近的點，移動到另一個子節點。接着，遞歸地進行最近鄰搜索；
如果不相交，向上回退。
（4）當回退到根節點時，搜索結束。最後的“當前最近點”即爲x的最近鄰點。

代碼

以下代碼在Python3中調試通過。

（1）生成KD樹

先上圖。還是挺有意思的。

輸入數據是一個2維向量集，也可支持多維，代碼做了適配。

import numpy as np
import matplotlib.pyplot as plt
import copy
import math
"""
X,  feature vectors
Y,  class of X
D,  dimension of each of vectors.
"""
# Construct initial to be classified data
D   = 2
NUM = 50
C = [ 'g', 'r', 'b' ]
#X = np.array([ (3,5), (2,4), (1,1), (5,2), (1,5), (4,1) ])
X = np.random.rand(NUM,D)
Y = [ C[i] for i in np.random.randint(0,len(C),NUM) ]

class KD_Node:
    cur_trav = None             # cursor for traversal.
    x_min = 0
    x_max = 1
    y_min = 0
    y_max = 1

    def __init__( self,
                  point=None, split=None, color=None,
                  L=None, R=None, father=None,
                  scope={} ):
        """
        initiate a kd tree.
        point: datum of this node
        split: split plane for this node
        L:     left son
        R:     right son
        father: father of this node, if root it's None
        scope: area in hyperspace for each node.
        """
        self.point  = point
        self.split  = split
        self.color  = color
        self.left   = L
        self.right  = R
        self.father = father
        self.flag_trav = 0      # traversal flag. 
                                #   bit 0 is notation for itself
                                #   bit 1 is for its left son
                                #   bit 2 is for its right son
        self.scope = scope      # paint scope:
                                #   x0: min of x
                                #   x1: max of x
                                #   y0: min of y
                                #   y1: max of y

    def clear_trav(self):
        KD_Node.cur_trav = None
        self.flag_trav = 0
        if self.left:
            self.left.clear_trav()
        if self.right:
            self.right.clear_trav()

    def __iter__(self):
        return self

    def __next__(self):
        # with non-iteration traverse the tree
        cursor = None
        if KD_Node.cur_trav == None:        # First time to use cur_trav, initiate.
            KD_Node.cur_trav = self

        cursor = KD_Node.cur_trav
        while 1:
            if cursor.flag_trav & 0X07 == 0X7:      # any node has flag with
                                                    # value=3 
                                                    # that states a completion
                                                    # of traversal.
                if cursor.father == None:
                    raise StopIteration
                else:
                    cursor = cursor.father
            
            elif cursor.flag_trav & 0X01 == 0:      # if bit0 == 0,
                cursor.flag_trav |= 0X01            # set bit0 = 1
                #cursor = cursor            # not need. set cursor => self
                break                               # BREAK! return current.
            
            elif cursor.flag_trav & 0X02 == 0:      # if bit1==0, bit2==0
                cursor.flag_trav |= 0X02            # set bit1 of self
                if cursor.left != None:
                    cursor = cursor.left            # set cursor => left son
                else:                               # self.left is None, skip
                    continue
            
            elif cursor.flag_trav & 0X04 == 0:      # if bit2 == 0,
                cursor.flag_trav |= 0X04            # set bit2 = 1
                if cursor.right != None:
                    cursor = cursor.right           # set cursor => right son
                else:
                    continue
        KD_Node.cur_trav = cursor

        return KD_Node.cur_trav


def CreateKDT(node=None, data=None, color=None, father=None ):
    """
    TODO: DOC FOR CreateKDT
    INPUT: node, the node itself?
           data, [ (3,5), (2,4), (1,1) ]
           father, the father
    OUTPUT: 
    """
    global C
    if len(data) > 0:
        global D
        dim = D
        var = np.var(data, axis=0)          # variance for each dimension
        split = np.argmax(var)              # split for this node
        pos = int(len(data)/2)
        pos_list = np.argpartition(data[:,split], pos)
        point = data[pos_list[pos]]         # point for this node
        color = C[np.random.randint(0, len(C))]
        cur_scope = {}                      # scope

        if not father:
            cur_scope = { 'x0': 0, 'x1': 6, # current scope is where the node is.
                          'y0': 0, 'y1': 6 }# Or you can assign it the min and
                                            # max of the graph.
        else:                               # update cur_scope
            cur_scope = copy.deepcopy(father.scope)
            if father.split == 0:
                if point[0] < father.point[0]:
                    cur_scope['x1'] = father.point[0]
                else:
                    cur_scope['x0'] = father.point[0]
            elif father.split == 1:
                if point[1] < father.point[1]:
                    cur_scope['y1'] = father.point[1]
                else:
                    cur_scope['y0'] = father.point[1]                

        node = KD_Node( point=point, split=split, color=color, father=father,
                        scope=cur_scope )

        if len(data[pos_list[:pos]]) != 0:
            node.left  = CreateKDT( node    = node.left,
                                    data    = data[pos_list[:pos]],
                                    color   = color,
                                    father  = node )

        if len(data[pos_list[(pos+1):]]) != 0:
            node.right = CreateKDT( node    = node.right,
                                    data    = data[pos_list[(pos+1):]],
                                    color   = color,
                                    father  = node )

    return node

def get_split_pos(data, split):
    """return the position to split in data."""
    pos = len(data)/2
    return 

def preorder(node, depth=-1):
    """
    Preorder a KD node
    """
    print(node)
    if node:
        if node.left:
            preorder(node.left)
        if node.right:
            preorder(node.right)

def draw_KDT(kd):
    """
    Draw a plot in which each of data determined by a point and draw the classifying plane.
    """
    x_min = kd.x_min
    x_max = kd.x_max
    y_min = kd.y_min
    y_max = kd.y_max
    plt.figure(figsize=(6,6))
    plt.xlabel("$x^{(1)}$")
    plt.ylabel("$x^{(2)}$")
    plt.title("Machine Learning: KD Tree")
    plt.xlim(int(x_min),math.ceil(x_max))
    plt.ylim(int(y_min),math.ceil(y_max))
    ax = plt.gca()
    ax.set_aspect(1)

    plt.plot( [x_min, x_max, x_max, x_min, x_min],
              [y_min, y_min, y_max, y_max, y_min] )

    line_from = []              # split line from and to
    line_to   = []
    
    for node in kd:
        if node.split == 0:
            line_from = [ node.point[0], node.scope['y0'] ]
            line_to   = [ node.point[0], node.scope['y1'] ]
        if node.split == 1:
            line_from = [ node.scope['x0'], node.point[1] ]
            line_to   = [ node.scope['x1'], node.point[1] ]

        plt.plot( [ line_from[0], line_to[0] ],
                  [ line_from[1], line_to[1] ],
                  'k-', linewidth=1 )
        plt.scatter( node.point[0], node.point[1], color=node.color )


    plt.show()
    pass


def find_knn(root, x):
    pass


def main():
    kd = None
    kd = CreateKDT(kd, X)

    #kd.clear_trav()
    draw_KDT(kd)

if __name__ == "__main__":
    main()

參考：
[1] http://blog.csdn.net/u010551621/article/details/44813299

Kevin_Song_HM

發佈了21 篇原創文章 · 獲贊 6 · 訪問量 4萬+

私信關注

統計學習筆記（三）k近鄰算法

算法描述

3.1 k近鄰算法

算法3.1

3.2 k近鄰模型

3.2.1 模型

3.2.2 距離

3.2.3 k值得選擇

3.2.4 分類決策規則

3.3 k近鄰法的實現：kd樹

3.3.1 構造kd樹

算法3.2 構造平衡kd樹

3.3.2 搜索kd樹

算法3.3 用kd樹的最近鄰搜索

代碼

（1）生成KD樹

Dataset collection

統計學習筆記（二）感知機

Tracing my right or wrong way of learning ML here.

學習Python類之迭代器

統計學習筆記（三）k近鄰算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結