統計學習筆記(三)k近鄰算法

算法描述

k近鄰算法(k-nearest neighbour)的輸入是實例的特徵向量,對應於特徵空間的點;輸出是實例的類別。k近鄰法假定在給定的訓練數據集裏,其中的實例的類別是確定的。對於新的實例,根據其k個最近的實例的類別,通過表決的方法進行預測。

3.1 k近鄰算法

算法3.1

輸入:訓練數據集TT和實例的特徵向量x^\hat{x}
其中訓練數據集T={(x1,y1),(x2,y2),...,(xN,yN)}T=\left\{\left(x_1,y_1\right),\left(x_2,y_2\right),...,\left(x_N,y_N\right)\right\}
其中,xiXRnx_i\in{X}\subseteq{R^n}爲實例的特徵向量,yiY{c1,c2,...,cK}y_i\in{Y}\subseteq\left\{c_1,c_2,...,c_K\right\}爲實例的類別,i=1,2,...,Ni=1,2,...,Nx^=(x(1),x(2),...,x(M))\hat{x}=(x^{(1)}, x^{(2)},...,x^{(M)})x(i)x^{(i)}是特徵向量的第i個參數,M是參數的個數;
輸出:實例xx所屬的類yy
(1)根據給定的距離度量,在訓練集TT裏找出與xx最鄰近的k個點,涵蓋這k個點的xx的鄰域記做Nk(x)N_k(x)
(2)在Nk(x)N_k(x)中根據分類決策規則,決定xx的分類yy
KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ y=\mathop{…

3.2 k近鄰模型

3.2.1 模型

3.2.2 距離

特徵空間中的距離是2個實例的相似程度的反映。k近鄰模型的特徵空間一般是n維實數向量空間RnR^n。距離一般使用歐氏距離,或者使用LpL_p距離或明可夫斯基距離。
設特徵空間X是n維實數向量空間RnR^nxi,xjXx_i,x_j\in Xxi^=(xi(1),xi(2),...,xi(n))\hat{x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})xj^=(xj(1),xj(2),...,xj(n))\hat{x_j}=(x_j^{(1)},x_j^{(2)},...,x_j^{(n)})xi^\hat{x_i}xj^\hat{x_j}的距離定義爲
Lp(xi,xj)=(l=1nxi(l)xj(l)p)1p L_p(x_i,x_j)=\left(\sum_{l=1}^n |x_i^{(l)}-x_j^{(l)}| ^ p \right) ^ {\frac 1p}
當p=2,稱爲歐氏距離
L2(xi,xj)=(l=1nxi(l)xj(l)2)12 L_2(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | ^ 2 \right) ^ {\frac 12}
當p=1,稱爲曼哈頓距離
L1(xi,xj)=(l=1nxi(l)xj(l)) L_1(x_i,x_j)=\left(\sum_{l=1}^n | x_i^{(l)}-x_j^{(l)} | \right)
當p\rightarrow \infty,她是各個座標差的最大值
L(xi,xj)=maxlxi(l)xj(l),  l=1,2,...,n L_{\infty}(x_i, x_j)= \max_l | x_i^{(l)}-x_j^{(l)} |,\ \ l=1,2,...,n

3.2.3 k值得選擇

通常使用交叉驗證法選擇一個最優的k值。

3.2.4 分類決策規則

表述很數學,我……

3.3 k近鄰法的實現:kd樹

3.3.1 構造kd樹

算法3.2 構造平衡kd樹

輸入:k維空間數據集T={x1,x2,...,xN}T=\{x_1,x_2,...,x_N\},其中xi=(xi(1),xi(2),...,xi(n)){x_i}=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})
輸出:一個kd樹
(1)開始構造根節點,根節點對應於包含TT的k維空間的超矩形區域。
選擇x(l)x^{(l)},以TT中所有實例的x(l)x^{(l)}座標的中位數爲切分點,將這個超矩形區域切分成兩個子區域。
由根節點生成深度爲1的左右兩個子節點,左子節點對應區域內所有點的x(l)x^{(l)}座標小於切分點的x(l)x^{(l)}座標,右子節點對應區域內所有點的x(l)x^{(l)}座標大於/等於切分點的x(l)x^{(l)}座標。
將落在切分超平面上的實例點保存在根節點。
(2)重複(1)直到兩個子區域中沒有實例存在時停止。

3.3.2 搜索kd樹

算法3.3 用kd樹的最近鄰搜索

輸入:已構造的kd樹;目標點x;
輸出:x的最近鄰。
(1)在kd樹中找出包含目標點的葉節點(區域):從根節點出發遞歸地訪問他的子節點。若目標點x座標小於切分點的座標,則移動到左子節點,否則移動到右子節點。直到葉子節點。
(2)以此節點作爲當前最近點。
(3)遞歸地向上回退,對每個節點進行:
(a)如果該節點保存的實例點比當前最近點距離目標點更近,則以該實例點爲當前最近點。
(b)當前最近點一定存在於該節點的一個子節點對應的區域。檢查該子節點的父節點的另一個子節點對應的區域是否有更近的點。具體的,檢查另一個子節點對應的區域是否與以目標節點爲球心、以目標點與“當前最近點”間的距離爲半徑的超球體相交。
如果相交,可能在另一個子節點對應的區域內存在距目標點更近的點,移動到另一個子節點。接着,遞歸地進行最近鄰搜索;
如果不相交,向上回退。
(4)當回退到根節點時,搜索結束。最後的“當前最近點”即爲x的最近鄰點。

代碼

以下代碼在Python3中調試通過。

(1)生成KD樹

先上圖。還是挺有意思的。
用Python生成的KD樹
輸入數據是一個2維向量集,也可支持多維,代碼做了適配。

import numpy as np
import matplotlib.pyplot as plt
import copy
import math
"""
X,  feature vectors
Y,  class of X
D,  dimension of each of vectors.
"""
# Construct initial to be classified data
D   = 2
NUM = 50
C = [ 'g', 'r', 'b' ]
#X = np.array([ (3,5), (2,4), (1,1), (5,2), (1,5), (4,1) ])
X = np.random.rand(NUM,D)
Y = [ C[i] for i in np.random.randint(0,len(C),NUM) ]

class KD_Node:
    cur_trav = None             # cursor for traversal.
    x_min = 0
    x_max = 1
    y_min = 0
    y_max = 1

    def __init__( self,
                  point=None, split=None, color=None,
                  L=None, R=None, father=None,
                  scope={} ):
        """
        initiate a kd tree.
        point: datum of this node
        split: split plane for this node
        L:     left son
        R:     right son
        father: father of this node, if root it's None
        scope: area in hyperspace for each node.
        """
        self.point  = point
        self.split  = split
        self.color  = color
        self.left   = L
        self.right  = R
        self.father = father
        self.flag_trav = 0      # traversal flag. 
                                #   bit 0 is notation for itself
                                #   bit 1 is for its left son
                                #   bit 2 is for its right son
        self.scope = scope      # paint scope:
                                #   x0: min of x
                                #   x1: max of x
                                #   y0: min of y
                                #   y1: max of y

    def clear_trav(self):
        KD_Node.cur_trav = None
        self.flag_trav = 0
        if self.left:
            self.left.clear_trav()
        if self.right:
            self.right.clear_trav()

    def __iter__(self):
        return self

    def __next__(self):
        # with non-iteration traverse the tree
        cursor = None
        if KD_Node.cur_trav == None:        # First time to use cur_trav, initiate.
            KD_Node.cur_trav = self

        cursor = KD_Node.cur_trav
        while 1:
            if cursor.flag_trav & 0X07 == 0X7:      # any node has flag with
                                                    # value=3 
                                                    # that states a completion
                                                    # of traversal.
                if cursor.father == None:
                    raise StopIteration
                else:
                    cursor = cursor.father
            
            elif cursor.flag_trav & 0X01 == 0:      # if bit0 == 0,
                cursor.flag_trav |= 0X01            # set bit0 = 1
                #cursor = cursor            # not need. set cursor => self
                break                               # BREAK! return current.
            
            elif cursor.flag_trav & 0X02 == 0:      # if bit1==0, bit2==0
                cursor.flag_trav |= 0X02            # set bit1 of self
                if cursor.left != None:
                    cursor = cursor.left            # set cursor => left son
                else:                               # self.left is None, skip
                    continue
            
            elif cursor.flag_trav & 0X04 == 0:      # if bit2 == 0,
                cursor.flag_trav |= 0X04            # set bit2 = 1
                if cursor.right != None:
                    cursor = cursor.right           # set cursor => right son
                else:
                    continue
        KD_Node.cur_trav = cursor

        return KD_Node.cur_trav


def CreateKDT(node=None, data=None, color=None, father=None ):
    """
    TODO: DOC FOR CreateKDT
    INPUT: node, the node itself?
           data, [ (3,5), (2,4), (1,1) ]
           father, the father
    OUTPUT: 
    """
    global C
    if len(data) > 0:
        global D
        dim = D
        var = np.var(data, axis=0)          # variance for each dimension
        split = np.argmax(var)              # split for this node
        pos = int(len(data)/2)
        pos_list = np.argpartition(data[:,split], pos)
        point = data[pos_list[pos]]         # point for this node
        color = C[np.random.randint(0, len(C))]
        cur_scope = {}                      # scope

        if not father:
            cur_scope = { 'x0': 0, 'x1': 6, # current scope is where the node is.
                          'y0': 0, 'y1': 6 }# Or you can assign it the min and
                                            # max of the graph.
        else:                               # update cur_scope
            cur_scope = copy.deepcopy(father.scope)
            if father.split == 0:
                if point[0] < father.point[0]:
                    cur_scope['x1'] = father.point[0]
                else:
                    cur_scope['x0'] = father.point[0]
            elif father.split == 1:
                if point[1] < father.point[1]:
                    cur_scope['y1'] = father.point[1]
                else:
                    cur_scope['y0'] = father.point[1]                

        node = KD_Node( point=point, split=split, color=color, father=father,
                        scope=cur_scope )

        if len(data[pos_list[:pos]]) != 0:
            node.left  = CreateKDT( node    = node.left,
                                    data    = data[pos_list[:pos]],
                                    color   = color,
                                    father  = node )

        if len(data[pos_list[(pos+1):]]) != 0:
            node.right = CreateKDT( node    = node.right,
                                    data    = data[pos_list[(pos+1):]],
                                    color   = color,
                                    father  = node )

    return node

def get_split_pos(data, split):
    """return the position to split in data."""
    pos = len(data)/2
    return 

def preorder(node, depth=-1):
    """
    Preorder a KD node
    """
    print(node)
    if node:
        if node.left:
            preorder(node.left)
        if node.right:
            preorder(node.right)

def draw_KDT(kd):
    """
    Draw a plot in which each of data determined by a point and draw the classifying plane.
    """
    x_min = kd.x_min
    x_max = kd.x_max
    y_min = kd.y_min
    y_max = kd.y_max
    plt.figure(figsize=(6,6))
    plt.xlabel("$x^{(1)}$")
    plt.ylabel("$x^{(2)}$")
    plt.title("Machine Learning: KD Tree")
    plt.xlim(int(x_min),math.ceil(x_max))
    plt.ylim(int(y_min),math.ceil(y_max))
    ax = plt.gca()
    ax.set_aspect(1)

    plt.plot( [x_min, x_max, x_max, x_min, x_min],
              [y_min, y_min, y_max, y_max, y_min] )

    line_from = []              # split line from and to
    line_to   = []
    
    for node in kd:
        if node.split == 0:
            line_from = [ node.point[0], node.scope['y0'] ]
            line_to   = [ node.point[0], node.scope['y1'] ]
        if node.split == 1:
            line_from = [ node.scope['x0'], node.point[1] ]
            line_to   = [ node.scope['x1'], node.point[1] ]

        plt.plot( [ line_from[0], line_to[0] ],
                  [ line_from[1], line_to[1] ],
                  'k-', linewidth=1 )
        plt.scatter( node.point[0], node.point[1], color=node.color )


    plt.show()
    pass


def find_knn(root, x):
    pass


def main():
    kd = None
    kd = CreateKDT(kd, X)

    #kd.clear_trav()
    draw_KDT(kd)

if __name__ == "__main__":
    main()


參考:
[1] http://blog.csdn.net/u010551621/article/details/44813299

發佈了21 篇原創文章 · 獲贊 6 · 訪問量 4萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章