K Nearest Neighbor問題的解決——KD-TREE Implementation

命題一：
已知的1000個整數的數組，給定一個整數，要求查證是否在數組中出現？

命題二：
已知1000個整數的數組，給定一個整數，要求查找數組中與之最接近的數字？

命題三：
已知1000個Point（包含X與Y座標）結構的數組，給定一個Point，要求查找數組中與之最接近（比如：歐氏距離最短）的點。

命題四：
已知1,000,000個向量，每個向量爲128維；給定一個向量，要求查找數組中與之最接近的K個向量

對於命題一，如果不考慮桶式、哈希等方式，常用的方法應該是排序後，使用折半查找。
對於命題二，與命題一類似，比較折半查找得出的結果，以及附近的各一個元素，即可。整個過程相當於是把這個包含1000個數組的數據結構做成一顆二叉樹，最後只需比較葉子節點與其父節點即可。
對於命題三、四其中命題三和四就是所謂的Nearest Neighbor問題。一種近似解決的方法就是KD-TREE

高維向量的KNN檢索問題，在圖像等多媒體內容搜索中是相當關鍵的。關於高維向量的討論，網上資料比較少；在此，我將一些心得分享給大家。
與二叉樹相比，KD-TREE也採用類似的劃分方式，只不過樹中的各節點均是高維向量，因此劃分的方式，採用隨機或指定的方式選取一個維度，在該指定維度上進行劃分；整體的思想就是採用多個超平面對數據集空間進行兩兩切分，這一點，有點類似於數據挖掘中的決策樹。

一個運用KD-TREE分割二維平面的DEMO如下：

KD-Tree build的代碼如下：

Java代碼  

private ClusterKDTree(Clusterable[] points, int height, boolean randomSplit){  

    if ( points.length == 1 ){  

        cluster = points[0];  

    }  

    else {  

        splitIndex = chooseSplitDimension//選取切分維度  

            (points[0].getLocation().length,height,randomSplit);  

        splitValue = chooseSplit(points,splitIndex);//選取切分值  

        Vector<Clusterable> left = new Vector<Clusterable>();  

        Vector<Clusterable> right = new Vector<Clusterable>();  

        for ( int i = 0; i < points.length; i++ ){  

            double val = points[i].getLocation()[splitIndex];  

            if ( val == splitValue && cluster == null ){  

                cluster = points[i];  

            }  

            else if ( val >= splitValue ){  

                right.add(points[i]);  

            } else {  

                left.add(points[i]);  

            }  

        }  

        if ( right.size() > 0 ){  

            this.right = new ClusterKDTree(right.toArray(new  

            Clusterable[right.size()]),  

            randomSplit ? splitIndex : height+1, randomSplit);  

        }  

        if ( left.size() > 0 ){  

            this.left = new ClusterKDTree(left.toArray(new  

            Clusterable[left.size()]),randomSplit ? splitIndex : height+1,  

            randomSplit);  

        }  

    }  

}  

private int chooseSplitDimension(int dimensionality,int height,boolean random){  

    if ( !random ) return height % dimensionality;  

    int rand = r.nextInt(dimensionality);  

    while ( rand == height ){  

        rand = r.nextInt(dimensionality);  

    }  

    return rand;  

}  

private double chooseSplit(Clusterable points[],int splitIdx){  

    double[] values = new double[points.length];  

    for ( int i = 0; i < points.length; i++ ){  

    values[i] = points[i].getLocation()[splitIdx];  

    }  

    Arrays.sort(values);  

    return values[values.length/2];//選取中間值以保持樹的平衡  

}

構建完一顆KD-TREE之後，如何使用它來做KNN檢索呢？我用下面的圖來表示（20s的GIF動畫）：

使用KD-TREE，經過一次二分查找可以獲得Query的KNN（最近鄰）貪心解，代碼如下：

Java代碼  

private Clusterable restrictedNearestNeighbor(Clusterable point, SizedPriorityQueue<ClusterKDTree> values){  

    if ( splitIndex == -1 ) {  

        return cluster; //已近到葉子節點  

    }  

    double val = point.getLocation()[splitIndex];  

    Clusterable closest = null;  

    if ( val >= splitValue && right != null || left == null ){  

        //沿右邊路徑遍歷，並將左邊子樹放進隊列  

        if ( left != null ){  

            double dist = val - splitValue;  

            values.add(left,dist);  

        }  

        closest = right.restrictedNearestNeighbor(point,values);  

    }  

    else if ( val < splitValue && left != null || right == null ) {  

        //沿左邊路徑遍歷，並將右邊子樹放進隊列  

        if ( right != null ){  

            double dist = splitValue - val;  

            values.add(right,dist);  

        }  

        closest = left.restrictedNearestNeighbor(point,values);  

    }  

    //current distance of the 'ideal' node  

    double currMinDistance = ClusterUtils.getEuclideanDistance(closest,point);  

    //check to see if the current node we've backtracked to is closer  

    double currClusterDistance = ClusterUtils.getEuclideanDistance(cluster,point);  

    if ( closest == null || currMinDistance > currClusterDistance ){  

        closest = cluster;  

        currMinDistance = currClusterDistance;  

    }  

    return closest;  

}

事實上，僅僅一次的遍歷會有不小的誤差，因此採用了一個優先級隊列來存放每次決定遍歷走向時，另一方向的節點。SizedPriorityQueue代碼的實現，可參考我的另一篇文章：
http://grunt1223.iteye.com/blog/909739

一種減少誤差的方法（BBF：Best Bin First）是回溯一定數量的節點：

Java代碼  

public Clusterable restrictedNearestNeighbor(Clusterable point, int numMaxBinsChecked){  

    SizedPriorityQueue<ClusterKDTree> bins = new SizedPriorityQueue<ClusterKDTree>(50,true);  

    Clusterable closest = restrictedNearestNeighbor(point,bins);  

    double closestDist = ClusterUtils.getEuclideanDistance(point,closest);  

    //System.out.println("retrieved point: " + closest + ", dist: " + closestDist);  

    int count = 0;  

    while ( count < numMaxBinsChecked && bins.size() > 0 ){  

        ClusterKDTree nextBin = bins.pop();  

    //System.out.println("Popping of next bin: " + nextBin);  

    Clusterable possibleClosest = nextBin.restrictedNearestNeighbor(point,bins);  

        double dist = ClusterUtils.getEuclideanDistance(point,possibleClosest);  

        if ( dist < closestDist ){  

        closest = possibleClosest;  

        closestDist = dist;  

    }  

    count++;  

    }  

    return closest;  

}

可以用如下代碼進行測試：

Java代碼  

public static void main(String args[]){  

    Clusterable clusters[] = new Clusterable[10];  

    clusters[0] = new Point(0,0);  

    clusters[1] = new Point(1,2);  

    clusters[2] = new Point(2,3);  

    clusters[3] = new Point(1,5);  

    clusters[4] = new Point(2,5);  

    clusters[5] = new Point(1,1);  

    clusters[6] = new Point(3,3);  

    clusters[7] = new Point(0,2);  

    clusters[8] = new Point(4,4);  

    clusters[9] = new Point(5,5);  

    ClusterKDTree tree = new ClusterKDTree(clusters,true);  

    //tree.print();  

    Clusterable c = tree.restrictedNearestNeighbor(new Point(4,4),1000);  

    System.out.println(c);  

}

K Nearest Neighbor問題的解決——KD-TREE Implementation

什麼是taxonomy, taxonomy access

什麼是 views

Drupal7模組介紹-Views-part1(下載安裝）

drupal7模板命名機制/規則

Drupal7模組介紹-Views-part4（排序功能）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結