已知的1000個整數的數組,給定一個整數,要求查證是否在數組中出現?
命題二:
已知1000個整數的數組,給定一個整數,要求查找數組中與之最接近的數字?
命題三:
已知1000個Point(包含X與Y座標)結構的數組,給定一個Point,要求查找數組中與之最接近(比如:歐氏距離最短)的點。
命題四:
已知1,000,000個向量,每個向量爲128維;給定一個向量,要求查找數組中與之最接近的K個向量
- 對於命題一,如果不考慮桶式、哈希等方式,常用的方法應該是排序後,使用折半查找。
- 對於命題二,與命題一類似,比較折半查找得出的結果,以及附近的各一個元素,即可。整個過程相當於是把這個包含1000個數組的數據結構做成一顆二叉樹,最後只需比較葉子節點與其父節點即可。
- 對於命題三、四其中命題三和四就是所謂的Nearest Neighbor問題。一種近似解決的方法就是KD-TREE
高維向量的KNN檢索問題,在圖像等多媒體內容搜索中是相當關鍵的。關於高維向量的討論,網上資料比較少;在此,我將一些心得分享給大家。
與二叉樹相比,KD-TREE也採用類似的劃分方式,只不過樹中的各節點均是高維向量,因此劃分的方式,採用隨機或指定的方式選取一個維度,在該指定維度上進行劃分;整體的思想就是採用多個超平面對數據集空間進行兩兩切分,這一點,有點類似於數據挖掘中的決策樹。
一個運用KD-TREE分割二維平面的DEMO如下:
KD-Tree build的代碼如下:
- private ClusterKDTree(Clusterable[] points, int height, boolean randomSplit){
- if ( points.length == 1 ){
- cluster = points[0];
- }
- else {
- splitIndex = chooseSplitDimension//選取切分維度
- (points[0].getLocation().length,height,randomSplit);
- splitValue = chooseSplit(points,splitIndex);//選取切分值
- Vector<Clusterable> left = new Vector<Clusterable>();
- Vector<Clusterable> right = new Vector<Clusterable>();
- for ( int i = 0; i < points.length; i++ ){
- double val = points[i].getLocation()[splitIndex];
- if ( val == splitValue && cluster == null ){
- cluster = points[i];
- }
- else if ( val >= splitValue ){
- right.add(points[i]);
- } else {
- left.add(points[i]);
- }
- }
- if ( right.size() > 0 ){
- this.right = new ClusterKDTree(right.toArray(new
- Clusterable[right.size()]),
- randomSplit ? splitIndex : height+1, randomSplit);
- }
- if ( left.size() > 0 ){
- this.left = new ClusterKDTree(left.toArray(new
- Clusterable[left.size()]),randomSplit ? splitIndex : height+1,
- randomSplit);
- }
- }
- }
- private int chooseSplitDimension(int dimensionality,int height,boolean random){
- if ( !random ) return height % dimensionality;
- int rand = r.nextInt(dimensionality);
- while ( rand == height ){
- rand = r.nextInt(dimensionality);
- }
- return rand;
- }
- private double chooseSplit(Clusterable points[],int splitIdx){
- double[] values = new double[points.length];
- for ( int i = 0; i < points.length; i++ ){
- values[i] = points[i].getLocation()[splitIdx];
- }
- Arrays.sort(values);
- return values[values.length/2];//選取中間值以保持樹的平衡
- }
構建完一顆KD-TREE之後,如何使用它來做KNN檢索呢?我用下面的圖來表示(20s的GIF動畫):
使用KD-TREE,經過一次二分查找可以獲得Query的KNN(最近鄰)貪心解,代碼如下:
- private Clusterable restrictedNearestNeighbor(Clusterable point, SizedPriorityQueue<ClusterKDTree> values){
- if ( splitIndex == -1 ) {
- return cluster; //已近到葉子節點
- }
- double val = point.getLocation()[splitIndex];
- Clusterable closest = null;
- if ( val >= splitValue && right != null || left == null ){
- //沿右邊路徑遍歷,並將左邊子樹放進隊列
- if ( left != null ){
- double dist = val - splitValue;
- values.add(left,dist);
- }
- closest = right.restrictedNearestNeighbor(point,values);
- }
- else if ( val < splitValue && left != null || right == null ) {
- //沿左邊路徑遍歷,並將右邊子樹放進隊列
- if ( right != null ){
- double dist = splitValue - val;
- values.add(right,dist);
- }
- closest = left.restrictedNearestNeighbor(point,values);
- }
- //current distance of the 'ideal' node
- double currMinDistance = ClusterUtils.getEuclideanDistance(closest,point);
- //check to see if the current node we've backtracked to is closer
- double currClusterDistance = ClusterUtils.getEuclideanDistance(cluster,point);
- if ( closest == null || currMinDistance > currClusterDistance ){
- closest = cluster;
- currMinDistance = currClusterDistance;
- }
- return closest;
- }
事實上,僅僅一次的遍歷會有不小的誤差,因此採用了一個優先級隊列來存放每次決定遍歷走向時,另一方向的節點。SizedPriorityQueue代碼的實現,可參考我的另一篇文章:
http://grunt1223.iteye.com/blog/909739
一種減少誤差的方法(BBF:Best Bin First)是回溯一定數量的節點:
- public Clusterable restrictedNearestNeighbor(Clusterable point, int numMaxBinsChecked){
- SizedPriorityQueue<ClusterKDTree> bins = new SizedPriorityQueue<ClusterKDTree>(50,true);
- Clusterable closest = restrictedNearestNeighbor(point,bins);
- double closestDist = ClusterUtils.getEuclideanDistance(point,closest);
- //System.out.println("retrieved point: " + closest + ", dist: " + closestDist);
- int count = 0;
- while ( count < numMaxBinsChecked && bins.size() > 0 ){
- ClusterKDTree nextBin = bins.pop();
- //System.out.println("Popping of next bin: " + nextBin);
- Clusterable possibleClosest = nextBin.restrictedNearestNeighbor(point,bins);
- double dist = ClusterUtils.getEuclideanDistance(point,possibleClosest);
- if ( dist < closestDist ){
- closest = possibleClosest;
- closestDist = dist;
- }
- count++;
- }
- return closest;
- }
可以用如下代碼進行測試:
- public static void main(String args[]){
- Clusterable clusters[] = new Clusterable[10];
- clusters[0] = new Point(0,0);
- clusters[1] = new Point(1,2);
- clusters[2] = new Point(2,3);
- clusters[3] = new Point(1,5);
- clusters[4] = new Point(2,5);
- clusters[5] = new Point(1,1);
- clusters[6] = new Point(3,3);
- clusters[7] = new Point(0,2);
- clusters[8] = new Point(4,4);
- clusters[9] = new Point(5,5);
- ClusterKDTree tree = new ClusterKDTree(clusters,true);
- //tree.print();
- Clusterable c = tree.restrictedNearestNeighbor(new Point(4,4),1000);
- System.out.println(c);
- }