nmslib完成大規模高維向量查找（如大規模人臉識別）

最近在完成一個人臉識別項目，標準人臉庫裏面大約有50萬張人臉圖像，人臉識別算法使用的是MTCNN + ArcFace，每張人臉圖片經過ArcFace模型轉換後形成一個512維的特徵向量，那麼50萬張人臉圖像形成的人臉特徵矩陣大小就是[500000, 512]。

在進行人臉識別時，通過計算待識別的人臉特徵與這50萬個特徵的最小距離來確定人物身份。對於每次只識別一個人臉特徵來說，使用numpy的廣播機制，可以完成最小距離的計算。但是現在如果要識別圖片上的10個人臉，這10張人臉的特徵矩陣是[10, 512]，這個矩陣無法通過numpy的廣播機制與[500000， 512]矩陣進行廣播運算。那麼只能使用for循環對10個人臉逐個進行識別，這種情況下識別一張圖片消耗的時間就與圖片上的人臉個數是完全線性的關係。這顯然是無法接受的。

下面簡單介紹下如何使用nmslib完成大規模高維向量的計算，nmslib使用過程中具體的參數配置詳見nmslib開源代碼，本篇內容提供的配置參數對於幾百萬的高維向量查找足夠了。

1、nmslib安裝

在python環境下安裝nmslib很很簡單，pip install nmslib就可以完成，完後程使用import nmslib進行導入即可

2、nmslib具體使用示例

# coding:utf-8

import numpy as np
import nmslib
import datetime
from functools import wraps


def func_execute_time(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = datetime.datetime.now()
        res = func(*args, **kwargs)
        end_time = datetime.datetime.now()
        duration_time = (end_time - start_time).microseconds // 1000
        print("execute function %s, elapse time %.4f ms" % (func.__name__, duration_time))
        return res
    return wrapper


def create_indexer(vec, index_thread, m, ef):
    """
    基於數據向量構建索引
    :param vec: 原始數據向量
    :param index_thread: 線程數
    :param m: 參照官網
    :param ef: 參照官網
    :return:
    """
    index = nmslib.init(method="hnsw", space="l2")
    index.addDataPointBatch(vec)
    INDEX_TIME_PARAMS = {
        "indexThreadQty": index_thread,
        "M": m,
        "efConstruction": ef,
        "post": 2
    }
    index.createIndex(INDEX_TIME_PARAMS, print_progress=True)
    index.saveIndex("data_%d_%d_%d.hnsw" % (index_thread, m, ef))


def load_indexer(index_path, ef_search=300):
    """
    加載構建好的向量索引文件
    :param index_path: 索引文件地址
    :param ef_search: 查詢結果參數
    :return:
    """
    indexer = nmslib.init(method="hnsw", space="l2")
    indexer.loadIndex(index_path)
    indexer.setQueryTimeParams({"efSearch": ef_search})
    return indexer


@func_execute_time
def search_vec_top_n(indexer, vecs, top_n=7, threads=4):
    """
    使用構建好的索引文件完成向量查詢
    :param indexer: 索引
    :param vecs: 待查詢向量
    :param top_n: 返回前top_n個查詢結果
    :param threads:
    :return:
    """
    neighbours = indexer.knnQueryBatch(vecs, k=top_n, num_threads=threads)
    # print(neighbours)
    return neighbours


if __name__ == '__main__':
    data = np.arange(300000000).reshape(1000000, 300)
    print(data.shape)
    create_indexer(data, 10, 5, 300)
    indexer = load_indexer("data_10_5_300.hnsw")

    print(search_vec_top_n(indexer, data[:10]))
    print(search_vec_top_n(indexer, data[-10:]))

1、上述代碼中構建了一個大小爲[1000000, 300]的矩陣，並使用create_indexer函數構建矩陣索引，索引文件保存在*.hnsw文件中

構建過程中涉及到的參數：

method="hnsw"，全稱叫“Hierarchical Navigable Small World Graph”是nmslib中使用的一種索引構建算法。

space="l2"，指定查找向量時使用的距離度量方法

indexThreadQty、M、efConstruction、post等參數用來權衡構建的索引複雜度和最終的查找精度上的權衡，對於幾百萬個向量的查找，參照上述參數設置即可。具體參數含義參考nmslib手冊

2、對於構建好的索引文件使用load_indexer進行加載

加載索引文件過程中涉及到的參數：efSearch的設置保持與efConstruction一致

3、在search_vec_top_n函數中使用構建好的所以完成向量查找

查找階段涉及的參數：

top_n在向量空間中查找最近的top_n個向量

thread指定查找時的線程數，與查找速度相關

上述代碼運行結果如下：

data[:10]向量的查找結果爲：

execute function search_vec_top_n, elapse time 15.0000 ms
[
    (array([0, 1, 2, 3, 4, 5, 6]), array([    0.    ,  5196.1523, 10392.305 , 15588.457 , 20784.61  , 25980.762 , 31176.914 ], dtype=float32)), 
    (array([1, 0, 2, 3, 4, 5, 6]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 15588.457 , 20784.61  , 25980.762 ], dtype=float32)), 
    (array([2, 1, 3, 0, 4, 5, 6]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 20784.61  ], dtype=float32)), 
    (array([3, 2, 4, 1, 5, 0, 6]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([4, 3, 5, 2, 6, 1, 7]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([5, 4, 6, 3, 7, 2, 8]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([6, 5, 7, 4, 8, 3, 9]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([ 7,  6,  8,  5,  9,  4, 10]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([ 8,  7,  9,  6, 10,  5, 11]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32)), 
    (array([ 9,  8, 10,  7, 11,  6, 12]), array([    0.    ,  5196.1523,  5196.1523, 10392.305 , 10392.305 , 15588.457 , 15588.457 ], dtype=float32))
]

data[:-10]向量的查找結果爲：

execute function search_vec_top_n, elapse time 15.0000 ms
[
    (array([999990, 999989, 999991, 999988, 999992, 999987, 999993]), array([    0.    ,  5200.271 ,  5209.6157, 10385.96  , 10395.076 , 15594.214 , 15594.214 ], dtype=float32)), 
    (array([999991, 999992, 999990, 999993, 999989, 999988, 999994]), array([    0.    ,  5196.528 ,  5209.6157, 10393.253 , 10398.72  , 15586.727 , 15586.727 ], dtype=float32)), 
    (array([999992, 999991, 999993, 999990, 999994, 999989, 999995]), array([    0.   ,  5196.528,  5207.748, 10395.076, 10398.72 , 15586.727, 15586.727], dtype=float32)), 
    (array([999993, 999994, 999992, 999995, 999991, 999990, 999996]), array([    0.   ,  5202.141,  5207.748, 10387.783, 10393.253, 15594.214, 15594.214], dtype=float32)), 
    (array([999994, 999995, 999993, 999992, 999996, 999991, 999997]), array([    0.   ,  5196.528,  5202.141, 10398.72 , 10400.542, 15586.727, 15586.727], dtype=float32)), 
    (array([999995, 999994, 999996, 999993, 999997, 999992, 999998]), array([    0.   ,  5196.528,  5215.215, 10387.783, 10398.72 , 15586.727, 15592.343], dtype=float32)), 
    (array([999996, 999997, 999995, 999998, 999994, 999999, 999993]), array([    0.   ,  5194.656,  5215.215, 10385.96 , 10400.542, 15588.599, 15594.214], dtype=float32)), 
    (array([999997, 999996, 999998, 999995, 999999, 999994, 999993]), array([    0.   ,  5194.656,  5202.141, 10398.72 , 10402.363, 15586.727, 20782.762], dtype=float32)), 
    (array([999998, 999997, 999999, 999996, 999995, 999994, 999993]), array([    0.   ,  5202.141,  5211.483, 10385.96 , 15592.343, 20782.762, 25976.826], dtype=float32)), 
    (array([999999, 999998, 999997, 999996, 999995, 999994, 999993]), array([    0.   ,  5211.483, 10402.363, 15588.599, 20797.54 , 25985.99 , 31181.549], dtype=float32))
]

data[:10]和data[:-10]分別是data向量的前10個和後10個，在l2空間距離上與他們最近的7個向量分別是他們各自爲中心，以及他們左右各3個向量。比如以data[:-10]中的第一個向量（也就是data[-10]，索引編號爲999990）的計算結果爲例：

(array([999990, 999989, 999991, 999988, 999992, 999987, 999993]), array([    0.    ,  5200.271 ,  5209.6157, 10385.96  , 10395.076 , 15594.214 , 15594.214 ], dtype=float32))

與data[-10]距離最近的向量是他自己，距離爲0.0，其次就是data[-10]的前一個向量，索引編號爲999989的向量，在下一個就是data[-10]的後一個向量，索引編號爲999991的向量。

在上述結果中可以看到，在[1000000, 300]大小的矩陣上查找10個大小爲300維的向量話費大約15毫秒左右的時間，表明nmslib構建索引後不僅可以保證查找結果的準確性，同時大大提升了查找效率。

最後講一下在人臉識別項目中nmslib的使用，首先使用nmslib構建使用算法處理好的人臉特徵向量的索引文件，在構建索引文件的同時，使用一個映射文件記錄每個人臉特徵向量的索引編號及其對應的人物身份。如（1，小明），（2，老王），……，然後使用索引編號與nmslib查詢結果中的索引編號進行關聯，就可以確定人臉的身份了。

nmslib完成大規模高維向量查找（如大規模人臉識別）

kaggle波士頓房價預測，score=0.12986

使用scikit-learn計算分類器的ROC曲線及AUC值

Ununtu16.04系統下編譯安裝ffmpeg、抽幀和計算圖片時間點

scikit-learn數值縮放、歸一化、標準化常用方法

scikit-learn工具包中常用的特徵選擇方法介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結