Faiss教程：入門

原文作者：@houkai
轉載自：https://www.cnblogs.com/houkai/p/9316129.html

目錄

Faiss處理固定維度d的數據，矩陣每一行表示一個向量，每列表示向量的一項。Faiss採用32-bit浮點型存儲。

假設xb爲數據集，維度爲nb×dnb×d；xq是查詢數據，維度爲nq×dnq×d

import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

爲數據構建索引，Faiss包含非常多的索引類型，這裏我們採用最簡單的版本IndexFlatL2，基於L2距離進行brute-force搜索。

所有的索引的構建都需要知道它們操作數據的維度（d）,其中大多索引需要一個訓練過程，基於訓練集來分析向量的分佈。對IndexFlatL2，我們可以跳過訓練。

索引創建後，add 和 search操作便可以基於索引來執行了。add 添加數據到索引（添加到xb）。

我們可以查看索引的屬性狀態，is_trained是否訓練完成（有些不需要訓練），ntotal被索引數據的數目。

有一些索引，需要提供向量的整數ID，如果ID沒有提供，add可以採用數據的序號數，第一個數據爲0，第二個是1，以此類推。

import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)
# output
True
100000

基於索引便可以進行k近鄰查詢了，結果矩陣爲nq×knq×k，第i行表示第i個查詢向量，每行包含k個最近鄰的ID，距離依次遞增。同時返回相同維度的距離矩陣。

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries
# output
[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[ 0.          7.17517328  7.2076292   7.25116253]
 [ 0.          6.32356453  6.6845808   6.79994535]
 [ 0.          5.79640865  6.39173603  7.28151226]
 [ 0.          7.27790546  7.52798653  7.66284657]
 [ 0.          6.76380348  7.29512024  7.36881447]]
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]

受向量第一項的影響，查詢數據中頭部數據的最近鄰也在數據集的頭部。

加速查詢，首先可以把數據集切分成多個，我們定義Voronoi Cells，每個數據向量只能落在一個cell中。查詢時只需要查詢query向量落在cell中的數據了，降低了距離計算次數。

通過IndexIVFFlat索引，可以實現上面的思想，它需要一個訓練的階段。IndexIVFFlat需要另一個索引，稱爲quantizer，來判斷向量屬於哪個cell。

search方法也相應引入了nlist（cell的數目）和nprobe（執行搜索的cell數）

nlist = 100
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
       # here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
index.train(xb)
assert index.is_trained

index.add(xb)                  # add may be a bit slower as well
D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(I[-5:])                  # neighbors of the 5 last queries
# output
[[ 9900 10500  9831 10808]
 [11055 10812 11321 10260]
 [11353 10164 10719 11013]
 [10571 10203 10793 10952]
 [ 9582 10304  9622  9229]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]

結果並不完全一致，因爲落在Voronoi cell外的數據也可能離查詢數據更近。適當增加nprobe可以得到和brute-force相同的結果，nprobe控制了速度和精度的平衡。

IndexFlatL2 和 IndexIVFFlat都要存儲所有的向量數據，這對於大型數據集是不現實的。Faiss基於PQ提供了變體IndexIVFPQ來壓縮數據向量（一定的精度損耗）。

向量仍是存儲在Voronoi cells中，但是它們的大小可以通過m來設置(m是d的因子)。

由於向量值不在準確存儲，所以search計算的距離也是近似的了。

nlist = 100
m = 8                             # number of bytes per vector
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
                                    # 8 specifies that each sub-vector is encoded as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(xq, k)     # search
print(I[-5:])
# output
[[   0  424  363  278]
 [   1  555 1063   24]
 [   2  304   46  346]
 [   3  773  182 1529]
 [   4  288  754  531]]
[[ 1.45568264  6.03136778  6.18729019  6.38852692]
 [ 1.4934082   5.74254704  6.19941282  6.21501732]
 [ 1.60279942  6.20174742  6.32792568  6.78541422]
 [ 1.69804895  6.2623148   6.26956797  6.56042767]
 [ 1.30235791  6.13624859  6.33899879  6.51442146]]
[[10664 10914  9922  9380]
 [10260  9014  9458 10310]
 [11291  9380 11103 10392]
 [10856 10284  9638 11276]
 [10304  9327 10152  9229]]

最近距離（到自身）不再是0了，因爲數據被壓縮了。整理64位 32-bits向量，被分割爲8份，每份用8bits表示，所以壓縮因子爲32。

查詢數據集的結果和IVFFlat對比，大多是錯誤的，但是它們都在10000左右。這種策略在實際數據中是更好的：

均勻分佈的數據是很難索引的，它們很難聚類和降維
自然數據，相似數據比不相干數據的距離要顯著的更小。

使用工廠方法簡化索引構建

index = faiss.index_factory(d, "IVF100,PQ8")

PQ8替換爲Flat便得到了IndexFlat索引，工廠方法是非常有效的，尤其是對數據採用預處理的時候，如參數"PCA32,IVF100,Flat"，表示通過PCA把向量減低到32維。

簡化索引的表達

通過上面IndexIVFFlat和IndexIVFPQ我們可以看到，他們的構造需要先提供另外一個index。類似的，faiss還提供pca、lsh等方法，有時候他們會組合使用。這樣組合的對構造索引會比較麻煩，faiss提供了通過字符串表達的方式構造索引。
如，下面表達式就能表示上面的創建IndexIVFPQ的實例。

index = faiss.index_factory(d, "IVF100,PQ8")

這裏有一點文檔中沒有提到的，通過查看c++代碼，index_factory方法還有第三個參數，就是上面說的metric。可傳入的就上面兩種。

Index *index_factory (int d, const char *description_in, MetricType metric)

更多的組合實例可以看demo

每類索引的簡寫可查詢Basic indexes

Faiss可以基本無縫地在GPU上運行，首先申請GPU資源，幷包括足夠的顯存空間。

res = faiss.StandardGpuResources()  # use a single GPU

使用GPU創建索引

# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

索引的使用和CPU上類似

gpu_index_flat.add(xb)         # add vectors to the index
print(gpu_index_flat.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k)  # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

使用多張GPU卡

ngpus = faiss.get_num_gpus()

print("number of GPUs:", ngpus)

cpu_index = faiss.IndexFlatL2(d)

gpu_index = faiss.index_cpu_to_all_gpus(  # build the index
    cpu_index
)

gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

Faiss教程：入門

簡化索引的表達

Wireshark 安裝+使用（一）

博客園商業化之路-衆包平臺：繼續召集早期合作開發者

【Tensorflow】tf.clip_by_value()的使用

【Tensorflow】slim.repeat和stack的使用

【Tensorflow】tf.variable_scope函數

SVD（奇異值分解）記錄

【Tensorflow】tf.concat()的使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結