Faiss教程：基礎

原文作者：@houkai
轉載自：https://www.cnblogs.com/houkai/p/9316136.html

目錄
聚類
 PCA降維
ProductQuantizer(PQ)
標量量化器（每一維度量化）
選擇索引的策略

Faiss對一些基礎算法提供了非常高效的實現：k-means、PCA、PQ編解碼。

聚類

假設2維tensor x：

ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)

中心點放在kmeans.centroids中，目標函數的值放在kmeans.obj中。返回查詢數據最近的中心點：

D, I = kmeans.index.search(x, 1)

返回某個測試數據集中離各個中心點最近的15個點。

index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

通過調整索引可以放到GPU上運行。

PCA降維

從40維降低到10維度

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

ProductQuantizer(PQ)

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

標量量化器（每一維度量化）

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

選擇索引的策略

推薦使用index_factory，通過參數創建索引。

Flat
提供數據集的基準結果，不壓縮向量，也不支持添加id；如果需要 add_with_ids，使用“IDMap,Flat”參數。
無需訓練，支持GPU.

Faiss的索引都是放在RAM中的，所以也就要考慮到內存的佔用。

HNSWx
足夠的內存，小的數據集。每個向量的links數目x範圍[4,64]，通過efSearch參數折中速度和精度，每個向量的內存佔用爲d4+x2*4個字節。
不支持add_with_ids（如需要添加IDMap），無需訓練，不支持從索引中移除向量，不支持GPU
xxx,Flat
xxx表示提前爲數據做了聚類，如IVFFlat，通過nprobe這種速度和精度，支持GPU（聚類方法也支持的情況下）。
PCARx,...,SQ8
存儲整改向量佔用資源太多，可以PCA降到x維度；SQ每項用一個字節表示。這樣每個向量只佔用x個字節的存儲空間。不支持GPU。
OPQx_y,...,PQx
PQx中x是字節數，通常<=64，如果更大采用SQ更爲高效。OPQ是對數據做了線性變換更利於數據壓縮，y表示：x的倍數、y<=d且y<4*x（推薦）。x表示OPQ中的分割參數，y纔是最終切分結果。支持GPU。

從數據集大小的角度（數據量、訓練數據大小）：

少於1百萬，使用...,IVFx,...
數據集大小爲N，x爲[4sqrt(N),16sqrt(N)]。使用K-menas進行聚類，我們需要[30x,256x]個向量進行訓練（當然越多越好）。支持GPU。
1百萬 < N < 1千萬，使用...,IMI2x10,...
IMI在訓練數據集中通過kmeans得到2^10箇中心點。但它是對向量的前後兩半分別進行的聚類，也就是得到的2^10^2=2^20箇中心描述。我們需要64*2^10個訓練樣本。不支持GPU。
1千萬 < N < 1個億，使用...,IMI2x12,...
同上，只是增加了聚類數。
1億 < N < 10億，使用...,IMI2x14,...
同上。

Faiss教程：基礎

聚類

PCA降維

ProductQuantizer(PQ)

標量量化器（每一維度量化）

選擇索引的策略

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

【Tensorflow】tf.clip_by_value()的使用

【Tensorflow】slim.repeat和stack的使用

【Tensorflow】tf.variable_scope函數

SVD（奇異值分解）記錄

【Tensorflow】tf.concat()的使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結