【Faiss】基礎使用：聚類，降維，量化

原創

2020-06-12 20:57

聚類

import faiss
import pickle
import numpy as np
import time


x = np.random.random((100000, 2048)).astype('float32')


ncentroids = 1000
niter = 500
verbose = True
d = x.shape[1]

start_time = time.time()

'''
d：向量維度
ncentroids：聚類中心
niter：迭代次數
verbose：是否打印迭代情況
gpu：是否使用GPU
'''
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(x)

train_time = time.time()
print(train_time - start_time)

cluster_cents = kmeans.centroids
cluster_wucha = kmeans.obj

D, I = kmeans.index.search(x, 1)
print(np.unique(np.array(I))) # 共有1000張數據，形狀爲[1000,2048]

search_time = time.time()
print(search_time - train_time)


# # 也可以創建一個檢索器，然後搜索出離這些中心點最近的15個向量
# index = faiss.IndexFlatL2 (d)
# index.add (x)
# D, I = index.search (kmeans.centroids, 15)

降維（PCA）

從40維向量減低到10維向量。

import faiss
import numpy as np

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix(40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print((tr ** 2).sum(0))

如何從PCA對象中得到Numpy中的PCA矩陣？

看見從PCA.ipynb獲取矩陣。這適用於任何LinearTransform對象。

import faiss
import numpy as np
# training data
xt = np.random.rand(1000, 20).astype('float32')
# test data
x = np.random.rand(10, 20).astype('float32')
# make the PCA matrix
pca = faiss.PCAMatrix(20, 10)
pca.train(xt)
# apply it to test data
yref = pca.apply_py(x)
# extract matrix + bias from the PCA object
# works for any linear transform (OPQ, random rotation, etc.)
b = faiss.vector_to_array(pca.b)
A = faiss.vector_to_array(pca.A).reshape(pca.d_out, pca.d_in)
# apply transformation
ynew = x @ A.T + b
# are the vectors the same?
print(np.allclose(yref, ynew))

量化

其實就是將數據進行編碼，然後用這個編碼代替這個數據，從而降低數據對資源的負擔。

PQ encoding / decoding

'''
這個ProductQuantizer對象可用於將矢量編碼或解碼爲代碼
'''
import numpy as np
import faiss

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)   ##########這個8不知道什麼意思，難道是指8位？
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print(avg_relative_error)

如何從ProductQuantizer對象獲取/更改質心？

看見Access_PQ_Centroids.ipynb.

scalar quantizer

import numpy as np
import faiss

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print(avg_relative_error)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Faiss】基礎使用：聚類，降維，量化

聚類

降維（PCA）

量化

認知提升的方法

C#開源的兩款功能強大的錄屏神器

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

前端 Vue yarn.lock文件：詳解和使用指南

【leetcode】網址--開篇

【Leetcode】時間複雜度

【mmdetection】參數解析

【SSD】方法解讀

【Python】round()

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結