通過撰寫代碼理解向量計算

原創

2024-01-24 14:26

embeded模型基於m3e。

一、原生向量代碼，自己計算距離

import numpy as np from numpy import dot from numpy.linalg import norm from sentence_transformers import SentenceTransformer model = SentenceTransformer('/home/helu/milvus/m3e-base') ###functions && classes#### def cos_sim(a, b): '''餘弦距離 -- 越大越相似''' return dot(a, b)/(norm(a)*norm(b)) def l2(a, b): '''歐式距離 -- 越小越相似''' x = np.asarray(a)-np.asarray(b) return norm(x) ###需要換成本地接口### def get_embeddings(texts): #data = embedding.create(input=texts).data embeddings = model.encode(texts) #return [x.embedding for x in data] return embeddings test_query = ["測試文本"] vec = get_embeddings(test_query)[0] print(vec[:10]) print(len(vec)) #query = "體育" # 且能支持跨語言 query = "sports" documents = [ "聯合國就蘇丹達爾富爾地區大規模暴力事件發出警告", "土耳其、芬蘭、瑞典與北約代表將繼續就瑞典“入約”問題進行談判", "日本岐阜市陸上自衛隊射擊場內發生槍擊事件 3人受傷", "國家游泳中心（水立方）：恢復游泳、嬉水樂園等水上項目運營", "我國首次在空間站開展艙外輻射生物學暴露實驗", ] query_vec = get_embeddings([query])[0] doc_vecs = get_embeddings(documents) print("Cosine distance:") print(cos_sim(query_vec, query_vec)) for vec in doc_vecs: print(cos_sim(query_vec, vec)) print("\nEuclidean distance:") print(l2(query_vec, query_vec)) for vec in doc_vecs: print(l2(query_vec, vec)) #基於以上結果，按照cos/l2方法建一個mix模型 print("mix distance:") for vec in doc_vecs: print(cos_sim(query_vec, vec)/l2(query_vec, vec))

二、引入向量檢索工具Faiss，幫助計算距離

import numpy as np import faiss from numpy import dot from numpy.linalg import norm from sentence_transformers import SentenceTransformer model = SentenceTransformer('/home/helu/milvus/m3e-base') ###functions && classes#### def get_datas_embedding(datas): return model.encode(datas) # 構建索引，FlatL2爲例 def create_index(datas_embedding): index = faiss.IndexFlatL2(datas_embedding.shape[1]) # 這裏必須傳入一個向量的維度，創建一個空的索引 index.add(datas_embedding) # 把向量數據加入索引 return index # 查詢索引 def data_recall(faiss_index, query, top_k): query_embedding = model.encode([query]) Distance, Index = faiss_index.search(query_embedding, top_k) return Index ############################### #query = "體育" # 且能支持跨語言 query = "sports" documents = [ "聯合國就蘇丹達爾富爾地區大規模暴力事件發出警告", "土耳其、芬蘭、瑞典與北約代表將繼續就瑞典“入約”問題進行談判", "日本岐阜市陸上自衛隊射擊場內發生槍擊事件 3人受傷", "國家游泳中心（水立方）：恢復游泳、嬉水樂園等水上項目運營", "我國首次在空間站開展艙外輻射生物學暴露實驗", ] datas_embedding = get_datas_embedding(documents) faiss_index = create_index(datas_embedding) sim_data_Index = data_recall(faiss_index,query, 3) print("相似的top3數據是：") for index in sim_data_Index[0]: print(documents[int(index)] + "\n")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

通過撰寫代碼理解向量計算

嘗試使用kimi解析體能表格

Hessian矩陣以及在血管增強中的應用——OpenCV實現【2024年更新】

基於vllm，探索產業級llm的部署

【內部項目預研】對信息分類進行探索

LCEL的具體實驗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結