NLP工具——Gensim的model.keyedvectors模塊

原創

冰__蓝

2019-08-15 05:54

文章目錄

1、簡介

models.keyedVectors模塊實現了詞向量及其相似性查找。訓練好的此線路與訓練方式無關，因此他們可以由獨立結構表示。

該結構稱爲KeyedVectors，實質上是實體和向量之間的映射。每個實體由其字符串id標識，因此是字符串和1維數組之間的映射關係。
實體通常對應一個單詞，因此是將單詞映射到一維向量，對於某些某些，值也可以對應一篇文檔，一個圖像或其他。

KeyedVectors和完整模型的區別在於無法進一步的訓練，及更小的RAM佔用，更簡單的接口。

2、如何獲取詞向量

訓練一個完整的模型，然後獲取它的model.wv屬性，該屬性包含獨立的keyed vectors。如，使用word2vec訓練向量。

from gensim.test.utils import common_texts
from gensim.models import Word2Vec

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv

從磁盤加載詞向量文件

from gensim.models import KeyedVectors

word_vectors.save("vectors_wv")
word_vectors = KeyedVectors.load("vectors_wv", mmap='r')

從磁盤加載原始Google’s word2vec C格式的詞向量文件作爲KeyedVectors實例

wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('word2vec_vector.bin'), binary=True)  # C text format

3、使用這些詞向量可以做什麼？

可以執行各種NLP語法/語義的單詞任務。

>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
>>>
>>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699

>>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
# 兩個單詞的相似度
>>> similarity = word_vectors.similarity('woman', 'man')
>>> similarity > 0.8
True
# 與指定單詞最相近的詞列表
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
# 兩句話的WMD距離
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
# 兩個單詞的距離
>>> distance = word_vectors.distance("media", "media")
>>> print("{:.1f}".format(distance))
0.0
# 兩個句子相似度
>>> sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
# 詞向量
>>> vector = word_vectors['computer']  # numpy vector of a word
>>> vector.shape
(100,)
>>>
>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100,)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NLP工具——Gensim的model.keyedvectors模塊

文章目錄

1、簡介

2、如何獲取詞向量

3、使用這些詞向量可以做什麼？

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

python ftplib模塊實現文件上傳下載

BERT之提取特徵向量及 bert-as-server的使用

【Python】—日誌模塊logging詳解多進程日誌記錄

機器學習：Python實現聚類算法(一)之K-Means

NLP工具——Gensim 模型及詞向量文件的保存與加載

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結