gensim中word2vec python源碼理解(一)初始化構建單詞表
gensim中word2vec python源碼理解(二)Skip-gram模型訓練
拖了太久沒更Part2.,再撿起來發現gensim3.8.0
裏面的word2vec有2018年的更新,更新文檔中提到:
gensim包中集成許多訓練詞向量的方法,不僅僅是word2vec
例如:~gensim.models.doc2vec.Doc2Vec
,~gensim.models.fasttext.FastText
and
wrappers for :class:~gensim.models.wrappers.VarEmbed
and :class:~gensim.models.wrappers.WordRank
本文是在上一篇《使用Hierarchical Softmax方法構建單詞表》的基礎上,繼續記錄對word2vec源碼的理解過程。
if sentences is not None:
if isinstance(sentences, GeneratorType):
raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
# 1. 構建單詞表
self.build_vocab(sentences, trim_rule=trim_rule)
# 2. 進行訓練
self.train(
sentences, total_examples=self.corpus_count, epochs=self.iter,
start_alpha=self.alpha, end_alpha=self.min_alpha
)
實際上gensim包在調用的時候使用的是C版本的skip-gram和COBW訓練模型,但是這裏主要看一下python代碼中skip-gram訓練的實現方法。
def train_batch_sg(model, sentences, alpha, work=None, compute_loss=False):
"""
Update skip-gram model by training on a sequence of sentences.
通過句子序列更新skip-gram模型
Each sentence is a list of string tokens, which are looked up in the model's
vocab dictionary. Called internally from `Word2Vec.train()`.
每個句子都是一個字符串標記列表,在模型的vocab詞典中查找。
在Word2Vec.train()中調用
This is the non-optimized, Python version. If you have cython installed, gensim
will use the optimized version from word2vec_inner instead.
"""
result = 0
for sentence in sentences: # 單獨處理一個batch句子序列的每一個句子
# 取出句子中出現的單詞的下標
word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab and
model.wv.vocab[w].sample_int > model.random.rand() * 2**32] #將該句中的單詞信息取出保存
for pos, word in enumerate(word_vocabs): # 遍歷所有的單詞
reduced_window = model.random.randint(model.window) # `b` in the original word2vec code
# now go over all words from the (reduced) window, predicting each one in turn
start = max(0, pos - model.window + reduced_window)
for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):#遍歷窗口內的所有單詞
# don't train on the `word` itself 排除目標單詞,計算每一個單詞與目標詞的得分
if pos2 != pos:
train_sg_pair(
model, model.wv.index2word[word.index], word2.index, alpha, compute_loss=compute_loss
)#作用:對一個上下文單詞進行計算,更新目標詞向量和二叉樹內節點的向量
result += len(word_vocabs) # 記錄處理的單詞總數並返回
return result
計算每個句子得分的函數
def score_sentence_sg(model, sentence, work=None):
"""
Obtain likelihood score for a single sentence in a fitted skip-gram representaion. 獲得單個句子的似然函數得分
The sentence is a list of Vocab objects (or None, when the corresponding
word is not in the vocabulary). Called internally from `Word2Vec.score()`.
This is the non-optimized, Python version. If you have cython installed, gensim
will use the optimized version from word2vec_inner instead.
"""
log_prob_sentence = 0.0 # 初始化對數似然得分爲0
if model.negative:
raise RuntimeError("scoring is only available for HS=True") # 只在分層softmax條件下使用
word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab] # 在句子中且在單詞表中的單詞w保存在word_vocabs list中
for pos, word in enumerate(word_vocabs): #遍歷句子中的每個單詞
if word is None:
continue # OOV word in the input sentence => skip
# now go over all words from the window, predicting each one in turn 一次預測窗口內的每個單詞
start = max(0, pos - model.window) # 找到窗口的起始位置,當前位置-窗口大小,爲負數的話則取0
for pos2, word2 in enumerate(word_vocabs[start: pos + model.window + 1], start): # 取窗口內的單詞計算得分並求和
# don't train on OOV words and on the `word` itself 不能計算當前詞的分數
if word2 is not None and pos2 != pos: # 將每個詞的得分累加
log_prob_sentence += score_sg_pair(model, word, word2)
return log_prob_sentence # 返回當前句子的似然函數分數值