gensim中word2vec python源碼理解（二）Skip-gram模型訓練

原創

2020-02-23 04:36

gensim中word2vec python源碼理解（一）初始化構建單詞表
 gensim中word2vec python源碼理解（二）Skip-gram模型訓練

拖了太久沒更Part2.，再撿起來發現gensim3.8.0裏面的word2vec有2018年的更新，更新文檔中提到：

gensim包中集成許多訓練詞向量的方法，不僅僅是word2vec
例如：~gensim.models.doc2vec.Doc2Vec，~gensim.models.fasttext.FastText and
wrappers for :class:~gensim.models.wrappers.VarEmbed and :class:~gensim.models.wrappers.WordRank

本文是在上一篇《使用Hierarchical Softmax方法構建單詞表》的基礎上，繼續記錄對word2vec源碼的理解過程。

if sentences is not None:
	if isinstance(sentences, GeneratorType):
    	raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
        # 1. 構建單詞表
        self.build_vocab(sentences, trim_rule=trim_rule)
        # 2. 進行訓練
        self.train(
            sentences, total_examples=self.corpus_count, epochs=self.iter,
            start_alpha=self.alpha, end_alpha=self.min_alpha
        )

實際上gensim包在調用的時候使用的是C版本的skip-gram和COBW訓練模型，但是這裏主要看一下python代碼中skip-gram訓練的實現方法。

def train_batch_sg(model, sentences, alpha, work=None, compute_loss=False):
    """
    Update skip-gram model by training on a sequence of sentences. 
    通過句子序列更新skip-gram模型
     Each sentence is a list of string tokens, which are looked up in the model's
    vocab dictionary. Called internally from `Word2Vec.train()`.
    每個句子都是一個字符串標記列表，在模型的vocab詞典中查找。
    在Word2Vec.train()中調用
    This is the non-optimized, Python version. If you have cython installed, gensim
    will use the optimized version from word2vec_inner instead.
     """
    result = 0
    for sentence in sentences: # 單獨處理一個batch句子序列的每一個句子
    	# 取出句子中出現的單詞的下標
        word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab and
                       model.wv.vocab[w].sample_int > model.random.rand() * 2**32]  #將該句中的單詞信息取出保存
        for pos, word in enumerate(word_vocabs): # 遍歷所有的單詞
            reduced_window = model.random.randint(model.window)  # `b` in the original word2vec code
            # now go over all words from the (reduced) window, predicting each one in turn
            start = max(0, pos - model.window + reduced_window)
            for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):#遍歷窗口內的所有單詞
                # don't train on the `word` itself 排除目標單詞，計算每一個單詞與目標詞的得分
                if pos2 != pos:
                    train_sg_pair(
                        model, model.wv.index2word[word.index], word2.index, alpha, compute_loss=compute_loss
                    )#作用：對一個上下文單詞進行計算，更新目標詞向量和二叉樹內節點的向量
        result += len(word_vocabs)  # 記錄處理的單詞總數並返回
    return result

計算每個句子得分的函數

   def score_sentence_sg(model, sentence, work=None):
        """
        Obtain likelihood score for a single sentence in a fitted skip-gram representaion.   獲得單個句子的似然函數得分

        The sentence is a list of Vocab objects (or None, when the corresponding
        word is not in the vocabulary). Called internally from `Word2Vec.score()`.

        This is the non-optimized, Python version. If you have cython installed, gensim
        will use the optimized version from word2vec_inner instead.

        """
        log_prob_sentence = 0.0 # 初始化對數似然得分爲0
        if model.negative:
            raise RuntimeError("scoring is only available for HS=True")  # 只在分層softmax條件下使用

        word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab] # 在句子中且在單詞表中的單詞w保存在word_vocabs list中
        for pos, word in enumerate(word_vocabs): #遍歷句子中的每個單詞
            if word is None:
                continue  # OOV word in the input sentence => skip

            # now go over all words from the window, predicting each one in turn 一次預測窗口內的每個單詞
            start = max(0, pos - model.window) # 找到窗口的起始位置，當前位置-窗口大小，爲負數的話則取0
            for pos2, word2 in enumerate(word_vocabs[start: pos + model.window + 1], start): # 取窗口內的單詞計算得分並求和
                # don't train on OOV words and on the `word` itself 不能計算當前詞的分數
                if word2 is not None and pos2 != pos: # 將每個詞的得分累加
                    log_prob_sentence += score_sg_pair(model, word, word2)

        return log_prob_sentence # 返回當前句子的似然函數分數值

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

gensim中word2vec python源碼理解（二）Skip-gram模型訓練

12款高效開源Wiki系統推薦，打造團隊知識管理利器

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

常用的 Git 指令

sm4加密工具類

【劍指offer】29. 數組中出現超過一半的數字(python)

gensim中word2vec python源碼理解（一）初始化構建單詞表

gensim中word2vec python源碼理解（二）Skip-gram模型訓練

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結