最近要用到文本詞向量，藉此機會重溫一下word2vec。本文會講解word2vec的原理和代碼實現。

本文提供的github代碼鏈接：https://github.com/yip522364642/word2vec-gensim

在NLP中，要讓計算機讀懂文本語言，首先要對文本進行編碼。常見的編碼如獨熱編碼（one-hot encoding），詞袋模型（BOW，bag of words），詞向量模型（word embedding）。而word2vec就是詞向量模型中的一種，它是google在2013年發佈的工具。

一、word2vec原理

word2vec工具主要包含兩個模型：連續詞袋模型（CBOW，continuous bag of words）和跳字模型（skip-gram）。如下圖所示，左邊藍色部分代表CBOW模型，右邊綠色部分代表Skip-gram模型。它們兩者的區別是，CBOW是根據上下文去預測目標詞來訓練得到詞向量，如圖是根據W(t-2),W(t-1),W(t+1),W(t+2)這四個詞來預測W(t)；而Skip-gram是根據目標詞去預測周圍詞來訓練得到詞向量，如圖是根據W(t)去預測W(t-2),W(t-1),W(t+1),W(t+2)。根據經驗，CBOW用於小型語料庫比較適合，而Skip-gram在大型的語料上表現得比較好。

那具體是如何實現的呢？下文以CBOW模型爲例，介紹各個步驟實現細節（具體公式先省略，有空再補上）

以上圖爲例，

① 輸入層（Input layer）：目標單詞上下文的單詞（這裏顯示三個），每個單詞用ont-hot編碼表示，爲[1 * V]大小的矩陣，V表示詞彙大小；

② 所有的ont-hot矩陣乘以輸入權重矩陣W，W是[V * N]大小的共享矩陣，N是指輸出的詞的向量維數；

③ 將相乘得到的向量（[1 * V] 的ont-hot矩陣乘上[V * N]的共享矩陣W）相加，然後求平均作爲隱層向量h，大小爲[1 * N]；

④ 將隱層向量h乘以輸出權重矩陣W'，W'是[N * V]大小的共享矩陣；

⑤ 相乘得到向量y，大小爲[1 * V]，然後利用softmax激活函數處理向量y，得到V-dim概率分佈；

⑥ 由於輸入的是ont-hot編碼，即每個維度都代表着一個單詞，那麼V-dim概率分佈中，概率最大的index所指代的那個單詞爲預測出的中間詞。

⑦ 將結果與真實標籤的ont-hot做比較，誤差越小越好，這裏的誤差函數，即loss function一般選交叉熵代價函數。

以上爲CBOW生成詞向量的全過程。如果我們只是想提取每個單詞的向量，那麼只需要得到向量y就可以了，但訓練過程中要去做預測並計算誤差，去求得輸入權重矩陣W和輸出權重矩陣W'。

二、word2vec代碼實現

下文，我將介紹採用python的gensim包實現word2vec，並介紹相關函數功能。

（1）獲取文本語料

本文采用網上的文本語料，語料大小將近100M，下載地址爲http://mattmahoney.net/dc/text8.zip

下載之後，可以查看語料內容（看語料主要是爲了清楚數據的格式是怎樣的，方便後面模型的讀取）

'''
1 獲取文本語料並查看
'''
with open('text8', 'r', encoding='utf-8') as file:
    for line in file.readlines():
        print(line)

我們發現語料已經按空格分好詞，並且去除了所有的標點符號，也沒有換行符，如下所示爲輸出截圖

（2）載入數據，訓練並保存模型

'''
2 載入數據，訓練並保存模型
'''
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # 輸出日誌信息
sentences = word2vec.Text8Corpus('text8')  # 將語料保存在sentence中
model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)  # 生成詞向量空間模型
model.save('text8_word2vec.model')  # 保存模型

接下來逐個講解每個代碼的意思（非常重要！！！）

① # 輸出日誌信息

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

這一行表示程序會輸出日誌信息，形式（format）爲日期（asctime）：信息級別（levelname）：日誌信息（message），信息級別爲正常信息（logging.INFO）。當然，logging.basicConfig函數裏面可以添加各個參數，這裏只添加了format參數，你也可以根據需要增加參數，建議只加自己想知道的東西，具體參考如下：

logging.basicConfig函數各參數:
filename: 指定日誌文件名
filemode: 和file函數意義相同，指定日誌文件的打開模式，'w'或'a'
format: 指定輸出的格式和內容，format可以輸出很多有用信息，如上例所示:
 %(levelno)s: 打印日誌級別的數值
 %(levelname)s: 打印日誌級別名稱
 %(pathname)s: 打印當前執行程序的路徑，其實就是sys.argv[0]
 %(filename)s: 打印當前執行程序名
 %(funcName)s: 打印日誌的當前函數
 %(lineno)d: 打印日誌的當前行號
 %(asctime)s: 打印日誌的時間
 %(thread)d: 打印線程ID
 %(threadName)s: 打印線程名稱
 %(process)d: 打印進程ID
 %(message)s: 打印日誌信息
datefmt: 指定時間格式，同time.strftime()
level: 設置日誌級別，默認爲logging.WARNING
stream: 指定將日誌的輸出流，可以指定輸出到sys.stderr,sys.stdout或者文件，默認輸出到sys.stderr，當stream和filename同時指定時，stream被忽略

logging打印信息函數：

logging.debug('This is debug message')
logging.info('This is info message')
logging.warning('This is warning message')

輸出結果截圖：

② # 將語料保存在sentence中

sentences = word2vec.Text8Corpus('text8')

這裏採用的‘text8‘語料是已經按空格分好詞，並且去除了所有的標點符號，也沒有換行符，所以不需要任何的預處理。

對於大規模數據集，sentences可以採用word2vec.BrownCorpus()，word2vec.Text8Corpus()或word2vec.LineSentence()來讀取；對於小規模數據集，sentences可以是一個List的形式，如sentences=[["I", "love", "China", "very", "much"], ["China", "is", "a", "strong", "country"]]。

③ # 生成詞向量空間模型

model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)

此行通過設置各個參數，來配置word2vec模型，具體參數的介紹如下：

'''
1.sentences：可以是一個List，對於大語料集，建議使用BrownCorpus,Text8Corpus或·ineSentence構建。
2.sg： 用於設置訓練算法，默認爲0，對應CBOW算法；sg=1則採用skip-gram算法。
3.size：是指輸出的詞的向量維數，默認爲100。大的size需要更多的訓練數據,但是效果會更好. 推薦值爲幾十到幾百。
4.window：爲訓練的窗口大小，8表示每個詞考慮前8個詞與後8個詞（實際代碼中還有一個隨機選窗口的過程，窗口大小<=5)，默認值爲5。
5.alpha: 是學習速率
6.seed：用於隨機數發生器。與初始化詞向量有關。
7.min_count: 可以對字典做截斷. 詞頻少於min_count次數的單詞會被丟棄掉, 默認值爲5。
8.max_vocab_size: 設置詞向量構建期間的RAM限制。如果所有獨立單詞個數超過這個，則就消除掉其中最不頻繁的一個。每一千萬個單詞需要大約1GB的RAM。設置成None則沒有限制。
9.sample: 表示 採樣的閾值，如果一個詞在訓練樣本中出現的頻率越大，那麼就越會被採樣。默認爲1e-3，範圍是(0,1e-5)
10.workers:參數控制訓練的並行數。
11.hs: 是否使用HS方法，0表示不使用，1表示使用 。默認爲0
12.negative: 如果>0,則會採用negativesamp·ing，用於設置多少個noise words
13.cbow_mean: 如果爲0，則採用上下文詞向量的和，如果爲1（default）則採用均值。只有使用CBOW的時候才起作用。
14.hashfxn： hash函數來初始化權重。默認使用python的hash函數
15.iter： 迭代次數，默認爲5。
16.trim_rule： 用於設置詞彙表的整理規則，指定那些單詞要留下，哪些要被刪除。可以設置爲None（min_count會被使用）或者一個接受()並返回RU·E_DISCARD,uti·s.RU·E_KEEP或者uti·s.RU·E_DEFAU·T的函數。
17.sorted_vocab： 如果爲1（defau·t），則在分配word index 的時候會先對單詞基於頻率降序排序。
18.batch_words：每一批的傳遞給線程的單詞的數量，默認爲10000
'''

④ # 保存模型

model.save('text8_word2vec.model')

將模型保存起來，以後再使用的時候就不用重新訓練，直接加載訓練好的模型使用就可以了。

下面會介紹加載模型後，直接使用word2vec來實現各個功能。

（3）加載模型，實現功能

'''
3 加載模型，實現各個功能
'''
# 加載模型
model = word2vec.Word2Vec.load('text8_word2vec.model')

# 計算兩個詞的相似度/相關程度
print("計算兩個詞的相似度/相關程度")
word1 = 'man'
word2 = 'woman'
result1 = model.similarity(word1, word2)
print(word1 + "和" + word2 + "的相似度爲：", result1)
print("\n================================")

# 計算某個詞的相關詞列表
print("計算某個詞的相關詞列表")
word = 'bad'
result2 = model.most_similar(word, topn=10)  # 10個最相關的
print("和" + word + "最相關的詞有：")
for item in result2:
    print(item[0], item[1])
print("\n================================")

# 尋找對應關係
print("尋找對應關係")
print(' "boy" is to "father" as "girl" is to ...? ')
result3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in result3:
    print(item[0], item[1])
print("\n")

more_examples = ["she her he", "small smaller bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
print("\n================================")

# 尋找不合羣的詞
print("尋找不合羣的詞")
result4 = model.doesnt_match("flower grass pig tree".split())
print("不合羣的詞：", result4)
print("\n================================")

# 查看詞向量（只在model中保留中的詞）
print("查看詞向量（只在model中保留中的詞）")
word = 'girl'
print(word, model[word])
# for word in model.wv.vocab.keys():  # 查看所有單詞
#     print(word, model[word])

（4）增量訓練

在使用詞向量時，如果出現了在訓練時未出現的詞（未登陸詞），可採用增量訓練的方法，訓練未登陸詞以得到其詞向量

'''
4 增量訓練
'''
model = word2vec.Word2Vec.load('text8_word2vec.model')
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)
model.save('text8_word2vec.model')

完整代碼如下，github爲：https://github.com/yip522364642/word2vec-gensim

# -*- coding: utf-8 -*-
# @Time : 2019/11/13 14:55
# @FileName: word2vec-gensim.py
# @Software: PyCharm
# @Author : yip
# @Email : [email protected]
# @Blog : https://blog.csdn.net/qq_30189255
# @Github : https://github.com/yip522364642


import warnings

warnings.filterwarnings("ignore")

'''
1 獲取文本語料並查看
'''
# with open('text8', 'r', encoding='utf-8') as file:
#     for line in file.readlines():
#         print(line)

'''
2 載入數據，訓練並保存模型
'''
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # 輸出日誌信息
sentences = word2vec.Text8Corpus('text8')  # 將語料保存在sentence中
model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)  # 生成詞向量空間模型
model.save('text8_word2vec.model')  # 保存模型


'''
3 加載模型，實現各個功能
'''
# 加載模型
model = word2vec.Word2Vec.load('text8_word2vec.model')

# 計算兩個詞的相似度/相關程度
print("計算兩個詞的相似度/相關程度")
word1 = 'man'
word2 = 'woman'
result1 = model.similarity(word1, word2)
print(word1 + "和" + word2 + "的相似度爲：", result1)
print("\n================================")

# 計算某個詞的相關詞列表
print("計算某個詞的相關詞列表")
word = 'bad'
result2 = model.most_similar(word, topn=10)  # 10個最相關的
print("和" + word + "最相關的詞有：")
for item in result2:
    print(item[0], item[1])
print("\n================================")

# 尋找對應關係
print("尋找對應關係")
print(' "boy" is to "father" as "girl" is to ...? ')
result3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in result3:
    print(item[0], item[1])
print("\n")

more_examples = ["she her he", "small smaller bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
print("\n================================")

# 尋找不合羣的詞
print("尋找不合羣的詞")
result4 = model.doesnt_match("flower grass pig tree".split())
print("不合羣的詞：", result4)
print("\n================================")

# 查看詞向量（只在model中保留中的詞）
print("查看詞向量（只在model中保留中的詞）")
word = 'girl'
print(word, model[word])
# for word in model.wv.vocab.keys():  # 查看所有單詞
#     print(word, model[word])


'''
4 增量訓練
'''
model = word2vec.Word2Vec.load('text8_word2vec.model')
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)
model.save('text8_word2vec.model')

以上介紹了word2vec原理代碼實現。

word2vec的原理及實現（附github代碼）

一、word2vec原理

二、word2vec代碼實現

（1）獲取文本語料

（2）載入數據，訓練並保存模型

① # 輸出日誌信息

② # 將語料保存在sentence中

③ # 生成詞向量空間模型

④ # 保存模型

（3）加載模型，實現功能

（4）增量訓練

搜索引擎的性能評估（以Baidu, Google and Bing爲例，附github代碼）

Xshell+VNC Viewer+Xftp連接服務器

word2vec的原理及實現（附github代碼）

使用git將文件/代碼上傳到github

tensorflow踩過的坑

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結