用搜狗新聞數據集來訓練中文詞向量（Word2Vec），自己做的時候踩了很多的坑，希望分享出來讓大家少走彎路。

在學習完這篇後，您可以點擊維基百科訓練詞向量，來進一步完善自己的詞向量模型！

參考文章：搜狗語料庫word2vec獲取詞向量
 自然語言處理入門(一)------搜狗新聞語料處理和word2vec詞向量的訓練
 word2vec使用方法小結

數據集下載

此次用的是搜狗實驗室的新聞數據集下載地址

有迷你版和完整版可供選擇。我下載的是完整版的 tar.gz 格式。

數據集處理

（一）文檔解壓

打開 Windows命令提示符 (cmd)，轉到該文件所在文檔，輸入：

tar -zvxf news_sohusite_xml.full.tar.gz

即可將下載下的文件 news_sohusite_xml.full.tar.gz 解壓爲 news_sohusite_xml.dat

（二）文檔提取

我們看一下解壓後的數據（由於數據太大，我用 Pycharm 打開）

發現兩個Key Point：(1) 文檔編碼有問題，我們需要對它進行轉碼 (2) 文檔存儲格式是 uml ，url 是頁面鏈接，contenttitle 是頁面標題，content 是頁面內容，可以根據自己需要來獲取信息。

利用 cmd ，在轉碼處理的同時再提取 “content” 中的內容：

type news_sohusite_xml.dat | iconv -f gbk -t utf-8 -c | findstr "<content>"  > corpus.txt

這時候可能會報錯，原因是缺少 iconv.exe ，需要下載 win_iconv - 編碼轉換工具，下載後解壓，複製 iconv.exe 到 C:\Windows\System32，即可使用。

保存在文檔 corpus.txt 中，效果如圖所示。

（三）文檔分詞

建立 corpusSegDone.txt 文件，作爲分詞後的結果保存文件。輸入以下代碼進行分詞，每處理100行就打印一次，可以看到進度。

import jieba
import re

filePath = 'corpus.txt'
fileSegWordDonePath = 'corpusSegDone.txt'

# 將每一行文本依次存放到一個列表
fileTrainRead = []
with open(filePath, encoding='utf-8') as fileTrainRaw:
    for line in fileTrainRaw:
        fileTrainRead.append(line)

# 去除標點符號
fileTrainClean = []
remove_chars = '[·’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~]+'
for i in range(len(fileTrainRead)):
    string = re.sub(remove_chars, "", fileTrainRead[i])
    fileTrainClean.append(string)

# 用jieba進行分詞
fileTrainSeg = []
file_userDict = 'dict.txt'  # 自定義的詞典
jieba.load_userdict(file_userDict)
for i in range(len(fileTrainClean)):
    fileTrainSeg.append([' '.join(jieba.cut(fileTrainClean[i][7:-7], cut_all=False))])  # 7和-7作用是過濾掉<content>標籤，可能要根據自己的做出調整
    if i % 100 == 0:  # 每處理100個就打印一次
        print(i)

with open(fileSegWordDonePath, 'wb') as fW:
    for i in range(len(fileTrainSeg)):
        fW.write(fileTrainSeg[i][0].encode('utf-8'))
        fW.write('\n'.encode("utf-8"))

分詞過程如下圖所示，我的總共 140w 行，用時 45 min。

分詞結果如圖所示：

用gensim訓練詞向量

解釋一下訓練語句model = word2vec.Word2Vec(sentences, size=100, sg=1, window=6, min_count=0, workers=4, iter=5)的參數，避免踩坑：

sentences：傳入的句子集合
size：是指訓練的詞向量的維度
sg：爲 0 表示訓練採用 CBOW 算法，1 表示採用 skip-gram 算法
window：窗口數，即當前詞與預測詞的最大距離
min_count：捨棄掉那些詞頻比該值小的詞。注意，我在這裏踩了很大的坑，一定要想清楚是否要過濾掉頻率較小的詞！如果你還希望做句子相似性的時候，如果捨棄了一部分單詞，有可能會出現句子中的詞不在詞彙表中的情況！！！
worker：線程數，使用多線程來訓練模型
iter：迭代次數

如果想了解一下輸出信息logging，請點擊 logging模塊、Logger類，也可以自己添加想要的信息！
剩餘的大部分做了註釋，直接貼代碼：

import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger


def train_word2vec(dataset_path, out_vector):
    # 設置輸出日誌
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # 把語料變成句子集合
    sentences = LineSentence(dataset_path)
    # 訓練word2vec模型（size爲向量維度，window爲詞向量上下文最大距離，min_count需要計算詞向量的最小詞頻）
    model = word2vec.Word2Vec(sentences, size=100, sg=1, window=5, min_count=5, workers=4, iter=5)
    # (iter隨機梯度下降法中迭代的最大次數，sg爲1是Skip-Gram模型)
    # 保存word2vec模型（創建臨時文件以便以後增量訓練）
    model.save("word2vec.model")
    model.wv.save_word2vec_format(out_vector, binary=False)


# 加載模型
def load_word2vec_model(w2v_path):
    model = word2vec.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
    return model


# 計算詞語的相似詞
def calculate_most_similar(model, word):
    similar_words = model.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])


if __name__ == '__main__':
    dataset_path = "corpusSegDone.txt"
    out_vector = 'corpusSegDone.vector'
    train_word2vec(dataset_path, out_vector)

但是我知道，絕不會那麼簡單。我出現了第一個警告：

UserWarning: C extension not loaded, training will be slow.

我剛開始沒看到，結果到後面一個 EPOCH 只能處理 160 詞/s ，而一般是幾十萬，我整整訓練了10個小時都沒有結果！！！果然很 slow ::>_<::

在查閱資料後，我得知是缺少C擴展，有以下三種解決方案：

自己動手安裝，需要 Visual Studio 請看此位博主
用 conda 安裝，會自動綁定上C編譯請看此位博主
卸載高版本的 gensim 包，安裝 3.7.1 版本，我用的是這種，比較快。

# 首先打開cmd，卸載gensim

pip uninstall gensim

#接着安裝3.7.1版本

pip install gensim==3.7.1

再次運行，就不會顯示這種警告，而是換了一個警告呢！很厲害 O__O"

UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: 點擊這裏

點開鏈接，發現是 smart_open ——用於在 Python 中流式傳輸大文件的工具。我們撇開它不談，不妨看一看現在處理詞的速度，發現達到了 30 w/s , 用一個外國程序猿的話說，" Now the program is running within seconds which made a drastic change in the execution time. "

回過頭來看看 smart_open ，open ( ) 做的它都可以做，還可以減少編寫的代碼和產生更少的錯誤。可以 pip install smart_open，在打開大文件時使用 smart_open.open ( ) 即可。

用時 1h 30min，得到了 “word2vec.model” 模型

註釋掉訓練代碼，運行加載模型的代碼：

# 加載模型
def load_word2vec_model(w2v_path):
    model = word2vec.Word2Vec.load(w2v_path)
    return model

model = load_word2vec_model("word2vec.model")  # 加載模型

用以下代碼計算與“中國”最相近的詞：

def calculate_most_similar(self, word):
    similar_words = self.wv.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])

model = load_word2vec_model("word2vec.model")
calculate_most_similar(model, "中國")

結果如下：（好像還行）

與男人相近的 O_o

這個可以打印出詞的向量：

print(model['男人'])

找出最不合羣的詞：

# 找出不合羣的詞
def find_word_dismatch(self, list):
    print(self.wv.doesnt_match(list))

list = ["早飯", "喫飯", "恰飯", "嘻哈"]
find_word_dismatch(model, list)

完整代碼：

import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger
# import smart_open


def train_word2vec(dataset_path, out_vector):
    # 設置輸出日誌
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # 把語料變成句子集合
    sentences = LineSentence(dataset_path)
    # sentences = LineSentence(smart_open.open(dataset_path, encoding='utf-8'))  # 或者用smart_open打開
    # 訓練word2vec模型（size爲向量維度，window爲詞向量上下文最大距離，min_count需要計算詞向量的最小詞頻）
    model = word2vec.Word2Vec(sentences, size=100, sg=1, window=5, min_count=5, workers=4, iter=5)
    # (iter隨機梯度下降法中迭代的最大次數，sg爲1是Skip-Gram模型)
    # 保存word2vec模型
    model.save("word2vec.model")
    model.wv.save_word2vec_format(out_vector, binary=False)


# 加載模型
def load_word2vec_model(w2v_path):
    model = word2vec.Word2Vec.load(w2v_path)
    return model


# 計算詞語最相似的詞
def calculate_most_similar(self, word):
    similar_words = self.wv.most_similar(word)
    print(word)
    for term in similar_words:
        print(term[0], term[1])


# 計算兩個詞相似度
def calculate_words_similar(model, word1, word2):
    print(model.similarity(word1, word2))


# 找出不合羣的詞
def find_word_dismatch(self, list):
    print(self.wv.doesnt_match(list))


if __name__ == '__main__':
    dataset_path = "corpusSegDone.txt"
    out_vector = 'corpusSegDone.vector'
    train_word2vec(dataset_path, out_vector)  # 訓練模型
    model = load_word2vec_model("word2vec.model")  # 加載模型

    # calculate_most_similar(model, "喫飯")  # 找相近詞

    # calculate_words_similar(model, "男人", "女人")  # 兩個詞相似度

    # print(model.wv.__getitem__('男人'))  # 詞向量

    # list = ["早飯", "喫飯", "恰飯", "嘻哈"]

    # find_word_dismatch(model, list)

\^o^/ 終於完成啦！有什麼問題可以留言，我會盡力解答！└(^o^)┘有什麼不對，還望多多指教！

歡迎繼續查看下一篇：用維基百科語料庫訓練中文詞向量

寫在最後

由於有不少小夥伴向我要訓練好的模型，我把它掛在了百度網盤，鏈接點擊下載提取碼: 7w97

創作不易，訓練更難，還希望小夥伴可以點個贊！

參考文章

再次感謝三位的文章，拯救了我這個小菜雞！

參考文章：搜狗語料庫word2vec獲取詞向量
 自然語言處理入門(一)------搜狗新聞語料處理和word2vec詞向量的訓練
 word2vec使用方法小結

『詞向量』用Word2Vec訓練中文詞向量（一）—— 採用搜狗新聞數據集

目錄

數據集下載

數據集處理

（一）文檔解壓

（二）文檔提取

（三）文檔分詞

用gensim訓練詞向量

寫在最後

參考文章

python gdal 安裝使用（Windows， python 3.6.8）

『詞向量』用Word2Vec訓練中文詞向量（一）—— 採用搜狗新聞數據集

『論文閱讀』SIF：一種簡單卻難以打敗的句子嵌入方法

『LDA主題模型』用Python實現主題模型LDA

『關鍵詞挖掘』結合 LDA + Word2Vec + TextRank 實現關鍵詞的挖掘

『NLP自然語言處理』中文文本的分詞、去標點符號、去停用詞、詞性標註

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結