文本向量化
創建一個目錄,並創建data
文件夾
安裝依賴
pip install gensim
下載數據集
數據集大概1.2G,下載完成後放在data
文件夾下
數據預處理
創建數據預處理文件
data_pre_process.py
# -*- coding: utf-8 -*-
from gensim.corpora import WikiCorpus
import jieba
from langconv import *
def my_function():
space = ' '
i = 0
l = []
zhwiki_name = './data/zhwiki-latest-pages-articles.xml.bz2'
f = open('./data/reduce_zhiwiki.txt', 'w')
wiki = WikiCorpus(zhwiki_name, lemmatize=False, dictionary={})
for text in wiki.get_texts():
for temp_sentence in text:
temp_sentence = Converter('zh-hans').convert(temp_sentence)
seg_list = list(jieba.cut(temp_sentence))
for temp_term in seg_list:
l.append(temp_term)
f.write(space.join(l) + '\n')
l = []
i = i + 1
if (i %200 == 0):
print('Saved ' + str(i) + ' articles')
f.close()
if __name__ == '__main__':
my_function()
執行數據預處理程序
python3 data_pre_process.py
由於語料庫比較大,過程可能較長,需要耐心等待。
data
下會產生reduce_zhiwili.txt
文檔
訓練
創建模型訓練文件
training.py
# -*- coding: utf-8 -*-
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def my_function():
wiki_news = open('./data/reduce_zhiwiki.txt', 'r')
model = Word2Vec(LineSentence(wiki_news), sg=0,size=192, window=5, min_count=5, workers=9)
model.save('zhiwiki_news.word2vec')
if __name__ == '__main__':
my_function()
開始訓練
python3 training.py
訓練過程可能較長,需要耐心等待。
將產生的以zhiwiki_
開頭的文件放在data
文件夾下
測試
創建測試文件
test.py
#coding=utf-8
import gensim
def my_function():
model = gensim.models.Word2Vec.load('./data/zhiwiki_news.word2vec')
print(model.similarity('西紅柿','番茄')) #相似度爲0.63
print(model.similarity('西紅柿','香蕉')) #相似度爲0.44
word = '中國'
if word in model.wv.index2word:
print(model.most_similar(word))
if __name__ == '__main__':
my_function()
執行測試
python3 test.py
可以看到結果: