Word2Vec類介紹
定義
def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()):
常用參數
sentences:數據類型爲list,可以用BrownCorpus,Text8Corpus或lineSentence來構建sentences
size:向量維度,默認爲100
window:當前詞與預測次在一個句子中最大距離是多少
min_count:用於字典階段,詞頻少於min_count次數的單詞會被丟棄掉,默認爲5
workers:控制訓練的並行數
sg:訓練算法,默認爲0,對應CBOW算法,sg爲1採用skip-gram算法
訓練方式一:
dim=300
embedding_size = dim
model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_word.txt'),
size=embedding_size,
window=5,
min_count=10,
workers=multiprocessing.cpu_count())
model.save(model_dir + "word2vec_gensim"+str(embedding_size)+".w2v")
model.wv.save_word2vec_format(model_dir + "word2vec_gensim_300d.txt", binary=False)
訓練方式2:
documents = list(LineSentence(model_dir + 'train_word.txt'))
print(len(documents))
print(documents[:10])
model = gensim.models.Word2Vec(documents, size=300)
model.train(documents, total_examples=len(documents), epochs=10)
model.save("./input/word2vec.w2v")
model.wv.save_word2vec_format("./input/word_gensim_300d.txt", binary=False)
兩種方法比較
在實際應用中,推薦使用訓練方式2,因爲在加載w2v向量時,第二種方式所需的內存大小更小。