數據
- 源代碼所用數據:20_newsgroups.txt,大小几十MB。
- 文件開頭:以texts 換行,作爲Key
- 個人嘗試之Japan.txt,成功。
- 個人嘗試之China.txt,失敗。(load_20newsgroups.py生成的skipgrams.txt爲空→run_20newsgroups.py報錯)
- 個人嘗試總結:直接從表格裏粘到.txt,可以運行。但從.txt或word粘到.txt,無法運行。
load_20newsgroups.py
- 先運行load_20newsgroups.py,生成所需文件
- 用到預訓練模型:glove(此處選用300-dim預訓練詞向量)
- 讀取文件報錯:OSError: Initializing from file failed;原因:文件名/文件夾路徑含中文
- 編碼報錯:行embedding_matrix = P.load_glove("D:\glove.6B\glove.6B.300d.txt") ;修改:Lda2vec-Tensorflow-master\lda2vec\nlppipe.py文件的相應位置:
#def load_glove(self, EMBEDDING_FILE)和def load_para(self, EMBEDDING_FILE)下的
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE,encoding='UTF-8'))
# Load embeddings from file if we choose to do so
if load_embeds:
# Load embedding matrix from file path - change path to where you saved them
# _*_ coding:utf-8 _*_
embedding_matrix = P.load_glove("D:\glove.6B\glove.6B.300d.txt")
else:
embedding_matrix = None
run_20newsgroups.py
- 用到load_20newsgroups.py產物
- 可自行調參
- pyLDAvis因"Vocabulary size did not match size of word vectors"報錯,以下爲報錯行(最後一行,可視化)
# Visualize topics with pyldavis
utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)