数据
- 源代码所用数据:20_newsgroups.txt,大小几十MB。
- 文件开头:以texts 换行,作为Key
- 个人尝试之Japan.txt,成功。
- 个人尝试之China.txt,失败。(load_20newsgroups.py生成的skipgrams.txt为空→run_20newsgroups.py报错)
- 个人尝试总结:直接从表格里粘到.txt,可以运行。但从.txt或word粘到.txt,无法运行。
load_20newsgroups.py
- 先运行load_20newsgroups.py,生成所需文件
- 用到预训练模型:glove(此处选用300-dim预训练词向量)
- 读取文件报错:OSError: Initializing from file failed;原因:文件名/文件夹路径含中文
- 编码报错:行embedding_matrix = P.load_glove("D:\glove.6B\glove.6B.300d.txt") ;修改:Lda2vec-Tensorflow-master\lda2vec\nlppipe.py文件的相应位置:
#def load_glove(self, EMBEDDING_FILE)和def load_para(self, EMBEDDING_FILE)下的
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE,encoding='UTF-8'))
# Load embeddings from file if we choose to do so
if load_embeds:
# Load embedding matrix from file path - change path to where you saved them
# _*_ coding:utf-8 _*_
embedding_matrix = P.load_glove("D:\glove.6B\glove.6B.300d.txt")
else:
embedding_matrix = None
run_20newsgroups.py
- 用到load_20newsgroups.py产物
- 可自行调参
- pyLDAvis因"Vocabulary size did not match size of word vectors"报错,以下为报错行(最后一行,可视化)
# Visualize topics with pyldavis
utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)