Python中的LDA - 如何网格搜索最佳主题模型？

翻译自该链接
LDA in Python – How to grid search best topic models?
Python中的LDA - 如何网格搜索最佳主题模型？
Python的Scikit Learn使用Latent Dirichlet分配（LDA），LSI和非负矩阵分解等算法为主题建模提供了方便的界面。在本教程中，您将学习如何构建最佳的LDA主题模型，并探索如何将输出显示为有意义的结果。
1.Introduction
在上一个教程中，您了解了如何使用gensim使用LDA构建主题模型。但是，在本教程中，我将使用python是最受欢迎的机器学习库 - scikit learn。

使用scikit learn，你有一个完全不同的界面，使用网格搜索和矢量化，你有很多选择可以探索，以找到最佳模型并呈现结果。

在本教程中，您将学习：

如何清理和处理文本数据？
如何准备文本文档用scikit构建主题模型学习？
如何使用LDA构建基本主题模型并理解params？
如何提取主题的关键字？
如何对最佳模型进行网格搜索和调优？
如何在每个文档中获得主要主题？
查看主题关键字分布并将其可视化
如何预测新文本的主题？
根据主题分布对文档进行分组
如何根据讨论的主题获得最相似的文档？
未来有很多令人兴奋的事情。来吧！
2、Load the packages
本教程中使用的核心包是scikit-learn（sklearn）。

正则表达式re，gensim和spacy用于处理文本。用于可视化的pyLDAvis和matplotlib以及以表格格式操作和查看数据的numpy和pandas。

让我们导入它们。

# Run in terminal or command prompt
# python3 -m spacy download en

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

3、Import Newsgroups Text Data
我将使用20-Newsgroups数据集。此版本的数据集包含来自20个不同主题的大约11k个新闻组帖子。这可以作为newsgroups.json使用。

由于它是json格式，具有一致的结构，我使用pandas.read_json（），结果数据集有3列，如图所示。

# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']

df.head(15)

4、Remove emails and newline characters

您可以在文本中看到许多电子邮件，换行符和额外空格，这非常令人分心。让我们使用正则表达式摆脱它们。

# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']

5、Tokenize and Clean-up using gensim’s simple_preprocess()
现在句子看起来更好了，但是你想把每个句子标记成一个单词列表，完全删除标点符号和不必要的字符。

Gensim的simple_preprocess（）非常适合这个。另外我设置了deacc = True来删除标点符号。

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', (..truncated..)]]

6、Lemmatization
词形还原是一个将单词转换为词根的过程。

例如：‘学习’变成’学习’，‘会变成’相遇’，‘更好’和’最佳’变成’好’。

这样做的好处是，我们可以减少字典中唯一单词的总数。因此，文档 - 单词矩阵中的列数（由下一步中的CountVectorizer创建）将以较小的列更密集。

您可以期待最终生成更好的主题。

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:2])

['where s  thing subject what car be nntp post host rac wam umd edu organization university maryland college park line be wonder anyone out there could enlighten car see other days be door sport car look be late early be call bricklin door be really small addition front bumper be separate rest body be know anyone can tellme model name engine spec year production where car be make history whatev info have funky look car mail thank bring  neighborhood lerxst' (..truncated..)]

7、Create the Document-Word matrix
LDA主题模型算法需要文档字矩阵作为主要输入。

您可以使用CountVectorizer创建一个。在下面的代码中，我将CountVectorizer配置为考虑至少发生10次（min_df）的单词，删除内置英语停用词，将所有单词转换为小写，并且单词可以包含至少长度为3的数字和字母为了成为一个词。

因此，要创建doc-word矩阵，首先需要使用所需的配置初始化CountVectorizer类，然后应用fit_transform来实际创建矩阵。

由于大多数单元格包含零，因此结果将采用稀疏矩阵的形式以节省内存。

如果要以2D数组格式实现它，请调用稀疏矩阵的todense（）方法，就像在下一步中完成的那样。

vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,                        # minimum reqd occurences of a word 
                             stop_words='english',             # remove stop words
                             lowercase=True,                   # convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                             # max_features=50000,             # max number of uniq words
                            )

data_vectorized = vectorizer.fit_transform(data_lemmatized)

8、Check the Sparsicity
稀疏性只不过是文档 - 字矩阵中非零数据点的百分比，即data_vectorized。

由于此矩阵中的大多数单元格将为零，因此我有兴趣知道单元格的百分比包含非零值。

# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  0.775887569365 %

9、Build LDA model with sklearn
一切都准备建立潜在Dirichlet分配（LDA）模型。让我们初始化一个并调用fit_transform（）来构建LDA模型。

对于此示例，我已根据有关数据集的先验知识将n_topics设置为20。稍后我们将使用网格搜索找到最佳数字。

# Build LDA Model
lda_model = LatentDirichletAllocation(n_topics=20,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=10, n_jobs=-1, n_topics=20, perp_tol=0.1,
             random_state=100, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

10、Diagnose model performance with perplexity and log-likelihood
具有较高对数似然性和较低困惑度的模型（exp（-1。*每个单词的对数似然））被认为是好的。我们来检查一下我们的模型。

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -9965645.21463
Perplexity:  2061.88393838
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 10,
 'n_jobs': -1,
 'n_topics': 20,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

另一方面，困惑可能不是评估主题模型的最佳方法，因为它不考虑单词之间的上下文和语义关联。这可以使用主题一致性度量来捕获，我在前面提到的gensim教程中描述了这个例子。
11、How to GridSearch the best LDA model?
LDA模型最重要的调整参数是n_components（主题数）。另外，我也将搜索learning_decay（它控制学习率）。

除此之外，其他可能的搜索参数可能是learning_offset（低于早期迭代。应该> 1）和max_iter。如果您有足够的计算资源，这些可能值得尝试。

请注意，网格搜索为param_grid dict中的所有可能的param值组合构建了多个LDA模型。因此，这个过程会消耗大量的时间和资源。

# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(data_vectorized)

GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_topics': [10, 15, 20, 25, 30], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

12、How to see the best topic model and its parameters?

# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_topics': 10}
Best Log Likelyhood Score:  -3417650.82946
Model Perplexity:  2028.79038336

13、Compare LDA Model Performance Scores
根据num_topics绘制对数似然得分，清楚地显示主题数= 10有更好的分数。而learning_decay为0.7优于0.5和0.9。

这让我想到，即使我们知道数据集有20个不同的主题，一些主题可以共享共同的关键字。例如，‘alt.atheism’和’soc.religion.christian’可以有很多常用词。与’rec.motorcycles’和’rec.autos’，'comp.sys.ibm.pc.hardware’和’comp.sys.mac.hardware’相同，你就明白了。

为了进一步调整这一点，您可以对10到15之间的主题数进行更精细的网格搜索。但是我现在要跳过它。

因此，最重要的是，对于此数据集，不同主题（甚至10个主题）的较低最佳数量可能是合理的。我还不知道。但是LDA这样说。让我们来看看。

# Get Log Likelyhoods from Grid Search Output
n_topics = [10, 15, 20, 25, 30]
log_likelyhoods_5 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.5]
log_likelyhoods_7 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.7]
log_likelyhoods_9 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.9]

# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()

14、How to see the dominant topic in each document?
要将文档归类为属于特定主题，逻辑方法是查看哪个主题对该文档的贡献最大并将其分配。

在下表中，我已经在文档中列出了所有主要主题，并在其自己的专栏中分配了最主要的主题。

# Create Document - Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_topics)]

# index names
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

15、Review topics distribution across documents

df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

16、How to visualize the LDA model with pyLDAvis?
pyLDAvis提供了最佳的可视化，可以查看主题 - 关键字分布。

一个好的主题模型将为每个主题提供不重叠，相当大的blob。这似乎就是这种情况。所以，我们很好。

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model,data_vectorized,vectorizer,mds='tsne')
panel

17、How to see the Topic’s keywords?
每个主题中每个关键字的权重包含在lda_model.components_中作为2d数组。可以使用get_feature_names（）从vectorizer对象获取关键字本身的名称。

让我们使用此信息为每个主题中的所有关键字构建权重矩阵。

# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames

# View
df_topic_keywords.head()

18、Get the top 15 keywords each topic
从上面的输出中，我想看到代表该主题的前15个关键字。

下面定义的show_topics（）创建了它。

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

19、How to predict the topics for a new piece of text?
假设您已经构建了主题模型，则需要在预测主题之前通过相同的转换例程来获取文本。

对于我们的情况，转换的顺序是：

sent_to_words（） - > lemmatization（） - > vectorizer.transform（） - > best_lda_model.transform（）

您需要以相同的顺序应用这些转换。因此，为了简化它，让我们将这些步骤组合成一个predict_topic（）函数。

# Define function to predict topic for a given text document.
nlp = spacy.load('en', disable=['parser', 'ner'])

def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization

    # Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))

    # Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

    # Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)

    # Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()
    return topic, topic_probability_scores

# Predict the topic
mytext = ["Some text about christianity and bible"]
topic, prob_scores = predict_topic(text = mytext)
print(topic)

['say', 'god', 'people', 'write', 'think', 'know', 'believe', 'christian', 'make', 'subject', 'line', 'good', 'just', 'organization', 'thing']

mytext已被分配到具有宗教和基督教相关关键词的主题，这是非常有意义和有意义的。
20、How to cluster documents that share similar topics and plot?
您可以在document-topic probabilioty矩阵上使用k-means聚类，这只是lda_output对象。由于最佳模型有15个簇，我在KMeans（）中设置了n_clusters = 15。

或者，您可以避免使用k-means，而是将群集指定为具有最高概率分数的主题列编号。

我们现在有了集群号。但我们还需要X和Y列来绘制图。

对于X和Y，您可以在lda_output对象上使用SVD，其中n_components为2. SVD确保这两列从前两个组件中的lda_output中捕获最大可能的信息量。

# Construct the k-means clusters
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=15, random_state=100).fit_predict(lda_output)

# Build the Singular Value Decomposition(SVD) model
svd_model = TruncatedSVD(n_components=2)  # 2 components
lda_output_svd = svd_model.fit_transform(lda_output)

# X and Y axes of the plot using SVD decomposition
x = lda_output_svd[:, 0]
y = lda_output_svd[:, 1]

# Weights for the 15 columns of lda_output, for each component
print("Component's weights: \n", np.round(svd_model.components_, 2))

# Percentage of total information in 'lda_output' explained by the two components
print("Perc of Variance Explained: \n", np.round(svd_model.explained_variance_ratio_, 2))

Component's weights: 
 [[ 0.08  0.23  0.24  0.14  0.2   0.85  0.09  0.19  0.07  0.2 ]
 [ 0.02 -0.1   0.9   0.16  0.16 -0.32 -0.01 -0.01  0.13  0.09]]
Perc of Variance Explained: 
 [ 0.09  0.21]

我们有每个文档的X，Y和簇号。

让我们沿着两个SVD分解的组件绘制文档。点的颜色表示簇编号（在本例中）或主题编号。

# Plot
plt.figure(figsize=(12, 12))
plt.scatter(x, y, c=clusters)
plt.xlabel('Component 2')
plt.xlabel('Component 1')
plt.title("Segregation of Topic Clusters", )

21、How to get similar documents for any given piece of text?
一旦知道给定文档的主题概率（使用predict_topic（）），用所有其他文档的概率分数计算欧氏距离。

最相似的文件是距离最小的文件。

from sklearn.metrics.pairwise import euclidean_distances

nlp = spacy.load('en', disable=['parser', 'ner'])

def similar_documents(text, doc_topic_probs, documents = data, nlp=nlp, top_n=5, verbose=False):
    topic, x  = predict_topic(text)
    dists = euclidean_distances(x.reshape(1, -1), doc_topic_probs)[0]
    doc_ids = np.argsort(dists)[:top_n]
    if verbose:        
        print("Topic KeyWords: ", topic)
        print("Topic Prob Scores of text: ", np.round(x, 1))
        print("Most Similar Doc's Probs:  ", np.round(doc_topic_probs[doc_ids], 1))
    return doc_ids, np.take(documents, doc_ids)

# Get similar documents
mytext = ["Some text about christianity and bible"]
doc_ids, docs = similar_documents(text=mytext, doc_topic_probs=lda_output, documents = data, top_n=1, verbose=True)
print('\n', docs[0][:500])

Topic KeyWords:  ['say', 'god', 'people', 'write', 'think', 'know', 'believe', 'christian', 'make', 'subject', 'line', 'good', 'just', 'organization', 'thing']
Topic Prob Scores of text:  [[ 0.   0.   0.8  0.   0.   0.   0.   0.   0.   0. ]]
Most Similar Doc's Probs:   [[ 0.   0.   0.8  0.   0.   0.   0.1  0.   0.   0. ]]

 From: Subject: about Eliz C Prophet Lines: 21 Rob Butera asks about a book called THE LOST YEARS OF JESUS, by Elizabeth Clare Prophet. I do not know the book. However, Miss Prophet is the leader of a group (The Church Universal and Triumphant) derived from the I AM group founded by a Mr. Ballard who began his mission in the 1930s (I am writing this from memory and may not have all the details straight -- for an old account, check your library for a book by Marcus Bach) after an eighteenth-centu

22、Conclusion
我们在这篇文章中介绍了一些尖端的主题建模方法。如果你成功完成了这项工作，那就做得很好。对于那些关注时间，内存消耗和构建主题模型的各种主题的人，请查看LDA的gensim教程。

Python中的LDA - 如何网格搜索最佳主题模型？

使用neovim打造go ide(支持代码跳转, 代码补全, 实时语法检查)

挑战程序设计竞赛 2.3章习题 poj 3046 Ant Counting

Shell/Python中的用户名获取

爬蟲基礎（續）

python學習筆記9---scrapy框架

python學習筆記5---（python網絡爬蟲-網絡請求）

基本庫的使用

關於Jupyter的小知識

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結