Python_文本分析_困惑度計算和一致性檢驗

原創

2020-06-16 15:24

在做LDA的過程中比較比較難的問題就是主題數的確定，下面介紹困惑度、一致性這兩種方法的實現。

其中的一些LDA的參數需要結合自己的實際進行設定
直接計算出的log_perplexity是負值，是困惑度經過對數去相反數得到的。

import csv
import datetime
import re
import pandas as pd
import numpy as np
import jieba
import matplotlib.pyplot as plt
import jieba.posseg as jp, jieba
import gensim
from snownlp import seg
from snownlp import SnowNLP
from snownlp import sentiment
from gensim import corpora, models
from gensim.models import CoherenceModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import warnings
warnings.filterwarnings("ignore")

comment = pd.read_csv(r"good_1", header = 0, index_col = False, engine='python',encoding = 'utf-8')
csv_data = comment[[(len(str(x)) > 100) for x in comment['segment']]]
print(csv_data.shape)

# 構造corpus
train = []
for i in range(csv_data.shape[0]):
    comment = csv_data.iloc[i,7].split()
    train.append(comment)
    
id2word = corpora.Dictionary(train)
corpus = [ id2word.doc2bow(sentence) for sentence in train]

# 一致性和困惑度計算
coherence_values = []
perplexity_values = []
model_list = []

for topic in range(15):
    lda_model = gensim.models.LdaMulticore(corpus = corpus, num_topics=topic+1, id2word = id2word, random_state=100, chunksize=100, passes=10, per_word_topics=True)
    perplexity = pow(2,-lda_model.log_perplexity(corpus)) 
    print(perplexity,end='   ')
    perplexity_values.append(round(perplexity,3))
    
    model_list.append(lda_model)
    coherencemodel = CoherenceModel(model=lda_model, texts=train, dictionary=id2word, coherence='c_v')
    coherence_values.append(round(coherencemodel.get_coherence(),3))

下面展示一種一致性可視化的方法

x = range(1,21)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python_文本分析_困惑度計算和一致性檢驗

Python_Leetcode_7_整數反轉

Python_Leetcode_1_ 兩數之和

Python_Leetcode_3_無重複字符的最長子串

Python_文本分析_困惑度計算和一致性檢驗

Python_算法實現_(11)位運算

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結