import gensim
import jieba
import pandas as pd
from gensim import corpora,models
from gensim.models.wrappers import DtmModel
from gensim.corpora import Dictionary
from collections import defaultdict

gensim模塊中的動態主題模型並不在官方所提供的代碼裏。想要使用動態主題模型，必須先下載保存在github上的二進制文件，這個文件有適合linux、win和darwin使用的版本，可直接下載，很方便。

這是鏈接：https://github.com/magsilva/dtm/tree/master/bin

文本要求

根據我的理解，簡單來說，動態主題模型就是一種動態調參的LDA主題模型，通常將時間線分爲幾個等長的時間片。因此按照gensim裏的這個模型要求，你必須先把整個事件片裏的文本整合到一起，簡而言之，就是你最後放到模型裏跑的那個文本列表的長度，必須等於你把時間線分成的段數。

我在使用的時候，一共八個月，被我分爲了八段，所以你最後的corpus的這個list的len也要是8

分詞處理

接下來，你需要先對你需要分析的文本進行分詞處理，我使用的代碼如下：

train = []#儲存分詞結果的list

for line in comment:
    line = line.strip()
    line = "".join(line.split())
    if not len(line):#判斷是否爲空行
        continue
    outstr = ' '
    seg_list =jieba.cut(line)
    for word in seg_list:
        if word not in stopword:
            if word != '\t' and word != u'\u200b' and word != '～':
                outstr += word
                outstr += " "
    train.append(outstr.strip().split(" "))

得到train後，還可以篩選掉低頻的單詞：

frenquecy = defaultdict(int)
for patch in train:
    for token in patch:
        frenquecy[token] += 1
train = [[token for token in patch if frenquecy[token] > threshold]
        for patch in train]

然後把分詞結果向量化：

dic = Dictionary(train)
corpus = [dic.doc2bow(text) for text in train]

這裏需要用doc2bow函數把分詞結果轉化爲bow格式的向量list

dtm模型構建

可以先去官網看一下，看不懂英文的chrome翻一下就好了

class gensim.models.wrappers.dtmmodel.DtmModel（dtm_path，corpus = None，time_slices = None，mode ='fit'，model ='dtm'，num_topics = 100，id2word = None，prefix = None，lda_sequence_min_iter = 6，lda_sequence_max_iter = 20，lda_max_em_iter = 10，alpha = 0.01，top_chain_var = 0.005，rng_seed = 0，initialize_lda = True ）

dtm_path（str） - dtm二進制文件的路徑，例如/ home / username / dtm / dtm / main。
corpus（iterable of （int ，int ）的迭代） - BoW格式的文本集合。
time_slices（int的列表） - 時間戳序列。
mode（{'fit' ，'time'} ，可選） - 控制模式的模式：'fit'用於訓練，'time'用於根據DTM分析文檔，基本上是一組。
model（{'fixed' ，'dtm'} ，可選） - 將運行的控制模型：'fixed'用於DIM，'dtm'用於DTM。
num_topics（int ，optional） - 主題數。
id2word（Dictionary，可選） -令牌ID和從胼字之間的映射，如果不是指定的-將被從推斷語料庫。
prefix（str ，optional） - 生成的臨時文件的前綴。
lda_sequence_min_iter（int ，optional） - LDA的最小迭代次數。
lda_sequence_max_iter（int ，optional） - LDA的最大迭代次數。
lda_max_em_iter（int ，optional） - LDA中的最大em優化迭代。
alpha（int ，optional） - 超參數，它影響每個時間片中LDA模型的文檔主題的稀疏性。
top_chain_var（int ，optional） - 影響的超參數。
rng_seed（int ，optional） - 隨機種子。
initialize_lda（bool ，optional） - 如果爲True - 使用LDA初始化DTM

這裏的第一個參數dtm_path，就是指放置你下載的二進制文件的位置。

整個函數中，最玄幻的參數就是這個time_slices。源代碼中要求，sum(time_slices)要等於你時間片的個數，即len(corpus)，但是這個實際上可以有無數種組合，可是官方文檔裏並沒有寫具體這個參數會對模型有什麼樣的影響，我也沒搞懂，就只能使用官方文檔例子的寫法time_slices = [1] * len(corpus)

其他參數可以使用函數的默認值，也可以自己慢慢調整。

根據官方文檔中說的，模型有兩種模式，一種是fit，一種是time。fit完全正常運行，但是time這個模式是根據時間戳進行分析的模式，可見這是我們想要的模式。但是在實際調用時卻出現問題：

問題1

會告訴你某一個函數返回了非0值，報錯。根據錯誤提示，我們一直找到gensim中的utils.py中的1916行，把這裏改成：

try:
    error = subprocess.CalledProcessError(retcode,cmd)
except Exception:
    error = None

問題2

接下來還會報錯，會告訴你各種模型所需要的文件均不存在。我用了好久才明白是怎麼回事，原來這個模式自己並不會生成這些初始化的文件，而是需要先運行一次fit模式，再使用fit模式初始化的文件來運行time模型，但是源代碼裏並沒有寫這個部分，導致運行失敗，因此修改源代碼dtmmodel.py中的164行:

if corpus is not None:
     if self.mode == 'time':
          print("time mode")
          self.train(corpus, time_slices, 'fit', model)
          self.train(corpus, time_slices, mode , model)
     elif self.mode == 'fit':
          print("train mode")
          self.train(corpus, time_slices, mode , model)

這樣，就能正常運行time模式。代碼如下：

model = DtmModel(path_to_dtm_binary,corpus = corpus,time_slices=time_slice,
id2word=dic,num_topics = num_topics,alpha = alpha,mode='time')

查看的得到的主題：

model.show_topics(num_topics = 10,times=1)

其他使用方法看官方文檔就好了

Gensim中動態主題模型——dtmmodel的使用

文本要求

分詞處理

dtm模型構建

問題1

問題2

System.Object未被引用的程序集中定義

【面試準備】項目經驗——接口自動化項目

Gensim中動態主題模型——dtmmodel的使用

numpy.bincount介紹以及巧妙計算分類結果中每一類預測正確的個數

Learning From Synthetic Data for Crowd Counting（CVPR）源碼分析——utils.py

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結