gensim實現LDA主題模型-------實戰案例（分析希拉里郵件的主題）

第一步：加載一些必要的庫，我們用的是gensim中的LDA模型，所以必須安裝gensim庫

import pandas as pd
import re
from gensim.models import doc2vec, ldamodel
from gensim import corpora

第二步：咱們看一下數據集，這裏的數據集有20個特徵，我們只取兩個。 id 和郵件內容。

if __name__ == '__main__':
    # 加載數據
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 這兩列主要有空缺值，這條數據就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)

分析上面代碼：先用pandas加載數據集，然後取出那兩列，如果哪一套數據有空值，我們直接扔掉。接下來打印前五行以及數據的規格

第三步：數據清洗

很明顯看出數據裏面的特殊字符比較多，如郵箱號，時間等，對咱們的主題生成沒有什麼幫助，所以，咱們必須將其清除掉。

def clean_email_text(text):
    # 數據清洗
    text = text.replace('\n', " ")  # 新行，我們是不需要的
    text = re.sub(r"-", " ", text)  # 把 "-" 的兩個單詞，分開。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text)  # 日期，對主體模型沒什麼意義
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)  # 時間，沒意義
    text = re.sub(r"[\w]+@[\.\w]+", "", text)  # 郵件地址，沒意義
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)  # 網址，沒意義
    pure_text = ''
    # 以防還有其他特殊字符（數字）等等，我們直接把他們loop一遍，過濾掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter == ' ':
            pure_text += letter
    # 再把那些去除特殊字符後落單的單詞，直接排除。
    # 我們就只剩下有意義的單詞了。
    text = ' '.join(word for word in pure_text.split() if len(word) > 1)  # 而且單詞長度必須是2以上
    return text

主要就是清除裏面的特殊字符，最後再將長度小於1的詞也扔掉，對主題生成沒有什麼幫助。記住：LDA是一個詞袋模型。沒有上下文的信息。

在我們的main裏面調用上述方法進行數據清洗，也就是添加五行代碼。最後docs.values是直接從表格中把數據拿出來，見一下輸出：

if __name__ == '__main__':
    # 加載數據
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 這兩列主要有空缺值，這條數據就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)
    
    # 新添加的代碼
    docs = df['ExtractedBodyText']   # 獲取郵件
    docs = docs.apply(lambda s: clean_email_text(s))   # 對郵件清洗

    print(docs.head(1).values)
    doclist = docs.values   # 直接將內容拿出來
    print(docs)

注意：一行代表的是一個郵件

第四步：進一步清洗：去除停用詞

首先，我們本地有一個停用詞表，我們加載進行，然後對文本進行停用詞的去除

def remove_stopword():
    stopword = []
    with open('./data/stop_words.utf8', 'r', encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            line = line.replace('\n', '')
            stopword.append(line)
    return stopword

在我們的main中調用，再main中進行代碼的添加：

    stop_word = remove_stopword()

    texts = [[word for word in doc.lower().split() if word not in stop_word] for doc in doclist]
    print(texts[0])  # 第一個文本現在的樣子

我們這一步不管進行了停用詞的處理，還將句子進行了分詞，因爲gensim中用LDA需要進行分詞。所以我們這一步的輸出是：（這裏我們只打印了第一個文本）

第五步：開始準備模型進行訓練

直接在main中添加以下代碼：

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    print(corpus[0])  # [(36, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

    lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
    print(lda.print_topic(10, topn=5))  # 第10個主題最關鍵的五個詞
    print(lda.print_topics(num_topics=20, num_words=5))  # 把所有的主題打印出來看看

解釋一下以上代碼：第一行是詞空間的生成，也就是將所有文章中取出來去重，剩下的詞組成的列表。並進行編號

第二行是針對每個文本，將詞彙轉爲id 如上述第三行後面的註釋，（36, 1）的意思就是你這篇文章有36號單詞，這個單詞在你這篇文章中出現了1次。

下面就是LDA模型的建立訓練，注意裏面傳的參數，前兩個不用講，最後一個參數說一下，就是我們想讓其生成幾個主題，跟kmeans中需要指定簇的個數是一個意思。接着打印最能體現第十個主題的前五個單詞。然後打印這20個主題相關的詞。

第二行是一個列表，直截取了一部分。

第六步：保存模型，方面以後的使用

    # 保存模型
    lda.save('zhutimoxing.model')

第七步：加載模型，並給定一個新郵件，讓其判斷屬於哪個主題

新郵件內容：'I was greeted by this heartwarming display on the corner of my street today. ' \ 'Thank you to all of you who did this. Happy Thanksgiving. -H'

    # 加載模型
    lda = ldamodel.LdaModel.load('zhutimoxing.model')

    # 新鮮數據，判讀主題
    text = 'I was greeted by this heartwarming display on the corner of my street today. ' \
           'Thank you to all of you who did this. Happy Thanksgiving. -H'
    text = clean_email_text(text)
    texts = [word for word in text.lower().split() if word not in stop_word]
    bow = dictionary.doc2bow(texts)
    print(lda.get_document_topics(bow))  # 最後得出屬於這三個主題的概率爲[(4, 0.6081926), (11, 0.1473181), (12, 0.13814318)]

最後的輸出：

可以看出，屬於第17個主題的概率最大。值爲0.5645

完整代碼：

"""

@file   : 010-希拉里郵件進行主題建立之主題模型.py

@author : xiaolu

@time1  : 2019-05-11

"""
import numpy as np
import pandas as pd
import re
from gensim.models import doc2vec, ldamodel
from gensim import corpora



def clean_email_text(text):
    # 數據清洗
    text = text.replace('\n', " ")  # 新行，我們是不需要的
    text = re.sub(r"-", " ", text)  # 把 "-" 的兩個單詞，分開。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text)  # 日期，對主體模型沒什麼意義
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)  # 時間，沒意義
    text = re.sub(r"[\w]+@[\.\w]+", "", text)  # 郵件地址，沒意義
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)  # 網址，沒意義
    pure_text = ''
    # 以防還有其他特殊字符（數字）等等，我們直接把他們loop一遍，過濾掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter == ' ':
            pure_text += letter
    # 再把那些去除特殊字符後落單的單詞，直接排除。
    # 我們就只剩下有意義的單詞了。
    text = ' '.join(word for word in pure_text.split() if len(word) > 1)  # 而且單詞長度必須是2以上
    return text


def remove_stopword():
    stopword = []
    with open('./data/stop_words.utf8', 'r', encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            line = line.replace('\n', '')
            stopword.append(line)
    return stopword


if __name__ == '__main__':
    # 加載數據
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 這兩列主要有空缺值，這條數據就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)

    docs = df['ExtractedBodyText']   # 獲取郵件
    docs = docs.apply(lambda s: clean_email_text(s))   # 對郵件清洗

    print(docs.head(1).values)
    doclist = docs.values   # 直接將內容拿出來
    print(docs)

    stop_word = remove_stopword()

    texts = [[word for word in doc.lower().split() if word not in stop_word] for doc in doclist]
    print(texts[0])  # 第一個文本現在的樣子

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    print(corpus[0])  # [(36, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

    lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
    print(lda.print_topic(10, topn=5))  # 第10個主題最關鍵的五個詞
    print(lda.print_topics(num_topics=20, num_words=5))  # 把所有的主題打印出來看看

    # 保存模型
    lda.save('zhutimoxing.model')

    # 加載模型
    lda = ldamodel.LdaModel.load('zhutimoxing.model')

    # 新鮮數據，判讀主題
    text = 'I was greeted by this heartwarming display on the corner of my street today. ' \
           'Thank you to all of you who did this. Happy Thanksgiving. -H'
    text = clean_email_text(text)
    texts = [word for word in text.lower().split() if word not in stop_word]
    bow = dictionary.doc2bow(texts)
    print(lda.get_document_topics(bow))  # 最後得出屬於這三個主題的概率爲[(4, 0.6081926), (11, 0.1473181), (12, 0.13814318)]

gensim實現LDA主題模型-------實戰案例（分析希拉里郵件的主題）

第一步：加載一些必要的庫，我們用的是gensim中的LDA模型，所以必須安裝gensim庫

第二步：咱們看一下數據集，這裏的數據集有20個特徵，我們只取兩個。 id 和郵件內容。

第三步：數據清洗

第四步：進一步清洗：去除停用詞

第五步：開始準備模型進行訓練

第六步：保存模型，方面以後的使用

第七步：加載模型，並給定一個新郵件，讓其判斷屬於哪個主題

tmux終端工具的簡單使用

Google的bert預訓練模型下載地址＋將tensorflow版本的預訓練模型轉爲pytorch版本進行加載

【Django系列】三：Django搭建一個個人博客流程（分頁和博客的歸檔）

《第一天》Linux學習過程中的筆記

Python語言面試問題集錦(實時更新ing)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

gensim實現LDA主題模型-------實戰案例（分析希拉里郵件的主題）

第一步： 加載一些必要的庫， 我們用的是gensim中的LDA模型，所以必須安裝gensim庫

第二步：咱們看一下數據集， 這裏的數據集有20個特徵，我們只取兩個。 id 和 郵件內容。

第三步：數據清洗

第四步：進一步清洗：去除停用詞

第五步：開始準備模型進行訓練

第六步：保存模型，方面以後的使用

第七步：加載模型，並給定一個新郵件，讓其判斷屬於哪個主題

第一步：加載一些必要的庫，我們用的是gensim中的LDA模型，所以必須安裝gensim庫

第二步：咱們看一下數據集，這裏的數據集有20個特徵，我們只取兩個。 id 和郵件內容。