【NLP】Python中文文本聚類

1. 準備需要進行聚類的文本，這裏選取了10篇微博。

import os
path = 'E:/work/@@@@/開發事宜/大數據平臺/5. 標籤設計/文本測試數據/微博/'
titles = []
files = []
for filename in os.listdir(path):
    titles.append(filename)
    #帶BOM的utf-8編碼的txt文件時開頭會有一個多餘的字符\ufeff，BOM被解碼爲一個字符\ufeff，如何去掉？修改encoding爲utf-8_sig或者utf_8_sig
    filestr = open(path + filename, encoding='utf-8_sig').read()
    files.append(filestr)
for i in range(len(titles)):
    print(i, titles[i], files[i])

0 #我有特別的推倒技巧#之佟掌櫃收李逍遙.txt 【#我有特別的推倒技巧#之佟掌櫃收李逍遙】@胡歌 此前對南都稱，“我要海陸空三棲”，可是還沒等他征服海陸空呢，@素描閆妮 就把他先收了……他倆主演的電視劇《生活啓示錄》收視率穩居衛視黃金檔排行第1名，在豆瓣甚至被刷到了8 .6的高分。佟掌櫃是怎麼推倒李逍遙而又不違和的？http://t.cn/Rvbj6oy
1 從前的稱是16兩爲一斤，爲什麼要16兩爲一斤呢？.txt 從前的稱是16兩爲一斤，爲什麼要16兩爲一斤呢？古人把北斗七星、南斗六星以及福、祿、壽三星，共16顆星比作16兩，商人賣東西，要講究商德，不能缺斤短兩；如果耍手腕，剋扣一兩就減福，剋扣二兩就損祿，剋扣三兩就折壽。所以這個數字和古代人對誠信的美好願望分不開。
2 作死男曬酒駕.txt #作死男曬酒駕# “他還曬出自己闖紅燈20多次，最終此人自首。”請問“親愛的交警同志”你們在幹神馬事情去啦！！！是不是撞死人呢，你們纔會管啊，開寶馬的命太金貴了，而我們的命太.......！！！[怒][淚] |作死男曬酒駕
3 冀中星被移送檢察院審查起訴.txt #新聞追蹤#：【冀中星被移送檢察院審查起訴】首都機場公安分局對冀中星爆炸案偵查終結，目前已移送朝陽檢察院審查起訴。7月20日18時24分，冀中星在首都機場T3航站樓B口外引爆自制炸藥。案發當天除冀中星左手腕因被炸截肢外，無其他人傷亡。7月29日，冀中星因涉嫌爆炸罪被批捕。http://t.cn/zQHjr0S
4 寶馬男微博炫富曬酒駕挑逗交警 最終因被人肉求饒[汗].txt 【寶馬男微博炫富曬酒駕挑逗交警 最終因被人肉求饒[汗]】"開車喝酒是不是違反交規？@深圳交警 今晚獵虎嗎？"，在向交警挑釁後，他又曬出車牌號，稱自己闖紅燈20多次，隨即遭人肉，並陸續被曝光私人信息。很快，他發微博求饒。26日，交警傳喚該男子，該男子炫富違法車輛已被查扣。http://t.cn/Rvb9gF6
5 廣場舞出口世界.txt #廣場舞出口世界# 澳大利亞引進廣場舞順帶引進大媽的疑惑：1）是否可以申請技術移民2）是否屬於物種入侵3）亞洲女子天團進入澳洲是否會影響當地娛樂圈的圈態平衡4）是否能夠接受“中國大媽一旦引進，一概不退不換”的要求
6 方舟子：錘子改口號，換湯不換藥！.txt 【方舟子：錘子改口號，換湯不換藥！】遭到方舟子舉報虛假宣傳後，錘子手機官網修改了宣傳口號，將"東半球最好用的手機"改成"全球第二好用的智能手機"。對此，方舟子稱，被舉報後，羅永浩一邊說着"呵呵"，一邊偷偷改了廣告用語，改成了”全球第二好用的智能手機"等，但這仍然是換湯不換藥的虛假廣告..
7 杜汶澤宣佈暫別香港 (2).txt #杜汶澤宣佈暫別香港#杜先生長的醜，嘴巴臭，爪子賤不是你的錯，你出來嚇人亂說話薰到人就是你的不對了，大陸人怎麼了，大陸人敢作敢當說不安逸你就不安逸你，大陸人讓你無地自容的本事還是綽綽有餘的，滾吧，杜狗！ 
8 杜汶澤宣佈暫別香港.txt #杜汶澤宣佈暫別香港# 雖說言論自由無可厚非，但攻擊民族種族國家，涉及歧視他人的行爲仍然不是營銷策略中可以突破的下線。把無恥當有趣，是多無聊的人才能幹出的事兒啊！該。。只有這個字能概括。
9 首都機場爆炸案嫌犯冀中星 移送檢方審查起訴.txt #豫廣微新聞#【首都機場爆炸案嫌犯冀中星 移送檢方審查起訴】 據報道，首都機場公安分局對冀中星爆炸案偵查終結，目前已移送朝陽檢察院審查起訴。7月20日，山東籍男子冀中星在首都機場T3航站樓引爆自制炸藥，案發當天除冀中星左手腕因被炸截肢外，無其他人傷亡

2. 創建方法封裝jieba分詞，注意還需要獲得用戶自定義詞和停用詞列表

此步驟會在分詞時將用戶自定義的詞看作一個整體，不會分開，比如在添加“佟掌櫃”這個詞之前，會將其分詞成“佟”、“掌櫃”，添加該詞後會將“佟掌櫃視爲整體”。且分詞後的list會過濾掉停用詞列表中的詞，這樣像標點符號等沒有意義的字或字符就不會出現在最終的集合中。

# 創建停用詞list  
def stopwordslist(stopwords_filepath):  
    stopwords = [line.strip() for line in open(stopwords_filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords  

# 對句子進行分詞
def segment(text, userdict_filepath = "userdict2.txt", stopwords_filepath = 'stopwords.txt'):
    import jieba
    jieba.load_userdict(userdict_filepath)
    stopwords = stopwordslist(stopwords_filepath)  # 這裏加載停用詞的路徑
    seg_list = jieba.cut(text, cut_all=False)
    seg_list_without_stopwords = []
    for word in seg_list:  
        if word not in stopwords:  
            if word != '\t':  
                seg_list_without_stopwords.append(word)    
    return seg_list_without_stopwords

用戶自定義字典，命名爲userdict2.txt，保存在項目文件夾下

杜汶澤
佟掌櫃
南都
生活啓示錄
第1名
違和
兩
南斗六星
福祿壽三星
顆
酒駕
曬出
親愛的
命
金貴
冀中星
7月20日
T3航站樓
B口
案發當天
7月29日
微博
炫富
廣場舞
物種入侵
虛假
宣傳口號
好用
不對
大陸人
才能
幹出
能
微新聞
被炸

停用詞列表：https://blog.csdn.net/shijiebei2009/article/details/39696571，命名爲stopwords.txt，保存在項目文件夾下

3. 使用分詞器將list of files進行分詞

totalvocab_tokenized = []
for i in files:
    allwords_tokenized = segment(i, "userdict2.txt", 'stopwords.txt')
    totalvocab_tokenized.extend(allwords_tokenized)
print(len(totalvocab_tokenized)) #去重前長度371，去重後256

371

4. 獲得Tf-idf矩陣

from sklearn.feature_extraction.text import TfidfVectorizer
#max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
#min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
tfidf_vectorizer = TfidfVectorizer(max_df=0.9, max_features=200000,
                                 min_df=0.1, stop_words='english',
                                 use_idf=True, tokenizer=segment)
#terms is just a 集合 of the features used in the tf-idf matrix. This is a vocabulary
#terms = tfidf_vectorizer.get_feature_names() #長度258
tfidf_matrix = tfidf_vectorizer.fit_transform(files) #fit the vectorizer to synopses
print(tfidf_matrix.shape) #(10, 258)：10篇文檔，258個feature

(10, 258)

5. 計算文檔相似性

from sklearn.metrics.pairwise import cosine_similarity
#Note that 有了 dist 就可以測量任意兩個或多個概要之間的相似性.
#cosine_similarity返回An array with shape (n_samples_X, n_samples_Y)
dist = 1 - cosine_similarity(tfidf_matrix)

6. 獲得分類

from scipy.cluster.hierarchy import ward, dendrogram, linkage
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['Microsoft YaHei'] #用來正常顯示中文標籤
#Perform Ward's linkage on a condensed distance matrix.
#linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
#Method 'ward' requires the distance metric to be Euclidean
linkage_matrix = linkage(dist, method='ward', metric='euclidean', optimal_ordering=False)
#Z[i] will tell us which clusters were merged, let's take a look at the first two points that were merged
#We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count]
print(linkage_matrix)
for index, title in enumerate(titles):
    print(index, title)

[[ 3.          9.          0.35350366  2.        ]
 [ 2.          4.          1.08521531  2.        ]
 [ 7.          8.          1.2902641   2.        ]
 [ 0.          5.          1.39239608  2.        ]
 [ 1.          6.          1.40430097  2.        ]
 [13.         14.          1.42131068  4.        ]
 [12.         15.          1.4744491   6.        ]
 [11.         16.          1.62772682  8.        ]
 [10.         17.          2.2853395  10.        ]]

0 #我有特別的推倒技巧#之佟掌櫃收李逍遙.txt
1 從前的稱是16兩爲一斤，爲什麼要16兩爲一斤呢？.txt
2 作死男曬酒駕.txt
3 冀中星被移送檢察院審查起訴.txt
4 寶馬男微博炫富曬酒駕挑逗交警 最終因被人肉求饒[汗].txt
5 廣場舞出口世界.txt
6 方舟子：錘子改口號，換湯不換藥！.txt
7 杜汶澤宣佈暫別香港 (2).txt
8 杜汶澤宣佈暫別香港.txt
9 首都機場爆炸案嫌犯冀中星 移送檢方審查起訴.txt

7. 可視化

plt.figure(figsize=(25, 10))
plt.title('中文文本層次聚類樹狀圖')
plt.xlabel('微博標題')
plt.ylabel('距離（越低表示文本越類似）')
dendrogram(
    linkage_matrix,
    labels=titles, 
    leaf_rotation=-70,  # rotates the x axis labels
    leaf_font_size=12  # font size for the x axis labels
)
plt.show()
plt.close()

栗子ma

發佈了13 篇原創文章 · 獲贊 21 · 訪問量 6萬+

私信關注

【NLP】Python中文文本聚類

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結