Python 3.6 利用NLTK 統計多個文本中的詞頻

原創

2020-07-03 17:45

#!/usr/bin/env python
# encoding: utf-8

"""
@author: wg
@software: PyCharm
@file: word_frequency_statistics.py
@time: 2017/3/16 0016 10:46
"""

import os
import nltk

'''
利用NLTK 統計多個文本中的詞頻
'''

dirs = os.listdir('../../data/大秦帝國/') # 獲取根目錄
dictionary = {} # 空詞典，用於保存最終的詞頻
stopwords = ['、','（','）','，','。','：','“','”','\n\u3000','\u3000','的','‘','’'] # 停用詞

'''
def process():
    for d in dirs: #遍歷根目錄下的文件夾
        subdir = os.listdir('../../data/大秦帝國/')
        for f in subdir: # 遍歷文件夾下的文件
            text = open('', 'r', encoding='utf-8').read() # 讀取文本內容
            print('D:/sogouOutput/'+d+'/'+f)
            fredist = nltk.FreqDist(text.split(' ')) # 獲取單文件詞頻

            for localkey in fredist.keys(): # 所有詞頻合併。 如果存在詞頻相加，否則添加
                if localkey in stopwords: # 檢查是否爲停用詞
                    print('-->停用詞：', localkey)
                    continue
                if localkey in dictionary.keys(): # 檢查當前詞頻是否在字典中存在
                    dictionary[localkey] = dictionary[localkey] + fredist[localkey] # 如果存在，將詞頻累加，並更新字典值
                    print('--> 重複值：', localkey, dictionary[localkey])
                else: # 如果字典中不存在
                    dictionary[localkey] = fredist[localkey] # 將當前詞頻添加到字典中
                    print('--> 新增值：', localkey, dictionary[localkey])
        print('===================================================')
    print(sorted(dictionary.items(), key = lambda  x:x[1])) # 根據詞頻字典值排序，並打印
'''

def process():
    subdir = os.listdir('../../data/wordcloud/')
    for f in subdir: # 遍歷文件夾下的文件
        text = open('../../data/wordcloud/'+f, 'r', encoding='utf-8').read() # 讀取文本內容
        print('../../data/wordcloud/'+f)
        fredist = nltk.FreqDist(text.split(' ')) # 獲取單文件詞頻

        for localkey in fredist.keys(): # 所有詞頻合併。 如果存在詞頻相加，否則添加
            if localkey in stopwords: # 檢查是否爲停用詞
                print('-->停用詞：', localkey)
                continue
            if localkey in dictionary.keys(): # 檢查當前詞頻是否在字典中存在
                dictionary[localkey] = dictionary[localkey] + fredist[localkey] # 如果存在，將詞頻累加，並更新字典值
                print('--> 重複值：', localkey, dictionary[localkey])
            else: # 如果字典中不存在
                dictionary[localkey] = fredist[localkey] # 將當前詞頻添加到字典中
                print('--> 新增值：', localkey, dictionary[localkey])
    print('===================================================')
    print(sorted(dictionary.items(), key = lambda  x:x[1])) # 根據詞頻字典值排序，並打印

if __name__ == '__main__':
    process()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python 3.6 利用NLTK 統計多個文本中的詞頻

自學編程兩個月，現在我月入 4 萬元

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

Java 多線程之 - 守護線程 java.lang.Thread.isDaemon()方法

利用POI讀取老版本的word和excel以及txt文件的內容

Python 3.6 利用NLTK 統計多個文本中的詞頻

Linux shell腳本傳參，傳入數組

Python 3.6 使用wordcloud製作詞雲（可設背景圖像）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結