python根據詞頻字典或字符串繪製詞雲圖

由於工作需要,要根據現有的新聞數據統計詞頻,繪製詞雲圖,比較擅長python,因此沒有用可以生成雲圖的網頁工具。由於我的數據量比較大,因此根據字符串自動進行統計並繪製雲圖的方式並不適合我。我需要手動從文件中讀取數據並進行統計,然後將詞頻字典傳入函數中進行繪製。
參考資料:

  1. 用Python實現一個詞頻統計(詞雲)圖
  2. python(wordcloud包)之生成詞雲(英文語料)

本文代碼參考上述兩個博客修改而成,語料爲英文,因此未用到結巴分詞,也不涉及字體問題。爲適應mask,所以結果圖比較簡單,如需要調整參數的,可參考下面這篇文章,其中介紹了wordcloud的各種參數的含義。Wordcloud各參數含義

數據示例

{"date":"20130131","url":"http://gulftoday.ae/portal/5308f5d3-e752-41e0-b011-4537ffe658b2.aspx","locinfo":[["Uzbekistan","UZ","UZ","41","64"]],"content":"delivering advanced defence system agency deputy defence assaying trip increase influence soviet union political trade security initiative aim tighten cooperation attempt capability soviet security bloc collective security treaty organisation combine division surplus defence ministry quoted division rocket system sending division faced criticism lack activity inception signed treaty suspending membership bloc signed contract unit war torn military","label":["military diplomacy"]}
{"date":"20130128","url":"http://enews.fergananews.com/news.php?id=2795","locinfo":[["Fergana, Farg ona, Uzbekistan","UZ","UZ03","40.3933","71.7794"]],"content":"advocate pay rare political inmate initiative independent human advocate visited inmate convicted political motif penalty enforcement colony chairman permission obtained human advocate penalty enforcement directorate ministry internal affair hold academic degree technical science born lived chairman executive council member supreme council soviet republic appointed mayor arrested criminal conspiracy","label":["jail sentence"]}

結果

詞雲圖

代碼

#-*-coding:utf-8-*-
import sys
import os
from pprint import pprint
import codecs
import json
from collections import Counter, defaultdict
from wordcloud import WordCloud
import matplotlib.pyplot as plt

path = sys.path[0] + os.sep

def wc_from_text(str, fn):
	'''根據字符串進行統計,並生成詞雲圖'''
    wc = WordCloud(
        background_color="white",  # 設置背景爲白色,默認爲黑色
        width = 1500,  # 設置圖片的寬度
        height= 960,  # 設置圖片的高度
        margin= 10  # 設置圖片的邊緣
    ).generate(s)
    plt.imshow(wc)  # 繪製圖片
    plt.axis("off")  # 消除座標軸
    plt.show()  # 展示圖片
    wc.to_file(path + fn)  # 保存圖片

def wc_from_word_count(word_count, fp):
	'''根據詞頻字典生成詞雲圖'''
    wc = WordCloud(
        max_words=500,  # 最多顯示詞數
        # max_font_size=100,  # 字體最大值
        background_color="white",  # 設置背景爲白色,默認爲黑色
        width = 1500,  # 設置圖片的寬度
        height= 960,  # 設置圖片的高度
        margin= 10  # 設置圖片的邊緣
    )
    wc.generate_from_frequencies(word_count)  # 從字典生成詞雲
    plt.imshow(wc)  # 顯示詞雲
    plt.axis('off')  # 關閉座標軸
    plt.show()  # 顯示圖像
    wc.to_file(fp)  # 保存圖片

def generate_dict_from_file(fp):
    with codecs.open(fp, 'r', 'utf-8') as source_file:
        for line in source_file:
            dic = json.loads(line)
            yield dic

def main(data_fp, pic_fp):
    word_count = defaultdict(lambda: 0)
    for dic in generate_dict_from_file(data_fp):
        words = dic['content'].split(' ')
        for word in words:
        	word_count[word] += 1
    with codecs.open(path + 'word_count.json', 'w', 'utf-8') as f:
        json.dump(word_count, f, ensure_ascii=False)
    wc_from_word_count(word_count, pic_fp)

if __name__ == '__main__':
    s = 'access restored ban remains blocked government order accessible aid proxy provider telecom restored access celebrating government revoked censorship order newsroom waiting appeal court lawsuit government allowed constitution reporting stringer spread dedication journalism critical reporting brought outlet respect recognition landed blacklist authoritarian regime dominate permanently blocked severe intolerance critical journalism authority deny domestic access occasional basis regional outlet sensitive issue incident hard technical glitch deliberately blocked access depending covered government corruption human abuse social discontent policy freedom protested blocked violent conflict ethnic resident authority imposed permanent ban parliament resolution lawmaker addressed conflict recommended action government resolution reason obtaining court order law shutting outlet introduce measure domain space resolution authority'
    # wc_from_text(s, 'wc1.jpg')
    # word_count = Counter(s.split(' '))
    # wc_from_word_count(word_count, 'wc2.jpg')
    data_fp = path + 'result.json'
    pic_fp = path + 'word_cloud_uz.jpg'
    main(data_fp, pic_fp)

以上,歡迎交流。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章