用Python做結巴分詞和詞雲圖

        入職一個月,領導讓我用爬蟲數據做一份競品分析報告,除了用數據條做對比分析,折線圖做趨勢分析,南丁格爾玫瑰圖做類目構成分析,氣泡圖做GMV分析詞雲圖做品質問題分析,箱線圖做價格帶分析,此外,我還想起了去年玩過jieba分析可以派上用場了,在做詞頻分析方面​。代碼如下:

先做分詞,自動拆解出各種常見詞彙出現的頻次:

# 導入庫
import jieba
# 加載文本數據
txt = open(r"C:\Users\Administrator\Desktop\分詞.txt","r",encoding="utf-8").read()
# 分詞
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word)==1:
        continue
    else:
        counts[word] = counts.get(word,0)+1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
# 獲取前多少個詞  注意數組下標不要越界
for i in range(10000):
    word,count = items[i]
    
# 參考自: https://www.zhihu.com/question/424396838/answer/1509880609
# print 應該小寫,然後.format及其後面的東西應該包含在print()的小括號內你的format後面少了東西,應該是兩個變量名,
# 比如是para1, para2,應該寫成print("{0:<10}{1:>5}".format(para1, para2)),其中{0},{1}分別指代format裏的para1, 
# para2<表示左對齊,>表示右對齊,所以{0:<10}表示para1左對齊佔10個位置,{1:>5}表示para2右對齊佔5個位置
    print("{0:<10}{1:>5}".format(word,count))

 

可視化渲染(生成詞雲圖的圖片):

#  導入庫
import jieba
import wordcloud
from imageio import imread

# 讀取圖片
mask = imread(r"C:\Users\Administrator\Desktop\五角星.jpg")

# 讀取數據
f = open(r"C:\Users\Administrator\Desktop\分詞.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)

# 自定義詞雲元素
w = wordcloud.WordCloud(font_path="msyh.ttc",mask = mask,width = 800,height=600,background_color="yellow")

# 導出生成詞雲圖
w.generate(txt)
w.to_file(r"C:\Users\Administrator\Desktop\生成的詞雲圖.png")

# 渲染效果
word_cloud.generate(text_cut)
plt.subplots(figsize=(18,12))
plt.imshow(word_cloud)
plt.axis("off")

 

# 完整代碼

# 導入相關的庫
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

# 加載並處理分詞
text = open(r"C:\Users\Administrator\Desktop\分詞.txt",encoding ="utf-8").read()
text = text.replace('\n',"").replace("\u3000","")
text_cut = jieba.lcut(text)
text_cut = ' '.join(text_cut)

stop_words = open(r"C:\Users\Administrator\Desktop\分詞.txt",encoding ="utf-8").read().split("\n")

# 讀取背景圖片,也可以輸入中文
background = Image.open(r"C:\Users\Administrator\Desktop\五角星.jpg")
graph = np.array(background)

word_cloud = WordCloud(font_path="msyh.ttc",  # 原參數 simsun.ttc  /   msyh.ttc
                       background_color="white",
                       mask=graph, # 指定詞雲的形狀
                       stopwords=stop_words)

# 渲染效果
word_cloud.generate(text_cut)
plt.subplots(figsize=(12,8))
plt.imshow(word_cloud)
plt.axis("off")

 

 

參考自​:https://www.bbsmax.com/A/amd0w0bD5g/

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章