詞雲python

原創

2019-02-01 15:18

詞雲

在開始接觸NLP階段，初試了文字生成的模型，從而在字符級、詞級的條件下建立示例的模型。回到最基本的詞的內容，通過可視化方式觀察詞頻的情況，對前期的分析也許有些幫助。這類型的詞雲圖，有時候作爲PPT彙報的點綴也提升解釋性。在此，簡單記錄繪製的過程，也方便後續回想。

import os
import numpy as np
np.random.seed(123)
os.environ['CUDA_VISIBLE_DEVICES'] = ""  # 設置爲cpu運行

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, RNN, Dense, Activation
from tensorflow.keras.optimizers import RMSprop,Adam

import jieba
import nltk

import matplotlib as mpl
# mpl.rcParams["font.sans-serif"] = [u"SimHei"]
# mpl.rcParams['axes.unicode_minus'] = False
import matplotlib.pyplot as plt
%matplotlib inline

myfont = mpl.font_manager.FontProperties(fname='/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc') 
mpl.rcParams['axes.unicode_minus'] = False

數據準備

alltext = open(file='./data/excise_caixin.txt',encoding='utf-8')

alltext_use = alltext.read()  # 獲取10000個字符
alltext.close()

alltext_use = alltext_use.replace('\n','').replace('\u3000', '').replace('\ufeff', '')

分詞

test1 = jieba.cut(alltext_use[:200])
'/'.join(test1)

'在/近期/陸續/召開/2019/年度/工作/會上/，/電力/、/石油/、/天然氣/、/鐵路/、/民航/、/電信/、/軍工/等/重點/領域/央企/紛紛表示/將/加快/建設/世界/一流/企業/，/在/加大/投資/力度/“/補短/板/、/穩/增長/”/的/同時/，/還/明確/了/混改/計劃/，/積極/引入/社會/資本/。/業內/專家/表示/，/這/意味着/重點/領域/混改/真正/向/縱深/推進/。/打造/“/世界/一流/”/企業/十九/大/報告/提出/，/深化/國有企業/改革/，/發展/混合/所有制/經濟/，/培育/具有/全球/競爭力/的/世界/一流/企業/。/《/經濟/參考報/》/記者/瞭解/到/，/國資委/選定/航天/科'

test1 = jieba.cut(alltext_use)
alltext_use2 = '/'.join(test1).split('/')

# 句子序列構造
maxlen = 100
step = 3
sentences = []
next_chars = []
for i in range(0, len(alltext_use) - maxlen, step):
    sentences.append(alltext_use[i:i+maxlen])
    next_chars.append(alltext_use[i+maxlen])
print('nb_sequences:', len(sentences))

nb_sequences: 924

可視化與分析

參考：
Quick Recipe: Building Word Clouds
Visual Text Analytics With Python

from wordcloud import WordCloud

wc = WordCloud(font_path='/usr/share/fonts/truetype/arphic/uming.ttc',max_font_size=80).generate(' '.join(alltext_use2))
# wc = WordCloud(max_font_size=60).generate(' '.join(alltext_use2))

fig = plt.figure(figsize=(10,10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
# plt.title(u'你好的')
plt.xlabel(u"橫座標xlabel",fontproperties=myfont)
plt.show()

from imageio import imread

Jpg = imread('/home/iot/myapp/LQK_Files/data/linshi.jpg')
wc = WordCloud(font_path='/usr/share/fonts/truetype/arphic/uming.ttc',mask=Jpg, background_color='white',max_font_size=80).generate(' '.join(alltext_use2))
# wc = WordCloud(max_font_size=60).generate(' '.join(alltext_use2))

fig = plt.figure(figsize=(10,10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
# plt.title(u'你好的')
plt.xlabel(u"橫座標xlabel",fontproperties=myfont)
plt.show()

建立網絡模型

help(pos_tag_sents)

Help on function pos_tag_sents in module nltk.tag:

pos_tag_sents(sentences, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to tag the
    given list of sentences, each consisting of a list of tokens.
    
    :param tokens: List of sentences to be tagged
    :type tokens: list(list(str))
    :param tagset: the tagset to be used, e.g. universal, wsj, brown
    :type tagset: str
    :param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
    :type lang: str
    :return: The list of tagged sentences
    :rtype: list(list(tuple(str, str)))

# import seaborn as sns
# sns.set(font='YaHei Consolas Hybrid')

plt.rcParams['font.sans-serif'] = ['YaHei Consolas Hybrid']

# get all words for government corpus
corpus_genre = 'government'
freqdist = nltk.FreqDist(alltext_use2)
plt.figure(figsize=(16,5))
freqdist.plot(50)

import pandas as pd
fd = pd.DataFrame(list(freqdist.items()), columns=['name', 'times'])
fd = fd.sort_values(by='name', ascending=False)
fd[fd['times']>5]

	name	times
161	；	6
8	，	126
19	領域	13
87	集團	8
18	重點	11
132	要	21
51	表示	7
89	能源	6
69	經濟	6
17	等	10
11	石油	8
37	的	33
42	混改	13
67	混合	6
63	深化	7
135	更加	6
65	改革	13
62	提出	7
57	推進	11
29	投資	12
68	所有制	6
24	建設	10
110	年	22
6	工作	14
22	將	7
204	實施	6
0	在	14
80	國資委	6
114	和	15
140	合作	6
66	發展	10
283	勘探	6
23	加快	8
28	加大	8
30	力度	8
122	公司	12
27	企業	18
84	中國	11
25	世界	12
516	與	6
26	一流	11
48	。	49
10	、	59
36	”	11
31	“	11
4	2019	19

訓練與檢查

alltext_use.count('公司')

from nltk import FreqDist
plt.figure(figsize=(16,5))

fdist = FreqDist(alltext_use)
fdist.plot()

上面還存在一個問題：ubuntu服務器環境中如何正確顯示中文標題？（如上文，已經通過常見的中文顯示設置，均無效）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

詞雲python

詞雲

數據準備

分詞

可視化與分析

建立網絡模型

訓練與檢查

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

基於客戶提取爲所屬客戶經理的信息

創建conda虛擬環境報錯

dataframe中分行

ubuntu服務器下載百度雲盤內容

詞雲python

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結