Python版本的詞雲生成模塊從2015年的v1.0到現在,已經更新到了v1.7。
下載請移步至:https://pypi.org/project/wordcloud/
wordcloud簡單應用:
import jieba
import wordcloud
w = wordcloud.WordCloud(
width=600,
height=600,
background_color='white',
font_path='msyh.ttc'
)
text = '看到此標題,我也是感慨萬千 首先弄清楚搞IT和被IT搞,誰是搞IT的?馬雲就是,馬化騰也是,劉強東也是,他們都是叫搞IT的, 但程序員只是被IT搞的人,可以比作蓋樓砌磚的泥瓦匠,你想想,四十歲的泥瓦匠能跟二十左右歲的年輕人較勁嗎?如果你是老闆你會怎麼做?程序員只是技術含量高的泥瓦匠,社會是現實的,社會的現實是什麼?利益驅動。當你跑的速度不比以前快了時,你就會被挨鞭子趕,這種窘境如果在做程序員當初就預料到的話,你就會知道,到達一定高度時,你需要改變行程。 程序員其實真的不是什麼好職業,技術每天都在更新,要不停的學,你以前學的每天都在被淘汰,加班可能是標配了吧。 熱點,你知道什麼是熱點嗎?社會上啥熱就是熱點,我舉幾個例子:在早淘寶之初,很多人都覺得做淘寶能讓自己發展,當初的規則是產品按時間輪候展示,也就是你的商品上架時間一到就會被展示,不論你星級多高。這種一律平等的條件固然好,但淘寶隨後調整了顯示規則,對產品和店鋪,銷量進行了加權,一下導致小賣家被弄到了很深的衚衕裏,沒人看到自己的產品,如何賣?做廣告費用也非常高,入不敷出,想必做過淘寶的都知道,再後來淘寶弄天貓,顯然,天貓是上檔次的商城,不同於淘寶的擺地攤,因爲攤位費漲價還鬧過事,鬧也白鬧,你有能力就弄,沒能力就淘汰掉。前幾天淘寶又推出C2M,客戶反向定製,客戶直接掛鉤大廠家,沒你小賣傢什麼事。 後來又出現了微商,在微商出現當天我就知道這東西不行,它比淘寶假貨還下三濫.我對TX一直有點偏見,因爲騙子都使用QQ 我說這麼多隻想說一個事,世界是變化的,你只能適應變化,否則就會被淘汰。 還是回到熱點這個話題,育兒嫂這個職位有很多人瞭解嗎?前幾年放開二胎後,這個職位迅速串紅,我的一個親戚初中畢業,現在已經月入一萬五,職務就是照看剛出生的嬰兒28天,節假日要雙薪。 你說這難到讓我一個男的去當育兒嫂嗎?扯,我只是說熱點問題。你沒踩在熱點上,你賺錢就會很費勁 這兩年的熱點是什麼?短視頻,你可以看到抖音的一些作品根本就不是普通人能實現的,說明專業級人才都開始努力往這上使勁了。 我只會編程,別的不會怎麼辦?那你就去編程。沒人用了怎麼辦?你看看你自己能不能僱傭你自己 學會適應社會,學會改變自己去適應社會 最後說一句:科大訊飛的劉鵬說的是對的。那我爲什麼還做程序員?他可以完成一些原始積累,只此而已。'
new_str = ' '.join(jieba.lcut(text))
w.generate(new_str)
w.to_file('x.png')
下面分析源碼:
wordcloud源碼中生成詞雲圖的主要步驟有:
1、分割詞組
2、生成詞雲
3、保存圖片
我們從 generate(self, text)切入,發現它僅僅調用了自身對象的一個方法 self.generate_from_text(text)
def generate_from_text(self, text):
"""Generate wordcloud from text.
"""
words = self.process_text(text) # 分割詞組
self.generate_from_frequencies(words) # 生成詞雲的主要方法(重點分析)
return self
process_text()源碼如下,處理的邏輯比較簡單:分割詞組、去除數字、去除's、去除數字、去除短詞、去除禁用詞等。
def process_text(self, text):
"""Splits a long text into words, eliminates the stopwords.
Parameters
----------
text : string
The text to be processed.
Returns
-------
words : dict (string, int)
Word tokens with associated frequency.
..versionchanged:: 1.2.2
Changed return type from list of tuples to dict.
Notes
-----
There are better ways to do word tokenization, but I don't want to
include all those things.
"""
flags = (re.UNICODE if sys.version < '3' and type(text) is unicode else 0)
regexp = self.regexp if self.regexp is not None else r"\w[\w']+"
# 獲得分詞
words = re.findall(regexp, text, flags)
# 去除 's
words = [word[:-2] if word.lower().endswith("'s") else word for word in words]
# 去除數字
if not self.include_numbers:
words = [word for word in words if not word.isdigit()]
# 去除短詞,長度小於指定值min_word_length的詞,被視爲短詞,篩除
if self.min_word_length:
words = [word for word in words if len(word) >= self.min_word_length]
# 去除禁用詞
stopwords = set([i.lower() for i in self.stopwords])
if self.collocations:
word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
else:
# remove stopwords
words = [word for word in words if word.lower() not in stopwords]
word_counts, _ = process_tokens(words, self.normalize_plurals)
return word_counts
重頭戲來了
generate_from_frequencies(self, frequencies, max_font_size=None) 方法體內的代碼比較多,總體上分爲以下幾步:
1、排序
2、詞頻歸一化
3、創建繪圖對象
4、確定初始字體大小(字號)
5、擴展單詞集
6、確定每個單詞的字體大小、位置、旋轉角度、顏色等信息
源碼如下(根據個人理解已添加中文註釋):
def generate_from_frequencies(self, frequencies, max_font_size=None):
"""Create a word_cloud from words and frequencies.
Parameters
----------
frequencies : dict from string to float
A contains words and associated frequency.
max_font_size : int
Use this font-size instead of self.max_font_size
Returns
-------
self
"""
# make sure frequencies are sorted and normalized
# 1、排序
# 對“單詞-頻率”列表按頻率降序排序
frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
if len(frequencies) <= 0:
raise ValueError("We need at least 1 word to plot a word cloud, "
"got %d." % len(frequencies))
# 確保單詞數在設置的最大範圍內,超出的部分被捨棄掉
frequencies = frequencies[:self.max_words]
# largest entry will be 1
# 取第一個單詞的頻率作爲最大詞頻
max_frequency = float(frequencies[0][1])
# 2、詞頻歸一化
# 把所有單詞的詞頻歸一化,由於單詞已經排序,所以歸一化後應該是這樣的:[('xxx', 1),('xxx', 0.96),('xxx', 0.87),...]
frequencies = [(word, freq / max_frequency)
for word, freq in frequencies]
# 隨機對象,用於產生一個隨機數,來確定是否旋轉90度
if self.random_state is not None:
random_state = self.random_state
else:
random_state = Random()
if self.mask is not None:
boolean_mask = self._get_bolean_mask(self.mask)
width = self.mask.shape[1]
height = self.mask.shape[0]
else:
boolean_mask = None
height, width = self.height, self.width
# 用於查找單詞可能放置的位置,例如圖片有效範圍內的空白處(非文字區域)
occupancy = IntegralOccupancyMap(height, width, boolean_mask)
# 3、創建繪圖對象
# create image
img_grey = Image.new("L", (width, height))
draw = ImageDraw.Draw(img_grey)
img_array = np.asarray(img_grey)
font_sizes, positions, orientations, colors = [], [], [], []
last_freq = 1.
# 4、確定初始字號
# 確定最大字號
if max_font_size is None:
# if not provided use default font_size
max_font_size = self.max_font_size
# 如果最大字號是空的,就需要確定一個最大字號作爲初始字號
if max_font_size is None:
# figure out a good font size by trying to draw with
# just the first two words
if len(frequencies) == 1:
# we only have one word. We make it big!
font_size = self.height
else:
# 遞歸進入當前函數,以獲得一個self.layout_,其中只有前兩個單詞的詞頻信息
# 使用這兩個詞頻計算出一個初始字號
self.generate_from_frequencies(dict(frequencies[:2]),
max_font_size=self.height)
# find font sizes
sizes = [x[1] for x in self.layout_]
try:
font_size = int(2 * sizes[0] * sizes[1]
/ (sizes[0] + sizes[1]))
# quick fix for if self.layout_ contains less than 2 values
# on very small images it can be empty
except IndexError:
try:
font_size = sizes[0]
except IndexError:
raise ValueError(
"Couldn't find space to draw. Either the Canvas size"
" is too small or too much of the image is masked "
"out.")
else:
font_size = max_font_size
# we set self.words_ here because we called generate_from_frequencies
# above... hurray for good design?
self.words_ = dict(frequencies)
# 5、擴展單詞集
# 如果單詞數不足最大值,則擴展單詞集以達到最大值
if self.repeat and len(frequencies) < self.max_words:
# pad frequencies with repeating words.
times_extend = int(np.ceil(self.max_words / len(frequencies))) - 1
# get smallest frequency
frequencies_org = list(frequencies)
downweight = frequencies[-1][1]
# 擴展單詞數,詞頻會保持原有詞頻的遞減規則。
for i in range(times_extend):
frequencies.extend([(word, freq * downweight ** (i + 1))
for word, freq in frequencies_org])
# 6、確定每一個單詞的字體大小、位置、旋轉角度、顏色等信息
# start drawing grey image
for word, freq in frequencies:
if freq == 0:
continue
# select the font size
rs = self.relative_scaling
if rs != 0:
font_size = int(round((rs * (freq / float(last_freq))
+ (1 - rs)) * font_size))
if random_state.random() < self.prefer_horizontal:
orientation = None
else:
orientation = Image.ROTATE_90
tried_other_orientation = False
# 尋找可能放置的位置,如果尋找一次,沒有找到,則嘗試改變文字方向或縮小字體大小,繼續尋找。
# 直到找到放置位置或者字體大小超出字號下限
while True:
# try to find a position
font = ImageFont.truetype(self.font_path, font_size)
# transpose font optionally
transposed_font = ImageFont.TransposedFont(
font, orientation=orientation)
# get size of resulting text
box_size = draw.textsize(word, font=transposed_font)
# find possible places using integral image:
result = occupancy.sample_position(box_size[1] + self.margin,
box_size[0] + self.margin,
random_state)
if result is not None or font_size < self.min_font_size:
# either we found a place or font-size went too small
break
# if we didn't find a place, make font smaller
# but first try to rotate!
if not tried_other_orientation and self.prefer_horizontal < 1:
orientation = (Image.ROTATE_90 if orientation is None else
Image.ROTATE_90)
tried_other_orientation = True
else:
font_size -= self.font_step
orientation = None
if font_size < self.min_font_size:
# we were unable to draw any more
break
# 收集該詞的信息:字體大小、位置、旋轉角度、顏色
x, y = np.array(result) + self.margin // 2
# actually draw the text
# 此處繪製圖像僅僅用於尋找放置單詞的位置,而不是最終的詞雲圖片。詞雲圖片是在另一個函數中生成:to_image
draw.text((y, x), word, fill="white", font=transposed_font)
positions.append((x, y))
orientations.append(orientation)
font_sizes.append(font_size)
colors.append(self.color_func(word, font_size=font_size,
position=(x, y),
orientation=orientation,
random_state=random_state,
font_path=self.font_path))
# recompute integral image
if self.mask is None:
img_array = np.asarray(img_grey)
else:
img_array = np.asarray(img_grey) + boolean_mask
# recompute bottom right
# the order of the cumsum's is important for speed ?!
occupancy.update(img_array, x, y)
last_freq = freq
# layout_是單詞信息列表,表中每項信息:單詞、頻率、字體大小、位置、旋轉角度、顏色等信息。爲後續步驟的繪圖工作做好準備。
self.layout_ = list(zip(frequencies, font_sizes, positions,
orientations, colors))
return self
注意
在第6步確定位置時,程序使用循環和隨機數來查找合適的放置位置,源碼如下。
# 尋找可能放置的位置,如果尋找一次,沒有找到,則嘗試改變文字方向或縮小字體大小,繼續尋找。
# 直到找到放置位置或者字體大小超出字號下限
while True:
# try to find a position
font = ImageFont.truetype(self.font_path, font_size)
# transpose font optionally
transposed_font = ImageFont.TransposedFont(
font, orientation=orientation)
# get size of resulting text
box_size = draw.textsize(word, font=transposed_font)
# find possible places using integral image:
result = occupancy.sample_position(box_size[1] + self.margin,
box_size[0] + self.margin,
random_state)
if result is not None or font_size < self.min_font_size:
# either we found a place or font-size went too small
break
# if we didn't find a place, make font smaller
# but first try to rotate!
if not tried_other_orientation and self.prefer_horizontal < 1:
orientation = (Image.ROTATE_90 if orientation is None else
Image.ROTATE_90)
tried_other_orientation = True
else:
font_size -= self.font_step
orientation = None
其中 occupancy.sample_position() 是具體尋找合適位置的方法。當你試圖進一步瞭解其中的奧祕時,卻發現你的【Ctrl+左鍵】已經無法跳轉到深層代碼了,悲哀的事情還是發生了......o(╥﹏╥)o
在wordcloud.py文件的頂部有這麼一行: from .query_integral_image import query_integral_image 而query_integral_image 是一個pyd文件,該文件無法直接查看。有關pyd格式的更多資料,請自行查閱。
再回到 generate_from_frequencies 上來,方法的最後把數據整理到了 self.layout_ 變量裏,這裏面就是所有詞組繪製時所需要的信息了。然後就可以調用to_file()方法,保存圖片了。
def to_file(self, filename):
img = self.to_image()
img.save(filename, optimize=True)
return self
核心方法 to_image() 就會把self.layout_裏的信息依次取出,繪製每一個詞組。
def to_image(self):
self._check_generated()
if self.mask is not None:
width = self.mask.shape[1]
height = self.mask.shape[0]
else:
height, width = self.height, self.width
img = Image.new(self.mode, (int(width * self.scale),
int(height * self.scale)),
self.background_color)
draw = ImageDraw.Draw(img)
for (word, count), font_size, position, orientation, color in self.layout_:
font = ImageFont.truetype(self.font_path,
int(font_size * self.scale))
transposed_font = ImageFont.TransposedFont(
font, orientation=orientation)
pos = (int(position[1] * self.scale),
int(position[0] * self.scale))
draw.text(pos, word, fill=color, font=transposed_font)
return self._draw_contour(img=img)
引申思考:
查找文字合適的放置該怎樣實現呢?(注意:文字筆畫的空隙裏也是可以放置更小一字號的文字)