Python wordcloud詞雲：源碼分析及簡單使用

Python版本的詞雲生成模塊從2015年的v1.0到現在，已經更新到了v1.7。

下載請移步至：https://pypi.org/project/wordcloud/

wordcloud簡單應用：

import jieba
import wordcloud

w = wordcloud.WordCloud(
    width=600,
    height=600,
    background_color='white',
    font_path='msyh.ttc'
)
text = '看到此標題，我也是感慨萬千 首先弄清楚搞IT和被IT搞，誰是搞IT的？馬雲就是，馬化騰也是，劉強東也是，他們都是叫搞IT的， 但程序員只是被IT搞的人，可以比作蓋樓砌磚的泥瓦匠，你想想，四十歲的泥瓦匠能跟二十左右歲的年輕人較勁嗎？如果你是老闆你會怎麼做？程序員只是技術含量高的泥瓦匠，社會是現實的，社會的現實是什麼？利益驅動。當你跑的速度不比以前快了時，你就會被挨鞭子趕，這種窘境如果在做程序員當初就預料到的話，你就會知道，到達一定高度時，你需要改變行程。 程序員其實真的不是什麼好職業，技術每天都在更新，要不停的學，你以前學的每天都在被淘汰，加班可能是標配了吧。 熱點，你知道什麼是熱點嗎？社會上啥熱就是熱點，我舉幾個例子：在早淘寶之初，很多人都覺得做淘寶能讓自己發展，當初的規則是產品按時間輪候展示，也就是你的商品上架時間一到就會被展示，不論你星級多高。這種一律平等的條件固然好，但淘寶隨後調整了顯示規則，對產品和店鋪，銷量進行了加權，一下導致小賣家被弄到了很深的衚衕裏，沒人看到自己的產品，如何賣？做廣告費用也非常高，入不敷出，想必做過淘寶的都知道，再後來淘寶弄天貓，顯然，天貓是上檔次的商城，不同於淘寶的擺地攤，因爲攤位費漲價還鬧過事，鬧也白鬧，你有能力就弄，沒能力就淘汰掉。前幾天淘寶又推出C2M,客戶反向定製，客戶直接掛鉤大廠家，沒你小賣傢什麼事。 後來又出現了微商，在微商出現當天我就知道這東西不行，它比淘寶假貨還下三濫.我對TX一直有點偏見，因爲騙子都使用QQ 我說這麼多隻想說一個事，世界是變化的，你只能適應變化，否則就會被淘汰。 還是回到熱點這個話題，育兒嫂這個職位有很多人瞭解嗎？前幾年放開二胎後，這個職位迅速串紅，我的一個親戚初中畢業，現在已經月入一萬五，職務就是照看剛出生的嬰兒28天，節假日要雙薪。 你說這難到讓我一個男的去當育兒嫂嗎？扯，我只是說熱點問題。你沒踩在熱點上，你賺錢就會很費勁 這兩年的熱點是什麼？短視頻，你可以看到抖音的一些作品根本就不是普通人能實現的，說明專業級人才都開始努力往這上使勁了。 我只會編程，別的不會怎麼辦？那你就去編程。沒人用了怎麼辦？你看看你自己能不能僱傭你自己 學會適應社會，學會改變自己去適應社會 最後說一句：科大訊飛的劉鵬說的是對的。那我爲什麼還做程序員？他可以完成一些原始積累，只此而已。'
new_str = ' '.join(jieba.lcut(text))
w.generate(new_str)
w.to_file('x.png')

下面分析源碼：

wordcloud源碼中生成詞雲圖的主要步驟有：

1、分割詞組

2、生成詞雲

3、保存圖片

我們從 generate(self, text)切入，發現它僅僅調用了自身對象的一個方法 self.generate_from_text(text)

    def generate_from_text(self, text):
        """Generate wordcloud from text.
        """
        words = self.process_text(text) # 分割詞組
        self.generate_from_frequencies(words) # 生成詞雲的主要方法（重點分析）
        return self

process_text()源碼如下，處理的邏輯比較簡單：分割詞組、去除數字、去除's、去除數字、去除短詞、去除禁用詞等。

    def process_text(self, text):
        """Splits a long text into words, eliminates the stopwords.

        Parameters
        ----------
        text : string
            The text to be processed.

        Returns
        -------
        words : dict (string, int)
            Word tokens with associated frequency.

        ..versionchanged:: 1.2.2
            Changed return type from list of tuples to dict.

        Notes
        -----
        There are better ways to do word tokenization, but I don't want to
        include all those things.
        """

        flags = (re.UNICODE if sys.version < '3' and type(text) is unicode else 0) 
                
        regexp = self.regexp if self.regexp is not None else r"\w[\w']+"

        # 獲得分詞
        words = re.findall(regexp, text, flags)
        # 去除 's
        words = [word[:-2] if word.lower().endswith("'s") else word for word in words]
        # 去除數字
        if not self.include_numbers:
            words = [word for word in words if not word.isdigit()]
        # 去除短詞，長度小於指定值min_word_length的詞，被視爲短詞，篩除
        if self.min_word_length:
            words = [word for word in words if len(word) >= self.min_word_length]
        # 去除禁用詞
        stopwords = set([i.lower() for i in self.stopwords])
        if self.collocations:
            word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
        else:
            # remove stopwords
            words = [word for word in words if word.lower() not in stopwords]
            word_counts, _ = process_tokens(words, self.normalize_plurals)

        return word_counts

重頭戲來了

generate_from_frequencies(self, frequencies, max_font_size=None) 方法體內的代碼比較多，總體上分爲以下幾步：

1、排序

2、詞頻歸一化

3、創建繪圖對象

4、確定初始字體大小（字號）

5、擴展單詞集

6、確定每個單詞的字體大小、位置、旋轉角度、顏色等信息

源碼如下（根據個人理解已添加中文註釋）：

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        """Create a word_cloud from words and frequencies.

        Parameters
        ----------
        frequencies : dict from string to float
            A contains words and associated frequency.

        max_font_size : int
            Use this font-size instead of self.max_font_size

        Returns
        -------
        self

        """
        # make sure frequencies are sorted and normalized
        # 1、排序
        # 對“單詞-頻率”列表按頻率降序排序
        frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
        if len(frequencies) <= 0:
            raise ValueError("We need at least 1 word to plot a word cloud, "
                             "got %d." % len(frequencies))
        # 確保單詞數在設置的最大範圍內，超出的部分被捨棄掉
        frequencies = frequencies[:self.max_words]

        # largest entry will be 1
        # 取第一個單詞的頻率作爲最大詞頻
        max_frequency = float(frequencies[0][1])

        # 2、詞頻歸一化
        # 把所有單詞的詞頻歸一化，由於單詞已經排序，所以歸一化後應該是這樣的：[('xxx', 1),('xxx', 0.96),('xxx', 0.87),...]
        frequencies = [(word, freq / max_frequency)
                       for word, freq in frequencies]

        # 隨機對象，用於產生一個隨機數，來確定是否旋轉90度
        if self.random_state is not None:
            random_state = self.random_state
        else:
            random_state = Random()

        if self.mask is not None:
            boolean_mask = self._get_bolean_mask(self.mask)
            width = self.mask.shape[1]
            height = self.mask.shape[0]
        else:
            boolean_mask = None
            height, width = self.height, self.width
        # 用於查找單詞可能放置的位置，例如圖片有效範圍內的空白處（非文字區域）
        occupancy = IntegralOccupancyMap(height, width, boolean_mask)

        # 3、創建繪圖對象
        # create image
        img_grey = Image.new("L", (width, height))
        draw = ImageDraw.Draw(img_grey)
        img_array = np.asarray(img_grey)
        font_sizes, positions, orientations, colors = [], [], [], []

        last_freq = 1.

        # 4、確定初始字號
        # 確定最大字號
        if max_font_size is None:
            # if not provided use default font_size
            max_font_size = self.max_font_size

        # 如果最大字號是空的，就需要確定一個最大字號作爲初始字號
        if max_font_size is None:
            # figure out a good font size by trying to draw with
            # just the first two words
            if len(frequencies) == 1:
                # we only have one word. We make it big!
                font_size = self.height
            else:
                # 遞歸進入當前函數，以獲得一個self.layout_，其中只有前兩個單詞的詞頻信息
                # 使用這兩個詞頻計算出一個初始字號
                self.generate_from_frequencies(dict(frequencies[:2]),
                                               max_font_size=self.height)
                # find font sizes
                sizes = [x[1] for x in self.layout_]
                try:
                    font_size = int(2 * sizes[0] * sizes[1]
                                    / (sizes[0] + sizes[1]))
                # quick fix for if self.layout_ contains less than 2 values
                # on very small images it can be empty
                except IndexError:
                    try:
                        font_size = sizes[0]
                    except IndexError:
                        raise ValueError(
                            "Couldn't find space to draw. Either the Canvas size"
                            " is too small or too much of the image is masked "
                            "out.")
        else:
            font_size = max_font_size

        # we set self.words_ here because we called generate_from_frequencies
        # above... hurray for good design?
        self.words_ = dict(frequencies)

        # 5、擴展單詞集
        # 如果單詞數不足最大值，則擴展單詞集以達到最大值
        if self.repeat and len(frequencies) < self.max_words:
            # pad frequencies with repeating words.
            times_extend = int(np.ceil(self.max_words / len(frequencies))) - 1
            # get smallest frequency
            frequencies_org = list(frequencies)
            downweight = frequencies[-1][1]
            # 擴展單詞數，詞頻會保持原有詞頻的遞減規則。
            for i in range(times_extend):
                frequencies.extend([(word, freq * downweight ** (i + 1))
                                    for word, freq in frequencies_org])

        # 6、確定每一個單詞的字體大小、位置、旋轉角度、顏色等信息
        # start drawing grey image
        for word, freq in frequencies:
            if freq == 0:
                continue
            # select the font size
            rs = self.relative_scaling
            if rs != 0:
                font_size = int(round((rs * (freq / float(last_freq))
                                       + (1 - rs)) * font_size))
            if random_state.random() < self.prefer_horizontal:
                orientation = None
            else:
                orientation = Image.ROTATE_90
            tried_other_orientation = False
            # 尋找可能放置的位置，如果尋找一次，沒有找到，則嘗試改變文字方向或縮小字體大小，繼續尋找。
            # 直到找到放置位置或者字體大小超出字號下限
            while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.TransposedFont(
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)
                if result is not None or font_size < self.min_font_size:
                    # either we found a place or font-size went too small
                    break
                # if we didn't find a place, make font smaller
                # but first try to rotate!
                if not tried_other_orientation and self.prefer_horizontal < 1:
                    orientation = (Image.ROTATE_90 if orientation is None else
                                   Image.ROTATE_90)
                    tried_other_orientation = True
                else:
                    font_size -= self.font_step
                    orientation = None

            if font_size < self.min_font_size:
                # we were unable to draw any more
                break

            # 收集該詞的信息：字體大小、位置、旋轉角度、顏色
            x, y = np.array(result) + self.margin // 2
            # actually draw the text
            # 此處繪製圖像僅僅用於尋找放置單詞的位置，而不是最終的詞雲圖片。詞雲圖片是在另一個函數中生成：to_image
            draw.text((y, x), word, fill="white", font=transposed_font)
            positions.append((x, y))
            orientations.append(orientation)
            font_sizes.append(font_size)
            colors.append(self.color_func(word, font_size=font_size,
                                          position=(x, y),
                                          orientation=orientation,
                                          random_state=random_state,
                                          font_path=self.font_path))
            # recompute integral image
            if self.mask is None:
                img_array = np.asarray(img_grey)
            else:
                img_array = np.asarray(img_grey) + boolean_mask
            # recompute bottom right
            # the order of the cumsum's is important for speed ?!
            occupancy.update(img_array, x, y)
            last_freq = freq

        # layout_是單詞信息列表，表中每項信息：單詞、頻率、字體大小、位置、旋轉角度、顏色等信息。爲後續步驟的繪圖工作做好準備。
        self.layout_ = list(zip(frequencies, font_sizes, positions,
                                orientations, colors))
        return self

注意

在第6步確定位置時，程序使用循環和隨機數來查找合適的放置位置，源碼如下。

            # 尋找可能放置的位置，如果尋找一次，沒有找到，則嘗試改變文字方向或縮小字體大小，繼續尋找。
            # 直到找到放置位置或者字體大小超出字號下限
            while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.TransposedFont(
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)
                if result is not None or font_size < self.min_font_size:
                    # either we found a place or font-size went too small
                    break
                # if we didn't find a place, make font smaller
                # but first try to rotate!
                if not tried_other_orientation and self.prefer_horizontal < 1:
                    orientation = (Image.ROTATE_90 if orientation is None else
                                   Image.ROTATE_90)
                    tried_other_orientation = True
                else:
                    font_size -= self.font_step
                    orientation = None

其中 occupancy.sample_position() 是具體尋找合適位置的方法。當你試圖進一步瞭解其中的奧祕時，卻發現你的【Ctrl＋左鍵】已經無法跳轉到深層代碼了，悲哀的事情還是發生了......o(╥﹏╥)o

在wordcloud.py文件的頂部有這麼一行： from .query_integral_image import query_integral_image 而query_integral_image 是一個pyd文件，該文件無法直接查看。有關pyd格式的更多資料，請自行查閱。

再回到 generate_from_frequencies 上來，方法的最後把數據整理到了 self.layout_ 變量裏，這裏面就是所有詞組繪製時所需要的信息了。然後就可以調用to_file()方法，保存圖片了。

    def to_file(self, filename):

        img = self.to_image()
        img.save(filename, optimize=True)
        return self

核心方法 to_image() 就會把self.layout_裏的信息依次取出，繪製每一個詞組。

    def to_image(self):
        self._check_generated()
        if self.mask is not None:
            width = self.mask.shape[1]
            height = self.mask.shape[0]
        else:
            height, width = self.height, self.width

        img = Image.new(self.mode, (int(width * self.scale),
                                    int(height * self.scale)),
                        self.background_color)
        draw = ImageDraw.Draw(img)
        for (word, count), font_size, position, orientation, color in self.layout_:
            font = ImageFont.truetype(self.font_path,
                                      int(font_size * self.scale))
            transposed_font = ImageFont.TransposedFont(
                font, orientation=orientation)
            pos = (int(position[1] * self.scale),
                   int(position[0] * self.scale))
            draw.text(pos, word, fill=color, font=transposed_font)

        return self._draw_contour(img=img)

引申思考：

查找文字合適的放置該怎樣實現呢？（注意：文字筆畫的空隙裏也是可以放置更小一字號的文字）

Python wordcloud詞雲：源碼分析及簡單使用

wordcloud簡單應用：

下面分析源碼：

重頭戲來了

注意

引申思考：

~ End ~

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Python算法之『簡潔的選擇排序』

簡單談談數據的歸一化問題（Python）

Python算法之『簡潔的快速排序』

Android自定義View之『自定義組合控件』

使用pyecharts繪製中國曆代都城的分佈圖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結