你真的會用wordcloud製作詞雲圖嗎？

前言

對於文本分析而言，大家都繞不開詞雲圖，而python中製作詞雲圖，又繞不開wordcloud，但我想說的是，你真的會用嗎？你可能已經按照網上的教程，做出來了一張好看的詞雲圖，但是我想今天這篇文章，絕對讓你明白wordcloud背後的原理。

小試牛刀

首先你需要使用pip安裝這個第三方庫。接着我們簡單看一下中英文製作詞雲有什麼不同。

from matplotlib import pyplot as plt
from wordcloud import WordCloud

text = 'my is luopan. he is zhangshan'

wc = WordCloud()
wc.generate(text)

plt.imshow(wc)

from matplotlib import pyplot as plt
from wordcloud import WordCloud

text = '我叫羅攀，他叫張三，我叫羅攀'

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc') #設置中文字體
wc.generate(text)

plt.imshow(wc)

聰明的你會發現，中文的詞雲圖並不是我們想要的，那是因爲wordcloud並不能成功爲中文進行分詞。通過下面wordcloud的源代碼分析，我想你就應該能弄明白了。

WordCloud源碼分析

我們主要是要看WordCloud類，這裏我不會把全部源代碼打上來，而是主要分析製作詞雲的整個流程。

class WordCloud(object):
    
    def __init__(self,):
        '''這個主要是初始化一些參數
        '''
        pass

    def fit_words(self, frequencies):
        return self.generate_from_frequencies(frequencies)

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        '''詞頻歸一化，創建繪圖對象 
        '''
        pass

    def process_text(self, text):
        """對文本進行分詞，預處理
        """

        flags = (re.UNICODE if sys.version < '3' and type(text) is unicode  # noqa: F821
                 else 0)
        pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+"
        regexp = self.regexp if self.regexp is not None else pattern

        words = re.findall(regexp, text, flags)
        # remove 's
        words = [word[:-2] if word.lower().endswith("'s") else word
                 for word in words]
        # remove numbers
        if not self.include_numbers:
            words = [word for word in words if not word.isdigit()]
        # remove short words
        if self.min_word_length:
            words = [word for word in words if len(word) >= self.min_word_length]

        stopwords = set([i.lower() for i in self.stopwords])
        if self.collocations:
            word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
        else:
            # remove stopwords
            words = [word for word in words if word.lower() not in stopwords]
            word_counts, _ = process_tokens(words, self.normalize_plurals)

        return word_counts

    def generate_from_text(self, text):
        words = self.process_text(text)
        self.generate_from_frequencies(words)
        return self

    def generate(self, text):
        return self.generate_from_text(text)

當我們使用generate方法時，其調用順序是：

generate_from_text
process_text  #對文本預處理
generate_from_frequencies #詞頻歸一化，創建繪圖對象

備註：所以製作詞雲時，不管你使用generate還是generate_from_text方法，其實最終都是會調用generate_from_text方法。

所以，這裏最重要的就是process_text 和generate_from_frequencies函數。接下來我們就來一一講解。

process_text函數

process_text函數其實就是對文本進行分詞，然後清洗，最好返回一個分詞計數的字典。我們可以嘗試使用一下：

text = 'my is luopan. he is zhangshan'

wc = WordCloud()
cut_word = wc.process_text(text)
print(cut_word)
#  {'luopan': 1, 'zhangshan': 1}

text = '我叫羅攀，他叫張三，我叫羅攀'

wc = WordCloud()
cut_word = wc.process_text(text)
print(cut_word)
# {'我叫羅攀': 2, '他叫張三': 1}

所以可以看出process_text函數是沒法對中文進行好分詞的。我們先不管process_text函數是怎麼清洗分詞的，我們就着重看看是怎麼對文本進行分詞的。

def process_text(self, text):
    """對文本進行分詞，預處理
    """

    flags = (re.UNICODE if sys.version < '3' and type(text) is unicode  # noqa: F821
             else 0)
    pattern = r"\w[\w']*" if self.min_word_length <= 1 else r"\w[\w']+"
    regexp = self.regexp if self.regexp is not None else pattern

    words = re.findall(regexp, text, flags)

這裏的關鍵就在於使用的是正則表達式進行分詞（"\w[\w']+"），學過正則表達式的都知道，\w[\w]+代表的是匹配2個至多個字母，數字，中文，下劃線（python正則表達式中\w可代表中文）。

所以中文沒法切分，只會在各種標點符號中切分中文，這是不符合中文分詞的邏輯的。但英文文本本身就是通過空格進行了分割，所以英文單詞可以輕鬆的分詞出來。

總結來說，wordcloud本身就是爲了英文文本來做詞雲的，如果需要製作中文文本詞雲，就需要先對中文進行分詞。

generate_from_frequencies函數

最後再簡單說下這個函數，這個函數的功能就是詞頻歸一化，創建繪圖對象。

繪圖這個代碼很多，也不是我們今天要講的重點，我們只需要瞭解到底是需要什麼數據來繪製詞雲圖，下面是詞頻歸一化的代碼，我想大家應該能看的懂。

from operator import itemgetter

def generate_from_frequencies(frequencies):
    frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
    if len(frequencies) <= 0:
        raise ValueError("We need at least 1 word to plot a word cloud, "
                         "got %d." % len(frequencies))

    max_frequency = float(frequencies[0][1])

    frequencies = [(word, freq / max_frequency)
                   for word, freq in frequencies]
    return frequencies

test = generate_from_frequencies({'我叫羅攀': 2, '他叫張三': 1})
test

# [('我叫羅攀', 1.0), ('他叫張三', 0.5)]

中文文本製作詞雲圖的正確方式

我們先通過jieba分詞，用空格拼接文本，這樣process_text函數就能返回正確的分詞計數的字典。

from matplotlib import pyplot as plt
from wordcloud import WordCloud
import jieba

text = '我叫羅攀，他叫張三，我叫羅攀'
cut_word = " ".join(jieba.cut(text))

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate(cut_word)

plt.imshow(wc)

當然，如果你直接就有分詞計數的字典，就不需要調用generate函數，而是直接調用generate_from_frequencies函數。

text = {
    '羅攀':2,
    '張三':1
}

wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc')
wc.generate_from_frequencies(text)

plt.imshow(wc)

總結

（1）通過process_text函數分析，wordcloud本身是對英文文本進行詞雲製作的第三方庫。

（2）如果需要製作中文詞雲，就需要先通過jieba等中文分詞庫把中文文本分割開。

最後，上述的中文詞雲也並不上我們最終理想的詞雲，例如我，他等不需要顯示出來，還有就是讓詞雲更美化，這些內容下期再告訴你~

你真的會用wordcloud製作詞雲圖嗎？

前言

小試牛刀

WordCloud源碼分析

process_text函數

generate_from_frequencies函數

中文文本製作詞雲圖的正確方式

總結

ENVI製作土地利用轉移矩陣

再見2021，你好2022

Arcgis計算橢球面積

Python多線程（下）

投影柵格的正確使用方式

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結