【原文鏈接】http://brandonrose.org/clustering

In this guide, I will explain how to cluster a set of documents using Python. My 目標例子 is to identify the 潛在的 structures within the 摘要 of the top 100 films of all time (per an IMDB list). See the original post for a more detailed discussion on the example. This guide covers:

tokenizing (分詞) and stemming (詞幹提取) each synopsis
transforming the corpus (文檔集) into vector space (向量空間) using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling (多維尺度變換) to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling (主題模型) using Latent Dirichlet Allocation (LDA)

Note that my github repo for the whole project is available. The 'cluster_analysis' workbook is fully functional; the 'cluster_analysis_web' workbook has been trimmed down for the purpose of creating this 攻略. Feel free to download the repo and use 'cluster_analysis' to step through the guide yourself.

If you have any questions for me, feel free to reach out on Twitter to @brandonmrose

But first, I import everything I am going to need up front (mpld3沒有安裝，在Anaconda Navigator中搜索mpld3，選擇Not Installed，點擊apply)

import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction
from bs4 import BeautifulSoup
import mpld3

For the purposes of this walkthrough, imagine that I have 2 primary lists:

'titles': the titles of the films in their 排名順序
'synopses': the synopses of the films matched to the 'titles' 順序

In the full workbook that I posted to github you can walk through the import of these lists, but for brevity just keep in mind that for the rest of this walk-through I will focus on using these two lists. Of primary importance is the 'synopses' list; 'titles' is mostly used for labeling purposes.

複製粘貼https://github.com/brandomr/document_cluster/blob/master/title_list.txt，到自己的project下面

#import three lists: titles, links and wikipedia synopses
titles = open('title_list.txt').read().split('\n')
#ensures that only the first 100 are read in
titles = titles[:100]
print(titles[:10]) #first 10 titles

['The Godfather', 'The Shawshank Redemption', "Schindler's List", 'Raging Bull', 'Casablanca', "One Flew Over the Cuckoo's Nest", 'Gone with the Wind', 'Citizen Kane', 'The Wizard of Oz', 'Titanic']

複製粘貼https://github.com/brandomr/document_cluster/blob/master/synopses_list_wiki.txt，到自己的project下面

複製粘貼https://github.com/brandomr/document_cluster/blob/master/synopses_list_imdb.txt，到自己的project下面

synopses_wiki = open('synopses_list_wiki.txt', encoding='UTF-8').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]
synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)
synopses_wiki = synopses_clean_wiki
synopses_imdb = open('synopses_list_imdb.txt').read().split('\n BREAKS HERE')
synopses_imdb = synopses_imdb[:100]
synopses_clean_imdb = []
for text in synopses_imdb:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_imdb.append(text)
synopses_imdb = synopses_clean_imdb
synopses = []
for i in range(len(synopses_wiki)):
    item = synopses_wiki[i] + synopses_imdb[i]
    synopses.append(item)
print(synopses[0][:200]) #first 200 characters in first synopses (for 'The Godfather')

Plot [edit] [ [ edit edit ] ] 
 On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son, Michael,

Stopwords, stemming, and tokenizing¶

This section is focused on defining some functions to manipulate the synopses. First, I load NLTK's (Natural Language Toolkit) list of English stop words. Stop words are words like "a", "the", or "in" which don't convey significant meaning. I'm sure there are much better explanations of this out there.

# nltk.download('stopwords') 如果報錯，將此行放開
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

Next I import the Snowball Stemmer which is actually part of NLTK. Stemming is just the process of breaking a word down into its root.

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

Below I define two functions:

tokenize_and_stem: tokenizes (splits the synopsis into a list of its respective words (or tokens) and also stems each token
tokenize_only: tokenizes the synopsis only

I use both these functions to create a dictionary which becomes important in case I want to use stems for an algorithm, but later convert stems back to their full words for presentation purposes. Guess what, I do want to do that!

# here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

Below I use my stemming/tokenizing and tokenizing functions to iterate over the list of synopses to create two vocabularies: one stemmed and one only tokenized.

#如果報錯，訪問http://www.nltk.org/nltk_data/，下載Punkt Tokenizer Models，粘貼到C:\Users\Administrator\AppData\Roaming\nltk_data\tokenizers，解壓
#not super pythonic, no, not at all.
#use extend so it's a big flat list of vocab
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list   
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

Using these two lists, I create a pandas DataFrame, 下標爲 stemmed vocabulary, 列爲 tokenized words. The benefit of this is it provides an efficient way to 查找一個 stem and 返回一個完整的 token. The 缺點 here is that stems to tokens are 一對多: the stem 'run' could be associated with 'ran', 'runs', 'running', etc. For my purposes this is fine--I'm perfectly happy 返回第一個與 stem 相關的 token.

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 312209 items in vocab_frame

You'll notice there is clearly some repetition here. I could clean it up, but there are only 312209 items in the DataFrame which isn't huge overhead (間接費用) in looking up a stemmed word based on the stem-index.

print (vocab_frame.head(10))

     words
plot  plot
edit  edit
edit  edit
edit  edit
on      on
the    the
day    day
of      of
his    his
onli  only

Tf-idf and 文檔相似性

Here, I define term frequency-inverse document frequency (tf-idf) vectorizer parameters and then convert the synopses list (摘要集合) into a tf-idf 矩陣.

To get a Tf-idf matrix, first 數每個詞在每篇文檔中出現的次數. This is transformed into a document-term matrix (dtm, 文檔-詞矩陣). This is also just called a term frequency matrix (詞頻矩陣). An example of a dtm is here at right.

Then apply the term frequency-inverse document frequency weighting (Tf-idf 權重): words that occur frequently within a document but not frequently within the corpus receive a higher weighting 因爲 these words are assumed 對於該文檔 to contain more meaning.

A couple things to note about the parameters I define below:

max_df: this is the 最大頻率 within the documents a given feature can have to be used in the tfi-idf matrix. If the term is in greater than 80% of the documents it probably cares little meanining (in the context of film synopses)
min_idf: this could be an integer (e.g. 5) and the term 至少要在 5 of the documents 纔會被考慮. Here I pass 0.2; the term must be in at least 20% of the document. I found that if I allowed a lower min_df I ended up 靠着名字進行聚類--for example "Michael" or "Tom" are names found in several of the movies and the synopses use these names frequently, but the names carry no real meaning.
ngram_range: this just means I'll look at unigrams, bigrams and trigrams (unigram 一元分詞，把句子分成一個一個的漢字；bigram 二元分詞，把句子從頭到尾每兩個字組成一個詞語；trigram 三元分詞，把句子從頭到尾每三個字組成一個詞語). See n-grams

from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
#%time：Time execution of a Python statement or expression. https://ipython.readthedocs.io/en/stable/interactive/magics.html
#%time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)

(100, 563)

terms is just a 集合 of the features used in the tf-idf matrix. This is a vocabulary

terms = tfidf_vectorizer.get_feature_names()

dist is defined as 1 - the cosine similarity of each document (每一個文檔的 1 - 餘弦相似度). Cosine similarity is measured against the tf-idf matrix and can be used to 生成一種比較相似度的方法 between each document and the other documents in the 文檔集 (each synopsis among the synopses). Subtracting it from 1 產生餘弦距離 which I will use for 畫圖 on a euclidean (2維) 平面.

Note that 有了 dist 就可以測量任意兩個或多個概要之間的相似性.

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

K-means 聚類

Now onto the fun part. Using the tf-idf matrix, you can run 許多聚類算法來更好地理解 the hidden structure within the synopses. I first chose k-means. K-means 初始化的時候需要預算確定好clusters的數量 (I chose 5). Each observation is assigned to a cluster (cluster assignment) so as to minimize cluster中的平方和. Next, the 平均值 of the clustered observations is calculated and used as the new cluster centroid (物體的轉動中心叫做矩心). Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence (收斂).

I found it took several runs for the algorithm to converge a global optimum as k-means is susceptible to reaching local optima.

from sklearn.cluster import KMeans
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
#%time km.fit(tfidf_matrix)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

I use joblib.dump to pickle the model (模型持久化：一般用來訓練模型的過程比較長，所以我們一般會將訓練的模型進行保存), once it has converged and to reload the model/reassign the labels as the clusters.

from sklearn.externals import joblib
#uncomment the below to save your model 
#since I've already run my model I am loading from the pickle
#joblib.dump(km, 'doc_cluster.pkl') 第一次運行時將註釋打開，項目中會生成doc_cluster.pkl文件，之後運行的時候再註釋掉這行就可以使用之前持久化的模型了
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

Here, I create a dictionary of 標題, 排名, 簡介, the cluster assignment, and the genre [rank and genre were scraped from IMDB].

I convert this dictionary to a Pandas DataFrame for easy access. I'm a huge fan of Pandas and recommend taking a look at some of its awesome functionality which I'll use below, but not describe in a ton of detail.

# generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later
ranks = []
for i in range(0,len(titles)):
    ranks.append(i)
#創建genres_list.txt文件，將https://github.com/brandomr/document_cluster/blob/master/genres_list.txt內容複製到這個文件中
genres = open('genres_list.txt').read().split('\n')
genres = genres[:100]
films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])
print(frame['cluster'].value_counts()) #number of films per cluster (clusters from 0 to 4)

4    26
0    25
2    21
1    16
3    12
dtype: int64

grouped = frame['rank'].groupby(frame['cluster']) #groupby cluster for aggregation purposes
print(grouped.mean()) #average rank (1 to 100) per cluster

cluster
0          47.200000
1          58.875000
2          49.380952
3          54.500000
4          43.730769
dtype: float64

Note that clusters 4 and 0 排名最底, which indicates that they, on average, contain films that were ranked as "better" on the top 100 list.

Here is some fancy indexing and sorting on each cluster to identify which are the top n (I chose n=6) words that are nearest to the cluster centroid. This gives a good sense of the 主題 of the cluster.

#from __future__ import print_function
#SyntaxError: from __future__ imports must occur at the beginning of the file
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 
for i in range(num_clusters):
    print("Cluster %d words: " %i, end='') #%d功能是轉成有符號十進制數 #end=''讓打印不要換行
    for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
        #b'...' is an encoded byte string. the unicode.encode() method outputs a byte string that needs to be converted back to a string with .decode()
        print('%s' %vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=', ')
    print() #add whitespace
    print() #add whitespace
    print("Cluster %d titles: " %i, end='')
    for title in frame.loc[i]['title'].values.tolist():
        print(' %s,' %title, end='')
    print() #add whitespace
    print() #add whitespace


Top terms per cluster:

Cluster 0 words: family, home, mother, war, house, dies,

Cluster 0 titles: Schindler's List, One Flew Over the Cuckoo's Nest, Gone with the Wind, The Wizard of Oz, Titanic, Forrest Gump, E.T. the Extra-Terrestrial, The Silence of the Lambs, Gandhi, A Streetcar Named Desire, The Best Years of Our Lives, My Fair Lady, Ben-Hur, Doctor Zhivago, The Pianist, The Exorcist, Out of Africa, Good Will Hunting, Terms of Endearment, Giant, The Grapes of Wrath, Close Encounters of the Third Kind, The Graduate, Stagecoach, Wuthering Heights,

Cluster 1 words: police, car, killed, murders, driving, house,

Cluster 1 titles: Casablanca, Psycho, Sunset Blvd., Vertigo, Chinatown, Amadeus, High Noon, The French Connection, Fargo, Pulp Fiction, The Maltese Falcon, A Clockwork Orange, Double Indemnity, Rebel Without a Cause, The Third Man, North by Northwest,

Cluster 2 words: father, new, york, new, brothers, apartments,

Cluster 2 titles: The Godfather, Raging Bull, Citizen Kane, The Godfather: Part II, On the Waterfront, 12 Angry Men, Rocky, To Kill a Mockingbird, Braveheart, The Good, the Bad and the Ugly, The Apartment, Goodfellas, City Lights, It Happened One Night, Midnight Cowboy, Mr. Smith Goes to Washington, Rain Man, Annie Hall, Network, Taxi Driver, Rear Window,

Cluster 3 words: george, dance, singing, john, love, perform,

Cluster 3 titles: West Side Story, Singin' in the Rain, It's a Wonderful Life, Some Like It Hot, The Philadelphia Story, An American in Paris, The King's Speech, A Place in the Sun, Tootsie, Nashville, American Graffiti, Yankee Doodle Dandy,

Cluster 4 words: killed, soldiers, captain, men, army, command,

Cluster 4 titles: The Shawshank Redemption, Lawrence of Arabia, The Sound of Music, Star Wars, 2001: A Space Odyssey, The Bridge on the River Kwai, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Apocalypse Now, The Lord of the Rings: The Return of the King, Gladiator, From Here to Eternity, Saving Private Ryan, Unforgiven, Raiders of the Lost Ark, Patton, Jaws, Butch Cassidy and the Sundance Kid, The Treasure of the Sierra Madre, Platoon, Dances with Wolves, The Deer Hunter, All Quiet on the Western Front, Shane, The Green Mile, The African Queen, Mutiny on the Bounty,Top terms per cluster:

Cluster 0 words: family, home, mother, war, house, dies,

Cluster 0 titles: Schindler's List, One Flew Over the Cuckoo's Nest, Gone with the Wind, The Wizard of Oz, Titanic, Forrest Gump, E.T. the Extra-Terrestrial, The Silence of the Lambs, Gandhi, A Streetcar Named Desire, The Best Years of Our Lives, My Fair Lady, Ben-Hur, Doctor Zhivago, The Pianist, The Exorcist, Out of Africa, Good Will Hunting, Terms of Endearment, Giant, The Grapes of Wrath, Close Encounters of the Third Kind, The Graduate, Stagecoach, Wuthering Heights,

Cluster 1 words: police, car, killed, murders, driving, house,

Cluster 1 titles: Casablanca, Psycho, Sunset Blvd., Vertigo, Chinatown, Amadeus, High Noon, The French Connection, Fargo, Pulp Fiction, The Maltese Falcon, A Clockwork Orange, Double Indemnity, Rebel Without a Cause, The Third Man, North by Northwest,

Cluster 2 words: father, new, york, new, brothers, apartments,

Cluster 2 titles: The Godfather, Raging Bull, Citizen Kane, The Godfather: Part II, On the Waterfront, 12 Angry Men, Rocky, To Kill a Mockingbird, Braveheart, The Good, the Bad and the Ugly, The Apartment, Goodfellas, City Lights, It Happened One Night, Midnight Cowboy, Mr. Smith Goes to Washington, Rain Man, Annie Hall, Network, Taxi Driver, Rear Window,

Cluster 3 words: george, dance, singing, john, love, perform,

Cluster 3 titles: West Side Story, Singin' in the Rain, It's a Wonderful Life, Some Like It Hot, The Philadelphia Story, An American in Paris, The King's Speech, A Place in the Sun, Tootsie, Nashville, American Graffiti, Yankee Doodle Dandy,

Cluster 4 words: killed, soldiers, captain, men, army, command,

Cluster 4 titles: The Shawshank Redemption, Lawrence of Arabia, The Sound of Music, Star Wars, 2001: A Space Odyssey, The Bridge on the River Kwai, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Apocalypse Now, The Lord of the Rings: The Return of the King, Gladiator, From Here to Eternity, Saving Private Ryan, Unforgiven, Raiders of the Lost Ark, Patton, Jaws, Butch Cassidy and the Sundance Kid, The Treasure of the Sierra Madre, Platoon, Dances with Wolves, The Deer Hunter, All Quiet on the Western Front, Shane, The Green Mile, The African Queen, Mutiny on the Bounty,

Multidimensional scaling

Here is some code to convert the dist matrix (A matrix that returns a distance: Calculate the distance between every point in a matrix and every point in another matrix) into a 二維數組 using multidimensional scaling. I won't pretend I know a ton about MDS, but it was useful for this purpose. Another option would be to use principal component analysis.

import os  # for os.path.basename
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.manifold import MDS
MDS()
# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
print(xs, ys)

Visualizing document clusters

In this section, I demonstrate how you can visualize the document clustering output using matplotlib and mpld3 (a matplotlib wrapper for D3.js).

First I define some dictionaries for going from cluster number to color and to cluster name. I based the cluster names off the words that were closest to each cluster centroid.

#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'}
#set up cluster names using a dict
cluster_names = {0: 'Family, home, war', 
                 1: 'Police, killed, murders', 
                 2: 'Father, New York, brothers', 
                 3: 'Dance, singing, love', 
                 4: 'Killed, soldiers, captain'}

Next, I plot the labeled observations (films, film titles) colored by cluster using matplotlib. I won't get into too much detail about the matplotlib plot, but I tried to provide some helpful commenting.

#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'}
#set up cluster names using a dict
cluster_names = {0: 'Family, home, war', 
                 1: 'Police, killed, murders', 
                 2: 'Father, New York, brothers', 
                 3: 'Dance, singing, love', 
                 4: 'Killed, soldiers, captain'}
#some ipython magic to show the matplotlib plots inline
#%matplotlib inline 
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 
#group by cluster
groups = df.groupby('label')
# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, 
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',         # changes apply to the y-axis
        which='both',      # both major and minor ticks are affected
        left='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelleft='off')    
ax.legend(numpoints=1)  #show legend with only 1 point
#add label in x,y position with the label as the film title
for i in range(len(df)):
    #與loc不同的之處是，.iloc 是根據行數與列數來索引的
    ax.text(df.loc[i]['x'], df.loc[i]['y'], df.loc[i]['title'], size=8)  
plt.show() #show the plot
#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)
#plt.close()

The clustering plot looks great, but it pains my eyes to see overlapping labels. Having some experience with D3.js I knew one solution would be to use a browser based/javascript interactive. Fortunately, I recently stumbled upon mpld3 a matplotlib wrapper for D3. Mpld3 basically let's you use matplotlib syntax to create web interactives. It has a really easy, high-level API for adding 工具提示插件 on mouse hover, which is what I am interested in.

It also has some nice functionality for zooming and 搖攝. The below javascript snippet basicaly defines a custom location for where the zoom/pan toggle (切換元素的可見狀態) resides. Don't worry about it too much and you actually don't need to use it, but it helped for formatting purposes when exporting to the web later. The only thing you might want to change is the x and y attr for the position of the toolbar.

#define custom toolbar location
class TopToolbar(mpld3.plugins.PluginBase):
    """Plugin for moving toolbar to top of figure"""
    JAVASCRIPT = """
    mpld3.register_plugin("toptoolbar", TopToolbar);
    TopToolbar.prototype = Object.create(mpld3.Plugin.prototype);
    TopToolbar.prototype.constructor = TopToolbar;
    function TopToolbar(fig, props){
        mpld3.Plugin.call(this, fig, props);
    };
    TopToolbar.prototype.draw = function(){
      // the toolbar svg doesn't exist
      // yet, so first draw it
      this.fig.toolbar.draw();
      // then change the y position to be
      // at the top of the figure
      this.fig.toolbar.toolbar.attr("x", 150);
      this.fig.toolbar.toolbar.attr("y", 400);
      // then remove the draw function,
      // so that it is not called again
      this.fig.toolbar.draw = function() {}
    }
    """
    def __init__(self):
        self.dict_ = {"type": "toptoolbar"}

Here is the actual creation of the interactive scatterplot. I won't go into much more detail about it since it's pretty much a straightforward copy of one of the mpld3 examples, though I use a pandas groupby to group by cluster, then iterate through the groups as I layer the scatterplot. Note that relative to doing this with raw D3, mpld3 is much simpler to integrate into your Python workflow. If you click around the rest of my website you'll see that I do love D3, but for basic interactives I will probably use mpld3 a lot going forward.

Note that mpld3 lets you define some custom CSS, which I use to style the font, the axes, and the left margin on the figure.

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 
#group by cluster
groups = df.groupby('label')
#define custom css to format the font and to remove the axis labeling
css = """
text.mpld3-text, div.mpld3-tooltip {
  font-family:Arial, Helvetica, sans-serif;
}
g.mpld3-xaxis, g.mpld3-yaxis {
display: none; }
svg.mpld3-figure {
margin-left: -200px;}
"""
# Plot 
fig, ax = plt.subplots(figsize=(14,6)) #set plot size
ax.margins(0.03) # Optional, just adds 5% padding to the autoscaling
#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    points = ax.plot(group.x, group.y, marker='o', linestyle='', ms=18, 
                     label=cluster_names[name], mec='none', 
                     color=cluster_colors[name])
    ax.set_aspect('auto')
    labels = [i for i in group.title]
    #set tooltip using points, labels and the already defined 'css'
    tooltip = mpld3.plugins.PointHTMLTooltip(points[0], labels,
                                       voffset=10, hoffset=10, css=css)
    #connect tooltip to fig
    mpld3.plugins.connect(fig, tooltip, TopToolbar())    
    #set tick marks as blank
    ax.axes.get_xaxis().set_ticks([])
    ax.axes.get_yaxis().set_ticks([])
    #set axis as blank
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)
ax.legend(numpoints=1) #show legend with only one dot
print(end) #不拋出錯誤圖顯示不出來
mpld3.display() #show the plot
#uncomment the below to export to html
#html = mpld3.fig_to_html(fig)
#print(html)

Hierarchical document clustering

Now that I was successfuly able to cluster and plot the documents using k-means, I wanted to try another clustering algorithm. I chose the Ward clustering algorithm because it offers hierarchical clustering. Ward clustering is an agglomerative (This is a "bottom up" approach) clustering method, meaning that at each stage, the pair of clusters with minimum between-cluster distance are merged. I used the precomputed cosine distance matrix (dist) to calclate a linkage_matrix, which I then plot as a dendrogram.

Note that this method returned 3 primary clusters, with the largest cluster being split into about 4 major subclusters. Note that the cluster in red contains many of the "Killed, soldiers, captain" films. Braveheart and Gladiator are within the same low-level cluster which is interesting as these are probably my two favorite movies.

from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(15, 20)) # set size
ax = dendrogram(linkage_matrix, orientation="right", labels=titles);
plt.tick_params(\
    axis= 'x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off')
plt.tight_layout() #show plot with tight layout
#uncomment below to save figure
# print(end) #不拋出錯誤圖顯示不出來
plt.savefig('ward_clusters.png', dpi=200) #save figure as ward_clusters
plt.close()

Latent Dirichlet Allocation（略）

栗子ma

發佈了13 篇原創文章 · 獲贊 21 · 訪問量 6萬+

私信關注

【NLP】Python英文文本聚類

Contents

Stopwords, stemming, and tokenizing¶

Tf-idf and 文檔相似性

K-means 聚類

Multidimensional scaling

Visualizing document clusters

Hierarchical document clustering

Latent Dirichlet Allocation（略）

SQL優化-20231016

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結