NLP 利器 Gensim 中 word2vec 模型詞嵌入 Word Embeddings 的可視化

原創

2020-06-16 09:14

本文爲系列文章之一，前面的幾篇請點擊鏈接：
NLP 利器 gensim 庫基本特性介紹和安裝方式
 NLP 利器 Gensim 庫的使用之 Word2Vec 模型案例演示
 NLP 利器 Gensim 來訓練自己的 word2vec 詞向量模型
 NLP 利器 Gensim 來訓練 word2vec 詞向量模型的參數設置
 NLP 利器 Gensim 中 word2vec 模型的內存需求，和模型評估方式
 NLP 利器 Gensim 中 word2vec 模型的恢復訓練：載入存儲模型並繼續訓練
 NLP 利器 Gensim 中 word2vec 模型的訓練損失計算，和對比基準的選擇
 NLP 利器 Gensim 中 word2vec 模型添加 model to dict 方法來加速搜索

使用 tSNE 的方法，把 Word Embeddings 降維到 2 維，就可以進行可視化了。

通過可視化，我們可以看到數據中語義和句法的呈現趨勢。

例如：

語義：cat, dog, cow 等單詞會比較靠近。
句法：run 和 running，或者 cut 和 cutting 會比較靠近。
向量關係：vKing - vMan = vQueen - vWoman 也可以看到。

注意：由於演示用的模型是由一個小的語料庫（lee_background 語料）訓練而成，所有有些關係看上去不是這麼明顯！

這裏需要首先安裝 plotly

pip install plotly

程序：

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)
plt.show()

運行結果：

圖片可以放大顯示：

至此，Gensim 中 word2vec 模型的 Demo 演示完結！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NLP 利器 Gensim 中 word2vec 模型詞嵌入 Word Embeddings 的可視化

強化學習算法實現自動炒股

C# 異常處理（學習心得 26）

C# 的繼承（學習心得 19）

強化學習算法 DDPG 進行四軸飛行器的速度控制

C# 接口 Interface（學習心得 22）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結