【NLP CS224N筆記】Assignment 1 - Exploring Word Vectors

作業來源：https://github.com/xixiaoyao/CS224n-winter-together

1. 寫在前面

這篇文章是CS224N課程的第一個大作業，主要是對詞向量做了一個探索，並直觀的感受了一下詞嵌入或者詞向量的效果。這個作業不難，感興趣的可以玩一下。這裏簡單的記錄一下我探索的一個過程。這篇文章基於第一節課的筆記理論【NLP CS224N筆記】Lecture 1 - Introduction and Word Vectors

這個大作業分爲兩部分，第一部分是基於計數的單詞詞向量，這個的靈感就是在相似的上下文中我們一般會使用意思相似的單詞(同義詞），因此，意思相近的單詞會通過上下文的方式在一起出現。通過檢查這些上下文，我們可以嘗試把單詞用詞向量的方式表示出來，一種簡單的方式就是依賴於單詞在一起出現的次數，所以就得到了一種叫做共現矩陣的策略，這是一個基於單詞頻數的詞向量矩陣，所以第一部分主要看看這個共現矩陣應該怎麼算。而第二部分，是基於詞向量的預測，是利用了已經訓練好的一個詞向量矩陣去介紹一下怎麼進行預測，比如可視化這些詞向量啊，找同義詞或者反義詞啊，實現單詞的類比關係啊等等。下面就來一一簡單的看看吧。

大綱如下：

實驗前的準備工作(導入包和語料庫)
Part1: Count-Based Word Vectors
Part2: Prediction-Based Word Vectors

Ok, let’s go!

2. 實驗前的準備工作

做實驗之前，我們要導入用到的包：

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors  # KeyedVectors:實現實體（單詞、文檔、圖片都可以）和向量之間的映射。每個實體由其字符串id標識。
from gensim.test.utils import datapath
import pprint     #  輸出的更加規範易讀
import matplotlib.pyplot as plt  
plt.rcParams['figure.figsize'] = [10, 5]  #  plt.rcParams主要作用是設置畫的圖的分辨率，大小等信息
import nltk
nltk.download('reuters')    # 這個可以從GitHub下載， 網址：https://github.com/nltk/nltk_data/tree/gh-pages/packages/corpora
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)

這裏面的Reuters是路透社(商業和金融新聞)語料庫，是一個詞庫，語料庫包含10788個新聞文檔，共計130萬詞。這些文檔跨越90個類別，分爲train和test，我們這次需要用其中的一個類別(crude)裏面的句子。

這裏說一下這個詞庫導入過程中我這邊出現的問題，如果是直接運行這兩行代碼：

import nltk
nltk.download('reuters')    # 這個可以從GitHub下載， 網址：https://github.com/nltk/nltk_data/tree/gh-pages/packages/corpora

我這邊會報錯：

所以這個語料庫我是先從GitHub上進行的下載，然後再導入進去。如果也遇到了這個問題，可以嘗試單獨下載這個語料庫nltk_data, 進入之後，找到retuters.zip，點擊下載。當然如果點擊下載後再報一個錯誤：

這個錯誤就是即使展開詳情這塊也發現沒法訪問，這個的解決方式就是在chrome瀏覽器地址欄輸入chrome://net-internals/#hsts，找到 delete domain security policies 項，輸入域名：github.com (注意這個地方輸入的是無法訪問的那個網址，這裏是拿github.com做個演示，這次實際上是raw.githubbusercontent.com)，再點擊delete。就可以正常訪問了。但是我這邊竟然還是沒法連接raw.githubusercontent.com。所以就通過

這樣，就可以下載預料庫了，下載下來之後，再簡單說一下保存：保存的話這幾個位置選一個：

還要切記一點的是得先建一個"corpora"文件夾，放這裏面

這樣就OK了。然後就是採用下面的函數，導入這個語料庫：

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)    # 類別爲crude文檔
    # 每個文檔都轉化爲小寫， 並在開頭結尾加標識符
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

這個是導入語料庫的函數，簡單的進行了一下預處理，就是在每句話的前面和後面各加了一個標識符，表示句子的開始和結束，然後把每個單詞分開。下面導入並看一下效果：

# pprint模塊格式化打印
# pprint.pprint(object, stream=None, indent=1, width=80, depth=None, *, compact=False)
# width：控制打印顯示的寬度。默認爲80個字符。注意：當單個對象的長度超過width時，並不會分多行顯示，而是會突破規定的寬度。
# compact：默認爲False。如果值爲False，超過width規定長度的序列會被分散打印到多行。如果爲True，會盡量使序列填滿width規定的寬度。
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:1], compact=True, width=100)  # compact 設置爲False是一行一個單詞

每個句子處理後長這樣：

有了這個準備工作之後，就可以看看兩個部分了。

3. PART 1： Count-Based Word Vectors

這部分的靈感上面已經說過，共現矩陣是實現這種詞向量的一種方式，我們看看共現矩陣是什麼意思？共現矩陣計算的是單詞在某些環境下一塊出現的頻率，對於共現矩陣，原文描述是這樣的：

上面的話其實就是這樣的一個意思，要想建立共現矩陣，我們需要先爲單詞構建一個詞典，然後共現矩陣的行列都是這個詞典裏的單詞，看下面這個例子：

上面基於這兩段文檔構建出的共現矩陣長這樣，這個是怎麼構建的？首先就是根據兩個文檔的單詞構建一個詞典，這裏面的數就是兩兩單詞在上下文中共現的頻率，比如第一行， START和all一起出現了兩次，這就是因爲兩個文檔裏面START的窗口中都有all。同理第二行all的那個，我們也固定一個窗口，發現第一個文檔裏面all左邊是START，右邊是that，第二個文檔all左邊是START，右邊是is, 那麼<all, START>=2, <all, that>=1, <all, is>=1。下面的都是同理了。

我們就是要構建這樣的一個矩陣來作爲每個單詞的詞向量，當然這個還不是最終形式，因爲可能詞典很大的話維度會特別高，所以就相當了降維技術，降維之後的結果就是每個單詞的詞向量。這個裏面使用的降維是SVD, 原理這裏不說，這裏使用了Truncated SVD，具體的實現是調用了sklearn中的包。

所以我們就有了下面的這樣一個思路框架：

對於語錄料庫中的文檔單詞，得先構建一個詞典（唯一單詞且排好序）
然後我們就是基於詞典和語料庫，爲每個單詞構建詞向量，也就是共現矩陣
對共現矩陣降維，就得到了最終的詞向量
可視化

好了，基於上面的思路開始實現：

3.1 爲語料庫中的單詞構建詞典

我們知道詞典就是記錄所有的單詞，但是單詞唯一且有序。那麼實現這個詞典的思路就是我遍歷每一篇文檔，先獲得所有的單詞，然後去掉重複的，然後再排序就搞定，當然還得記錄字典裏的單詞總數。基於這個思路，就有了下面的代碼實現：

# 計算出語料庫中出現的不同單詞，並排序。
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    # 首先得把所有單詞放到一個列表裏面, 然後用set去重， 然後排序
    for everylist in corpus:
        corpus_words.extend(everylist)
    corpus_words = sorted(set(corpus_words))
    num_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, num_corpus_words

這裏只是用了一種獲得單詞列表的方式，還可以用列表推導式的方式：

flattened_list = [word for every_list in corpus for word in every_list]  # 展平成一維
corpus_words = sorted(set(flattened_list))  # set去重，sorted排序
num_corpus_words = len(corpus_words)  # 字典總數

詞典建成，下面就是構建共現矩陣了。

3.2 構建共現矩陣

這個依然是簡單說一下思路，上面已經說了共現矩陣的原理了，就是記錄一塊出現的頻數嘛，那麼具體實現是咋樣的呢？

首先我們得定義一個M矩陣，也就是共現矩陣，大小就是行列都是詞典的單詞個數（上面圖片一目瞭然），然後還得定義一個字典單詞到索引的映射，因爲我們統計的時候是遍歷真實文檔，而填矩陣的時候是基於字典，這兩個是基於同一個單詞進行聯繫起來的，所以我們需要獲得真實文檔中單詞在字典裏面的索引才能去填矩陣。所以有了下面這幾行代碼：

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)   # 單詞已經去重或者排好序  
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    word2Ind = {k: v for (k, v) in zip(words, range(num_words))}
    M = np.zeros((num_words, num_words))

接下來就是填充共現矩陣了，思路是這樣子，我們遍歷每一篇文檔，對於每一篇文檔，我們遍歷每個單詞，對於每個單詞，我們先獲得在字典中的索引，然後去找以這個單詞爲中心詞的窗口範圍，這樣就找到了這個單詞的上下文，然後對於每個上下文單詞，在共現矩陣裏面計數就可以了。所以這裏每個單詞會有兩個索引，一個是字典裏面的索引，一個是文檔裏面的索引，前者是爲了把一起共現的單詞次數填充到共現矩陣裏面，後者是爲了找到上下文。下面的代碼接上面（註釋感覺寫的挺明白了）：

	# 接下來是遍歷語料庫 對於每一篇文檔， 我們得遍歷每個單詞
    # 對於每個單詞， 我們得找到窗口的範圍， 然後再去遍歷它窗口內的每個單詞
    # 對於這每個單詞， 我們就可以在我們的M詞典中進行計數， 但是要注意每個單詞其實有兩個索引
    # 一個是詞典裏面的索引， 一個是文檔中的索引， 我們統計的共現頻率是基於字典裏面的索引， 
    # 所以這裏涉及到一個索引的轉換
    
    # 首先遍歷語料庫
    for every_doc in corpus:
        for cword_doc_ind, cword in enumerate(every_doc):  # 遍歷當前文檔的每個單詞和在文檔中的索引
            # 對於當前的單詞， 我們先找到它在詞典中的位置
            cword_dic_ind = word2Ind[cword]
            
            # 找窗口的起始和終止位置  開始位置就是當前單詞的索引減去window_size, 終止位置
            # 是當前索引加上windo_size+1， 
            window_start = cword_doc_ind - window_size
            window_end = cword_doc_ind + window_size + 1
            
            # 有了窗口， 我們就要遍歷窗口裏面的每個單詞， 然後往M裏面記錄就行了
            # 但是還要注意一點， 就是邊界問題， 因爲開始單詞左邊肯定不夠窗口大小， 結束單詞
            # 右邊肯定不夠窗口大小， 所以遍歷之後得判斷一下是不是左邊後者右邊有單詞
            for j in range(window_start, window_end):
                # 前面兩個條件控制不越界， 最後一個條件控制不是它本身
                if j >=0 and j < len(every_doc) and j != cword_doc_ind:
                    # 想辦法加入到M， 那麼得獲取這個單詞在詞典中的位置
                    oword = every_doc[j]   # 獲取到上下文單詞
                    oword_dic_ind = word2Ind[oword]
                    # 加入M
                    M[cword_dic_ind, oword_dic_ind] += 1
    # ------------------

    return M, word2Ind

通過上面的代碼，就實現了共現矩陣的構建。下面就簡單了，實現降維

3.3 降到k維

降維直接調用的包sklearn.decomposition.TruncatedSVD.

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=2020)
    M_reduced = svd.fit_transform(M)
        # ------------------

    print("Done.")
    return M_reduced

這個就不用解釋了，通過降維就可以得到每個單詞的詞嵌入向量，我們可以通過下面的代碼可視化一下，這裏介紹matplotlib的畫圖文檔https://matplotlib.org/gallery/index.html：

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    # 遍歷句子， 獲得每個單詞的x，y座標
    for word in words:
        word_dic_index = word2Ind[word]
        x = M_reduced[word_dic_index][0]
        y = M_reduced[word_dic_index][1]
        plt.scatter(x, y, marker='x', color='red')
        # plt.text()給圖形添加文本註釋
        plt.text(x+0.0002, y+0.0002, word, fontsize=9)  # # x、y上方0.002處標註文字說明，word標註的文字，fontsize：文字大小
    plt.show()
    # ------------------

3.4 把上面的過程綜合起來：

簡單的回憶下上面過程，首先是讀入數據，然後計算共現矩陣，然後是降維，最後是可視化：

reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

結果如下：

還是可以看出點相近來哈，比如oil和energy， peteroleum與industry等。這就是第一部分的內容啦。

4. PART 2: Prediction-Based Word Vectors

4.1 可視化Word2Vec訓練的詞嵌入

這一部分其實是利用了一個用Word2Vec技術訓練好的詞向量矩陣去測試一些有趣的效果，看看詞向量到底是幹啥用的。所以用gensim包下載了一個詞向量矩陣：

def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("word2vec-google-news-300")
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

當然這行代碼運行時間很長。有了這個代碼，我們就能得到一個基於Word2Vec訓練好的詞向量矩陣（和上面我們的M矩陣是類似的，只不過得到的方式不同），接下來就是進行降維並可視化詞嵌入：

def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]       # 選10000個加入
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

# -----------------------------------------------------------------
# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions
# Note: This may take several minutes
# -----------------------------------------------------------------
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)         # 減到了2維

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

結果如下：

4.2 餘弦相似性

我們已經得到了每個單詞的詞向量表示，那麼怎麼看兩個單詞的相似性程度呢？餘弦相似性是一種方式，公式如下：
$s = \frac{p \cdot q}{||p|| ||q||}, \textrm{ where } s \in [-1, 1]$
這個詳細的的可以參考： Cosine Similarity

基於這個方式，我們就可以找到單詞的多義詞，同義詞，反義詞還能實現單詞的類比推理等好玩的事情。所以下面主要是介紹一下實現這些好玩事情的方法，畢竟這裏是直接調用的gensim的函數。

比如，我們找和某個單詞最相近的10個單詞：
可以使用gensim裏面的most_similar函數， GenSim documentation

# 找和energy最相近的10個單詞
wv_from_bin.most_similar("energy")

##結果
[('renewable_energy', 0.6721636056900024),
 ('enery', 0.6289607286453247),
 ('electricity', 0.6030439138412476),
 ('enegy', 0.6001754403114319),
 ('Energy', 0.595537006855011),
 ('fossil_fuel', 0.5802257061004639),
 ('natural_gas', 0.5767925381660461),
 ('renewables', 0.5708995461463928),
 ('fossil_fuels', 0.5689164996147156),
 ('renewable', 0.5663810968399048)]

再比如，爲我們可以找同義詞和反義詞：

w1 = "man"
w2 = "king"
w3 = "woman"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

## 結果：
Synonyms man, king have cosine distance: 0.7705732733011246
Antonyms man, woman have cosine distance: 0.2335987687110901

還可以實現類比關係：
比如： China : Beijing = Japan : ?，那麼我們可以用下面的代碼求這樣的類別關係，注意下面的positive和negative裏面的單詞順序，我們求得？其實和Japan和Beijing相似，和China遠。

# Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['Bejing', 'Japan'], negative=['China']))

## 結果：
[('Tokyo', 0.6124968528747559),
 ('Osaka', 0.5791803598403931),
 ('Maebashi', 0.5635818243026733),
 ('Fukuoka_Japan', 0.5362966060638428),
 ('Nagoya', 0.5359445214271545),
 ('Fukuoka', 0.5319067239761353),
 ('Osaka_Japan', 0.5298740267753601),
 ('Nagano', 0.5293833017349243),
 ('Taisuke', 0.5258569717407227),
 ('Chukyo', 0.5195443034172058)]

5. 總結

在這裏簡單的小總一下，第一次大作業相對來說可能是熱身階段，所以難度上不是那麼的大，不過還是挺有意思的，並且還學習到了一個共現矩陣求解詞向量的方式，當然，第二節課中還會講到這個思想，所以第一部分就是講了一個求解詞向量的方式，這個是基於統計的方式，而第二部分是Word2Vec訓練好的詞向量，演示了一下可以做什麼事情。

今天的內容就是這些了，這次實驗裏面下載語料庫這塊如果遇到了問題，可以嘗試單獨下下來，然後再做。去第二節課了，繼續Rush！

【NLP CS224N筆記】Assignment 1 - Exploring Word Vectors

1. 寫在前面

2. 實驗前的準備工作

3. PART 1： Count-Based Word Vectors

3.1 爲語料庫中的單詞構建詞典

3.2 構建共現矩陣

3.3 降到k維

3.4 把上面的過程綜合起來：

4. PART 2: Prediction-Based Word Vectors

4.1 可視化Word2Vec訓練的詞嵌入

4.2 餘弦相似性

5. 總結

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

【NLP CS224N筆記】Lecture 12 - Information from parts of words Subword Models

【NLP CS224N筆記】Lecture 13 - Contextual Word Representations and Pretraining

【NLP CS224N筆記】Assignment 1 - Exploring Word Vectors

概率統計基礎（三）：常見分佈與假設檢驗

概率統計基礎（四）：方差分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結