詞嵌入基礎

我們在“循環神經網絡的從零開始實現”一節中使用 one-hot 向量表示單詞，雖然它們構造起來很容易，但通常並不是一個好選擇。一個主要的原因是，one-hot 詞向量無法準確表達不同詞之間的相似度，如我們常常使用的餘弦相似度。

Word2Vec 詞嵌入工具的提出正是爲了解決上面這個問題，它將每個詞表示成一個定長的向量，並通過在語料庫上的預訓練使得這些向量能較好地表達不同詞之間的相似和類比關係，以引入一定的語義信息。基於兩種概率模型的假設，我們可以定義兩種 Word2Vec 模型：

Skip-Gram 跳字模型：假設背景詞由中心詞生成，即建模 $P(w_o\mid w_c)$ ，其中 $w_c$ 爲中心詞， $w_o$ 爲任一背景詞；

CBOW (continuous bag-of-words) 連續詞袋模型：假設中心詞由背景詞生成，即建模 $P(w_c\mid \mathcal{W}_o)$ ，其中 $\mathcal{W}_o$ 爲背景詞的集合。

在這裏我們主要介紹 Skip-Gram 模型的實現，CBOW 實現與其類似，讀者可之後自己嘗試實現。後續的內容將大致從以下四個部分展開：

PTB 數據集
Skip-Gram 跳字模型
負採樣近似
訓練模型

!pip install torchtext
import collections
import math
import random
import sys
import time
import os
import numpy as np
import torch
from torch import nn
import torch.utils.data as Data
# 導入數據

Requirement already satisfied: torchtext in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: torch in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: requests in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.6/site-packages (from torchtext)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests->torchtext)
Requirement already satisfied: idna<2.7,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests->torchtext)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests->torchtext)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests->torchtext)
[33mYou are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

PTB 數據集

簡單來說，Word2Vec 能從語料中學到如何將離散的詞映射爲連續空間中的向量，並保留其語義上的相似關係。那麼爲了訓練 Word2Vec 模型，我們就需要一個自然語言語料庫，模型將從中學習各個單詞間的關係，這裏我們使用經典的 PTB 語料庫進行訓練。PTB (Penn Tree Bank) 是一個常用的小型語料庫，它採樣自《華爾街日報》的文章，包括訓練集、驗證集和測試集。我們將在PTB訓練集上訓練詞嵌入模型。

載入數據集

數據集訓練文件 ptb.train.txt 示例：

aer banknote berlitz calloway centrust cluett fromstein gitano guterman ...
pierre  N years old will join the board as a nonexecutive director nov. N 
mr.  is chairman of  n.v. the dutch publishing group 
...

with open('/home/kesci/input/ptb_train1020/ptb.train.txt', 'r') as f:
    lines = f.readlines() # 該數據集中句子以換行符爲分割
    raw_dataset = [st.split() for st in lines] # st是sentence的縮寫，單詞以空格爲分割
print('# sentences: %d' % len(raw_dataset))

# 對於數據集的前3個句子，打印每個句子的詞數和前5個詞
# 句尾符爲 '' ，生僻詞全用 '' 表示，數字則被替換成了 'N'
for st in raw_dataset[:3]:
    print('# tokens:', len(st), st[:5])

# sentences: 42068
# tokens: 24 ['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
# tokens: 15 ['pierre', '<unk>', 'N', 'years', 'old']
# tokens: 11 ['mr.', '<unk>', 'is', 'chairman', 'of']

建立詞語索引

counter = collections.Counter([tk for st in raw_dataset for tk in st]) # tk是token的縮寫,對於每句話的每個詞語進行統計
# print(counter['aer'])
counter = dict(filter(lambda x: x[1] >= 5, counter.items())) # 只保留在數據集中至少出現5次的詞

idx_to_token = [tk for tk, _ in counter.items()]# 對每單詞進行配置索引 格式 id:token
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}# 反索引 格式 token:id
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx]
           for st in raw_dataset] # raw_dataset中的單詞在這一步被轉換爲對應的idx
# for st in dataset[:3]:
#     print('# tokens:', len(st), st[:5])
num_tokens = sum([len(st) for st in dataset])
'# tokens: %d' % num_tokens# 總共有887100個數字,有去重

'# tokens: 887100'

二次採樣

文本數據中一般會出現一些高頻詞，如英文中的“the”“a”和“in”。通常來說，在一個背景窗口中，一個詞（如“chip”）和較低頻詞（如“microprocessor”）同時出現比和較高頻詞（如“the”）同時出現對訓練詞嵌入模型更有益。因此，訓練詞嵌入模型時可以對詞進行二次採樣。具體來說，數據集中每個被索引詞 $w_i$ 將有一定概率被丟棄，該丟棄概率爲

$P(w_i)=\max(1-\sqrt{\frac{t}{f(w_i)}},0)$

其中 $f(w_i)$ 是數據集中詞 $w_i$ 的個數與總詞數之比，常數 $t$ 是一個超參數（實驗中設爲 $10^{−4}$ ）。可見，只有當 $f(w_i)>t$ 時，我們纔有可能在二次採樣中丟棄詞 $w_i$ ，並且越高頻的詞被丟棄的概率越大。具體的代碼如下：

def discard(idx):
    '''
    @params:
        idx: 單詞的下標
    @return: True/False 表示是否丟棄該單詞
    uniform() 方法將隨機生成下一個實數，它在 [x, y] 範圍內
    '''
    return random.uniform(0, 1) < 1 - math.sqrt(
        1e-4 / counter[idx_to_token[idx]] * num_tokens)#在概率大於0的情況下,有一定的概率丟棄
        # 

subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]# 二次採樣
print('# tokens: %d' % sum([len(st) for st in subsampled_dataset]))

def compare_counts(token):
    return '# %s: before=%d, after=%d' % (token, sum(
        [st.count(token_to_idx[token]) for st in dataset]), sum(
        [st.count(token_to_idx[token]) for st in subsampled_dataset]))
# 上面證明了,在指定情況下,出現頻率越高,刪除概率越大
print(compare_counts('the'))
print(compare_counts('join'))

# tokens: 376204
# the: before=50770, after=2144
# join: before=45, after=45

提取中心詞和背景詞

def get_centers_and_contexts(dataset, max_window_size):
    '''
    @params:
        dataset: 數據集爲句子的集合，每個句子則爲單詞的集合，此時單詞已經被轉換爲相應數字下標
        max_window_size: 背景詞的詞窗大小的最大值
    @return:
        centers: 中心詞的集合
        contexts: 背景詞窗的集合，與中心詞對應，每個背景詞窗則爲背景詞的集合
    '''
    centers, contexts = [], []
    for st in dataset:
        if len(st) < 2:  # 每個句子至少要有2個詞纔可能組成一對“中心詞-背景詞”
            continue
        centers += st
        for center_i in range(len(st)):
            window_size = random.randint(1, max_window_size) # 隨機選取背景詞窗大小
            indices = list(range(max(0, center_i - window_size),
                                 min(len(st), center_i + 1 + window_size)))
            indices.remove(center_i)  # 將中心詞排除在背景詞之外
            contexts.append([st[idx] for idx in indices])
    # print(centers)
    # print(contexts)
    return centers, contexts

all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)# 中心詞,和背景詞

tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2]
center 2 has contexts [1, 3]
center 3 has contexts [2, 4]
center 4 has contexts [2, 3, 5, 6]
center 5 has contexts [3, 4, 6]
center 6 has contexts [5]
center 7 has contexts [8]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]

注：數據批量讀取的實現需要依賴負採樣近似的實現，故放於負採樣近似部分進行講解。

Skip-Gram 跳字模型

在跳字模型中，每個詞被表示成兩個 $d$ 維向量，用來計算條件概率。假設這個詞在詞典中索引爲 $i$ ，當它爲中心詞時向量表示爲 $\boldsymbol{v}_i\in\mathbb{R}^d$ ，而爲背景詞時向量表示爲 $\boldsymbol{u}_i\in\mathbb{R}^d$ 。設中心詞 $w_c$ 在詞典中索引爲 $c$ ，背景詞 $w_o$ 在詞典中索引爲 $o$ ，我們假設給定中心詞生成背景詞的條件概率滿足下式：

$KaTeX parse error: Expected '}', got 'EOF' at end of input: …ldsymbol{v}_c)}$

PyTorch 預置的 Embedding 層

Embedding 的基本內容如前面介紹所示，然而我想說的是它的價值並不僅僅在於 word embedding 或者 entity embedding，這種將類別數據用低維表示且可自學習的思想更存在價值。通過這種方式，我們可以將神經網絡，深度學習用於更廣泛的領域，Embedding 可以表示更多的東西，而這其中的關鍵在於要想清楚我們需要解決的問題和應用 Embedding 表示我們得到的是什麼。

embed = nn.Embedding(num_embeddings=10, embedding_dim=4)# 第二維度是嵌入的單詞數(最大),第一維度是嵌入的單詞維度
print(embed.weight)

x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
print(embed(x))
# 每一個數字都可以用1*4表示

Parameter containing:
tensor([[-0.2204, -0.1124,  1.2359,  1.1613],
        [ 0.7078,  0.0692, -1.6871,  1.9099],
        [-0.5771, -0.9069, -1.0943, -1.4463],
        [ 0.2425, -1.6640, -0.3192,  0.2481],
        [-2.5930,  0.7935, -0.4504,  0.2491],
        [ 1.5414, -0.0749, -1.9898, -0.7608],
        [ 0.8809,  1.4855,  2.0447, -0.4760],
        [ 1.3550,  0.3636,  0.9588,  0.7353],
        [-1.4855, -2.3381,  0.7396,  0.2760],
        [-0.4090,  1.7631,  0.1599,  1.1009]], requires_grad=True)
tensor([[[ 0.7078,  0.0692, -1.6871,  1.9099],
         [-0.5771, -0.9069, -1.0943, -1.4463],
         [ 0.2425, -1.6640, -0.3192,  0.2481]],

        [[-2.5930,  0.7935, -0.4504,  0.2491],
         [ 1.5414, -0.0749, -1.9898, -0.7608],
         [ 0.8809,  1.4855,  2.0447, -0.4760]]], grad_fn=<EmbeddingBackward>)

PyTorch 預置的批量乘法

X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
print(torch.bmm(X, Y).shape)
# 批量乘法 2*1*4 * 2*4*6 = 2*(1*4*4*6) = 2*1*6

torch.Size([2, 1, 6])

Skip-Gram 模型的前向計算

def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    '''
    @params:
        center: 中心詞下標，形狀爲 (n, 1) 的整數張量
        contexts_and_negatives: 背景詞和噪音詞下標，形狀爲 (n, m) 的整數張量
        embed_v: 中心詞的 embedding 層
        embed_u: 背景詞的 embedding 層
    @return:
        pred: 中心詞與背景詞（或噪音詞）的內積，之後可用於計算概率 p(w_o|w_c)
    '''
    v = embed_v(center) # shape of (n, 1, d)
    u = embed_u(contexts_and_negatives) # shape of (n, m, d)
    pred = torch.bmm(v, u.permute(0, 2, 1)) # bmm((n,   1, d), (n, d, m)) => shape of (n, 1, m)
    #得到中心詞與背景詞的內積
    # permute替換維度位置
    return pred

負採樣近似

訓練一個神經網絡意味着要輸入訓練樣本並且不斷調整神經元的權重，從而不斷提高對目標的準確預測。每當神經網絡經過一個訓練樣本的訓練，它的權重就會進行一次調整。

正如我們上面所討論的，vocabulary的大小決定了我們的Skip-Gram神經網絡將會擁有大規模的權重矩陣，所有的這些權重需要通過我們數以億計的訓練樣本來進行調整，這是非常消耗計算資源的，並且實際中訓練起來會非常慢。

負採樣（negative sampling）解決了這個問題，它是用來提高訓練速度並且改善所得到詞向量的質量的一種方法。不同於原本每個訓練樣本更新所有的權重，負採樣每次讓一個訓練樣本僅僅更新一小部分的權重，這樣就會降低梯度下降過程中的計算量。

當我們用訓練樣本 ( input word: “fox”，output word: “quick”) 來訓練我們的神經網絡時，“ fox”和“quick”都是經過one-hot編碼的。如果我們的vocabulary大小爲10000時，在輸出層，我們期望對應“quick”單詞的那個神經元結點輸出1，其餘9999個都應該輸出0。在這裏，這9999個我們期望輸出爲0的神經元結點所對應的單詞我們稱爲“negative” word。

當使用負採樣時，我們將隨機選擇一小部分的negative words（比如選5個negative words）來更新對應的權重。我們也會對我們的“positive” word進行權重更新（在我們上面的例子中，這個單詞指的是”quick“）。
在論文中，作者指出指出對於小規模數據集，選擇5-20個negative words會比較好，對於大規模數據集可以僅選擇2-5個negative words。
回憶一下我們的隱層-輸出層擁有300 x 10000的權重矩陣。如果使用了負採樣的方法我們僅僅去更新我們的positive word-“quick”的和我們選擇的其他5個negative words的結點對應的權重，共計6個輸出神經元，相當於每次只更新[公式]個權重。對於3百萬的權重來說，相當於只計算了0.06%的權重，這樣計算效率就大幅度提高

由於 softmax 運算考慮了背景詞可能是詞典 $\mathcal{V}$ 中的任一詞，對於含幾十萬或上百萬詞的較大詞典，就可能導致計算的開銷過大。我們將以 skip-gram 模型爲例，介紹負採樣 (negative sampling) 的實現來嘗試解決這個問題。

負採樣方法用以下公式來近似條件概率 $P(w_o\mid w_c)=\frac{\exp(\boldsymbol{u}_o^\top \boldsymbol{v}_c)}{\sum_{i\in\mathcal{V}}\exp(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}$ ：

$P(w_o\mid w_c)=P(D=1\mid w_c,w_o)\prod_{k=1,w_k\sim P(w)}^K P(D=0\mid w_c,w_k)$

其中 $P(D=1\mid w_c,w_o)=\sigma(\boldsymbol{u}_o^\top\boldsymbol{v}_c)$ ， $\sigma(\cdot)$ 爲 sigmoid 函數。對於一對中心詞和背景詞，我們從詞典中隨機採樣 $K$ 個噪聲詞（實驗中設 $K=5$ ）。根據 Word2Vec 論文的建議，噪聲詞采樣概率 $P(w)$ 設爲 $w$ 詞頻與總詞頻之比的 $0.75$ 次方。

def get_negatives(all_contexts, sampling_weights, K):
    '''
    @params:
        all_contexts: [[w_o1, w_o2, ...], [...], ... ]
        sampling_weights: 每個單詞的噪聲詞采樣概率
        K: 隨機採樣個數,大型樣本2-5個
    @return:
        all_negatives: [[w_n1, w_n2, ...], [...], ...]
    '''
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights)))# list,索引的list
    for contexts in all_contexts:# 每個中心詞對應的背景詞
        negatives = []
        while len(negatives) < len(contexts) * K:
            if i == len(neg_candidates):
                # 根據每個詞的權重（sampling_weights）隨機生成k個詞的索引作爲噪聲詞。
                # 爲了高效計算，可以將k設得稍大一點
                i, neg_candidates = 0, random.choices(
                    population, sampling_weights, k=int(1e5))
                # print(neg_candidates,0)
            neg, i = neg_candidates[i], i + 1
            # 噪聲詞不能是背景詞
            if neg not in set(contexts):
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

sampling_weights = [counter[w]**0.75 for w in idx_to_token]
# print(sampling_weights)
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
# all_contexts背景詞

注：除負採樣方法外，還有層序 softmax (hiererarchical softmax) 方法也可以用來解決計算量過大的問題，請參考原書10.2.2節。

批量讀取數據

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, centers, contexts, negatives):
        assert len(centers) == len(contexts) == len(negatives)
        self.centers = centers
        self.contexts = contexts
        self.negatives = negatives
        
    def __getitem__(self, index):
        return (self.centers[index], self.contexts[index], self.negatives[index])

    def __len__(self):
        return len(self.centers)
    
def batchify(data):
    '''
    用作DataLoader的參數collate_fn
    @params:
        data: 長爲batch_size的列表，列表中的每個元素都是__getitem__得到的結果,中心詞,背景詞,噪音詞
    @outputs:
        batch: 批量化後得到 (centers, contexts_negatives, masks, labels) 元組
            centers: 中心詞下標，形狀爲 (n, 1) 的整數張量
            contexts_negatives: 背景詞和噪聲詞的下標，形狀爲 (n, m) 的整數張量
            masks: 與補齊相對應的掩碼，形狀爲 (n, m) 的0/1整數張量
            labels: 指示中心詞的標籤，形狀爲 (n, m) 的0/1整數張量
    '''
    max_len = max(len(c) + len(n) for _, c, n in data)# 背景詞和噪音詞添加,的最大
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]# 中心詞
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]# 填充
        masks += [[1] * cur_len + [0] * (max_len - cur_len)] # 使用掩碼變量mask來避免填充項對損失函數計算的影響
        # 這裏標記避免填充項計算
        labels += [[1] * len(context) + [0] * (max_len - len(context))]# 背景詞統計
        batch = (torch.tensor(centers).view(-1, 1), torch.tensor(contexts_negatives),
            torch.tensor(masks), torch.tensor(labels))
#             centers: 中心詞下標，形狀爲 (n, 1) 的整數張量
#             contexts_negatives: 背景詞和噪聲詞的下標，形狀爲 (n, m) 的整數張量
#             masks: 與補齊相對應的掩碼，形狀爲 (n, m) 的0/1整數張量
#             labels: 指示中心詞的標籤，形狀爲 (n, m) 的0/1整數張量
#     return batch

batch_size = 512
num_workers = 0 if sys.platform.startswith('win32') else 4

dataset = MyDataset(all_centers, all_contexts, all_negatives)# 所有的詞語進去
data_iter = Data.DataLoader(dataset, batch_size, shuffle=True,
                            collate_fn=batchify, 
                            num_workers=num_workers)
for batch in data_iter:
    for name, data in zip(['centers', 'contexts_negatives', 'masks',
                           'labels'], batch):
        print(name, 'shape:', data.shape)
    break

centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])

訓練模型

損失函數

應用負採樣方法後，我們可利用最大似然估計的對數等價形式將損失函數定義爲如下

$\sum_{t=1}^T\sum_{-m\le j\le m,j\ne 0} [-\log P(D=1\mid w^{(t)},w^{(t+j)})-\sum_{k=1,w_k\sim P(w)^K}\log P(D=0\mid w^{(t)},w_k)]$

根據這個損失函數的定義，我們可以直接使用二元交叉熵損失函數進行計算：

class SigmoidBinaryCrossEntropyLoss(nn.Module):
    def __init__(self):
        super(SigmoidBinaryCrossEntropyLoss, self).__init__()
    def forward(self, inputs, targets, mask=None):
        '''
        @params:
            inputs: 經過sigmoid層後爲預測D=1的概率
            targets: 0/1向量，1代表背景詞，0代表噪音詞
        @return:
            res: 平均到每個label的loss
        '''
        inputs, targets, mask = inputs.float(), targets.float(), mask.float()
        res = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction="none", weight=mask)
        res = res.sum(dim=1) / mask.float().sum(dim=1)
        return res

loss = SigmoidBinaryCrossEntropyLoss()

pred = torch.tensor([[1.5, 0.3, -1, 2], [1.1, -0.6, 2.2, 0.4]])
label = torch.tensor([[1, 0, 0, 0], [1, 1, 0, 0]]) # 標籤變量label中的1和0分別代表背景詞和噪聲詞
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 1, 0]])  # 掩碼變量
print(loss(pred, label, mask))

def sigmd(x):
    return - math.log(1 / (1 + math.exp(-x)))
print('%.4f' % ((sigmd(1.5) + sigmd(-0.3) + sigmd(1) + sigmd(-2)) / 4)) # 注意1-sigmoid(x) = sigmoid(-x)
print('%.4f' % ((sigmd(1.1) + sigmd(-0.6) + sigmd(-2.2)) / 3))

tensor([0.8740, 1.2100])
0.8740
1.2100

模型初始化

embed_size = 100
net = nn.Sequential(nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size),
                    nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size))

訓練模型

def train(net, lr, num_epochs):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print("train on", device)
    net = net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        start, l_sum, n = time.time(), 0.0, 0
        for batch in data_iter:
            center, context_negative, mask, label = [d.to(device) for d in batch]
            
            pred = skip_gram(center, context_negative, net[0], net[1])
            
            l = loss(pred.view(label.shape), label, mask).mean() # 一個batch的平均loss
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            l_sum += l.cpu().item()
            n += 1
        print('epoch %d, loss %.2f, time %.2fs'
              % (epoch + 1, l_sum / n, time.time() - start))

train(net, 0.01, 5)

train on cuda
epoch 1, loss 1.96, time 350.53s
epoch 2, loss 0.62, time 349.80s

train on cpu
epoch 1, loss 0.61, time 221.30s
epoch 2, loss 0.42, time 227.70s
epoch 3, loss 0.38, time 240.50s
epoch 4, loss 0.36, time 253.79s
epoch 5, loss 0.34, time 238.51s

注：由於本地CPU上訓練時間過長，故只截取了運行的結果，後同。大家可以自行在網站上訓練。

測試模型

def get_similar_tokens(query_token, k, embed):
    '''
    @params:
        query_token: 給定的詞語
        k: 近義詞的個數
        embed: 預訓練詞向量
    '''
    W = embed.weight.data
    x = W[token_to_idx[query_token]]
    # 添加的1e-9是爲了數值穩定性
    cos = torch.matmul(W, x) / (torch.sum(W * W, dim=1) * torch.sum(x * x) + 1e-9).sqrt()
    _, topk = torch.topk(cos, k=k+1)
    topk = topk.cpu().numpy()
    for i in topk[1:]:  # 除去輸入詞
        print('cosine sim=%.3f: %s' % (cos[i], (idx_to_token[i])))
        
get_similar_tokens('chip', 3, net[0])

cosine sim=0.446: intel
cosine sim=0.427: computer
cosine sim=0.427: computers

參考

標記

詞嵌入模型首先需要在大規模語料庫上進行訓練，才能得到更有意義的詞向量，其次在後續模型的訓練過程中，可能還需要進行進一步的模型參數優化，所以在實現和使用上，都是比 one-hot 向量更復雜的
大語料庫意味着大的詞典，若不使用負採樣近似方法，詞嵌入模型進行前向計算和梯度回傳時，softmax 的計算代價將是難以承受的
由於 skip-gram 模型（或 CBOW 模型）的假設中，中心詞和背景詞都處於一種不對稱的關係，而模型的數學表達式裏，向量的點積項卻又是對稱的，所以只能通過引入兩個詞嵌入層來保留假設中的非對稱關係

pytorch-詞嵌入基礎

詞嵌入基礎

PTB 數據集

載入數據集

建立詞語索引

二次採樣

提取中心詞和背景詞

Skip-Gram 跳字模型

PyTorch 預置的 Embedding 層

PyTorch 預置的批量乘法

Skip-Gram 模型的前向計算

負採樣近似

批量讀取數據

訓練模型

損失函數

模型初始化

訓練模型

測試模型

參考

標記

pytorch-高級rnn

pytorch-優化算法進階

排序算法:堆排序

pytorch-詞嵌入基礎

pytorch-lenet

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結