注意力機制和Seq2seq模型

Attention Mechanism

注意力機制借鑑了人類的注意力思維方式，以獲得需要重點關注的目標區域

在編碼器—解碼器（seq2seq) 中，解碼器在各個時間步依賴相同的背景變量（context vector）來獲取輸⼊序列信息。解碼器輸入的語境向量(context vector)不同，每個位置都會計算各自的 attention 輸出。當編碼器爲循環神經⽹絡時，背景變量來⾃它最終時間步的隱藏狀態。將源序列輸入信息以循環單位狀態編碼，然後將其傳遞給解碼器以生成目標序列。

然而這種結構存在着問題，尤其是RNN機制實際中存在長程梯度消失的問題，對於較長的句子，我們很難寄希望於將輸入的序列轉化爲定長的向量而保存所有的有效信息，所以隨着所需翻譯句子的長度的增加，這種結構的效果會顯著下降。與此同時，解碼的目標詞語可能只與原輸入的部分詞語有關，而並不是與所有的輸入有關。例如，當把 “Hello world” 翻譯成 “Bonjour le monde” 時，“Hello” 映射成 “Bonjour”，“world” 映射成 “monde”。

在 seq2seq 模型中，解碼器只能隱式地從編碼器的最終狀態中選擇相應的信息。然而，注意力機制可以將這種選擇過程顯式地建模。

注意力機制框架

Attention 是一種通用的帶權池化方法，輸入由兩部分構成：詢問（query）和鍵值對（key-value pairs）。 $k_i∈R^{d_k}, v_i∈R^{d_v}$ . Query $q∈R^{d_q}$ , attention layer 得到輸出與value的維度一致 $o∈R^{d_v}$ . 對於一個query來說，attention layer 會與每一個 key 計算注意力分數並進行權重的歸一化，輸出的向量 $o$ 則是 value 的加權求和，而每個 key 計算的權重與 value 一一對應。

爲了計算輸出，我們首先假設有一個函數 $\alpha$ 用於計算query和key的相似性，然後可以計算所有的 attention scores $a_1, \ldots, a_n$ by

$a_i = \alpha(\mathbf q, \mathbf k_i).$

我們使用 softmax 函數獲得注意力權重：

$b_1, \ldots, b_n = \textrm{softmax}(a_1, \ldots, a_n).$

最終的輸出就是 value 的加權求和：

$\mathbf o = \sum_{i=1}^n b_i \mathbf v_i.$

不同的 attetion layer 的區別在於 score 函數的選擇

接下來將利用[機器翻譯及其相關技術介紹]一文中的(https://blog.csdn.net/RokoBasilisk/article/details/104367653)

介紹兩個常用的注意層

Dot-product Attention

Multilayer Perceptron Attention

import math
import torch 
import torch.nn as nn
# import dataset
import os
os.listdir('path to storaged file of dataset')

工具1: Masked Softmax

# 排除padding位置的影響
def SequenceMask(X, X_len,value=-1e6):
    maxlen = X.size(1)
    #print(X.size(),torch.arange((maxlen),dtype=torch.float)[None, :],'\n',X_len[:, None] )
    # shape as same as X
    mask = torch.arange((maxlen),dtype=torch.float)[None, :] >= X_len[:, None]   
    #print(mask)
    X[mask]=value
    return X

def masked_softmax(X, valid_length):
    # X: 3-D tensor, valid_length: 1-D or 2-D tensor
    softmax = nn.Softmax(dim=-1)
    if valid_length is None:
        return softmax(X)
    else:
        shape = X.shape
        if valid_length.dim() == 1:
            try:
                valid_length = torch.FloatTensor(valid_length.numpy().repeat(shape[1], axis=0))#[2,2,3,3]
            except:
                valid_length = torch.FloatTensor(valid_length.cpu().numpy().repeat(shape[1], axis=0))#[2,2,3,3]
        else:
            valid_length = valid_length.reshape((-1,))
        # fill masked elements with a large negative, whose exp is 0
        X = SequenceMask(X.reshape((-1, shape[-1])), valid_length)
 
        return softmax(X).reshape(shape)

工具2： 超出2維矩陣的乘法

$X$ 和 $Y$ 是維度分別爲 $(b,n,m)$ 和 $(b, m, k)$ 的張量，進行 $b$ 次二維矩陣乘法後得到 $Z$ , 維度爲 $(b, n, k)$ 。

$Z[i,:,:] = dot(X[i,:,:], Y[i,:,:])\qquad for\ i= 1,…,n\ .$

Dot Product Attention

The dot product 假設query和keys有相同的維度, 即 $\forall i, q,k_i ∈ R_d$ . 通過計算 query 和 key 轉置的乘積來計算 attention score ,通常還會除去 $\sqrt{d}$ 減少計算出來的 score 對維度 𝑑 的依賴性，如下

$α (q,k)=⟨q,k⟩/ \sqrt{d}$

假設 $Q∈R^{m×d}$ 有 $m$ 個query， $K∈R^{n×d}$ 有 $n$ 個 keys. 我們可以通過矩陣運算的方式計算所有 $mn$ 個 score：

$α (Q,K)=QK^T/\sqrt{d}$

它支持一批查詢和鍵值對。此外，它支持作爲正則化隨機刪除一些注意力權重.

# Save to the d2l package.
class DotProductAttention(nn.Module): 
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # query: (batch_size, #queries, d)
    # key: (batch_size, #kv_pairs, d)
    # value: (batch_size, #kv_pairs, dim_v)
    # valid_length: either (batch_size, ) or (batch_size, xx)
    def forward(self, query, key, value, valid_length=None):
        d = query.shape[-1]
        # set transpose_b=True to swap the last two dimensions of key
        
        scores = torch.bmm(query, key.transpose(1,2)) / math.sqrt(d)
        attention_weights = self.dropout(masked_softmax(scores, valid_length))
        print("attention_weight\n",attention_weights)
        return torch.bmm(attention_weights, value)

測試：

創建了兩個批，每個批有一個query和10個key-values對。

通過valid_length指定，對於第一批，只關注前2個鍵-值對，而對於第二批，檢查前6個鍵-值對

因此，儘管這兩個批處理具有相同的查詢和鍵值對，但我們獲得的輸出是不同的。

atten = DotProductAttention(dropout=0)

keys = torch.ones((2,10,2),dtype=torch.float)
values = torch.arange((40), dtype=torch.float).view(1,10,4).repeat(2,1,1)
atten(torch.ones((2,1,2),dtype=torch.float), keys, values, torch.FloatTensor([2, 6]))

# Result
attention_weight
 tensor([[[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000]],

        [[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000,
          0.0000, 0.0000]]])
tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]])

Multilayer Porceptron Attentiion

在多層感知器中，我們首先將 query and keys 投影到 $R^ℎ$ .爲了更具體，我們將可以學習的參數做如下映射
$W_k∈R^{h×d_k}$ , $W_q∈R^{h×d_q}$ , and $v∈R^h$ . 將 score 函數定義

$α(k,q)=v^Ttanh(W_kk+W_qq)$
.
然後將key 和 value 在特徵的維度上合併（concatenate），然後送至 a single hidden layer perceptron 這層中 hidden layer 爲 ℎ and 輸出的size爲 1 .隱層激活函數爲tanh，無偏置.

# Save to the d2l package.
class MLPAttention(nn.Module):  
    def __init__(self, units,ipt_dim,dropout, **kwargs):
        super(MLPAttention, self).__init__(**kwargs)
        # Use flatten=True to keep query's and key's 3-D shapes.
        self.W_k = nn.Linear(ipt_dim, units, bias=False)
        self.W_q = nn.Linear(ipt_dim, units, bias=False)
        self.v = nn.Linear(units, 1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, valid_length):
        query, key = self.W_k(query), self.W_q(key)
        #print("size",query.size(),key.size())
        # expand query to (batch_size, #querys, 1, units), and key to
        # (batch_size, 1, #kv_pairs, units). Then plus them with broadcast.
        features = query.unsqueeze(2) + key.unsqueeze(1)
        #print("features:",features.size())  #--------------開啓
        scores = self.v(features).squeeze(-1) 
        attention_weights = self.dropout(masked_softmax(scores, valid_length))
        return torch.bmm(attention_weights, value)

測試：

儘管 MLPAttention 包含一個額外的 MLP 模型，但如果給定相同的輸入和相同的鍵，我們將獲得與DotProductAttention相同的輸出

atten = MLPAttention(ipt_dim=2,units = 8, dropout=0)
atten(torch.ones((2,1,2), dtype = torch.float), keys, values, torch.FloatTensor([2, 6]))      
#Result
tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]], grad_fn=<BmmBackward>)

比較

在Dot-product Attention中，key與query維度需要一致，在MLP Attention中則不需要。

Seq2seq模型

機器翻譯及其相關技術介紹

seq2seq 模型的預測需人爲設定終止條件，設定最長序列長度或者輸出 [EOS] 結束符號，若不加以限制則可能生成無窮長度序列。

引出：

引入注意力機制的Seq2seq模型

注意力機制本身有高效的並行性，但引入注意力並不能改變seq2seq內部RNN的迭代機制，因此無法加速。

將注意機制添加到 sequence to sequence 模型中，以顯式地使用權重聚合 states。

下圖展示 encoding 和 decoding 的模型結構，在時間步爲 $t$ 的時候。此刻 attention layer 保存着 encodering 看到的所有信息——即 encoding 的每一步輸出。在 decoding 階段，解碼器的 $t$ 時刻的隱藏狀態被當作 query，encoder 的每個時間步的 hidden states 作爲 key 和 value 進行 attention 聚合.

Attetion model 的輸出當作成上下文信息 context vector，並與解碼器輸入 $D_t$ 拼接起來一起送到解碼器：

$Fig1具有注意機制的seq-to-seq模型解碼的第二步$

下圖展示了seq2seq機制的所以層的關係，下面展示了encoder和decoder的layer結構

$Fig2具有注意機制的seq-to-seq模型中層結構$

解碼器

由於帶有注意機制的 seq2seq 的編碼器與之前章節中的 Seq2SeqEncoder 相同，所以在此處我們只關注解碼器。

我們添加了一個MLP注意層(MLPAttention)，它的隱藏大小與解碼器中的LSTM層相同。然後我們通過從編碼器傳遞三個參數來初始化解碼器的狀態:

the encoder outputs of all timesteps：encoder 輸出的各個狀態，被用於attetion layer 的 memory 部分，有相同的 key 和 values ；
the hidden state of the encoder’s final timestep：編碼器最後一個時間步的隱藏狀態，被用於初始化decoder 的hidden state ；
the encoder valid length: 編碼器的有效長度，藉此，注意層不會考慮編碼器輸出中的填充標記（Paddings）；

在解碼的每個時間步，我們使用解碼器的最後一個RNN層的輸出作爲注意層的 query。

然後，將注意力模型的輸出與輸入嵌入向量連接起來，輸入到 RNN 層。雖然 RNN 層隱藏狀態也包含來自解碼器的歷史信息，但是 attention model 的輸出顯式地選擇了 enc_valid_len 以內的編碼器輸出，這樣 attention機制就會儘可能排除其他不相關的信息。

class Seq2SeqAttentionDecoder(d2l.Decoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
        self.attention_cell = MLPAttention(num_hiddens,num_hiddens, dropout)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.LSTM(embed_size+ num_hiddens,num_hiddens, num_layers, dropout=dropout)
        self.dense = nn.Linear(num_hiddens,vocab_size)

    def init_state(self, enc_outputs, enc_valid_len, *args):
        outputs, hidden_state = enc_outputs
#         print("first:",outputs.size(),hidden_state[0].size(),hidden_state[1].size())
        # Transpose outputs to (batch_size, seq_len, hidden_size)
        return (outputs.permute(1,0,-1), hidden_state, enc_valid_len)
        #outputs.swapaxes(0, 1)
        
    def forward(self, X, state):
        enc_outputs, hidden_state, enc_valid_len = state
        #("X.size",X.size())
        X = self.embedding(X).transpose(0,1)
#         print("Xembeding.size2",X.size())
        outputs = []
        for l, x in enumerate(X):
#             print(f"\n{l}-th token")
#             print("x.first.size()",x.size())
            # query shape: (batch_size, 1, hidden_size)
            # select hidden state of the last rnn layer as query
            query = hidden_state[0][-1].unsqueeze(1) # np.expand_dims(hidden_state[0][-1], axis=1)
            # context has same shape as query
#             print("query enc_outputs, enc_outputs:\n",query.size(), enc_outputs.size(), enc_outputs.size())
            context = self.attention_cell(query, enc_outputs, enc_outputs, enc_valid_len)
            # Concatenate on the feature dimension
#             print("context.size:",context.size())
            x = torch.cat((context, x.unsqueeze(1)), dim=-1)
            # Reshape x to (1, batch_size, embed_size+hidden_size)
#             print("rnn",x.size(), len(hidden_state))
            out, hidden_state = self.rnn(x.transpose(0,1), hidden_state)
            outputs.append(out)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.transpose(0, 1), [enc_outputs, hidden_state,
                                        enc_valid_len]

encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8,
                            num_hiddens=16, num_layers=2)
# encoder.initialize()
decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8,
                                  num_hiddens=16, num_layers=2)
X = torch.zeros((4, 7),dtype=torch.long)
print("batch size=4\nseq_length=7\nhidden dim=16\nnum_layers=2\n")
print('encoder output size:', encoder(X)[0].size())
print('encoder hidden size:', encoder(X)[1][0].size())
print('encoder memory size:', encoder(X)[1][1].size())
state = decoder.init_state(encoder(X), None)
out, state = decoder(X, state)
out.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape

訓練

import zipfile
import torch
import requests
from io import BytesIO
from torch.utils import data
import sys
import collections

class Vocab(object): # This class is saved in d2l.
  def __init__(self, tokens, min_freq=0, use_special_tokens=False):
    # sort by frequency and token
    counter = collections.Counter(tokens)
    token_freqs = sorted(counter.items(), key=lambda x: x[0])
    token_freqs.sort(key=lambda x: x[1], reverse=True)
    if use_special_tokens:
      # padding, begin of sentence, end of sentence, unknown
      self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
      tokens = ['', '', '', '']
    else:
      self.unk = 0
      tokens = ['']
    tokens += [token for token, freq in token_freqs if freq >= min_freq]
    self.idx_to_token = []
    self.token_to_idx = dict()
    for token in tokens:
      self.idx_to_token.append(token)
      self.token_to_idx[token] = len(self.idx_to_token) - 1
      
  def __len__(self):
    return len(self.idx_to_token)
  
  def __getitem__(self, tokens):
    if not isinstance(tokens, (list, tuple)):
      return self.token_to_idx.get(tokens, self.unk)
    else:
      return [self.__getitem__(token) for token in tokens]
    
  def to_tokens(self, indices):
    if not isinstance(indices, (list, tuple)):
      return self.idx_to_token[indices]
    else:
      return [self.idx_to_token[index] for index in indices]

def load_data_nmt(batch_size, max_len, num_examples=1000):
    """Download an NMT dataset, return its vocabulary and data iterator."""
    # Download and preprocess
    def preprocess_raw(text):
        text = text.replace('\u202f', ' ').replace('\xa0', ' ')
        out = ''
        for i, char in enumerate(text.lower()):
            if char in (',', '!', '.') and text[i-1] != ' ':
                out += ' '
            out += char
        return out 


    with open('/home/kesci/input/fraeng6506/fra.txt', 'r') as f:
      raw_text = f.read()


    text = preprocess_raw(raw_text)

    # Tokenize
    source, target = [], []
    for i, line in enumerate(text.split('\n')):
        if i >= num_examples:
            break
        parts = line.split('\t')
        if len(parts) >= 2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))

    # Build vocab
    def build_vocab(tokens):
        tokens = [token for line in tokens for token in line]
        return Vocab(tokens, min_freq=3, use_special_tokens=True)
    src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)

    # Convert to index arrays
    def pad(line, max_len, padding_token):
        if len(line) > max_len:
            return line[:max_len]
        return line + [padding_token] * (max_len - len(line))

    def build_array(lines, vocab, max_len, is_source):
        lines = [vocab[line] for line in lines]
        if not is_source:
            lines = [[vocab.bos] + line + [vocab.eos] for line in lines]
        array = torch.tensor([pad(line, max_len, vocab.pad) for line in lines])
        valid_len = (array != vocab.pad).sum(1)
        return array, valid_len

    src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)
    src_array, src_valid_len = build_array(source, src_vocab, max_len, True)
    tgt_array, tgt_valid_len = build_array(target, tgt_vocab, max_len, False)
    train_data = data.TensorDataset(src_array, src_valid_len, tgt_array, tgt_valid_len)
    train_iter = data.DataLoader(train_data, batch_size, shuffle=True)
    return src_vocab, tgt_vocab, train_iter

embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0
batch_size, num_steps = 64, 10
lr, num_epochs, ctx = 0.005, 500, d2l.try_gpu()

src_vocab, tgt_vocab, train_iter = load_data_nmt(batch_size, num_steps)
encoder = d2l.Seq2SeqEncoder(
    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqAttentionDecoder(
    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = d2l.EncoderDecoder(encoder, decoder)

預測

d2l.train_s2s_ch9(model, train_iter, lr, num_epochs, ctx)
for sentence in ['Go .', 'Good Night !', "I'm OK .", 'I won !']:
    print(sentence + ' => ' + d2l.predict_s2s_ch9(
        model, sentence, src_vocab, tgt_vocab, num_steps, ctx))

RokoのBasilisk

發佈了46 篇原創文章 · 獲贊 21 · 訪問量 3433

私信關注

注意力機制和Seq2seq模型

注意力機制框架

介紹兩個常用的注意層

Dot Product Attention

Multilayer Porceptron Attentiion

比較

Seq2seq模型

引入注意力機制的Seq2seq模型

解碼器

訓練

預測

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

注意力 && Seq2seq模型

自增主鍵的前世今生

What is this Process and Why is it Running【一】

Multilayer Perceptron & Classify image

詞嵌入之 Word2Vec

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結