注意力機制

在“編碼器—解碼器（seq2seq）”⼀節⾥，解碼器在各個時間步依賴相同的背景變量（context vector）來獲取輸⼊序列信息。當編碼器爲循環神經⽹絡時，背景變量來⾃它最終時間步的隱藏狀態。將源序列輸入信息以循環單位狀態編碼，然後將其傳遞給解碼器以生成目標序列。然而這種結構存在着問題，尤其是RNN機制實際中存在長程梯度消失的問題，對於較長的句子，我們很難寄希望於將輸入的序列轉化爲定長的向量而保存所有的有效信息，所以隨着所需翻譯句子的長度的增加，這種結構的效果會顯著下降。

與此同時，解碼的目標詞語可能只與原輸入的部分詞語有關，而並不是與所有的輸入有關。例如，當把“Hello world”翻譯成“Bonjour le monde”時，“Hello”映射成“Bonjour”，“world”映射成“monde”。在seq2seq模型中，解碼器只能隱式地從編碼器的最終狀態中選擇相應的信息。然而，注意力機制可以將這種選擇過程顯式地建模。

注意力機制框架

不同的attetion layer的區別在於score函數的選擇，在本節的其餘部分，我們將討論兩個常用的注意層 Dot-product Attention 和 Multilayer Perceptron Attention；隨後我們將實現一個引入attention的seq2seq模型並在英法翻譯語料上進行訓練與測試。

import math
import torch 
import torch.nn as nn

import os
def file_name_walk(file_dir):
    for root, dirs, files in os.walk(file_dir):
#         print("root", root)  # 當前目錄路徑
         print("dirs", dirs)  # 當前路徑下所有子目錄
         print("files", files)  # 當前路徑下所有非目錄子文件

file_name_walk("/home/kesci/input/fraeng6506")

dirs []
files ['_about.txt', 'fra.txt']

Softmax屏蔽

在深入研究實現之前，我們首先介紹softmax操作符的一個屏蔽操作。

def SequenceMask(X, X_len,value=-1e6):
    maxlen = X.size(1)
    #print(X.size(),torch.arange((maxlen),dtype=torch.float)[None, :],'\n',X_len[:, None] )
    mask = torch.arange((maxlen),dtype=torch.float)[None, :] >= X_len[:, None]   
    #print(mask)
    X[mask]=value
    return X



def masked_softmax(X, valid_length):
    # X: 3-D tensor, valid_length: 1-D or 2-D tensor
    softmax = nn.Softmax(dim=-1)
    if valid_length is None:
        return softmax(X)
    else:
        shape = X.shape
        if valid_length.dim() == 1:
            try:
                valid_length = torch.FloatTensor(valid_length.numpy().repeat(shape[1], axis=0))#[2,2,3,3]
            except:
                valid_length = torch.FloatTensor(valid_length.cpu().numpy().repeat(shape[1], axis=0))#[2,2,3,3]
        else:
            valid_length = valid_length.reshape((-1,))
        # fill masked elements with a large negative, whose exp is 0
        X = SequenceMask(X.reshape((-1, shape[-1])), valid_length)
 
        return softmax(X).reshape(shape)

masked_softmax(torch.rand((2,2,4),dtype=torch.float), torch.FloatTensor([2,3]))


#tensor([[[0.5423, 0.4577, 0.0000, 0.0000],
         [0.5290, 0.4710, 0.0000, 0.0000]],

        [[0.2969, 0.2966, 0.4065, 0.0000],
         [0.3607, 0.2203, 0.4190, 0.0000]]])

torch.bmm(torch.ones((2,1,3), dtype = torch.float), torch.ones((2,3,2), dtype = torch.float))
#
tensor([[[3., 3.]],

        [[3., 3.]]])

點積注意力

# Save to the d2l package.
class DotProductAttention(nn.Module): 
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # query: (batch_size, #queries, d)
    # key: (batch_size, #kv_pairs, d)
    # value: (batch_size, #kv_pairs, dim_v)
    # valid_length: either (batch_size, ) or (batch_size, xx)
    def forward(self, query, key, value, valid_length=None):
        d = query.shape[-1]
        # set transpose_b=True to swap the last two dimensions of key
        
        scores = torch.bmm(query, key.transpose(1,2)) / math.sqrt(d)
        attention_weights = self.dropout(masked_softmax(scores, valid_length))
        print("attention_weight\n",attention_weights)
        return torch.bmm(attention_weights, value)

測試

現在我們創建了兩個批，每個批有一個query和10個key-values對。我們通過valid_length指定，對於第一批，我們只關注前2個鍵-值對，而對於第二批，我們將檢查前6個鍵-值對。因此，儘管這兩個批處理具有相同的查詢和鍵值對，但我們獲得的輸出是不同的。

atten = DotProductAttention(dropout=0)

keys = torch.ones((2,10,2),dtype=torch.float)
values = torch.arange((40), dtype=torch.float).view(1,10,4).repeat(2,1,1)
atten(torch.ones((2,1,2),dtype=torch.float), keys, values, torch.FloatTensor([2, 6]))

attention_weight
 tensor([[[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000]],

        [[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000,
          0.0000, 0.0000]]])
tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]])

多層感知機注意力

# Save to the d2l package.
class MLPAttention(nn.Module):  
    def __init__(self, units,ipt_dim,dropout, **kwargs):
        super(MLPAttention, self).__init__(**kwargs)
        # Use flatten=True to keep query's and key's 3-D shapes.
        self.W_k = nn.Linear(ipt_dim, units, bias=False)
        self.W_q = nn.Linear(ipt_dim, units, bias=False)
        self.v = nn.Linear(units, 1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, valid_length):
        query, key = self.W_k(query), self.W_q(key)
        #print("size",query.size(),key.size())
        # expand query to (batch_size, #querys, 1, units), and key to
        # (batch_size, 1, #kv_pairs, units). Then plus them with broadcast.
        features = query.unsqueeze(2) + key.unsqueeze(1)
        #print("features:",features.size())  #--------------開啓
        scores = self.v(features).squeeze(-1) 
        attention_weights = self.dropout(masked_softmax(scores, valid_length))
        return torch.bmm(attention_weights, value)

#測試
#儘管MLPAttention包含一個額外的MLP模型，但如果給定相同的輸入和相同的鍵，我們將獲得與#DotProductAttention相同的輸出

atten = MLPAttention(ipt_dim=2,units = 8, dropout=0)
atten(torch.ones((2,1,2), dtype = torch.float), keys, values, torch.FloatTensor([2, 6]))

tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]], grad_fn=<BmmBackward>)

總結¶

注意力層顯式地選擇相關的信息。
注意層的內存由鍵-值對組成，因此它的輸出接近於鍵類似於查詢的值。

引入注意力機制的Seq2seq模型

本節中將注意機制添加到sequence to sequence 模型中，以顯式地使用權重聚合states。下圖展示encoding 和decoding的模型結構，在時間步爲t的時候。此刻attention layer保存着encodering看到的所有信息——即encoding的每一步輸出。在decoding階段，解碼器的tt時刻的隱藏狀態被當作query，encoder的每個時間步的hidden states作爲key和value進行attention聚合. Attetion model的輸出當作成上下文信息context vector，並與解碼器輸入Dt拼接起來一起送到解碼器：

Fig1具有注意機制的seq−to−seq模型解碼的第二步

下圖展示了seq2seq機制的所以層的關係，下面展示了encoder和decoder的layer結構

Fig2具有注意機制的seq−to−seq模型中層結構

import sys
sys.path.append('/home/kesci/input/d2len9900')
import d2l

解碼器

由於帶有注意機制的seq2seq的編碼器與之前章節中的Seq2SeqEncoder相同，所以在此處我們只關注解碼器。我們添加了一個MLP注意層(MLPAttention)，它的隱藏大小與解碼器中的LSTM層相同。然後我們通過從編碼器傳遞三個參數來初始化解碼器的狀態:

the encoder outputs of all timesteps：encoder輸出的各個狀態，被用於attetion layer的memory部分，有相同的key和values

the hidden state of the encoder’s final timestep：編碼器最後一個時間步的隱藏狀態，被用於初始化decoder 的hidden state

the encoder valid length: 編碼器的有效長度，藉此，注意層不會考慮編碼器輸出中的填充標記（Paddings）

在解碼的每個時間步，我們使用解碼器的最後一個RNN層的輸出作爲注意層的query。然後，將注意力模型的輸出與輸入嵌入向量連接起來，輸入到RNN層。雖然RNN層隱藏狀態也包含來自解碼器的歷史信息，但是attention model的輸出顯式地選擇了enc_valid_len以內的編碼器輸出，這樣attention機制就會儘可能排除其他不相關的信息。

class Seq2SeqAttentionDecoder(d2l.Decoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
        self.attention_cell = MLPAttention(num_hiddens,num_hiddens, dropout)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.LSTM(embed_size+ num_hiddens,num_hiddens, num_layers, dropout=dropout)
        self.dense = nn.Linear(num_hiddens,vocab_size)

    def init_state(self, enc_outputs, enc_valid_len, *args):
        outputs, hidden_state = enc_outputs
#         print("first:",outputs.size(),hidden_state[0].size(),hidden_state[1].size())
        # Transpose outputs to (batch_size, seq_len, hidden_size)
        return (outputs.permute(1,0,-1), hidden_state, enc_valid_len)
        #outputs.swapaxes(0, 1)
        
    def forward(self, X, state):
        enc_outputs, hidden_state, enc_valid_len = state
        #("X.size",X.size())
        X = self.embedding(X).transpose(0,1)
#         print("Xembeding.size2",X.size())
        outputs = []
        for l, x in enumerate(X):
#             print(f"\n{l}-th token")
#             print("x.first.size()",x.size())
            # query shape: (batch_size, 1, hidden_size)
            # select hidden state of the last rnn layer as query
            query = hidden_state[0][-1].unsqueeze(1) # np.expand_dims(hidden_state[0][-1], axis=1)
            # context has same shape as query
#             print("query enc_outputs, enc_outputs:\n",query.size(), enc_outputs.size(), enc_outputs.size())
            context = self.attention_cell(query, enc_outputs, enc_outputs, enc_valid_len)
            # Concatenate on the feature dimension
#             print("context.size:",context.size())
            x = torch.cat((context, x.unsqueeze(1)), dim=-1)
            # Reshape x to (1, batch_size, embed_size+hidden_size)
#             print("rnn",x.size(), len(hidden_state))
            out, hidden_state = self.rnn(x.transpose(0,1), hidden_state)
            outputs.append(out)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.transpose(0, 1), [enc_outputs, hidden_state,
                                        enc_valid_len]

#現在我們可以用注意力模型來測試seq2seq。爲了與第9.7節中的模型保持一致，我們對vocab_size、#embed_size、num_hiddens和num_layers使用相同的超參數。結果，我們得到了相同的解碼器輸出形狀，但#是狀態結構改變了。

encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8,
                            num_hiddens=16, num_layers=2)
# encoder.initialize()
decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8,
                                  num_hiddens=16, num_layers=2)
X = torch.zeros((4, 7),dtype=torch.long)
print("batch size=4\nseq_length=7\nhidden dim=16\nnum_layers=2\n")
print('encoder output size:', encoder(X)[0].size())
print('encoder hidden size:', encoder(X)[1][0].size())
print('encoder memory size:', encoder(X)[1][1].size())
state = decoder.init_state(encoder(X), None)
out, state = decoder(X, state)
out.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape

batch size=4
seq_length=7
hidden dim=16
num_layers=2

encoder output size: torch.Size([7, 4, 16])
encoder hidden size: torch.Size([2, 4, 16])
encoder memory size: torch.Size([2, 4, 16])
(torch.Size([4, 7, 10]), 3, torch.Size([4, 7, 16]), 2, torch.Size([2, 4, 16]))

訓練

與第9.7.4節相似，通過應用相同的訓練超參數和相同的訓練損失來嘗試一個簡單的娛樂模型。從結果中我們可以看出，由於訓練數據集中的序列相對較短，額外的注意層並沒有帶來顯著的改進。由於編碼器和解碼器的注意層的計算開銷，該模型比沒有注意的seq2seq模型慢得多。

import zipfile
import torch
import requests
from io import BytesIO
from torch.utils import data
import sys
import collections

class Vocab(object): # This class is saved in d2l.
  def __init__(self, tokens, min_freq=0, use_special_tokens=False):
    # sort by frequency and token
    counter = collections.Counter(tokens)
    token_freqs = sorted(counter.items(), key=lambda x: x[0])
    token_freqs.sort(key=lambda x: x[1], reverse=True)
    if use_special_tokens:
      # padding, begin of sentence, end of sentence, unknown
      self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
      tokens = ['', '', '', '']
    else:
      self.unk = 0
      tokens = ['']
    tokens += [token for token, freq in token_freqs if freq >= min_freq]
    self.idx_to_token = []
    self.token_to_idx = dict()
    for token in tokens:
      self.idx_to_token.append(token)
      self.token_to_idx[token] = len(self.idx_to_token) - 1
      
  def __len__(self):
    return len(self.idx_to_token)
  
  def __getitem__(self, tokens):
    if not isinstance(tokens, (list, tuple)):
      return self.token_to_idx.get(tokens, self.unk)
    else:
      return [self.__getitem__(token) for token in tokens]
    
  def to_tokens(self, indices):
    if not isinstance(indices, (list, tuple)):
      return self.idx_to_token[indices]
    else:
      return [self.idx_to_token[index] for index in indices]

def load_data_nmt(batch_size, max_len, num_examples=1000):
    """Download an NMT dataset, return its vocabulary and data iterator."""
    # Download and preprocess
    def preprocess_raw(text):
        text = text.replace('\u202f', ' ').replace('\xa0', ' ')
        out = ''
        for i, char in enumerate(text.lower()):
            if char in (',', '!', '.') and text[i-1] != ' ':
                out += ' '
            out += char
        return out 


    with open('/home/kesci/input/fraeng6506/fra.txt', 'r') as f:
      raw_text = f.read()


    text = preprocess_raw(raw_text)

    # Tokenize
    source, target = [], []
    for i, line in enumerate(text.split('\n')):
        if i >= num_examples:
            break
        parts = line.split('\t')
        if len(parts) >= 2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))

    # Build vocab
    def build_vocab(tokens):
        tokens = [token for line in tokens for token in line]
        return Vocab(tokens, min_freq=3, use_special_tokens=True)
    src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)

    # Convert to index arrays
    def pad(line, max_len, padding_token):
        if len(line) > max_len:
            return line[:max_len]
        return line + [padding_token] * (max_len - len(line))

    def build_array(lines, vocab, max_len, is_source):
        lines = [vocab[line] for line in lines]
        if not is_source:
            lines = [[vocab.bos] + line + [vocab.eos] for line in lines]
        array = torch.tensor([pad(line, max_len, vocab.pad) for line in lines])
        valid_len = (array != vocab.pad).sum(1)
        return array, valid_len

    src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)
    src_array, src_valid_len = build_array(source, src_vocab, max_len, True)
    tgt_array, tgt_valid_len = build_array(target, tgt_vocab, max_len, False)
    train_data = data.TensorDataset(src_array, src_valid_len, tgt_array, tgt_valid_len)
    train_iter = data.DataLoader(train_data, batch_size, shuffle=True)
    return src_vocab, tgt_vocab, train_iter


embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0
batch_size, num_steps = 64, 10
lr, num_epochs, ctx = 0.005, 500, d2l.try_gpu()

src_vocab, tgt_vocab, train_iter = load_data_nmt(batch_size, num_steps)
encoder = d2l.Seq2SeqEncoder(
    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqAttentionDecoder(
    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = d2l.EncoderDecoder(encoder, decoder)

訓練和預測

d2l.train_s2s_ch9(model, train_iter, lr, num_epochs, ctx)

epoch   50,loss 0.104, time 54.7 sec
epoch  100,loss 0.046, time 54.8 sec
epoch  150,loss 0.031, time 54.7 sec
epoch  200,loss 0.027, time 54.3 sec
epoch  250,loss 0.025, time 54.3 sec
epoch  300,loss 0.024, time 54.4 sec
epoch  350,loss 0.024, time 54.4 sec
epoch  400,loss 0.024, time 54.5 sec
epoch  450,loss 0.023, time 54.4 sec
epoch  500,loss 0.023, time 54.7 sec

for sentence in ['Go .', 'Good Night !', "I'm OK .", 'I won !']:
    print(sentence + ' => ' + d2l.predict_s2s_ch9(
        model, sentence, src_vocab, tgt_vocab, num_steps, ctx))

Go . => va !
Good Night ! =>   !
I'm OK . => ça va .
I won ! => j'ai gagné !

動手學深度學習-15 注意力機制與Seq2seq模型

注意力機制

注意力機制框架

Softmax屏蔽

點積注意力

測試

多層感知機注意力

總結¶

引入注意力機制的Seq2seq模型

解碼器

訓練

訓練和預測

動手學深度學習-20 數據增強

動手學深度學習-19 優化算法進階

動手學深度學習-18 梯度下降

動手學深度學習-16 Transformer

動手學深度學習-15 注意力機制與Seq2seq模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結