Seq2Seq(Attention)的PyTorch實現（超級詳細）

文本主要介紹一下如何使用PyTorch復現Seq2Seq(with Attention)，實現簡單的機器翻譯任務，請先閱讀論文Neural Machine Translation by Jointly Learning to Align and Translate，之後花上15分鐘閱讀我的這兩篇文章Seq2Seq 與注意力機制，圖解Attention，最後再來看文本，方能達到醍醐灌頂，事半功倍的效果

數據預處理

數據預處理的代碼其實就是調用各種API，我不希望讀者被這些不太重要的部分分散了注意力，因此這裏我不貼代碼，僅口述一下帶過即可

如下圖所示，本文使用的是德語→英語數據集，輸入是德語，並且輸入的每個句子開頭和結尾都帶有特殊的標識符。輸出是英語，並且輸出的每個句子開頭和結尾也都帶有特殊標識符

不管是英語還是德語，每句話長度都是不固定的，所以我對於每個batch內的句子，將它們的長度通過加<PAD>變得一樣，也就說，一個batch內的句子，長度都是相同的，不同batch內的句子長度不一定相同。下圖維度表示分別是[seq_len, batch_size]

隨便打印一條數據，看一下數據封裝的形式

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-B8BCQZmA-1593675277369)(https://i.loli.net/2020/07/02/5vCLnH9SWieUg4l.png#shadow)]

在數據預處理的時候，需要將源句子和目標句子分開構建字典，也就是單獨對德語構建一個詞庫，對英語構建一個詞庫

Encoder

Encoder我是用的單層雙向GRU

雙向GRU的隱藏狀態輸出由兩個向量拼接而成，例如 $h_1=[\overrightarrow{h_1};\overleftarrow{h_T}]$ , $h_2=[\overrightarrow{h_2};\overleftarrow{h}_{T-1}]$ …所有時刻的最後一層隱藏狀態就構成了GRU的output

$output=\{h_1,h_2,...h_T\}$

假設這是個m層GRU，那麼最後一個時刻所有層中的隱藏狀態就構成了GRU的final hidden states
$hidden=\{h^1_T,h^2_T,...,h^m_T\}$
其中
$h^i_T=[\overrightarrow{h^i_T};\overleftarrow{h^i_0}]$
所以
$hidden=\{[\overrightarrow{h^1_T};\overleftarrow{h^1_0}],[\overrightarrow{h^2_T};\overleftarrow{h^2_0}],...,[\overrightarrow{h^m_T};\overleftarrow{h^m_0}]\}$
根據論文，或者你看了我的圖解Attention這篇文章就會知道，我們需要的是hidden的最後一層輸出（包括正向和反向），因此我們可以通過hidden[-2,:,:]和hidden[-1,:,:]取出最後一層的hidden states，將它們拼接起來記作 $s_0$

最後一個細節之處在於， $s_0$ 的維度是[batch_size, en_hid_dim*2]，即便是沒有Attention機制，將 $s_0$ 作爲Decoder的初始隱藏狀態也不對，因爲維度不匹配，需要將 $s_0$ 的維度轉爲[batch_size, src_len, dec_hid_dim]，中間的src_len暫且不談，首先要做的是轉爲[batch_size, dec_hid_dim]，所以這裏需要將 $s_0$ 通過一個全連接神經網絡，進行維度轉換

Encoder的細節就這麼多，下面直接上代碼，我的代碼風格是，註釋在上，代碼在下

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src): 
        '''
        src = [src_len, batch_size]
        '''
        src = src.transpose(0, 1) # src = [batch_size, src_len]
        embedded = self.dropout(self.embedding(src)).transpose(0, 1) # embedded = [src_len, batch_size, emb_dim]
        
        # enc_output = [src_len, batch_size, hid_dim * num_directions]
        # enc_hidden = [n_layers * num_directions, batch_size, hid_dim]
        enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently

        # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # enc_output are always from the last layer
        
        # enc_hidden [-2, :, : ] is the last of the forwards RNN 
        # enc_hidden [-1, :, : ] is the last of the backwards RNN
        
        # initial decoder hidden is final hidden state of the forwards and backwards 
        # encoder RNNs fed through a linear layer
        # s = [batch_size, dec_hid_dim]
        s = torch.tanh(self.fc(torch.cat((enc_hidden[-2,:,:], enc_hidden[-1,:,:]), dim = 1)))
        
        return enc_output, s

Attention

attention無非就是三個公式
$E_t=tanh(attn(s_{t-1},H))\\ \tilde{a_t}=vE_t\\ {a_t}=softmax(\tilde{a_t})$
其中 $s_{t-1}$ 指的就是Encoder中的變量s， $H$ 指的就是Encoder中的變量enc_output， $attn()$ 其實就是一個簡單的全連接神經網絡

我們可以從最後一個公式反推各個變量的維度是什麼，或者維度有什麼要求

首先 $a_t$ 的維度應該是[batch_size, src_len]，這是毋庸置疑的，那麼 $\tilde{a_t}$ 的維度也應該是[batch_size, src_len]，或者 $\tilde{a_t}$ 是個三維的，但是某個維度值爲1，可以通過squeeze()變成兩維的。這裏我們先假設 $\tilde{a_t}$ 的維度是[batch_size, src_len, 1]，等會兒我再解釋爲什麼要這樣假設

繼續往上推，變量 $v$ 的維度就應該是[?, 1]，?表示我暫時不知道它的值應該是多少。E_t的維度應該是[batch_size, src_len, ?]

現在已知 $H$ 的維度是[batch_size, src_len, enc_hid_dim*2]， $s_{t-1}$ 目前的維度是[batch_size, dec_hid_dim]，這兩個變量需要做拼接，送入全連接神經網絡，因此我們首先需要將 $s_{t-1}$ 的維度變成[batch_size, src_len, dec_hid_dim]，如果我們在定義參數的時候就能夠滿足dec_hid_dim=enc_hid_dim*2這樣的關係，那麼 $s_{t-1}$ 就可以和 $H$ 拼接了，拼接之後的維度就變成[batch_size, src_len, enc_hid_dim*2+enc_hid_dim]，於是 $attn()$ 這個函數的輸入輸出值也就有了

attn = nn.Linear(enc_hid_dim*2+enc_hid_dim, ?)

到此爲止，除了?部分的值不清楚，其它所有維度都推導出來了。現在我們回過頭思考一下?設置成多少，好像其實並沒有任何限制，所以我們可以設置?爲任何值（在代碼中我設置?爲dec_hid_dim）

Attention細節就這麼多，下面給出代碼

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, s, enc_output):
        
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        batch_size = enc_output.shape[1]
        src_len = enc_output.shape[0]
        
        # repeat decoder hidden state src_len times
        # s = [batch_size, src_len, dec_hid_dim]
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        s = s.unsqueeze(1).repeat(1, src_len, 1)
        enc_output = enc_output.transpose(0, 1)
        
        # energy = [batch_size, src_len, dec_hid_dim]
        energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2)))
        
        # attention = [batch_size, src_len]
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)

Seq2Seq(with Attention)

我調換一下順序，先講Seq2Seq，再講Decoder的部分

傳統Seq2Seq是直接將句子中每個詞連續不斷輸入Decoder進行訓練，而引入Attention機制之後，我需要能夠人爲控制一個詞一個詞進行輸入（因爲輸入每個詞到Decoder，需要再做一些運算），所以在代碼中會看到我使用了for循環，循環trg_len-1次（開頭的<SOS>我手動輸入，所以循環少一次）

並且訓練過程中我使用了一種叫做Teacher Forcing的機制，保證訓練速度的同時增加魯棒性，如果不瞭解Teacher Forcing可以看我的這篇文章

思考一下for循環中應該要做哪些事？首先要將變量傳入Decoder，由於Attention的計算是在Decoder的內部進行的，所以我需要將dec_input、s、enc_output這三個變量傳入Decoder，Decoder會返回dec_output以及新的s。之後根據概率對dec_output做Teacher Forcing即可

Seq2Seq細節就這麼多，下面給出代碼

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        # src = [src_len, batch_size]
        # trg = [trg_len, batch_size]
        # teacher_forcing_ratio is probability to use teacher forcing
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # enc_output is all hidden states of the input sequence, back and forwards
        # s is the final forward and backward hidden states, passed through a linear layer
        enc_output, s = self.encoder(src)
                
        # first input to the decoder is the <sos> tokens
        dec_input = trg[0,:]
        
        for t in range(1, trg_len):
            
            # insert dec_input token embedding, previous hidden state and all encoder hidden states
            # receive output tensor (predictions) and new hidden state
            dec_output, s = self.decoder(dec_input, s, enc_output)
            
            # place predictions in a tensor holding predictions for each token
            outputs[t] = dec_output
            
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            # get the highest predicted token from our predictions
            top1 = dec_output.argmax(1) 
            
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            dec_input = trg[t] if teacher_force else top1

        return outputs

Decoder

Decoder我用的是單向單層GRU

Decoder部分實際上也就是三個公式
$w_t=a_tH\\ s_t=GRU(emb(y_t), w_t, s_{t-1})\\ \hat{y_t}=f(emb(y_t), w_t, s_t)$
$H$ 指的是Encoder中的變量enc_output， $emb(y_t)$ 指的是將enc_input經過WordEmbedding後得到的結果， $f()$ 函數實際上就是爲了轉換維度，因爲需要的輸出是TRG_VOCAB_SIZE大小。其中有個細節，GRU的參數只有兩個，一個輸入，一個隱藏層輸入，但是上面的公式有三個變量，所以我們應該選一個作爲隱藏層輸入，另外兩個"整合"一下，作爲輸入

我們從第一個公式正推各個變量的維度是什麼

首先在Encoder中最開始先調用一次Attention，得到權重 $a_t$ ，它的維度是[batch_size, src_len]，而 $H$ 的維度是[src_len, batch_size, enc_hid_dim*2]，它倆要相乘，同時應該保留batch_size這個維度，所以應該先將 $a_t$ 擴展一維，然後調換一下 $H$ 維度的順序，之後再按照batch相乘（即同一個batch內的矩陣相乘）

a = a.unsqueeze(1) # [batch_size, 1, src_len]
H = H.transpose(0, 1) # [batch_size, src_len, enc_hid_dim*2]
w = torch.bmm(a, h) # [batch_size, 1, enc_hid_dim*2]

前面也說了，由於GRU不需要三個變量，所以需要將 $emb(y_t)$ 和 $w_t$ 整合一下， $y_t$ 實際上就是Seq2Seq類中的enc_input變量，它的維度是[batch_size]，因此先將 $y_t$ 擴展一個維度，再通過WordEmbedding，這樣他就變成[batch_size, 1, emb_dim]。最後對 $w_t$ 和 $emb(y_t)$ 進行concat

y = y.unsqueeze(1) # [batch_size, 1]
emb_y = self.emb(y) # [batch_size, 1, emb_dim]
rnn_input = torch.cat((emb_y, w), dim=2) # [batch_size, 1, emb_dim+enc_hid_dim*2]

$s_{t-1}$ 的維度是[batch_size, dec_hid_dim]，所以應該先將其拓展一個維度

rnn_input = rnn_input.transpose(0, 1) # [1, batch_size, emb_dim+enc_hid_dim*2]
s = s.unsqueeze(1) # [batch_size, 1, dec_hid_dim]

# dec_output = [1, batch_size, dec_hid_dim]
# dec_hidden = [1, batch_size, dec_hid_dim] = s (new, is not s previously)
dec_output, dec_hidden = self.rnn(rnn_input, s)

最後一個公式，需要將三個變量全部拼接在一起，然後通過一個全連接神經網絡，得到最終的預測。我們先分析下這個三個變量的維度， $emb(y_t)$ 的維度是[batch_size, 1, emb_dim]， $w_t$ 的維度是[batch_size, 1, enc_hid_dim]， $s_t$ 的維度是[1, batch_size, dec_hid_dim]，因此我們可以像下面這樣把他們全部拼接起來

emd_y = emb_y.unsqueeze(1) # [batch_size, emb_dim]
w = w.unsqueeze(1) # [batch_size, enc_hid_dim*2]
s = s.unsqueeze(0) # [batch_size, dec_hid_dim]

fc_input = torch.cat((emb_y, w, s), dim=1) # [batch_size, enc_hid_dim*2+dec_hid_dim+emb_hid]

以上就是Decoder部分的細節，下面給出代碼（上面的那些只是示例代碼，和下面代碼變量名可能不一樣）

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, dec_input, s, enc_output):
             
        # dec_input = [batch_size]
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]
        
        embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim]
        
        # a = [batch_size, 1, src_len]  
        a = self.attention(s, enc_output).unsqueeze(1)
        
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        enc_output = enc_output.transpose(0, 1)

        # c = [1, batch_size, enc_hid_dim * 2]
        c = torch.bmm(a, enc_output).transpose(0, 1)

        # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]
        rnn_input = torch.cat((embedded, c), dim = 2)
            
        # dec_output = [src_len(=1), batch_size, dec_hid_dim]
        # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]
        dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0))
        
        # embedded = [batch_size, emb_dim]
        # dec_output = [batch_size, dec_hid_dim]
        # c = [batch_size, enc_hid_dim * 2]
        embedded = embedded.squeeze(0)
        dec_output = dec_output.squeeze(0)
        c = c.squeeze(0)
        
        # pred = [batch_size, output_dim]
        pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))
        
        return pred, dec_hidden.squeeze(0)

定義模型

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

倒數第二行CrossEntropyLoss()中的參數很少見，ignore_index=TRG_PAD_IDX，這個參數的作用是忽略某一類別，不計算其loss，但是要注意，忽略的是真實值中的類別，例如下面的代碼，真實值的類別都是1，而預測值全部預測認爲是2（下標從0開始），同時loss function設置忽略第一類的loss，此時會打印出0

label = torch.tensor([1, 1, 1])
pred = torch.tensor([[0.1, 0.2, 0.6], [0.2, 0.1, 0.8], [0.1, 0.1, 0.9]])
loss_fn = nn.CrossEntropyLoss(ignore_index=1)
print(loss_fn(pred, label).item()) # 0

如果設置loss function忽略第二類，此時loss並不會爲0

label = torch.tensor([1, 1, 1])
pred = torch.tensor([[0.1, 0.2, 0.6], [0.2, 0.1, 0.8], [0.1, 0.1, 0.9]])
loss_fn = nn.CrossEntropyLoss(ignore_index=2)
print(loss_fn(pred, label).item()) # 1.359844

最後給出完整代碼鏈接（需要科學的力量）
Github項目地址：nlp-tutorial

Seq2Seq(Attention)的PyTorch實現（超級詳細）

數據預處理

Encoder

Attention

Seq2Seq(with Attention)

Decoder

定義模型

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

Seq2Seq(Attention)的PyTorch實現（超級詳細）

TextRNN的PyTorch實現

圖解Attention

Seq2Seq的PyTorch實現

BiLSTM的PyTorch應用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結