Seq2Seq的PyTorch實現

本文介紹一下如何使用 PyTorch 復現 Seq2Seq，實現簡單的機器翻譯應用，請先簡單閱讀論文Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation(2014)，瞭解清楚Seq2Seq結構是什麼樣的，之後再閱讀本篇文章，可達到事半功倍的效果

我看了很多Seq2Seq網絡結構圖，感覺PyTorch官方提供的這個圖是最好理解的

首先，從上面的圖可以很明顯的看出，Seq2Seq需要對三個變量進行操作，這和之前我接觸到的所有網絡結構都不一樣。我們把Encoder的輸入稱爲 enc_input，Decoder的輸入稱爲 dec_input， Decoder的輸出稱爲 dec_output。下面以一個具體的例子來說明整個Seq2Seq的工作流程

下圖是一個由LSTM組成的Encoder結構，輸入的是"go away"中的每個字母（包括空格），我們只需要最後一個時刻隱藏狀態的信息，即 $h_t$ 和 $c_t$

然後將Encoder輸出的 $h_t$ 和 $c_t$ 作爲Decoder初始時刻隱藏狀態的輸入 $h_0$ 、 $c_0$ ，如下圖所示。同時Decoder初始時刻輸入層輸入的是代表一個句子開始的標誌（由用戶定義，"<SOS>","\t",“S"等均可，這裏以”\t"爲例），之後得到輸出"m"，以及新的隱藏狀態 $h_1$ 和 $c_1$

再將 $h_1$ 、 $c_1$ 和"m"作爲輸入，得到輸入"a"，以及新的隱藏狀態 $h_2$ 和 $c_2$

重複上述步驟，直到最終輸出句子的結束標誌（由用戶定義，"<EOS>","\n",“E"等均可，這裏以”\n"爲例）

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-d93LZXDB-1593652158728)(https://i.loli.net/2020/06/30/2P5grBbxf4SE3uq.png#shadow)]

在Decoder部分，大家可能會有以下幾個問題，我做下解答

訓練過程中，如果Decoder停不下來怎麼辦？即一直不輸出句子的終止標誌
- 首先，訓練過程中Decoder應該要輸出多長的句子，這個是已知的，假設當前時刻已經到了句子長度的最後一個字符了，並且預測的不是終止標誌，那也沒有關係，就此打住，計算loss即可
測試過程中，如果Decoder停不下來怎麼辦？例如預測得到"wasd s w \n sdsw \n…（一直輸出下去）"
- 不會停不下來的，因爲測試過程中，Decoder也會有輸入，只不過這個輸入是很多個沒有意義的佔位符，例如很多個"<pad>"。由於Decoder有有限長度的輸入，所以Decoder一定會有有限長度的輸出。那麼只需要獲取第一個終止標誌之前的所有字符即可，對於上面的例子，最終的預測結果爲"wasd s w"
Decoder的輸入和輸出，即 dec_input 和 dec_output 有什麼關係？
- 在訓練階段，不論當前時刻Decoder輸出什麼字符，下一時刻Decoder都按照原來的"計劃"進行輸入。舉個例子，假設 dec_input="\twasted"，首先輸入"\t"之後，Decoder輸出的是"m"這個字母，記錄下來就行了，並不會影響到下一時刻Decoder繼續輸入"w"這個字母
- 在驗證或者測試階段，Decoder每一時刻的輸出是會影響到輸入的，因爲在驗證或者測試時，網絡是看不到結果的，所以它只能循環的進行下去。舉個例子，我現在要將英語"wasted"翻譯爲德語"verschwenden"。那麼Decoder一開始輸入"\t"，得到一個輸出，假如是"m"，下一時刻Decoder會輸入"m"，得到輸出，假如是"a"，之後會將"a"作爲輸入，得到輸出…如此循環往復，直到最終時刻

這裏說句題外話，其實我個人覺得Seq2Seq與AutoEncoder非常相似

下面開始代碼講解

首先導庫，這裏我用’S’作爲開始標誌，‘E’作爲結束標誌，如果輸入或者輸入過短，我使用’?'進行填充

# code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathor
import torch
import numpy as np
import torch.nn as nn
import torch.utils.data as Data

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# S: Symbol that shows starting of decoding input
# E: Symbol that shows starting of decoding output
# ?: Symbol that will fill in blank sequence if current batch data size is short than n_step

定義數據集以及參數，這裏數據集我設定的非常簡單，可以看作是翻譯任務，只不過是將英語翻譯成英語罷了。

n_step 保存的是最長單詞的長度，其它所有不夠這個長度的單詞，都會在其後用’?'填充

letter = [c for c in 'SE?abcdefghijklmnopqrstuvwxyz']
letter2idx = {n: i for i, n in enumerate(letter)}

seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]

# Seq2Seq Parameter
n_step = max([max(len(i), len(j)) for i, j in seq_data]) # max_len(=5)
n_hidden = 128
n_class = len(letter2idx) # classfication problem
batch_size = 3

下面是對數據進行處理，主要做的是，首先對單詞長度不夠的，用’?‘進行填充；然後將Deocder的輸入數據末尾添加終止標誌’E’，Decoder的輸入數據開頭添加開始標誌’S’，Decoder的輸出數據末尾添加結束標誌’E’，其實也就如下圖所示

def make_data(seq_data):
    enc_input_all, dec_input_all, dec_output_all = [], [], []

    for seq in seq_data:
        for i in range(2):
            seq[i] = seq[i] + '?' * (n_step - len(seq[i])) # 'man??', 'women'

        enc_input = [letter2idx[n] for n in seq[0]] # ['m', 'a', 'n', '?', '?', 'E']
        dec_input = [letter2idx[n] for n in ('S' + seq[1])] # ['S', 'w', 'o', 'm', 'e', 'n']
        dec_output = [letter2idx[n] for n in (seq[1] + 'E')] # ['w', 'o', 'm', 'e', 'n', 'E']

        enc_input_all.append(np.eye(n_class)[enc_input])
        dec_input_all.append(np.eye(n_class)[dec_input])
        dec_output_all.append(dec_output) # not one-hot

    # make tensor
    return torch.Tensor(enc_input_all), torch.Tensor(dec_input_all), torch.LongTensor(dec_output_all)

'''
enc_input_all: [6, n_step+1 (because of 'E'), n_class]
dec_input_all: [6, n_step+1 (because of 'S'), n_class]
dec_output_all: [6, n_step+1 (because of 'E')]
'''
enc_input_all, dec_input_all, dec_output_all = make_data(seq_data)

由於這裏有三個數據要返回，所以需要自定義DataSet，具體來說就是繼承torch.utils.data.Dataset類，然後實現裏面的__len__以及__getitem__方法

class TranslateDataSet(Data.Dataset):
    def __init__(self, enc_input_all, dec_input_all, dec_output_all):
        self.enc_input_all = enc_input_all
        self.dec_input_all = dec_input_all
        self.dec_output_all = dec_output_all
    
    def __len__(self): # return dataset size
        return len(self.enc_input_all)
    
    def __getitem__(self, idx):
        return self.enc_input_all[idx], self.dec_input_all[idx], self.dec_output_all[idx]

loader = Data.DataLoader(TranslateDataSet(enc_input_all, dec_input_all, dec_output_all), batch_size, True)

下面定義Seq2Seq模型，我用的是簡單的RNN作爲編碼器和解碼器。如果你對RNN比較瞭解的話，定義網絡結構的部分其實沒什麼說的，註釋我也寫的很清楚了，包括數據維度的變化

# Model
class Seq2Seq(nn.Module):
    def __init__(self):
        super(Seq2Seq, self).__init__()
        self.encoder = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5) # encoder
        self.decoder = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5) # decoder
        self.fc = nn.Linear(n_hidden, n_class)

    def forward(self, enc_input, enc_hidden, dec_input):
        # enc_input(=input_batch): [batch_size, n_step+1, n_class]
        # dec_inpu(=output_batch): [batch_size, n_step+1, n_class]
        enc_input = enc_input.transpose(0, 1) # enc_input: [n_step+1, batch_size, n_class]
        dec_input = dec_input.transpose(0, 1) # dec_input: [n_step+1, batch_size, n_class]

        # h_t : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
        _, h_t = self.encoder(enc_input, enc_hidden)
        # outputs : [n_step+1, batch_size, num_directions(=1) * n_hidden(=128)]
        outputs, _ = self.decoder(dec_input, h_t)

        model = self.fc(outputs) # model : [n_step+1, batch_size, n_class]
        return model

model = Seq2Seq().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

下面是訓練，由於輸出的pred是個三維的數據，所以計算loss需要每個樣本單獨計算，因此就有了下面for循環的代碼

for epoch in range(5000):
  for enc_input_batch, dec_input_batch, dec_output_batch in loader:
      # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
      h_0 = torch.zeros(1, batch_size, n_hidden).to(device)

      (enc_input_batch, dec_intput_batch, dec_output_batch) = (enc_input_batch.to(device), dec_input_batch.to(device), dec_output_batch.to(device))
      # enc_input_batch : [batch_size, n_step+1, n_class]
      # dec_intput_batch : [batch_size, n_step+1, n_class]
      # dec_output_batch : [batch_size, n_step+1], not one-hot
      pred = model(enc_input_batch, h_0, dec_intput_batch)
      # pred : [n_step+1, batch_size, n_class]
      pred = pred.transpose(0, 1) # [batch_size, n_step+1(=6), n_class]
      loss = 0
      for i in range(len(dec_output_batch)):
          # pred[i] : [n_step+1, n_class]
          # dec_output_batch[i] : [n_step+1]
          loss += criterion(pred[i], dec_output_batch[i])
      if (epoch + 1) % 1000 == 0:
          print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
          
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

從下面測試的代碼可以看出，在測試過程中，Decoder的input是沒有意義佔位符，所佔位置的長度即最大長度 n_step 。並且在輸出中找到第一個終止符的位置，截取在此之前的所有字符

# Test
def translate(word):
    enc_input, dec_input, _ = make_data([[word, '?' * n_step]])
    enc_input, dec_input = enc_input.to(device), dec_input.to(device)
    # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
    hidden = torch.zeros(1, 1, n_hidden).to(device)
    output = model(enc_input, hidden, dec_input)
    # output : [n_step+1, batch_size, n_class]

    predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
    decoded = [letter[i] for i in predict]
    translated = ''.join(decoded[:decoded.index('E')])

    return translated.replace('?', '')

print('test')
print('man ->', translate('man'))
print('mans ->', translate('mans'))
print('king ->', translate('king'))
print('black ->', translate('black'))
print('up ->', translate('up'))

完整代碼如下

# code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathor
import torch
import numpy as np
import torch.nn as nn
import torch.utils.data as Data

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# S: Symbol that shows starting of decoding input
# E: Symbol that shows starting of decoding output
# ?: Symbol that will fill in blank sequence if current batch data size is short than n_step

letter = [c for c in 'SE?abcdefghijklmnopqrstuvwxyz']
letter2idx = {n: i for i, n in enumerate(letter)}

seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]

# Seq2Seq Parameter
n_step = max([max(len(i), len(j)) for i, j in seq_data]) # max_len(=5)
n_hidden = 128
n_class = len(letter2idx) # classfication problem
batch_size = 3

def make_data(seq_data):
    enc_input_all, dec_input_all, dec_output_all = [], [], []

    for seq in seq_data:
        for i in range(2):
            seq[i] = seq[i] + '?' * (n_step - len(seq[i])) # 'man??', 'women'

        enc_input = [letter2idx[n] for n in seq[0]] # ['m', 'a', 'n', '?', '?', 'E']
        dec_input = [letter2idx[n] for n in ('S' + seq[1])] # ['S', 'w', 'o', 'm', 'e', 'n']
        dec_output = [letter2idx[n] for n in (seq[1] + 'E')] # ['w', 'o', 'm', 'e', 'n', 'E']

        enc_input_all.append(np.eye(n_class)[enc_input])
        dec_input_all.append(np.eye(n_class)[dec_input])
        dec_output_all.append(dec_output) # not one-hot

    # make tensor
    return torch.Tensor(enc_input_all), torch.Tensor(dec_input_all), torch.LongTensor(dec_output_all)

'''
enc_input_all: [6, n_step+1 (because of 'E'), n_class]
dec_input_all: [6, n_step+1 (because of 'S'), n_class]
dec_output_all: [6, n_step+1 (because of 'E')]
'''
enc_input_all, dec_input_all, dec_output_all = make_data(seq_data)

class TranslateDataSet(Data.Dataset):
    def __init__(self, enc_input_all, dec_input_all, dec_output_all):
        self.enc_input_all = enc_input_all
        self.dec_input_all = dec_input_all
        self.dec_output_all = dec_output_all
    
    def __len__(self): # return dataset size
        return len(self.enc_input_all)
    
    def __getitem__(self, idx):
        return self.enc_input_all[idx], self.dec_input_all[idx], self.dec_output_all[idx]

loader = Data.DataLoader(TranslateDataSet(enc_input_all, dec_input_all, dec_output_all), batch_size, True)

# Model
class Seq2Seq(nn.Module):
    def __init__(self):
        super(Seq2Seq, self).__init__()
        self.encoder = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5) # encoder
        self.decoder = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5) # decoder
        self.fc = nn.Linear(n_hidden, n_class)

    def forward(self, enc_input, enc_hidden, dec_input):
        # enc_input(=input_batch): [batch_size, n_step+1, n_class]
        # dec_inpu(=output_batch): [batch_size, n_step+1, n_class]
        enc_input = enc_input.transpose(0, 1) # enc_input: [n_step+1, batch_size, n_class]
        dec_input = dec_input.transpose(0, 1) # dec_input: [n_step+1, batch_size, n_class]

        # h_t : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
        _, h_t = self.encoder(enc_input, enc_hidden)
        # outputs : [n_step+1, batch_size, num_directions(=1) * n_hidden(=128)]
        outputs, _ = self.decoder(dec_input, h_t)

        model = self.fc(outputs) # model : [n_step+1, batch_size, n_class]
        return model

model = Seq2Seq().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(5000):
  for enc_input_batch, dec_input_batch, dec_output_batch in loader:
      # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
      h_0 = torch.zeros(1, batch_size, n_hidden).to(device)

      (enc_input_batch, dec_intput_batch, dec_output_batch) = (enc_input_batch.to(device), dec_input_batch.to(device), dec_output_batch.to(device))
      # enc_input_batch : [batch_size, n_step+1, n_class]
      # dec_intput_batch : [batch_size, n_step+1, n_class]
      # dec_output_batch : [batch_size, n_step+1], not one-hot
      pred = model(enc_input_batch, h_0, dec_intput_batch)
      # pred : [n_step+1, batch_size, n_class]
      pred = pred.transpose(0, 1) # [batch_size, n_step+1(=6), n_class]
      loss = 0
      for i in range(len(dec_output_batch)):
          # pred[i] : [n_step+1, n_class]
          # dec_output_batch[i] : [n_step+1]
          loss += criterion(pred[i], dec_output_batch[i])
      if (epoch + 1) % 1000 == 0:
          print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
          
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
    
# Test
def translate(word):
    enc_input, dec_input, _ = make_data([[word, '?' * n_step]])
    enc_input, dec_input = enc_input.to(device), dec_input.to(device)
    # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
    hidden = torch.zeros(1, 1, n_hidden).to(device)
    output = model(enc_input, hidden, dec_input)
    # output : [n_step+1, batch_size, n_class]

    predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
    decoded = [letter[i] for i in predict]
    translated = ''.join(decoded[:decoded.index('E')])

    return translated.replace('?', '')

print('test')
print('man ->', translate('man'))
print('mans ->', translate('mans'))
print('king ->', translate('king'))
print('black ->', translate('black'))
print('up ->', translate('up'))

Seq2Seq的PyTorch實現

Seq2Seq(Attention)的PyTorch實現（超級詳細）

TextRNN的PyTorch實現

圖解Attention

Seq2Seq的PyTorch實現

BiLSTM的PyTorch應用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結