NMT實戰理解Attention、Seq2Seq

最近在看NMT相關的研究，論文很多，每隔幾個月就會有新的論文發出來，提出新的模型或者改進，作爲小白，我覺得還是先搞懂一些基礎理念，試着去實現最簡單的模型，練練手。

本次以Pytorch的Translation with sequence to sequence network and attention爲例，介紹一下Seq2Seq和Attention機制，順便了解一下最簡單的NMT模型。好了話不多說，進入正題。

任務簡介

任務很簡單，French -> English ，法語到英語的翻譯任務。如下示例，> 表示輸入的源語言句子， = 表示目標語言句子，< 表示模型翻譯的目標語言結果。

[KEY: > input, = target, < output]

> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?

> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .

> vous etes trop maigre .
= you re too skinny .
< you re all alone .

本文的 seq2seq network 參考論文是谷歌發表於NIPS 2014的 Sequence to Sequence Learning with Neural Networks。整體架構如下圖所示：

輸入是一個法語單詞序列，最後增加一個<EOS>表示句子結尾(End of Sentence)，然後將單詞轉爲詞彙表中該單詞對應的編號，依次餵給Encoder，Encoder內部有一個GRU（就是RNN的一種變體）的結構，循環接收輸入，最後遇到<EOS>結束，將Encoder的最後一個輸出向量作爲Decoder的輸入，在Decoder端，將目標語言序列(在最前面加一個<SOS>符號表示句子開頭 start of sentence)依次餵給Decoder，Decoder依次輸出"the cat is black"，最後結束。

這種傳統的Encoder Decoder 框架，在解碼時僅僅依賴Encoder生成的固定長度的向量表示，當輸入序列比較長時，性能很差，於是就有人提出Attention機制進行優化。本文的Attention版本參考論文是Bahdanau發表於ICLR 2015的 Neural Machine Translation by Jointly Learning to Align and Translate。

關於Attention的本質，張俊林博客裏有很詳細的介紹，大意就是模型在Decoder端解碼的時候，Encoder端的輸入序列各個單詞對其影響程度是不同的。舉例來說，比如輸入的是英文句子：Tom chase Jerry，Encoder-Decoder框架逐步生成中文單詞：“湯姆”，“追逐”，“傑瑞”。在翻譯“傑瑞”這個中文單詞的時候，顯然“Jerry”對於翻譯成“傑瑞”更重要，但是傳統模型是無法體現這一點的，這就是引入注意力機制的原因。本文將着重從代碼層面來分析理解Attention。

代碼分析

首先第一步，導入依賴庫，主要是torch相關的庫，因爲涉及法語的一些起碼字符，引入了unicodedata這個庫：

from __future__ import unicode_literals, print_function, division

import random
import re
import unicodedata
from io import open

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

數據預處理相關

語言類，很常規的套路，其實就是爲了構建一個語言的詞彙表，word2index用來將單詞轉成對應的編號，作爲模型的輸入，index2word是將模型輸出的標號轉爲詞彙表中對應的單詞，SOS和EOS是特殊的兩個符號，表示句子開始和結尾，n_words表示詞彙表的大小。

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

特殊字符轉碼，具體細節沒怎麼研究，我就是用了幾個法語的句子做了單元測試，看了一下輸出，大概功能就是將法語裏那些àè這類長得像英文字母的，轉成正常的英文字母。

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
# 將 àè等這種字符轉成ae正常的字母
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

剔除掉亂七八糟的字符，留下的單詞或符號中間以空格分隔。這兩正則表達式，我剛開始也很懵，把它們單拎出來做單元測試，就明白每個實現了什麼功能了，這個方法是我用來理解複雜的系統或代碼結構的絕招，俗稱拆輪子。

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    # 給. ! ? 前面加空格
    s = re.sub(r"([.!?])", r" \1", s)
    # 對於任何非a-zA-Z.!?開頭的一個或多個連續字符，都替換成空格
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

讀取數據，下載地址在這裏。原始數據格式是每行左邊是英語，右邊是法語，本文是法語到英語的翻譯任務，所以需要反過來，即reverse=True。將讀取的語料對放在pairs裏返回。

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

代碼中一些常量解釋：

# <SOS>和<EOS>在詞彙表中的編碼分別是0和1
SOS_token = 0
EOS_token = 1

# 從語料文件裏，過濾出長度小於10的句子，
# 並且英語句子以en_prefixes爲前綴的才留下，
# 作爲本次任務的數據集
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

# 隨機概率值，隨機選擇是用target的序列
# 作爲decoder的輸入，還是用decoder上一個輸出作爲當前的輸入
teacher_forcing_ratio = 0.5

# 隱藏層的大小，也即詞向量的維度
hidden_size = 256

過濾條件和過濾操作：

# 只保留eng和fre長度都小於10，並且英語以eng_prefixes這些前綴開頭的語料對
def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

準備數據：

# 原始語料有 13 5842條記錄
# 經過過濾之後，剩下1 0599條平行語料對
# 詞彙表大小：
# fra 4345
# eng 2803
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

模型部分

終於進入正題了。先看看Encoder。結構很簡單，一個Embedding，一個GRU。模型數據流動看模型結構圖就很好理解，具體到代碼細節，就得好好捋清楚每個變量的shape，以及怎麼轉換的。經常在forward函數裏看到各種變換方法，如view(1, 1, -1)、sequeeze、unsqueeze，還有各種矩陣乘法運算，如torch.bmm，sotmax歸一化運算F.Softmax(matrix, dim=1)各種，都需要一個個拎出來，寫點小例子測試一下，瞭解其參數含義，實現的功能。說白了，就是哪個輪子不太明白，就把它拆下來研究。

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Encoder的模型結構圖如下：

不加Attention的Decoder代碼如下，有一點不明白就是forward裏的relu的作用，如果不加這個激活函數會怎樣？梯度消失？爆炸？

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # Embedding 第一個參數，詞彙表的大小；第二個參數，詞向量的維度
        # 由於Decoder的輸出，是從詞彙表大小裏挑一個，所以 num_embeddings = output_size
        # 本文詞向量的維度和隱藏層大小一致，所以 embedding_dim = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        # GRU這兩個參數？？？ 本文兩個size都一樣，都是hidden_size大小 256
        # input_size =
        # hidden_size =
        self.gru = nn.GRU(hidden_size, hidden_size)
        # Linear 兩個參數：in_features, out_features
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    # RNN 系列，輸入只有兩個
    # input: shape [1, 1]
    def forward(self, input, hidden):
        # embedded shape : [1, 1, 256]
        embedded = self.embedding(input)
        # 這一步其實是多此一舉
        output = embedded.view(1, 1, -1)
        # 非常不明白這爲何relu
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output[0])
        output = self.softmax(output)
        return output, hidden

DecoderRNN模型結構圖如下：

正如前面所說，爲了解決長句子的翻譯效果，本文基於Encoder-Decoder框架加入了Attention機制，這種一般被稱爲Soft-Attention。儘管本文訓練的語料對都不超過10個單詞，對比不出加與不加的區別。

本文真正使用的是AttnDecoderRNN，代碼如下：

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        # embedded shape : [1, 1, 256]
        # hidden shape : [1, 1, 256]
        embedded = self.embedding(input)
        # 這一步其實是多此一舉
        embedded = embedded.view(1, 1, -1)
        # 爲啥輸入也dropout
        embedded = self.dropout(embedded)

        #######第一部分：利用Q和K的相似性，計算weights#####

        # 兩者shape都是 [1, 256]
        # cat到一起變成了 [1, 512]
        cat_res = torch.cat((embedded[0], hidden[0]), 1)

        # 又來一個全連接層把它打回原形，max_length=10，變成 [1, 10]
        attn_res = self.attn(cat_res)

        # 對行進行歸一化
        attn_weights = F.softmax(attn_res, dim=1)

        #######第二部分：context vector = weights * value#####

        # bmm : batch 矩陣相乘
        # unsqueeze(0) -- 將[1, 10] 變成 [1, 1, 10]
        # [1, 1, 10] 和 [1, 10, 256] 矩陣相乘
        # 得到  [1, 1, 256]
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        ### 後面幹啥搞不太懂#####

        # 又把兩個 [1, 256] 拼接到一起， 變成[1, 512]
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        # 再來一個全連接層，打回原形  [1, 256]
        # 又unsqueeze變成  [1, 1, 256]
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Attention Decoder 粗略模型結構圖如下：

還是很好理解的，Decoder有3個輸入，Input（Decoder端輸入的單詞序列）、Hidden（上一步輸出的隱藏狀態）、Encoder outputs（Encoder每個timestep輸出的output列表）。其中 attention weights是根據Input和Hidden算出來的，然後再和Encoder outputs進行 element-wise product，即對應位置相乘，不同於數學中的點積或者矩陣乘法。關於element-wise product其實很好理解，見下圖：

簡單來說，上面就是Attention 的工作方式，具體到整個 Attention Decoder的數據流圖，看一下下面這張：

通過斷點調試，我將每個Tensor的shape都搞明白了（寫在上面代碼註釋裏），數學運算也清楚了，但知道怎麼計算的了還是不太懂爲何要這麼設計，有什麼理論依據嗎？或者說參考哪篇論文實現的嗎？比如，下面這幾點我就不是很懂：

爲何 attention weights 的計算就是簡單的把 input的embedding向量和上一步的bidden向量拼接，然後通過一個全連接層，進行了線性運算，最後softmax歸一？？這麼簡單嗎？爲什麼這樣算出的結果就能表示關於Encoder每個timestep的影響程度？
計算出 attention_weights 和 encoder_outputs 相乘的結果之後，爲什麼又要和input的embedding向量進行拼接，又經過一個全連接層，線性運算一下，將結果relu一下加入非線性運算？？？這一頓操作又是在幹啥？？？
GRU比LSTM主要好在哪些方面？？？

後面的就是，把一個個句子轉成Tensor的過程，都非常簡單，常規的python操作邏輯：


# 把一個句子轉換成id列表
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


# 在最後添加一個EOS結尾符號，並將id列表轉爲Tensor
def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


# 將每個語料對轉成 Tensor形式的二元祖
def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

訓練部分

# 這是一次迭代（iteration），即跑一遍所有數據的訓練函數
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,
          max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    # 優化器梯度清零
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # input_length 和 target_length 分別表示
    # encoder 和 decoder 輸入序列單詞個數，在下面循環的時候用到
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    # encoder_outputs 初始化是0，長度是ma_length，維度是 encoder 的hidden_size大小
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # encoder循環, 次數是輸入句子的長度
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        # 這個encoder_outputs 保存每個time_step的output
        # 在後面會用來和attention weights相乘，得到一個context vector
        encoder_outputs[ei] = encoder_output[0, 0]

    # Decoder的輸入初始化爲<SOS>符號，表示開始
    decoder_input = torch.tensor([[SOS_token]], device=device)

    # 將encoder的最後一次hidden狀態，最爲decoder_hidden的初始值
    decoder_hidden = encoder_hidden

    # 1/2的概率：
    # True: 用target（目標語言）的單詞作爲decoder的每個輸入，這個是從語料對裏取出來的
    # False: 用decoder上一個time_step預測出的單詞作爲decoder的下一個輸入
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            # 使用訓練語料的目標語言句子的單詞作爲下一個輸入
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            # topv : softmax後的最大概率數值
            # topi : softmax後最大概率值的位置
            # decoder_output的shape是[1, target_vocab_size]
            # 所以topi 就是詞彙表中對應的編號
            topv, topi = decoder_output.topk(1)
            # detach 表示不需要反向傳播更新梯度
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

大體就是這些，帶有詳細註釋的代碼已上傳到我的github。

NMT實戰理解Attention、Seq2Seq

任務簡介

代碼分析

知識驅動的主動式開放域對話系統 by 車萬翔 2020/4/11

ParlAI 學習記錄(一)：安裝及demo上手

torch-sparse gcc編譯失敗分析

最新github訪問、下載慢解決辦法

Transformer實戰

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結