項目介紹

我們將訓練一個 Transformer 模型用於將葡萄牙語翻譯成英語。在此之前，建議先了解有關文本生成和注意力機制的相關內容。

Transformer 模型的核心思想是自注意力機制（self-attention）——能注意輸入序列的不同位置以計算該序列的表示的能力。Transformer 創建了多層自注意力層（self-attetion layers）組成的堆棧，下文的按比縮放的點積注意力（Scaled dot product attention）和多頭注意力（Multi-head attention）部分對此進行了說明。

一個 transformer 模型用自注意力層而非 RNNs 或 CNNs 來處理變長的輸入。這種通用架構有一系列的優勢：

它不對數據間的時間/空間關係做任何假設。這是處理一組對象（objects）的理想選擇（例如，星際爭霸單位（StarCraft units））。
層輸出可以並行計算，而非像 RNN 這樣的序列計算。
遠距離項可以影響彼此的輸出，而無需經過許多 RNN 步驟或卷積層（例如，參見場景記憶 Transformer（Scene Memory Transformer））
它能學習長距離的依賴。在許多序列任務中，這是一項挑戰。

該架構的缺點是：

對於時間序列，一個單位時間的輸出是從整個歷史記錄計算的，而非僅從輸入和當前的隱含狀態計算得到。這可能效率較低。
如果輸入確實有時間/空間的關係，像文本，則必須加入一些位置編碼，否則模型將有效地看到一堆單詞。

訓練完模型後，您將能輸入葡萄牙語句子，得到其英文翻譯。

代碼實現

1、導入需要的庫

import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

2、導入數據集

在這裏我們使用 tensorflow_datasets 來導入葡萄牙語-英語翻譯數據集，該數據集來自於 TED 演講開放翻譯項目.

該數據集包含來約 50000 條訓練樣本，1100 條驗證樣本，以及 2000 條測試樣本。

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']
train_examples

<_OptionsDataset shapes: ((), ()), types: (tf.string, tf.string)>

此時得到的 train_examples 和 **val_examples ** 的類型都是 dataset，所以我們可以用它的 take 屬性打印其中一個樣本：

for pt, en in train_examples.take(1):
    print(pt)
    print(en)

tf.Tensor(b'os astr\xc3\xb3nomos acreditam que cada estrela da gal\xc3\xa1xia tem um planeta , e especulam que at\xc3\xa9 um quinto deles tem um planeta do tipo da terra que poder\xc3\xa1 ter vida , mas ainda n\xc3\xa3o vimos nenhum deles .', shape=(), dtype=string)
tf.Tensor(b"astronomers now believe that every star in the galaxy has a planet , and they speculate that up to one fifth of them have an earth-like planet that might be able to harbor life , but we have n't seen any of them .", shape=(), dtype=string)

3、將文本編碼成數字形式

3.1 建立詞彙表並統計詞彙表中的單詞數量

tokenizer_pt = tfds.features.text.Tokenizer()
tokenizer_en = tfds.features.text.Tokenizer()

vocabulary_set_pt = set()
vocabulary_set_en = set()
for pt, en in train_examples:
    some_tokens_en = tokenizer_en.tokenize(en.numpy())
    vocabulary_set_en.update(some_tokens_en)
    some_tokens_pt = tokenizer_pt.tokenize(pt.numpy())
    vocabulary_set_pt.update(some_tokens_pt)

vocab_size_en = len(vocabulary_set_en)
vocab_size_pt = len(vocabulary_set_pt)
vocab_size_en

其中 tokenizer = tfds.features.text.Tokenizer() 的目的是實例化一個分詞器，tokenizer.tokenize 可以將一句話分成多個單詞。

3.2 建立編碼器

encoder_en = tfds.features.text.TokenTextEncoder(vocabulary_set_en)
encoder_pt = tfds.features.text.TokenTextEncoder(vocabulary_set_pt)

我們可以拿一個樣本實驗：

sample_string = next(iter(train_examples))[1].numpy()
print ('The sample string: {}'.format(sample_string))

tokenized_string = encoder_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = encoder_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

The sample string is b"except , i 've never lived one day of my life there ."
Tokenized string is [21883, 16754, 12892, 4950, 23143, 10288, 15354, 2589, 5321, 25368, 14355]
The original string is except i ve never lived one day of my life there

然後，我們將編碼器寫成函數供以後調用：

def encode(lang1, lang2):
    lang1 = [encoder_pt.vocab_size] + encoder_pt.encode(
          lang1.numpy()) + [encoder_pt.vocab_size+1]

    lang2 = [encoder_en.vocab_size] + encoder_en.encode(
          lang2.numpy()) + [encoder_en.vocab_size+1]
  
    return lang1, lang2

這裏要將開始和結束標記添加到輸入和目標，所以要在原來的 encoder_pt.encode(lang1.numpy()) 前後加上兩個新的數字，假設 encoder_pt 中一共有 n 個單詞，那麼開始標記被記爲 n，結束標記被記爲 n+1。

3.3 對所有樣本進行編碼

這裏可以參考文章Tensorflow2.0加載和預處理數據的方法彙總中的第七部分，其中對以下代碼中使用的函數做了詳細說明。

3.3.1 刪除過長的樣本

爲了使訓練速度變快，我們刪除長度大於40個單詞的樣本。

MAX_LENGTH = 40

def filter_max_length(x, y, max_length=MAX_LENGTH):
    return tf.logical_and(tf.size(x) <= max_length,
                           tf.size(y) <= max_length)

3.3.2 編碼函數

def tf_encode(pt, en):
    result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
    result_pt.set_shape([None])
    result_en.set_shape([None])

    return result_pt, result_en

3.3.3 將樣本打亂、分批

BUFFER_SIZE = 20000
BATCH_SIZE = 64

train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# 將數據集緩存到內存中以加快讀取速度。
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, ((None, ), (None, )))
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)


val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE, ((None, ), (None, )))

此時，我們得到的最終 train_dataset 和 val_dataset 中的樣本已經從文本轉換成了數字向量：

pt_batch, en_batch = next(iter(train_dataset))
pt_batch, en_batch

(<tf.Tensor: id=884035, shape=(64, 38), dtype=int64, numpy=
 array([[37503, 22040, 25913, ...,     0,     0,     0],
        [37503, 14863, 33404, ...,     0,     0,     0],
        [37503, 27883,  1899, ...,     0,     0,     0],
        ...,
        [37503, 11538, 37504, ...,     0,     0,     0],
        [37503, 25837, 27826, ...,     0,     0,     0],
        [37503, 37130,  6792, ...,     0,     0,     0]], dtype=int64)>,
 <tf.Tensor: id=884036, shape=(64, 36), dtype=int64, numpy=
 array([[26597, 24117, 14025, ...,     0,     0,     0],
        [26597,  6900, 22616, ...,     0,     0,     0],
        [26597,   436, 15562, ...,     0,     0,     0],
        ...,
        [26597,  1627, 26598, ...,     0,     0,     0],
        [26597,  5490, 16754, ...,     0,     0,     0],
        [26597,  7492, 15118, ...,     0,     0,     0]], dtype=int64)>)

可見該批次中葡萄牙語樣本（輸入樣本）的最大長度爲 38 個字母，英語樣本（目標樣本）的最大長度爲 36 個字母。

4、位置編碼

因爲該模型並不包括任何的循環神經網絡，所以此模型中不包括任何的詞序信息，而這些信息是非常重要的。

比如一個單詞在句子中的位置或排列順序不同，可能整個句子的意思就發生了偏差。

I do not like apple, but I do like banana.
I do like apple, but I do not like banana.

上面兩句話所使用的的單詞完全一樣，但是所表達的句意卻截然相反。那麼，我們需要引入詞序信息來區別這兩句話的意思。

所以模型添加了位置編碼，爲模型提供一些關於單詞在句子中相對位置的信息。

Transformer 模型本身不具備像循環神經網絡那樣的學習詞序信息的能力，所以我們需要主動地將詞序信息輸入模型。那麼，模型原先的輸入是不含詞序信息的詞向量，位置編碼需要將詞序信息和詞向量結合起來形成一種新的表示輸入給模型（在編碼器和解碼器中使用），這樣模型就具備了學習詞序信息的能力。

計算位置編碼的公式如下：

其中， $pos$ 是單詞的位置索引，設句子長度爲 $L$ ，那麼 $pos=0,1,...,L−1$ 。 $i$ 是向量的某一維度，假設詞向量維度 $d_{model}=512$ ，那麼 $i=0,1,...,255$ 。

舉例來說，假設 $d_{model}=5$ ，那麼在一個樣本中：
第一個單詞的位置編碼爲：
$\begin{bmatrix} sin(\frac{0}{10000^{\frac{2\times 0}{5}}}) & cos(\frac{0}{10000^{\frac{2\times 0}{5}}}) & sin(\frac{0}{10000^{\frac{2\times 1}{5}}}) & cos(\frac{0}{10000^{\frac{2\times 1}{5}}}) & sin(\frac{0}{10000^{\frac{2\times 2}{5}}}) \\ \end{bmatrix}$
第二個單詞的位置編碼爲：
$\begin{bmatrix} sin(\frac{1}{10000^{\frac{2\times 0}{5}}}) & cos(\frac{1}{10000^{\frac{2\times 0}{5}}}) & sin(\frac{1}{10000^{\frac{2\times 1}{5}}}) & cos(\frac{1}{10000^{\frac{2\times 1}{5}}}) & sin(\frac{1}{10000^{\frac{2\times 2}{5}}}) \\ \end{bmatrix}$

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    # 將 sin 應用於數組中的偶數索引（indices）；2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # 將 cos 應用於數組中的奇數索引；2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

5、遮擋（Masking）

5.1 填充遮擋

遮擋一批序列中所有的填充標記（即將文本轉換到數字向量後標記爲零的位置）。這確保了模型不會將填充作爲輸入。在填充值 0 出現的位置 mask 輸出 1，否則輸出 0。

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

    # 添加額外的維度來將填充加到
    # 注意力對數（logits）。
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

5.2 前瞻遮擋

前瞻遮擋用於遮擋未來的信息。這意味着，如果要預測第三個詞，將僅使用第一個和第二個詞。與此類似，預測第四個詞，僅使用第一個，第二個和第三個詞，依此類推。舉例來說：

比如說輸入是一句話 “I have a dream” 總共4個單詞，這裏就會形成一張4x4的注意力機制（在下面介紹）的圖。

I 作爲第一個單詞，只能有和 I 自己的 attention。have 作爲第二個單詞，有和 I, have 兩個 attention。 a 作爲第三個單詞，有和 I, have, a 前面三個單詞的 attention。到了最後一個單詞 dream 的時候，纔有對整個句子 4 個單詞的 attention。

def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

這裏的 tf.linalg.band_part(input, num_lower, num_upper) 函數可以返回一個三角矩陣，input 是輸入的矩陣；num_lower 是下三角中保留的行數；num_upper 是上三角中保留的行數，當 num_lower 或 num_upper 等於 -1 時，表示下三角或上三角全部保留。

5.3 生成所有遮擋

def create_masks(inp, tar):
    # 編碼器填充遮擋
    enc_padding_mask = create_padding_mask(inp)

    # 在解碼器的第二個注意力模塊使用。
    # 該填充遮擋用於遮擋編碼器的輸出。
    dec_padding_mask = create_padding_mask(inp)

    # 在解碼器的第一個注意力模塊使用。
    # 用於填充（pad）和遮擋（mask）解碼器獲取到的輸入的後續標記（future tokens）。
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, combined_mask, dec_padding_mask

在編碼器中，我們對其中唯一的注意力模塊使用填充遮擋；在解碼器中，我們對其中第一個注意力模塊使用填充遮擋和前瞻遮擋，對第二個注意力模塊使用填充遮擋。

6、Scaled dot-product attention

Scaled dot-product attention 的結構爲：
Transformer 使用的注意力函數有三個輸入：Q（請求（query））、K（主鍵（key））、V（數值（value））。用於計算注意力權重的等式爲：

這裏的 $d_k$ 其實就是詞嵌入向量的維度 $d_{model}$ 。

假設 Q 和 K 的均值爲0，方差爲1。它們的矩陣乘積將有均值爲0，方差爲 $d_k$ 。因此， $d_k$ 的平方根被用於縮放，因爲，Q 和 K 的矩陣乘積的均值本應該爲 0，方差本應該爲1，這樣會獲得一個更平緩的 softmax。

def scaled_dot_product_attention(q, k, v, mask):
    """計算注意力權重。
    q, k, v 必須具有匹配的前置維度。
    k, v 必須有匹配的倒數第二個維度，例如：seq_len_k = seq_len_v。
    雖然 mask 根據其類型（填充或前瞻）有不同的形狀，
    但是 mask 必須能進行廣播轉換以便求和。

    參數:
    q: 請求的形狀 == (..., seq_len_q, depth)
    k: 主鍵的形狀 == (..., seq_len_k, depth)
    v: 數值的形狀 == (..., seq_len_v, depth_v)
    mask: Float 張量，其形狀能轉換成
          (..., seq_len_q, seq_len_k)。默認爲None。

    返回值:
    輸出，注意力權重
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # 縮放 matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # 將 mask 加入到縮放的張量上。
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    # softmax 在最後一個軸（seq_len_k）上歸一化，因此分數
    # 相加等於1。
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

在這裏我們將遮擋（mask）與 -1e9（接近於負無窮）相乘，其目標是將這些單元在 softmax 中歸零，因爲 softmax 的較大負數輸入在輸出中接近於零。

當 softmax 在 K 上進行歸一化後，它的值決定了將 K 對應的 V 分配到 Q 的重要程度。

輸出表示注意力權重和 V（數值）向量的乘積。這使得要關注的詞保持原樣，而無關的詞將被清除掉。所以它的機制可以被理解爲：

7、Multi-head attention（多頭注意力）

在上一部分中，我們介紹了注意力函數的三個輸入爲：Q（請求（query））、K（主鍵（key））、V（數值（value）），那麼這些值是怎麼來的呢？

其實，這些值是將經過處理後的文本經過 Dense 層後得到的，這些處理包括：詞嵌入、位置編碼等。也就是：

當然，這裏的 X 不一定相同，如果相同，我們稱其爲 self-attention。

Multi-head attention 就是把上面的 Scaled dot-product attention 操作執行 H 次，然後把輸出合併，如下圖所示：

舉例來說，如果 H=8，那麼我們要將 V，K，Q 各分成 8 份，對每一份進行 Scaled dot-product attention 操作，最後將得到的結果合併起來。

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)
        
    def split_heads(self, x, batch_size):
        """分拆最後一個維度到 (num_heads, depth).
        轉置結果使得形狀爲 (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
        
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

8、Position-wise feed-forward networks（點式前饋網絡）

這層主要是提供非線性變換，是一個全連接層，在編碼器和解碼器的最後都要使用。

def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

對此過程中的維度變化進行分析：

9、編碼與解碼

輸入語句經過 N 個編碼器層，爲序列中的每個詞生成一個輸出。
解碼器使用編碼器的輸出以及它自身的輸入（即目標語句的自注意力）來預測下一個詞。

9.1 編碼器層

每個編碼器層包括以下子層：

多頭注意力（包括填充遮擋）；
點式前饋網絡。

其中的多頭注意力其實是輸入語句的自注意力，編碼器層的輸出將會被輸入解碼器層。

每個子層在其周圍有一個殘差連接，然後進行層歸一化。殘差連接有助於避免深度網絡中的梯度消失問題。

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    
  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
    
    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)
    
    return out2

9.2 解碼器層

每個解碼器層包括以下子層：

第一個多頭注意力（包括前瞻遮擋和填充遮擋）；
第二個多頭注意力（包括填充遮擋）；
點式前饋網絡。

其中，第一個多頭注意力其實是目標語句的自注意力，它的 V，K 和 Q 都是來源於目標語句；而第二個多頭注意力的 V 和 K 接收編碼器輸出（即輸入語句的自注意力）作爲輸入。Q 接收第一個多頭注意力的輸出作爲輸入。

每個子層在其周圍有一個殘差連接，然後進行層歸一化。

當 Q 接收到解碼器的第一個自注意力塊的輸出，並且 K 接收到編碼器的輸出時，注意力權重表示根據編碼器的輸出賦予解碼器輸入不同的重要性。換一種說法，解碼器通過查看編碼器輸出和對其自身輸出的自注意力來預測下一個詞。

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    
    def call(self, x, enc_output, training, 
             look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

對編碼器層和解碼器層中的維度變化進行分析：

9.3 編碼器

編碼器包括：

輸入嵌入；
位置編碼；
N 個編碼器層。

將輸入經過詞嵌入層後，再把該嵌入與位置編碼相加。該加法操作的輸出是編碼器層的輸入。編碼器的輸出是解碼器的輸入。

class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
                 maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                                self.d_model)


        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)
        
    def call(self, x, training, mask):

        seq_len = tf.shape(x)[1]

        # 將嵌入和位置編碼相加。
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)

9.4 解碼器

解碼器包括：

輸出嵌入；
位置編碼；
N 個解碼器層。

將目標語句經過詞嵌入層後，再把該嵌入與位置編碼相加。該加法操作的結果是解碼器層的輸入。解碼器的輸出是最後的線性層的輸入。

class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
    
    def call(self, x, enc_output, training, 
           look_ahead_mask, padding_mask):

        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                                 look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2
    
        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

對編碼器和解碼器中的維度變化進行分析：

10、創建 Transformer

Transformer 包括編碼器，解碼器和最後的線性層。編碼器的輸出是解碼器的輸入，解碼器的輸出是線性層的輸入，返回線性層的輸出。

class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, 
                               input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff, 
                               target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)
    
    def call(self, inp, tar, training, enc_padding_mask, 
           look_ahead_mask, dec_padding_mask):

        enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights

對此過程中的維度變化進行分析：

11、配置超參數

超參數包括：

num_layers：編碼器層數和解碼器層數；
d_model：詞嵌入維度；
dff：點式前饋網絡中第一個全連接層的神經元個數；
num_heads：多頭注意力中的頭數；
input_vocab_size：輸入樣本中單詞的個數（含開頭標記與結尾標記）；
target_vocab_size：目標樣本中單詞的個數（含開頭標記與結尾標記）；
dropout_rate。

num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
dropout_rate = 0.1

12、優化器

在這裏我們使用 Adam 優化器，但我們規定學習率爲：

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)

通過打印學習率隨訓練次數的變化，我們可以看到學習率的變化趨勢：

temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

13、損失函數與指標

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask
  
  return tf.reduce_mean(loss_)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

在這裏我們使用了遮擋（mask），這樣就可以把我們填充的位置的損失置爲0。

14、訓練

14.1 檢查點

transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, target_vocab_size, 
                          pe_input=input_vocab_size, 
                          pe_target=target_vocab_size,
                          rate=dropout_rate)

checkpoint_path = "./checkpoints/train"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# 如果檢查點存在，則恢復最新的檢查點。
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print ('Latest checkpoint restored!!')

14.2 梯度下降

目標語句被分成了 tar_inp 和 tar_real。tar_inp 作爲輸入被傳遞到解碼器。tar_real 是位移了 1 的同一個輸入：在 tar_inp 中的每個位置，tar_real 包含了應該被預測到的下一個標記。

例如，目標語句爲 “SOS A lion in the jungle is sleeping EOS”，那麼：
tar_inp = “SOS A lion in the jungle is sleeping”
tar_real = “A lion in the jungle is sleeping EOS”

Transformer 是一個自迴歸模型：它一次作一個部分的預測，然後使用到目前爲止的自身的輸出來決定下一步要做什麼。

在訓練過程中，我們使用了 teacher-forcing 的方法。無論模型在當前時間步驟下預測出什麼，teacher-forcing 方法都會將真實的輸出傳遞到下一個時間步驟上。

當 Transformer 預測每個詞時，自注意力（self-attention）功能使它能夠查看輸入序列中前面的單詞，從而更好地預測下一個單詞。

# 該 @tf.function 將追蹤-編譯 train_step 到 TF 圖中，以便更快地
# 執行。該函數專用於參數張量的精確形狀。爲了避免由於可變序列長度或可變
# 批次大小（最後一批次較小）導致的再追蹤，使用 input_signature 指定
# 更多的通用形狀。

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]
  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
  
  with tf.GradientTape() as tape:
    predictions, _ = transformer(inp, tar_inp, 
                                 True, 
                                 enc_padding_mask, 
                                 combined_mask, 
                                 dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)    
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
  
  train_loss(loss)
  train_accuracy(tar_real, predictions)

14.3 訓練

EPOCHS = 20
for epoch in range(EPOCHS):
  start = time.time()
  
  train_loss.reset_states()
  train_accuracy.reset_states()
  
  # inp -> portuguese, tar -> english
  for (batch, (inp, tar)) in enumerate(train_dataset):
    train_step(inp, tar)
    
    if batch % 50 == 0:
      print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
          epoch + 1, batch, train_loss.result(), train_accuracy.result()))
      
  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager.save()
    print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
                                                         ckpt_save_path))
    
  print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, 
                                                train_loss.result(), 
                                                train_accuracy.result()))

  print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

15、評估

以下步驟用於評估：

用葡萄牙語編碼器（encoder_pt）編碼輸入語句。此外，添加開始和結束標記，這樣輸入就與模型訓練的內容相同。
解碼器的第一個輸入爲英語的開始標記，即tokenizer_en.vocab_size。
計算填充遮擋和前瞻遮擋。
解碼器通過查看編碼器輸出和它自身的輸出（自注意力）給出預測。
選擇最後一個詞並計算它的 argmax。
將預測的詞連接到解碼器輸入，然後傳遞給解碼器。
在這種方法中，解碼器根據它預測的之前的詞預測下一個。

15.1 評估函數

def evaluate(inp_sentence):
    start_token = [encoder_pt.vocab_size]
    end_token = [encoder_pt.vocab_size + 1]

    # 輸入語句是葡萄牙語，增加開始和結束標記
    inp_sentence = start_token + encoder_pt.encode(inp_sentence) + end_token
    encoder_input = tf.expand_dims(inp_sentence, 0)

    # 因爲目標是英語，輸入 transformer 的第一個詞應該是
    # 英語的開始標記。
    decoder_input = [encoder_en.vocab_size]
    output = tf.expand_dims(decoder_input, 0)

    for i in range(MAX_LENGTH):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
            encoder_input, output)
  
        # predictions.shape == (batch_size, seq_len, vocab_size)
        predictions, attention_weights = transformer(encoder_input, 
                                                     output,
                                                     False,
                                                     enc_padding_mask,
                                                     combined_mask,
                                                     dec_padding_mask)
    
        # 從 seq_len 維度選擇最後一個詞
        predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        # 如果 predicted_id 等於結束標記，就返回結果
        if predicted_id == encoder_en.vocab_size+1:
            return tf.squeeze(output, axis=0), attention_weights
    
        # 連接 predicted_id 與輸出，作爲解碼器的輸入傳遞到解碼器。
        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights

15.2 注意力權重圖

def plot_attention_weights(attention, sentence, result, layer):
    fig = plt.figure(figsize=(16, 8))

    sentence = tokenizer_pt.encode(sentence)

    attention = tf.squeeze(attention[layer], axis=0)

    for head in range(attention.shape[0]):
        ax = fig.add_subplot(2, 4, head+1)

        # 畫出注意力權重
        ax.matshow(attention[head][:-1, :], cmap='viridis')

        fontdict = {'fontsize': 10}

        ax.set_xticks(range(len(sentence)+2))
        ax.set_yticks(range(len(result)))

        ax.set_ylim(len(result)-1.5, -0.5)

        ax.set_xticklabels(
            ['<start>']+[tokenizer_pt.decode([i]) for i in sentence]+['<end>'], 
            fontdict=fontdict, rotation=90)

        ax.set_yticklabels([tokenizer_en.decode([i]) for i in result 
                            if i < tokenizer_en.vocab_size], 
                           fontdict=fontdict)

        ax.set_xlabel('Head {}'.format(head+1))
  
    plt.tight_layout()
    plt.show()

15.3 翻譯函數

def translate(sentence, plot=''):
    result, attention_weights = evaluate(sentence)

    predicted_sentence = encoder_en.decode([i for i in result 
                                            if i < encoder_en.vocab_size])  

    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(predicted_sentence))

    if plot:
        plot_attention_weights(attention_weights, sentence, result, plot)

我們可以輸入一句葡萄牙語嘗試翻譯：

translate("este é um problema que temos que resolver.")
print ("Real translation: this is a problem we have to solve .")

Input: este é um problema que temos que resolver.
Predicted translation: this is a problem that we have to solve
Real translation: this is a problem we have to solve .

可見翻譯得還是比較準確的。

Tensorflow2.0之理解語言的 Transformer 模型

文章目錄