Attention Is All You Need (Transformer and Self-Attention)


Github整理了Transformer的完整代碼,建議直接看官方源碼


Transformer Architecture

The Transformer using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of the following left figure, respectively.

Encoder和Decoder均由6個layer堆疊組成:

Google AI Blog給出的動態展示過程


Encoder

Encoder的layer由兩個sublayer組成:multi-head self-attentionfeed-forward network,兩層sublayer均使用layer normalizationresidual connection.

Encoder的輸入層(最底層)需要做positional encoding,以考慮序列位置信息。所有層均需要做padding mask,以忽略padding的token。

FFN有兩層,第一層是ReLU激活函數,第二層是線性激活函數,FFN可表示爲
FFN(Z)=max(0,ZW1+b1)W2+b2 \text{FFN}(Z)=\max(0,ZW_1+b_1)\cdot W_2+b_2


Tensorflow實現

class EncoderLayer(tf.keras.layers.Layer):
    """
    Each encoder layer consists of sublayers.
        1. Multi-head attention (with padding mask)
        2. Point wise feed forward networks.

    Each of these sublayers has a residual connection around it followed by a layer
    normalization. Residual connections help in avoiding the vanishing gradient problem
    in deep networks.
    """

    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        # all the shape=(batch_size, seq_len)
        attention_output, _ = self.mha(x, x, x, mask)
        attention_output = self.dropout1(attention_output, training=training)
        output1 = self.layernorm1(x + attention_output)

        ffn_output = self.ffn(output1)
        ffn_output = self.dropout2(ffn_output, training=training)
        output2 = self.layernorm2(output1 + ffn_output)
        return output2


class Encoder(tf.keras.layers.Layer):

    def __init__(self, num_layers, d_model, num_heads, dff, vocab_size, maximum_position_encoding,
                 rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
        self.enc_layers = [
            EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :tf.shape(x)[1], :]
        x = self.dropout(x, training=training)
        for enc_layer in self.enc_layers:
            x = enc_layer(x, training, mask)
        return x

Why does embedding vector multiplied by a constant in Transformer model? 詞向量在加上位置嵌入之前,通過乘以常數(詞向量維度的根號)進行放大,可能是想儘量保留語義信息,避免添加位置嵌入後語義信息丟失。此外,在self-attention中計算執行softmax計算分數概率分佈時,也會除以這個常數!

 x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))

Decoder

Decoder的layer由三個sublayer組成:

  • Masked multi-head attention (with look ahead mask and padding mask);
  • Masked multi-head attention (with padding mask). V and K receive the encoder outputs as inputs, Q receives the output from the first multi-head sublayer;
  • Point wise feed forward networks;

解碼器第一層multi-head除使用padding mask、positional encoding之外,還對使用look ahead mask,約束當前解碼不考慮未來的信息。
解碼器二層multi-head使用第一層multi-head的輸出作爲Q向量K和V向量來自於Encoder,保證每次解碼時考慮全部輸入信息。
解碼器第三層與編碼器一致,使用FFN輸出。解碼器的不同sublayer間,同樣使用residual connection和layer normalization。


Tensorflow實現

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, combined_mask, enc_mask):
        attn1, block1 = self.mha1(x, x, x, combined_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, block2 = self.mha2(enc_output, enc_output, out1, enc_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)
        return out3, block1, block2


class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, vocab_size, maximum_position_encoding,
                 rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
        self.dec_layers = [
            DecoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, combined_mask, enc_mask):
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :tf.shape(x)[1], :]
        x = self.dropout(x, training=training)

        attention_weights = {}
        for i, dec_layer in enumerate(self.dec_layers):
            x, block1, block2 = dec_layer(x, enc_output, training, combined_mask, enc_mask)
            attention_weights['decoder_layer{}_block1'.format(i + 1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i + 1)] = block2

        return x, attention_weights

Self Attention

The rnn-model does not support parallel computing, self-attention layer is designed for address the problem.

Self-attention have two kind of forms: scaled dot-product attention and multi-head attention.


Scaled Dot-Product Attention

不同的輸入向量ai\boldsymbol a_i經線性變換矩陣WqW^qWkW^kWvW^v爲查詢向量qi\boldsymbol q_i、鍵向量ki\boldsymbol k_i和值向量vi\boldsymbol v_i,每個查詢向量qi\boldsymbol q_i與同序列其他輸入的鍵向量kj\boldsymbol k_j做內積,得到ai\boldsymbol a_i將每個vj\boldsymbol v_j作爲輸出的注意力/分數αij\alpha_{ij}

對分數向量α\boldsymbol\alpha使用softmax得到具有概率分佈的分數,加權vj\boldsymbol v_j得到ai\boldsymbol a_i的輸出向量oi\boldsymbol o_i,計算公式如下
oi=Vsoftmax(αidk)αij=qikj, qj=Wqaj, kj=Wkaj, vj=Wvaj \boldsymbol o_i=V\cdot\text{softmax}\left(\frac{\boldsymbol \alpha_i}{\sqrt{d_k}}\right)^\top\\ \alpha_{ij}=\boldsymbol q_i\cdot\boldsymbol k_j,\ \boldsymbol q_j=W^q\boldsymbol a_j,\ \boldsymbol k_j=W^k\boldsymbol a_j,\ \boldsymbol v_j=W^v\boldsymbol a_j
維度較高時內積值過大,softmax在較大值處的梯度較小,除以dk\sqrt{d_k}避免梯度過小。

Transform each of input embedding a\boldsymbol a to vector query q\boldsymbol q, key k\boldsymbol k, value v\boldsymbol v by multiply the matrix on the left.

Take the scaled dot-product of each of qi\boldsymbol q^i and each of kj\boldsymbol k^j to get the value ai,ja_{i,j}, and then normalizes it into a probability distribution a^i,j\hat a_{i,j} with applying softmax, which is the attention of bi\boldsymbol b^i (output of xi\boldsymbol x^i) on vj\boldsymbol v^j.

Matrix form of self-attention for parallel computing:
Attention(Q,K,V)=Vsoftmax(KTQdk) \text{Attention}(Q,K,V)=V\cdot\text{softmax}\left(\frac{K^T Q}{\sqrt{d_k}}\right)
We suspect that for large values of dkd_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/dk1/\sqrt{d_k}. Maybe for small values of dkd_k, the scale is not necessary.

To illustrate why the dot products get large, assume that the components of q\boldsymbol q and k\boldsymbol k are independent random variables with mean 0 and variance 1. Then their dot product, qk=i=1dkqiki\boldsymbol q\cdot \boldsymbol k = \sum_{i=1}^{d_k}q_ik_i has mean 0 and variance dkd_k.


Tensorflow實現

def scaled_dot_product_attention(q, k, v, mask):
    """
    The transformer takes three inputs: Q (query), K (key), V (value).

    The equation used to calculate the attention weights is:
        Attention(Q, K, V) = softmax(Q·K^T/sqrt(d_k))V

    sqrt(d_k): This is done because for large valus of depth, the dot product grows large
    in magnitude pushing the softmax function where it has small gradients resulting in a
    very hard softmax.

    The mask is multiplied with 1e-9 (close to negative infinity), and large negative
    inputs to softmax are near zero in the output.

    :param q: shape=(..., seq_len, depth)
    :param k: shape=(..., seq_len, depth)
    :param v: shape=(..., seq_len, depth_v)
    :param mask: shape=(..., seq_len, seq_len)
        multi_head => (batch_size, 1, 1, seq_len), 利用廣播性質mask
    """
    # (..., seq_len, seq_len)
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        # padding_mask = (batch_size, 1, 1, seq_len)
        # combined_mask = (batch_size, 1, seq_len, seq_len)
        scaled_attention_logits += (mask * -1e9)

    # softmax is normalized on the last axis (seq_len) so that the scores add up to 1.
    # (..., seq_len, seq_len), 對最後二維矩陣的行做歸一化
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    # (..., seq_len, depth_v)
    output = tf.matmul(attention_weights, v)

    return output, attention_weights

Multi-Head Attention

Multi-head attention consists of four parts:

  • Linear layers and split into heads.

  • Scaled dot-product attention.

  • Concatenation of heads.

  • Final linear layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces.
MultiHead(Q,K,V)=Concat(head1headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV) \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1\cdots\text{head}_h)W^O\\[1ex] \text{where head}_i=\text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
Where the projections are parameter matrices:
WiQ,WiKRdmodel×dk, WiVRdmodel×dv and WORhdv×dmodel W_i^Q,W_i^K\in\R^{d_\text{model}\times d_k},\ W_i^V\in\R^{d_\text{model}\times d_v} \ and \ W^O\in\R^{hd_v\times d_\text{model}}
If we employ h=8h = 8 parallel attention layers, or heads, and use dk=dv=dmodel/h=64d_k = d_v = d_\text{model}/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

“多頭” 模型可以分工關注序列內的局部或全局關聯關係,效果很大可能優於同等計算量的 “單頭” 模型。


Tensorflow實現

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # number of scaled dot-product attentions
        self.num_heads = num_heads
        # dimension of output
        self.d_model = d_model
        assert d_model % num_heads == 0
        self.depth = d_model // num_heads
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        # 轉置變換使得後期計算可以充分利用“廣播”性質,加速計算
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k=None, q=None, mask=None):
        batch_size = tf.shape(q)[0]
        # shape=(batch_size, num_heads, seq_len, depth)
        q = self.split_heads(self.wq(q), batch_size)
        k = self.split_heads(self.wk(k), batch_size)
        v = self.split_heads(self.wv(v), batch_size)

        # scaled_attention.shape=(batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape=(batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
        # (batch_size, seq_len_q, num_heads, depth)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        # (batch_size, seq_len_q, d_model), d_model=num_heads*depth
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        # W_O matrix, (batch_size, seq_len_q, d_model)
        output = self.dense(concat_attention)
        return output, attention_weights

Positional Encoding

Since this model doesn’t contain any recurrence or convolution, positional encoding for giving the model some information about the relative position of the words in the sentence.

CNN和RNN得到的結果具有位置信息!

The intuition here is that adding these values to the embedding provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

Positional encoding target: We can add a positional encoding vector to the embedding with d-dimension, after adding the positional encoding vector, words will be closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space.

詞嵌入添置位置嵌入後,空間距離越近的詞距離越相近。

We use sine and cosine functions of different frequencies:
PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel) \begin{aligned} PE(pos,2i) &=\sin(pos/10000^{2i/d_\text{model}})\\ PE(pos,2i+1) & =\cos(pos/10000^{2i/d_\text{model}})\\ \end{aligned}

where pospos is the position and ii is the dimension.

The following figure is positional encoding matrix with 50 length, 500 dimensions.

If the embedding has a dimensionality of 4, the actual positional encodings would look like this:

This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

In other words, embedding vector ai\boldsymbol a^i plus a constant vector ei\boldsymbol e^i is equivalent to extend its dimension with a one hot vector.
a+e=[WIWp][xp]=WIx+WPp \boldsymbol a+\boldsymbol e= \begin{bmatrix} W^I &W^p \end{bmatrix} \begin{bmatrix} \boldsymbol x\\ \boldsymbol p\\ \end{bmatrix}=W^I\boldsymbol x+W^P\boldsymbol p


Tensorflow實現

def positional_encoding(position, d_model):
    """
    Positional encoding is added to give the model some information about the relative
    position of the words in the sentence.

    The formula for calculating the positional encoding is as follwing:
        PE(pos,2i)   = sin(pos/10000^(2i/d_model))
        PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
    """
    positions = np.arange(position)[:, np.newaxis]
    evens = np.arange(d_model)[np.newaxis, :] // 2
    angle_rates = 1 / np.power(10000, 2 * evens / np.float32(d_model))
    # shape=(seq_len, n_dimension)
    angle_rads = positions * angle_rates
    # even positions
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    # odd position
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    # shape=(1, position, n_dimension)
    encoding = angle_rads[np.newaxis, ...]
    return tf.cast(encoding, dtype=tf.float32)

Mask

mask表示掩碼,是對某些值進行掩蓋,使其在參數更新是不產生效果。

Transformer涉及padding masksequence mask/look ahead mask。 padding mask 在所有的 scaled dot-product attention裏面都需要用到,而sequence mask只在decoder的 self-attention裏面用到,decoder的輸入層需同時考慮padding mask和sequence mask,給它取個新的名字combined mask

Padding Mask: 由於各樣本輸入序列長度不一,需要對輸入序列對齊至固定長度,如較短的序列填充0、較長的序列截斷,在編解碼時應當忽略這些padding位置。

Sequence mask: 解碼器在時刻t解碼時,爲使解碼器此時不考慮時刻t之後的狀態,需要將t之後信息隱藏。具體做法是,產生維度等於序列長度、對角線元素爲0的上三角矩陣,乘以大負數作用到attention權重,經過softmax後,模型在這些位置的注意力接近於0。

if mask is not None:
    # padding_mask = (batch_size, 1, 1, seq_len)
    # combined_mask = (batch_size, 1, seq_len, seq_len)
    scaled_attention_logits += (mask * -1e9)

Tensorflow實現

def create_padding_mask(sequence):
    """
    Mask all the pad tokens in the batch of sequence.
    It ensures that the model does not treat padding as the input.
    """
    # shape=(batch_size, seq_len)
    mask = tf.cast(tf.math.equal(sequence, 0), tf.float32)
    # shape=(batch_size, 1, 1, seq_len)
    mask = mask[:, tf.newaxis, tf.newaxis, :]
    return mask


def create_look_ahead_mask(size):
    """
    Mask the future tokens in a sequence. Used for decoding that mean the current output
    of decoder is independent of the future output.
    """
    # upper triangular matrix, shape=(seq_len, seq_len)
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask


def create_combined_mask(sequence):
    """
    創建聯合掩碼
    主要用於解碼器輸入,需同時考慮padding_mask和look_ahead_mask
    """
    look_ahead_mask = create_look_ahead_mask(tf.shape(sequence)[1])
    padding_mask = create_padding_mask(sequence)
    combined_mask = tf.maximum(padding_mask, look_ahead_mask)
    return combined_mask

Layer Normalization

深度學習:層標準化詳解(Layer Normalization)


Residual Connection/Skip Connection

There is a residual connection between each sub-layer (self-attention, ffnn), and is followed by a layer-normalization step. The layer norm output is:
LayerNorm(x+Sublayer(x)) \text{LayerNorm}(x+\text{Sublayer}(x))
The vectors and the layer-norm operation associated with self attention like:

"Residual Connections" is very important in retaining the position related information which we are adding to the input representation/embedding across the network.

Experimentally, the residual connections can be mainly applied to the concatenated positional encoding section to propagate it through, by concatenating positional encoding, instead of adding them, to the embedding.

通過將位置嵌入以擴展形式取代加性形式附加到詞嵌入證明,Transformer中殘差連接的主要作用是傳播位置信息


Residual Connection在梯度更新中的作用

使用residual連接的層,對xx求偏導時增加了一項常數項,有較大梯度更新xx避免梯度消失


Warm-Up Strategy and Post-LN Transformer

Adam優化器學習率的warm-up
lrate=dmodel0.5min(step_num0.5,step_numwarmup_steps1.5) lrate = d_{model}^{-0.5}*min(step\_num^{-0.5},step\_num*warmup\_steps^{-1.5})

當total_steps=40000,warmup_steps=4000,學習率變化曲線:

warm-up策略的優缺點:
  • 減緩模型初始階段對mini-batch的提前過擬合現象,保持分佈平穩
  • 保持模型深度穩定性
  • 訓練時間增加

原始Transformer的layer norm放在residua過程之後,這種連接稱爲Post-LN Transformer. Post-LN Transformer對參數敏感,調參困難,必備warm-up學習策略,限制初始學習速率。

爲什麼需要限制模型初始學習速率?Post-LN Transformer訓練初始階段,輸出層附近梯度非常大,初始階段不加以限制學習率,模型學習可能爆炸。如果將layer-norm放在residual過程之中呢?這種連接稱爲Pre-LN Transformer,發現初始訓練梯度下降很多,甚至不需要warm-up。

具體說明見Transformer中warm-up和LayerNorm的重要性探究


Reference

1. Attention Is All You Need
2. The Illustrated Transformer–Jay Alammar
3. Transformer Architecture: Attention Is All You Need
4. Transformer中warm-up和LayerNorm的重要性探究
5.Transformer model for language understanding

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章