deep learning 13. transformer 代碼詳細解析之decoder

開始的話：
從基礎做起，不斷學習，堅持不懈，加油。
一位愛生活愛技術來自火星的程序汪

$bert$ 系列：

話不多說，直接開始今天的主要內容。

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

輸入參數的 $shape$ 已經在代碼中給出了詳細的註釋。
$ok$ 我們一步一步來看看代碼。

1、 $embedding$ _ $layer$

和 $encoder$ 中是一樣的，這裏就不再說明了，有疑問的請看上一節內容。最後返回的 $shape$ 就是[ $batch$ _ $size$ , $sequence$ _ $length$ , $hidden$ _ $size$ ]。

2、 $pad$

decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]

這個方法還是比較好理解的吧。 $shape$ 爲[ $batch$ _ $size$ , $sequence$ _ $length$ , $hidden$ $size$ ]， $rank$ 爲3，第一維不 $pad$ ，最後一維也不pad，中間這個維度 $pad$ ，並且是前面 $pad$ ，後面不 $pad$ 。所以 $pad$ 之後的維度爲[ $batch$ $size$ , $sequence$ _ $length$ +1, $hidden$ $size$ ]，然後在在第二維度去掉了最後一個值（代表的就是[ $EOS$ ]這個標誌位）,這樣 $shape$ 仍然爲 [ $batch$ $size$ , $sequence$ _ $length$ , $hidden$ _ $size$ ]。

3、 $get$ _ $position$ _ $encoding$

這一個步驟和上一節的操作也是一樣的，也就不再細說了。返回的 $shape$ 爲[ $sequence$ _ $length$ , $hidden$ $size$ ] ，然後和 $embedding$ 的輸出做 $add$ ，做簡單的相加。最後返回的 $shape$ 爲 [ $batch$ $size$ , $sequence$ _ $length$ , $hidden$ _ $size$ ]。然後再加了一個 $dropout$ 層。

4、 $get$ _ $decoder$ _ $bias$

def get_decoder_self_attention_bias(length):
    with tf.name_scope("decoder_self_attention_bias"):
        valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
        valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
        decoder_bias = _NEG_INF * (1.0 - valid_locs)
    return decoder_bias

$Lower$ $triangular$ $part$ ，就和下面這個一樣。

				[[1. 0. 0. 0. 0.]
                 [1. 1. 0. 0. 0.]
                 [1. 1. 1. 0. 0.]
                 [1. 1. 1. 1. 0.]
                 [1. 1. 1. 1. 1.]]

最後的輸出就如下面這樣,成爲了一個 $Upper$ $triangular$ $part$

tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)

5、 $decoder$ _ $stack$

$ok$ ，我們來看下 $deocder$ _ $stack$

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

class DecoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(DecoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            vanilla_attention_layer = AttentionLayer(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )

            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )

        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

5.1 $self$ _ $attention$

這一部分和 $encoder$ 的 $self$ _ $attention$ 是一模一樣的，
$Q、K、V=decoder$ $inputs$ ，具體計算過程也是一毛一樣的。
唯一不同的是 $bias$ 是 $get$ $decoder$ _ $bias$ 產生的

5.2 $vanilla$ _ $attention$

這個 $attention$ 的不同之處在於，做的是 $decoder$ 對 $encoder$ 的 $attention$ 。似乎這是這是很重要的一個 $attention$ ，將 $encoder$ 和 $decoder$ 做了對齊。

$Q=decoder$ _ $inputs$
$K、V=encoder$ _ $inputs$

$Q$ 的 $shape$ 爲[ $B$ , $T_d$ , $D$ ]，而 $K、V$ 的 $shape$ 爲 [ $B$ , $T_e$ , $D$ ]
對 $Q、K、V$ 分別做 $split$ _ $head$ 操作。 $Q shape$ 爲[ $B$ , $H$ , $T_d$ , $D//H$ ] ， $K、V shape$ 爲[ $B$ , $H$ , $T_e$ , $D//H$ ] 其中 $H$ 表示 $num$ _ $heads$ ,
$Q = scale(Q)$
$logits = tf.matmul(Q, K, transpose_b=True)$ ，返回 $shape$ 爲[ $B$ , $H$ ， $T_d$ , $T_e$ ]
$logits = tf.add(logits, bias)$ ，這個 $bias$ 就是最開始第一節第一步求得的 $attention$ _ $bias$ 。 $shape$ 爲 [ $B$ , $1$ , $1$ , $T_e$ ]。最終返回 $shape$ 爲[ $B$ , $H$ ， $T_d$ , $T_e$ ]
$weights = tf.nn.softmax(logits)$
$dropout(weights)$
$attention$ _ $output = tf.matmul(weights, V)$ ，weights $shape$ 爲[ $B$ , $H$ , $T_d$ , $T_e$ ],V=[ $B$ , $H$ , $T_e$ , $D//H$ ]，最終爲[ $B$ , $H$ , $T_d$ , $D//H$ ],
$out=combine(heads)$ 返回 $shape$ 爲[ $B$ , $T_d$ , $D$ ]
$dense(out, D)$ 返回 $shape$ 爲[ $B$ , $T_d$ , $D$ ]

5.3 $feed$ _ $forward$

和 $encoder$ 的是一樣的。

5.4 $norm$

和 $encoder$ 的是一樣的。

5.5 $linear$

    def linear(self, inputs):
        """
        :param inputs:  a tensor with shape [batch_size, length, hidden_size]
        :return: float32 tensor with shape [batch_size, length, vocab_size]
        """

        with tf.name_scope('pre_softmax_linear'):
            batch_size = tf.shape(inputs)[0]
            length = tf.shape(inputs)[1]

            inputs = tf.reshape(inputs, [-1, self.hidden_size])
            """
                inputs              [batch_size, length, hidden_size]
                shared_weights      [vocab_size, hidden_size]
                transpose           [hidden_size, vocab_size]
                logits              [batch_size, length, vocab_size]
            """
            logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)

            return tf.reshape(logits, [batch_size, length, self.vocab_size])

這個就不細說了。值得注意的是： $shared$ _ $weights$ ，是 $embedidng$ 時候初始化的向量。最後輸出的，也就是每個位置在 $vocab$ 上的概率分佈。

最後：
一直沒提到的

class PrePostProcessingWrapper(object):
    """Wrapper class that applies layer pre-processing and post-processing."""

    def __init__(self, layer, params, train):
        self.layer = layer
        self.postprocess_dropout = params["layer_postprocess_dropout"]
        self.train = train

        # Create normalization layer
        self.layer_norm = LayerNormalization(params["hidden_size"])

    def __call__(self, x, *args, **kwargs):
        # Preprocessing: apply layer normalization
        y = self.layer_norm(x)

        # Get layer output
        y = self.layer(y, *args, **kwargs)

        # Postprocessing: apply dropout and residual connection
        if self.train:
            y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
        return x + y

這個 $wrapper$

先對輸入做了一個 $norm$ ，和前面提到的 $norm$ 是一樣的。
然後拿到 $layer$ 的輸出
對結果加了一個 $dropout$
最後和輸入相加，做了一個 $residual$ 。

這個操作是對每一個 $layer$ 的輸入輸出都操作了。

謝謝

更多代碼請移步我的個人 $github$ ，會不定期更新。
歡迎關注

deep learning 13. transformer 代碼詳細解析之decoder

1、 $embedding$ _ $layer$

2、 $pad$

3、 $get$ _ $position$ _ $encoding$

4、 $get$ _ $decoder$ _ $bias$

5、 $decoder$ _ $stack$

5.1 $self$ _ $attention$

5.2 $vanilla$ _ $attention$

5.3 $feed$ _ $forward$

5.4 $norm$

5.5 $linear$

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

deep learning 09. pointer-generator graph

deep learning 10. bert 生成訓練語料流程graph

deep learning 11. bert loss graph

deep_learning 01. tf.nn.rnn_cell.BasicRNNCell()

deep_learning 05. word2vec and fasttext

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

deep learning 13. transformer 代碼詳細解析之decoder

1、embeddingembeddingembedding_layerlayerlayer

2、padpadpad

3、getgetget_positionpositionposition _encodingencodingencoding

4、getgetget_decoderdecoderdecoder _biasbiasbias

5、decoderdecoderdecoder_stackstackstack

5.1 selfselfself_attentionattentionattention

5.2 vanillavanillavanilla_attentionattentionattention

5.3 feedfeedfeed _forwardforwardforward

5.4 normnormnorm

5.5 linearlinearlinear

1、 $embedding$ _ $layer$

2、 $pad$

3、 $get$ _ $position$ _ $encoding$

4、 $get$ _ $decoder$ _ $bias$

5、 $decoder$ _ $stack$

5.1 $self$ _ $attention$

5.2 $vanilla$ _ $attention$

5.3 $feed$ _ $forward$

5.4 $norm$

5.5 $linear$