deep learning 13. transformer 代碼詳細解析之decoder

開始的話:
從基礎做起,不斷學習,堅持不懈,加油。
一位愛生活愛技術來自火星的程序汪

bertbert系列:

  1. bertbert 語料生成
  2. bertbert lossloss解析
  3. bertbert transformertransformer詳細解析之encoderencoder
  4. bertbert transformertransformer詳細解析之decoderdecoder

話不多說,直接開始今天的主要內容。

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

輸入參數的shapeshape已經在代碼中給出了詳細的註釋。
okok 我們一步一步來看看代碼。

1、embeddingembedding_layerlayer

encoderencoder中是一樣的,這裏就不再說明了,有疑問的請看上一節內容。最後返回的shapeshape就是[ batchbatch_sizesize, sequencesequence _lengthlength, hiddenhidden _sizesize]。

2、padpad

decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]

這個方法還是比較好理解的吧。shapeshape爲[ batchbatch_sizesize, sequencesequence _lengthlength, hiddenhidden sizesize],rankrank爲3,第一維不padpad,最後一維也不pad,中間這個維度padpad,並且是前面padpad,後面不padpad。所以padpad之後的維度爲[batchbatchsizesize, sequencesequence _lengthlength+1, hiddenhidden sizesize],然後 在在第二維度去掉了最後一個值(代表的就是[EOSEOS]這個標誌位),這樣shapeshape仍然爲 [batchbatchsizesize, sequencesequence _lengthlength, hiddenhidden _sizesize]。

3、getget_positionposition _encodingencoding

這一個步驟和上一節的操作也是一樣的,也就不再細說了。返回的shapeshape爲[sequencesequence _lengthlength, hiddenhidden sizesize] ,然後和embeddingembedding的輸出做addadd,做簡單的相加。最後返回的shapeshape爲 [ batchbatchsizesize, sequencesequence _lengthlength, hiddenhidden _sizesize]。然後再加了一個dropoutdropout層。

4、getget_decoderdecoder _biasbias

def get_decoder_self_attention_bias(length):
    with tf.name_scope("decoder_self_attention_bias"):
        valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
        valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
        decoder_bias = _NEG_INF * (1.0 - valid_locs)
    return decoder_bias

LowerLower triangulartriangular partpart,就和下面這個一樣。

				[[1. 0. 0. 0. 0.]
                 [1. 1. 0. 0. 0.]
                 [1. 1. 1. 0. 0.]
                 [1. 1. 1. 1. 0.]
                 [1. 1. 1. 1. 1.]]

最後的輸出就如下面這樣,成爲了一個UpperUpper triangulartriangular partpart

tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)

5、decoderdecoder_stackstack

okok,我們來看下deocderdeocder_stackstack

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits
class DecoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(DecoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            vanilla_attention_layer = AttentionLayer(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )

            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )

        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

5.1 selfself_attentionattention

這一部分和encoderencoderselfself_attentionattention是一模一樣的,
QKV=decoderQ、K、V=decoder inputsinputs,具體計算過程也是一毛一樣的。
唯一不同的是biasbiasgetget
decoderdecoder _biasbias產生的

5.2 vanillavanilla_attentionattention

這個attentionattention的不同之處在於,做的是decoderdecoderencoderencoderattentionattention。似乎這是這是很重要的一個attentionattention,將encoderencoderdecoderdecoder做了對齊。

Q=decoderQ=decoder _inputsinputs
KV=encoderK、V=encoder _inputsinputs

  1. QQshapeshape爲[BB,TdT_d,DD],而KVK、Vshapeshape爲 [BB,TeT_e,DD]
  2. QKVQ、K、V分別做splitsplit _headhead操作。QshapeQ shape爲[BB, HH, TdT_d, D//HD//H] ,KVshapeK、V shape爲[BB, HH, TeT_e, D//HD//H] 其中 HH 表示 numnum _headsheads,
  3. Q=scale(Q)Q = scale(Q)
  4. logits=tf.matmul(Q,K,transposeb=True)logits = tf.matmul(Q, K, transpose_b=True),返回shapeshape爲[BB, HHTdT_d, TeT_e]
  5. logits=tf.add(logits,bias)logits = tf.add(logits, bias),這個 biasbias 就是最開始第一節第一步求得的 attentionattention_biasbiasshapeshape爲 [BB, 11, 11, TeT_e]。最終返回shapeshape爲[BB, HHTdT_d, TeT_e]
  6. weights=tf.nn.softmax(logits)weights = tf.nn.softmax(logits)
  7. dropout(weights)dropout(weights)
  8. attentionattention_output=tf.matmul(weights,V)output = tf.matmul(weights, V),weightsshapeshape爲[BB, HH, TdT_d, TeT_e],V=[BB,HH,TeT_e,D//HD//H],最終爲[BB, HH, TdT_d, D//HD//H],
  9. out=combine(heads)out=combine(heads) 返回shapeshape爲[BB, TdT_d, DD]
  10. dense(out,D)dense(out, D) 返回shapeshape爲[BB, TdT_d, DD]

5.3 feedfeed _forwardforward

encoderencoder 的是一樣的。

5.4 normnorm

encoderencoder 的是一樣的。

5.5 linearlinear

    def linear(self, inputs):
        """
        :param inputs:  a tensor with shape [batch_size, length, hidden_size]
        :return: float32 tensor with shape [batch_size, length, vocab_size]
        """

        with tf.name_scope('pre_softmax_linear'):
            batch_size = tf.shape(inputs)[0]
            length = tf.shape(inputs)[1]

            inputs = tf.reshape(inputs, [-1, self.hidden_size])
            """
                inputs              [batch_size, length, hidden_size]
                shared_weights      [vocab_size, hidden_size]
                transpose           [hidden_size, vocab_size]
                logits              [batch_size, length, vocab_size]
            """
            logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)

            return tf.reshape(logits, [batch_size, length, self.vocab_size])

這個就不細說了。值得注意的是:sharedshared_weightsweights,是embedidngembedidng時候初始化的向量。最後輸出的,也就是每個位置在vocabvocab上的概率分佈。

最後:
一直沒提到的

class PrePostProcessingWrapper(object):
    """Wrapper class that applies layer pre-processing and post-processing."""

    def __init__(self, layer, params, train):
        self.layer = layer
        self.postprocess_dropout = params["layer_postprocess_dropout"]
        self.train = train

        # Create normalization layer
        self.layer_norm = LayerNormalization(params["hidden_size"])

    def __call__(self, x, *args, **kwargs):
        # Preprocessing: apply layer normalization
        y = self.layer_norm(x)

        # Get layer output
        y = self.layer(y, *args, **kwargs)

        # Postprocessing: apply dropout and residual connection
        if self.train:
            y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
        return x + y

這個wrapperwrapper

  1. 先對輸入做了一個normnorm,和前面提到的normnorm是一樣的。
  2. 然後拿到layerlayer的輸出
  3. 對結果加了一個dropoutdropout
  4. 最後和輸入相加,做了一個residualresidual

這個操作是對每一個layerlayer的輸入輸出都操作了。

謝謝

更多代碼請移步我的個人githubgithub,會不定期更新。
歡迎關注

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章