deep learning 12. transformer 代碼詳細解析之encoder

開始的話:
從基礎做起,不斷學習,堅持不懈,加油。
一位愛生活愛技術來自火星的程序汪

bertbert系列:

  1. bertbert 語料生成
  2. bertbert lossloss解析
  3. bertbert transformertransformer詳細解析之encoderencoder
  4. bertbert transformertransformer詳細解析之decoderdecoder

結合着自己 githubgithub(地址見文末尾)上的transformertransformer代碼,詳細分析下代碼和邏輯。

    def __call__(self, feature, targets=None):
        initializer = tf.variance_scaling_initializer(
            scale=self.params.get('initializer_gain'),
            mode='fan_avg',
            distribution='uniform'
        )

        with tf.variable_scope('transformer', initializer=initializer):
            #   [batch_size, 1, 1, length]
            attention_bias = model_utils.get_padding_bias(feature)

            encoder_outputs = self.encode(feature, attention_bias)

            if targets is None:
                return self.predict(encoder_outputs, attention_bias)

            logits = self.decode(targets, encoder_outputs, attention_bias)
            return logits

主要call()call() 方法入口。(爲了代碼不那麼長去掉了代碼中的註釋)

def get_padding_bias(x):
    with tf.name_scope("attention_bias"):
        padding = get_padding(x)
        attention_bias = padding * _NEG_INF
        attention_bias = tf.expand_dims(
            tf.expand_dims(attention_bias, axis=1), axis=1)
    return attention_bias

getget_padding()padding() 方法代碼如下,主要目的是拿到attentionattention _biasbias

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

xxshapeshape爲[batchbatch_sizesize, sequencesequence lengthlength],是已經paddingpadding過的數據。經過這個方法,就能知道哪些是paddingpadding,哪些是nonnon-paddingpadding的數據了。返回shapeshape爲[batchbatchsizesize, sequencesequence lengthlength]。其中 00 ->nonnon-paddingpadding11 ->paddingpadding
_NEG_INF = -1e9paddingpadding biasbias肯定就是給 paddingpadding加上 biasbias了,這個值就是給 paddingpadding設置的 biasbias
最後返回的shapeshape爲 [batchbatch
sizesize, 11, 11, sequencesequence _lengthlength]。

第一步:encoderencoder

    def encode(self, inputs, attention_bias):
        with tf.name_scope('encode'):
            #   [batch_size, length, hidden_size]
            embedded_inputs = self.embedding_layer(inputs)
            #   [batch_size, length]
            inputs_padding = model_utils.get_padding(inputs)
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(embedded_inputs)[1]
                #   use sin cos calculate position embeddings
                pos_encoding = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                encoder_inputs = tf.add(embedded_inputs, pos_encoding)
            if self.train:
                encoder_inputs = tf.nn.dropout(encoder_inputs, 1 - self.params.get('encoder_decoder_dropout'))
            return self.encoder_stack(encoder_inputs, attention_bias, inputs_padding)

okok 一步一步來解析吧!

1.1 embeddingembedding_layerlayer

主要實現代碼如下:

    def call(self, inputs, **kwargs):
        with tf.name_scope('embedding'):
            mask = tf.to_float(tf.not_equal(inputs, 0))
            embeddings = tf.gather(self.shared_weights, inputs)
            embeddings *= tf.expand_dims(mask, -1)
            embeddings *= self.hidden_size ** 0.5
            return embeddings

這段代碼還是比較簡單的對吧,理解起來。maskmask的作用就是讓paddingpadding的部分都爲00。最後對embeddingembedding的部分進行了一個scalescale。最後返回的shapeshape爲 [ batchbatch_sizesize, sequencesequence _lengthlength, hiddenhidden _sizesize]。

1.2 getget_paddingpadding

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

和上面attentionattention_biasbias的邏輯一樣。這裏返回的是[batchbatch _sizesize, sequencesequence _lengthlength]。其中 00 ->nonnon-paddingpadding11 ->paddingpadding

1.3 getgetpositionpositionencodingencoding

def get_position_encoding( length, hidden_size, min_timescale=1.0, max_timescale=1.0e4):
    position = tf.to_float(tf.range(length))
    num_timescales = hidden_size // 2
    log_timescale_increment = (
            math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1))
    inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
    scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)

    signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
    return signal

算的是coscossinsin的值作爲sequencesequence_lengthlengthpositionposition編碼。返回的shapeshape爲[sequencesequence _lengthlength, hiddenhidden sizesize] 。然後和embeddingembedding的輸出做addadd,做簡單的相加。最後返回的shapeshape爲 [ batchbatchsizesize, sequencesequence _lengthlength, hiddenhidden _sizesize]。然後再加了一個dropoutdropout層,接着扔進 encoderencoder _stackstack中。

1.4 encoderencoder_stackstack

class EncoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(EncoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )
           ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )
            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )
        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

結構很簡單,就是一個selfself_attentionattention層 + feedfeed _wardward層 + normnorm層。

    def call(self, encoder_inputs, attention_bias, inputs_padding):
        """
        :param encoder_inputs: [batch_size, input_length, hidden_size]
        :param attention_bias: [batch_size, 1, 1, inputs_length]
        :param inputs_padding: [batch_size, length]
        :return: [batch_size, input_length, hidden_size]
        """
        for n, layer in enumerate(self.layers):
            self_attention_layer = layer[0]
            ffn_layer = layer[1]
            with tf.variable_scope('encoder_stack_lay_{}'.format(n)):
                with tf.variable_scope('self_attention'):
                    encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
                with tf.variable_scope('ffn'):
                    encoder_inputs = ffn_layer(encoder_inputs, inputs_padding)
        return self.output_norm(encoder_inputs)

主要call()call()的輸入參數已經在代碼中給出了註釋。

1.4.1 selfself_attentionattention

對於selfself_attentionattention層的詳細解釋在githubgithub中已經逐步添加了註釋,很清晰明瞭,這裏就不再多做細說。主要流程是:

  1. QKV=encoderQ、K、V=encoder_inputsinputsshapeshape都爲[BB, TT, DD]。看得懂吧,這樣表達簡單點。
  2. QKVQ、K、V分別做splitsplit _headhead操作。shapeshape都爲[BB, HH, TT, D//HD//H] 其中 HH 表示 numnum _headsheads
  3. Q=scale(Q)Q = scale(Q)
  4. logits=tf.matmul(Q,K,transposeb=True)logits = tf.matmul(Q, K, transpose_b=True),返回shapeshape爲[BB, HHTT, TT]
  5. logits=tf.add(logits,bias)logits = tf.add(logits, bias),這個 biasbias 就是第一步求得的 attentionattention_biasbias
  6. weights=tf.nn.softmax(logits)weights = tf.nn.softmax(logits)
  7. dropout(weights)dropout(weights)
  8. attentionattention_output=tf.matmul(weights,V)output = tf.matmul(weights, V),返回shapeshape爲[BB, HH, TT, D//HD//H]
  9. out=combine(heads)out=combine(heads) 返回shapeshape爲[BB, TT, DD]
  10. dense(out,D)dense(out, D) 返回shapeshape爲[BB, TT, DD]

這樣一個selfself-attentionattention的流程就走完了。

1.4.2 feedfeed _wardward

    def call(self, inputs, padding=None):
        padding = padding if self.allow_pad else None
        batch_size = tf.shape(inputs)[0]
        length = tf.shape(inputs)[1]
        if padding is not None:
            with tf.name_scope('remove_padding'):
                pad_mask = tf.reshape(padding, [-1])         
                non_pad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))
                inputs = tf.reshape(inputs, [-1, self.hidden_size])             
                inputs = tf.gather_nd(params=inputs, indices=non_pad_ids)
                inputs.set_shape([None, self.hidden_size])
                inputs = tf.expand_dims(inputs, axis=0)
        outputs = self.filter_layer(inputs)
        if self.train:
            outputs = tf.nn.dropout(outputs, 1.0 - self.relu_dropout)
        outputs = self.output_layer(outputs)
        if padding is not None:
            with tf.name_scope('re_add_padding'):
                outputs = tf.squeeze(outputs, axis=0)
                outputs = tf.scatter_nd(
                    indices=non_pad_ids,
                    updates=outputs,
                    shape=[batch_size * length, self.hidden_size]
                )
                outputs = tf.reshape(outputs, [batch_size, length, self.hidden_size])
        return outputs

這裏的 paddingpadding 就是:上面1.2 getget_padding()padding()求出的結果。

1.4.3 normnorm

    def call(self, x, epsilon=1e-6):
        mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
        variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
        norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
        return norm_x * self.scale + self.bias

這個很好理解的對吧,求均值和方差,然後normnorm

這樣一個encoderencoder流程就完了。還是比較簡單的,不得不佩服googlegoogle大佬的強悍。

下一節會將decoderdecoder部分。

謝謝

更多代碼請移步我的個人githubgithub,會不定期更新。
歡迎關注

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章