bert代碼解讀2之模型transformer的解讀

github：https://github.com/google-research/bert

解讀翻譯：https://www.jiqizhixin.com/articles/2018-11-01-9

https://baijiahao.baidu.com/s?id=1616001262563114372&wfr=spider&for=pc

https://zhuanlan.zhihu.com/p/34781297《attention is all you need》解讀

https://blog.csdn.net/weixin_39470744

總結前篇的核心：

任務一：Masked LM

80% 的時間：用 [Mask] token 掩蓋之前選擇的單詞。例如：my dog is hairy → my dog is [Mask].
10% 的時間：用隨機單詞掩蓋這個單詞。例如：my dog is hairy → my dog is apple.
10% 的時間：保持單詞不被掩蓋。例如：my dog is hairy → my dog is hairy. (這樣做的目的是將表徵偏向於實際觀察到的單詞)

任務二：Next Sentence Prediciton

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

transformer模型部分： Kyubyong實現版的解讀

# -*- coding: utf-8 -*-
#/usr/bin/python2
'''
June 2017 by kyubyong park. 
[email protected].
https://www.github.com/kyubyong/transformer
transfermer主要結構是由encoder和decoder構成。其中，encoder是由embedding + positional_encoding作爲輸入，
然後加一個dropout層，然後輸入放到6個multihead_attention構成的結構中，每個multihead_attention後面跟一個feedforward。
而decoder是由decoder embedding + positional_encoding作爲輸入，輸入到dropout層，
然後後面跟六個self multihead_attention+ multihead_attention，最後後面跟一個feedward。最後加一個liner projection
https://blog.csdn.net/weixin_38569817/article/details/81357650?utm_source=blogxgwz3
https://www.jianshu.com/p/ef41302edeef
rivastava R K, Greff K, Schmidhuber J. Highway networks[J]. arXiv preprintarXiv:1505.00387, 2015
'''
from __future__ import print_function
import tensorflow as tf
import numpy as np
def normalize(inputs, 
              epsilon = 1e-8,
              scope="ln",
              reuse=None):
    '''Applies layer normalization.
    Args:
      inputs: A tensor with 2 or more dimensions, where the first dimension has
        `batch_size`.
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      
    Returns:
      A tensor with the same shape and data dtype as `inputs`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
    
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta
        
    return outputs

def embedding(inputs, 
              vocab_size, 
              num_units, 
              zero_pad=True, 
              scale=True,
              scope="embedding", 
              reuse=None):
    '''Embeds a given tensor.

    Args:
      inputs: A `Tensor` with type `int32` or `int64` containing the ids
         to be looked up in `lookup table`.
      vocab_size: An int. Vocabulary size.
      num_units: An int. Number of embedding hidden units.
      zero_pad: A boolean. If True, all the values of the fist row (id 0)
        should be constant zeros.
      scale: A boolean. If True. the outputs is multiplied by sqrt num_units.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.

    Returns:
      A `Tensor` with one more rank than inputs's. The last dimensionality
        should be `num_units`.
        
    For example,
    
    ```
    import tensorflow as tf
    
    inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
    outputs = embedding(inputs, 6, 2, zero_pad=True)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print sess.run(outputs)
    >>
    [[[ 0.          0.        ]
      [ 0.09754146  0.67385566]
      [ 0.37864095 -0.35689294]]

     [[-1.01329422 -1.09939694]
      [ 0.7521342   0.38203377]
      [-0.04973143 -0.06210355]]]
    ```
    
    ```
    import tensorflow as tf
    
    inputs = tf.to_int32(tf.reshape(tf.range(2*3), (2, 3)))
    outputs = embedding(inputs, 6, 2, zero_pad=False)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print sess.run(outputs)
    >>
    [[[-0.19172323 -0.39159766]
      [-0.43212751 -0.66207761]
      [ 1.03452027 -0.26704335]]

     [[-0.11634696 -0.35983452]
      [ 0.50208133  0.53509563]
      [ 1.22204471 -0.96587461]]]    
    ```    
    '''
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',  #name
                                       dtype=tf.float32,
                                       shape=[vocab_size, num_units],  #形狀
                                       initializer=tf.contrib.layers.xavier_initializer())
        #print(lookup_table) #<tf.Variable 'encoder/enc_embed/lookup_table:0' shape=(9796, 512) dtype=float32_ref>
        #<tf.Variable 'decoder/dec_embed/lookup_table:0' shape=(8767, 512) dtype=float32_ref>
        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
            #print(lookup_table) #Tensor("decoder/dec_embed/concat:0", shape=(8767, 512), dtype=float32)
            #Tensor("encoder/enc_embed/concat:0", shape=(9796, 512), dtype=float32)
        outputs = tf.nn.embedding_lookup(lookup_table, inputs)
        
        if scale:
            outputs = outputs * (num_units ** 0.5)
            #print(outputs) #shape=(32, 10, 512)
    return outputs
    

def positional_encoding(inputs,
                        num_units,
                        zero_pad=True,
                        scale=True,
                        scope="positional_encoding",
                        reuse=None):
    '''Sinusoidal Positional_Encoding.

    Args:
      inputs: A 2d Tensor with shape of (N, T).
      num_units: Output dimensionality
      zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero
      scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
    Returns:
        A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
    '''

    N, T = inputs.get_shape().as_list()
    with tf.variable_scope(scope, reuse=reuse):
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
            for pos in range(T)])
        #print(position_enc)

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1

        # Convert to a tensor
        lookup_table = tf.convert_to_tensor(position_enc)

        if zero_pad:
            lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
                                      lookup_table[1:, :]), 0)
        outputs = tf.nn.embedding_lookup(lookup_table, position_ind)

        if scale:
            outputs = outputs * num_units**0.5

        return outputs

def multihead_attention(queries,
                        keys,
                        num_units=None, 
                        num_heads=8, 
                        dropout_rate=0,
                        is_training=True,
                        causality=False,
                        scope="multihead_attention", 
                        reuse=None):
    '''Applies multihead attention.
    Args:
      queries: A 3d tensor with shape of [N, T_q, C_q].
      keys: A 3d tensor with shape of [N, T_k, C_k].
      num_units: A scalar. Attention size.
      dropout_rate: A floating point number.
      is_training: Boolean. Controller of mechanism for dropout.
      causality: Boolean. If true, units that reference the future are masked. 
      num_heads: An int. Number of heads.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]
        # print(num_units) #512
        
        # Linear projections #全連接層
        #print(queries) #shape=(32, 10, 512)
        Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C) #self.enc,
        K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)  #self.enc,
        V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)  #self.enc,
        #print(Q) #shape=(32, 10, 512)
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h)
        #print(Q_)  #shape=(256, 10, 64) 拆成8個
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        #print(outputs) #shape=(256, 10, 10)
        # Scale
        #print(K_.get_shape().as_list()[-1] ) #64
        #print(K_.get_shape().as_list()) #[256, 10, 64]
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)  #weight值

        '''
        key_masks它是想讓那些key值的unit爲0的key對應的attention score極小，這樣在加權計算value的時候相當於對結果不造成影響。 
        首先用一個reduce_sum(keys, axis=-1))將最後一個維度上的值加起來,keys的shape也從[N, T_k, C_k]變爲[N,T_k]
        然後再用abs取絕對值，即其值只能爲0,或正數
        然後用到了tf.sign(x, name=None)，該函數返回符號 y = sign(x) = -1 if x < 0; 0 if x == 0; 1 if x > 0，sign會將原tensor對應的每
        個值變爲-1,0,或者1。則經此操作，得到key_masks,有兩個值，0或者1。0代表原先的keys第三維度所有值都爲0，反之則爲1，我們要mask的就是這些爲0的key。 
        tf.tile把key_masks轉化爲shape爲(h*N, T_k)的key_masks
        每個queries都要對應這些keys，而mask的key對每個queries都是mask的。而之前的key_masks只相當於一份mask，所以擴充之前key_masks的維度，
        在中間加上一個維度大小爲queries的序列長度。然後利用tile函數複製相同的mask值即可。
        定義一個和outputs同shape的paddings，該tensor每個值都設定的極小。
        用where函數比較，當對應位置的key_masks值爲0也就是需要mask時，outputs的該值（attention score）設置爲極小的值（利用paddings實現），
        否則保留原來的outputs值。 
        '''
        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
        #print(keys) #(32, 10, 512)
        #print(key_masks ) #shape=(32, 10)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        #print(key_masks) #shape=(256, 10) #8
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        #print(key_masks) #shape=(256, 10, 10) 在axes＝1上增加一個維度
        paddings = tf.ones_like(outputs)*(-2**32+1) #全１的矩陣，形狀類似與outputs的形狀
        #print(paddings) ##shape=(256, 10, 10)
        #tf.equal功能：對比兩個矩陣/向量的元素是否相等，如果相等就返回True，反之返回False。
        #where(condition, x=None, y=None, name=None)
        #condition， x, y 相同維度，condition是bool型值，True/False where(condition）的用法,
        #返回值，是condition中元素爲True對應的索引
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
        #等於０的位置是ＴＲＵＥ，其他的位置爲False ,TRUE取ｐａｄｄｉｎｇ中的值，ｆａｌｓｅ爲outputs中的值
        #print(outputs) #shape=(256, 10, 10) #8 * 64 ,10,10
        # Causality = Future blinding
        '''
        causality參數告知我們是否屏蔽未來序列的信息（解碼器self attention的時候不能看到自己之後的那些信息），這裏即causality爲True時的屏蔽操作。 
        
        '''

        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            #print(diag_vals) #shape=(10, 10)
            #https://devdocs.io/tensorflow~python/tf/linalg/linearoperatorlowertriangular--線性代數庫
            '''
            將該矩陣轉爲三角陣tril。三角陣中，對於每一個T_q,凡是那些大於它角標的T_k值全都爲0，這樣作爲mask就可以讓query只取它之前的key
            （self attention中query即key）。由於該規律適用於所有query，接下來仍用tile擴展堆疊其第一個維度，構成masks，
            shape爲(h*N, T_q,T_k).------就是下三角矩陣,以上操作就可以當不需要來自未來的key值時將未來位置的key的score設置爲極小
            '''
            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
            #print(tril) #shape=(10, 10)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
            #print(masks) #shape=(256, 10, 10)
            #當不需要來自未來的key值時將未來位置的key的score設置爲極小
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
            #print("-----")
        # Activation 將attention score了利用softmax轉化爲加起來爲1的權值，

        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)

        # Query Masking
        '''
        所謂要被mask的內容，就是本身不攜帶信息或者暫時禁止利用其信息的內容。這裏query mask也是要將那些初始值爲0的queryies
        （比如一開始句子被PAD填充的那些位置作爲query）mask住。代碼前三行和key mask的方式大同小異，只是擴展維度等是在最後一個維度展開的。
        操作之後形成的query_masks的shape爲[h*N, T_q, T_k]。
    第四行代碼直接用outputs的值和query_masks相乘。這裏的outputs是之前已經softmax之後的權值。所以此步之後，需要mask的權值會乘以0，
    不需要mask的乘以之前取的正數的sign爲1所以權值不變。實現了query_masks的目的。
    這裏源代碼中的註釋應該寫錯了，outputs的shape不應該是(N, T_q, C)而應該和query_masks 的shape一樣，爲(h*N, T_q, T_k)。
        '''
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        '''
        由於每個queries都要對應這些keys，而mask的key對每個queries都是mask的。而之前的key_masks只相當於一份mask，所以擴充之前key_masks的維度，
        在中間加上一個維度大小爲queries的序列長度。然後利用tile函數複製相同的mask值即可。 
        '''
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        #print(query_masks ) #shape=(256, 10, 10)
        outputs *= query_masks # broadcasting. (N, T_q, C) #對應位置的元素相乘
        '''
        query mask也是要將那些初始值爲0的queryies（比如一開始句子被PAD填充的那些位置作爲query） mask住。
        outputs的值和query_masks相乘。這裏的outputs是之前已經softmax之後的權值。所以此步之後，需要mask的權值會乘以0，
        不需要mask的乘以之前取的正數的sign爲1所以權值不變。實現了query_masks的目的。 
        '''
        #print(outputs) #shape=(256, 10, 10)
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
        #print(V_) #shape=(256, 10, 64)
        #print(outputs) #shape=(256, 10, 64)
        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
        #print(outputs) #shape=(32, 10, 512)
        # Residual connection
        outputs += queries  #殘差網絡的思想
        # Normalize
        outputs = normalize(outputs) # (N, T_q, C)
        #print(outputs) #shape=(32, 10, 512)
    return outputs

def feedforward(inputs, 
                num_units=[2048, 512],
                scope="multihead_attention", 
                reuse=None):
    '''Point-wise feed forward net.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns:
      A 3d tensor with the same shape and dtype as inputs
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        
        # Residual connection
        outputs += inputs
        
        # Normalize
        outputs = normalize(outputs)
    
    return outputs
#把之前的one_hot中的0改成了一個很小的數，1改成了一個比較接近於1的數
def label_smoothing(inputs, epsilon=0.1):
    '''Applies label smoothing. See https://arxiv.org/abs/1512.00567.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, V], where V is the number of vocabulary.
      epsilon: Smoothing rate.
    
    For example,
    
    ```
    import tensorflow as tf
    inputs = tf.convert_to_tensor([[[0, 0, 1], 
       [0, 1, 0],
       [1, 0, 0]],

      [[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0]]], tf.float32)
       
    outputs = label_smoothing(inputs)
    
    with tf.Session() as sess:
        print(sess.run([outputs]))
    
    >>
    [array([[[ 0.03333334,  0.03333334,  0.93333334],
        [ 0.03333334,  0.93333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334]],

       [[ 0.93333334,  0.03333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334],
        [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]   
    ```    
    '''
    K = inputs.get_shape().as_list()[-1] # number of channels
    return ((1-epsilon) * inputs) + (epsilon / K)

模型配置的參數：

" attention_probs_dropout_prob": 0.1, #乘法attention時，softmax後dropout概率
"hidden_act": "gelu", #激活函數
 "hidden_dropout_prob": 0.1, #隱藏層dropout概率
 "hidden_size": 768, #隱藏單元數
 "initializer_range": 0.02, #初始化範圍 
"intermediate_size": 3072, #升維維度 
"max_position_embeddings": 512, #一個大於seq_length的參數，用於生成position_embedding "num_attention_heads": 12, #每個隱藏層中的attention head數
 "num_hidden_layers": 12, #隱藏層數
 "type_vocab_size": 2, #segment_ids類別 [0,1] 
"vocab_size": 30522 #詞典中詞數

這裏的輸入參數：input_ids,input_mask,token_type_ids對應上篇文章中輸出的input_ids,input_mask,segment_ids

transformer:

由於 Self-Attention 是每個詞和所有詞都要計算 Attention，所以不管他們中間有多長距離，最大的路徑長度也都只是 1。可以捕獲長距離依賴關係

attention表示成k、q、v的方式:

傳統的attention(sequence2sequence問題)：

上下文context表示成如下的方式（h的加權平均）：

那麼權重alpha（attention weight）可表示成Q和K的乘積，小h即V（下圖中很清楚的看出，Q是大H，K和V是小h）：

上述可以做個變種，就是K和V不想等，但需要一一對應，例如：

V=h+x_embedding
Q = H
k=h

乘法VS加法attention

加法注意力：

還是以傳統的RNN的seq2seq問題爲例子，加性注意力是最經典的注意力機制，它使用了有一個隱藏層的前饋網絡（全連接）來計算注意力分配：

乘法注意力：

就是常見的用乘法來計算attention score：

乘法注意力不用使用一個全連接層，所以空間複雜度佔優；另外由於乘法可以使用優化的矩陣乘法運算，所以計算上也一般佔優。

論文中的乘法注意力除了一個scale factor:

論文中指出當dk比較小的時候，乘法注意力和加法注意力效果差不多；但當d_k比較大的時候，如果不使用scale factor，則加法注意力要好一些，因爲乘法結果會比較大，容易進入softmax函數的“飽和區”，梯度較小。

self-attention

以一般的RNN的S2S爲例子，一般的attention的Q來自Decoder（如下圖中的大H），K和V來自Encoder（如下圖中的小h）。self-attention就是attention的K、Q、V都來自encoder或者decoder，使得每個位置的表示都具有全局的語義信息，有利於建立長依賴關係。

Layer normalization(LN)

batch normalization是對一個每一個節點，針對一個batch，做一次normalization，即縱向的normalization:

layer normalization(LN)，是對一個樣本，同一個層網絡的所有神經元做normalization，不涉及到batch的概念，即橫向normalization:

BN適用於不同mini batch數據分佈差異不大的情況，而且BN需要開闢變量存每個節點的均值和方差，空間消耗略大；而且 BN適用於有mini_batch的場景。

LN只需要一個樣本就可以做normalization，可以避免 BN 中受 mini-batch 數據分佈影響的問題，也不需要開闢空間存每個節點的均值和方差。

但是，BN 的轉換是針對單個神經元可訓練的——不同神經元的輸入經過再平移和再縮放後分布在不同的區間，而 LN 對於一整層的神經元訓練得到同一個轉換——所有的輸入都在同一個區間範圍內。如果不同輸入特徵不屬於相似的類別（比如顏色和大小，scale不一樣），那麼 LN 的處理可能會降低模型的表達能力。

encoder:

輸入：和conv s2s類似，詞向量加上了positional embedding，即給位置1，2，3，4...n等編碼（也用一個embedding表示）。然後在編碼的時候可以使用正弦和餘弦函數，使得位置編碼具有週期性，並且有很好的表示相對位置的關係的特性（對於任意的偏移量k，PE[pos+k]可以由PE[pos]表示）：

輸入的序列長度是n，embedding維度是d，所以輸入是n*d的矩陣
N=6，6個重複一樣的結構，由兩個子層組成：
- 子層1:
  - Multi-head self-attention
  - 殘餘連接和LN：
    - Output = LN (x+sublayer(x))

- 子層2:
  - Position-wise fc層(跟卷積很像)
  - 對n*d的矩陣的每一行進行操作（相當於把矩陣每一行鋪平，接一個FC），同一層的不同行FC層用一樣的參數，不同層用不同的參數(對於全連接的節點數目，先從512變大爲2048，再縮小爲512)：

整個encoder的輸出也是n*d的矩陣

decoder:

•輸入:假設已經翻譯出k個詞，向量維度還是d

•同樣使用N=6個重複的層，依然使用殘餘連接和LN

•3個子層，比encoder多一個attention層，是Decoder端去attend encoder端的信息的層:

Sub-L1:
self-attention，同encoder，但要Mask掉未來的信息，得到k*d的矩陣

Sub-L2:和encoder做attention的層，輸出k*d的矩陣
Sub-L3:全連接層，輸出k*d的矩陣，用第k行去預測輸出y

mutli-head attention:

MultiHead可以看成是一種ensemble方式，獲取不同子空間的語義:

獲取每個子任務的Q、K、V：

通過全連接進行線性變換映射成多個Q、K、V，線性映射得到的結果維度可以不變、也可以減少(類似降維)
或者通過Split對Q、K、V進行劃分(分段)

如果採用線性映射的方式，使得維度降低；或者通過split的方式使得維度降低，那麼多個head做attention合併起來的複雜度和原來一個head做attention的複雜度不會差多少，而且多個head之間做attention可以並行。

decoder的輸入：

self.y的輸出

[[1008 3936 1924    4  401 5087 5651    3    0    0]
 [ 141   25    4  101 3180    6  362  552    3    0]
 [  15    4  420   12  845    3    0    0    0    0]
 [  79  243    6    1   15   42 1614  634    3    0]
 [ 527  119   30   85  976    3    0    0    0    0]
 [   1   12  669  193   20  135    1 7288    3    0]
 [   1    1    1    1    3    0    0    0    0    0]
 [  78   58    5 1934  764   87    3    0    0    0]
 [  65   20   44  484  304 1190    3    0    0    0]
 [  92  132    8 3174    5 1286  217    3    0    0]
 [ 534    1   81  979 2304    3    0    0    0    0]
 [  15  334   54   44  505  145    8  222 3052    3]
 [2124 2626 1435   14    3    0    0    0    0    0]
 [ 291  429   39    3    0    0    0    0    0    0]
 [   1    6  438  162    8  178 5506  702    3    0]
 [ 166  106   11   21    5  102 1403   97    3    0]
 [  47  578  116   18  387    3    0    0    0    0]
 [  11  100    5  173   28    1  459    3    0    0]
 [ 193   14   12    3    0    0    0    0    0    0]
 [  80  910   12 1581 1713    3    0    0    0    0]
 [  65   90  807    5   66   56    3    0    0    0]
 [  24   95   44 3973   85  340  327  110    3    0]
 [   1    1    1 1615    1    3    0    0    0    0]
 [  15   14   74  174   31    4 2982  512 1105    3]
 [ 350    3    0    0    0    0    0    0    0    0]
 [  96  246  425    5   54    3    0    0    0    0]
 [3585 1357  117 1105    1   61 5155    3    0    0]
 [  79  159    8    1  898    3    0    0    0    0]
 [  79   33   40  352    3    0    0    0    0    0]
 [  79 3060  569    6  515   25    1   39    3    0]
 [ 120   94  955   10   56 1606    3    0    0    0]
 [  24   95   28   16  139    7  244  199  729    3]]

self.dec:

[[[-2.279155    1.0991251  -1.5886356  ...  1.4294004  -1.5982031
   -1.1408417 ]
  [-1.5231744  -0.05949117 -1.1321675  ... -1.4490161  -0.30499786
   -0.77806205]
  [-1.3187622  -0.15014127 -1.13905    ... -3.1452506  -2.686214
   -0.5845207 ]
  ...
  [-0.41217917  1.4415227  -1.2828536  ... -0.3612737  -0.4510218
   -0.7630521 ]
  [-0.7186153   0.86343354 -1.883712   ... -0.91078717 -0.01014436
    0.07463718]
  [-0.86691934  0.98398304 -2.3404958  ... -0.8095867   0.442111
   -0.04082058]]

 [[-2.631531    0.8046703  -1.8410189  ...  0.98089796 -1.0553441
   -1.0271642 ]
  [-2.527477    0.47748724 -1.5274042  ... -0.9979782  -1.4225875

self.x德語 [[  14   29    1  111  744    3    0    0    0    0]
 [ 344  303    3    0    0    0    0    0    0    0]
 [ 352   68   13   22 2974    1    1    3    0    0]
 [  37   68  459  591    8    9   94    3    0    0]
 [ 100  210   70  737    3    0    0    0    0    0]
 [  34    1    4 3720  492  119    1 2668    3    0]
 [  21    6  660 1234  116   25 4757  539    1    3]
 [ 832   12   46  160  154   32    3    0    0    0]
 [  34    7   20 1067    1    3    0    0    0    0]
 [  25  207   19   18 2361    1   28    3    0    0]
 [  34    1 2921 1123    3    0    0    0    0    0]
 [  34    1 5514   10   56    1    3    0    0    0]
 [  61   14  110   78    8   19  334    3    0    0]
 [  21   15  420  356    3    0    0    0    0    0]
 [  14  171 3183  310 2075 2426 1024    3    0    0]
 [  65  362   13   35    1   10  217 4820    3    0]
 [  25   75    8  557  124 2810 2124    9   29    3]
 [ 774    1    1    1    3    0    0    0    0    0]
 [   1    7 1149 4973    3    0    0    0    0    0]
 [  21   11   99   19 2640  107  215    3    0    0]
 [ 163 1116   29 1335  402    9  197    3    0    0]
 [ 120 2161  918  157   15   23    6  400 1335    3]
 [  21   11  174   89 1260    9   86   71    3    0]
 [  14   48    4 3168  111   54   45  375    3    0]
 [  34  617  225   23  465    3    0    0    0    0]
 [  95  886   12 3115    1    3    0    0    0    0]
 [ 340  269   12   16  426  874 1797  156  117    3]
 [ 204   28    4 3979   30    8 2287   30    1    3]
 [ 477   52   10  405 3787  168  902   51    3    0]
 [ 100   28    4 1241   98 2820    1    3    0    0]
 [  14  589   27   64    1    3    0    0    0    0]
self.y英語 [[   1   27   13  134    4 1534 3332    3    0    0]
 [5502    7  349    8  701 8414  114    3    0    0]
 [ 125  101  444 2134 2960 1767   10  480 2865    3]
 [ 120   21  143 4641 1866    7 1149    3    0    0]
 [ 291   12   33 2686    3    0    0    0    0    0]
 [  92   55    1  955    3    0    0    0    0    0]
 [1540    9    1    7    1    3    0    0    0    0]
 [3050   37    4  462  502   12 8718    3    0    0]
 [  11  100   13    5   29  559    3    0    0    0]
 [ 125  175   48   73   12  169 4592 7722    3    0]
 [5070    7 2667   47    1    3    0    0    0    0]
 [ 231   12   41    1   57 1708    3    0    0    0]
 [   1    6   13   20    1    3    0    0    0    0]
 [  15    4  420   12  111    3    0    0    0    0]
 [ 235    4  314  457  261   21    8  559 3722    3]
 [  15  115 2008    3    0    0    0    0    0    0]
 [  65   94 4087   17  160    3    0    0    0    0]
 [  11 3803  130    1 1661    1    4 4373    3    0]
 [  49    1    9  332  303  110    3    0    0    0]
 [   1  162    6    4  614  181  100   17    3    0]
 [  24   11  182 4885    3    0    0    0    0    0]
 [  96  488    1    1   79  421    3    0    0    0]
 [ 965    8 1126   28   20   13    1    3    0    0]
 [ 120   63  784    5 2063 3565    3    0    0    0]
 [  24   16  132    5  174    9 4755  392    3    0]
 [ 166   44  246  366    3    0    0    0    0    0]
 [  24   28   25  129   22 1411    3    0    0    0]
 [ 737  298 3720  162    3    0    0    0    0    0]
 [  11  100    5   75  477    4 1054    3    0    0]
 [  65  164   25  156    1   46 8431    3    0    0]
 [  80   18  198 2343  624    3    0    0    0    0]
 [ 640  281   37  162    3    0    0    0    0    0]]

deecoder部分：
[[   2   24  334   54  164   25 2337  114    3    0]
 [   2    1   10  423   48   20  115   91   25   17]
 [   2   47  114   37   18  234   10  417 4175   10]
 [   2  166    1  121 3475  409    5    8    1   97]
 [   2   80   12    8 1924  901  159    8 7055  901]
 [   2   96   86    1   79    8    1 2252    3    0]
 [   2 2171   87   11   18    1   23   38  136 4969]
 [   2  120    1 1326 3555    3    0    0    0    0]
 [   2   79  345 7208   19    4 6423    3    0    0]
 [   2  589   16   21  263 5134   53    6  381  267]
 [   2  601   44  265   17   37   22   19  643    3]
 [   2   80   83  104 1242    3    0    0    0    0]
 [   2  141   12    9 2443   77  339  122    5 1571]
 [   2 4358  216  137   61    1   10    3    0    0]
 [   2 3130    1    1    3    0    0    0    0    0]
 [   2   96  181  242    9   89    3    0    0    0]
 [   2   65   20   19    1 4223  127    3    0    0]
 [   2   15  138   51 1303   23   50 1386    3    0]
 [   2   15   13   27  173   56   22  926  460    3]
 [   2   65   21    5 1942    9    3    0    0    0]
 [   2   79    8   51  143  676 4787    3    0    0]
 [   2   15   42  227   19 1225    3    0    0    0]
 [   2   15   16   63    5  286  131    6    8 4174]
 [   2  166    1   23    4    1    3    0    0    0]
 [   2   15   17 4367   12 3481 3562    7  298    1]
 [   2   15   11  100   38  377    5   29   23   13]
 [   2   15  981  112   18    1    8 1169    3    0]
 [   2   15   16 2135  523    1    3    0    0    0]
 [   2 4264   17   12   33   68  208    3    0    0]
 [   2   80   12    4 4979    1    3    0    0    0]
 [   2   11  823   13   51  109    3    0    0    0]
 [   2   79   33 1431  390    3    0    0    0    0]]
 [  25    1  201  526    3    0    0    0    0    0]]

self.decoder_inputs的構成tf.concat((tf.ones_like(self.y[:, :1])*2, self.y[:, :-1]), -1)

tf.ones_like(self.y[:, :1])
[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]
tf.ones_like(self.y[:, :1])*2
[[2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]]
self.y[:, :-1]
[[  11   74  123  350   95   41 3029    3    0]
 [3510  144 3781    3    0    0    0    0    0]
 [   1   92  738  173    8    1    3    0    0]
 [  96   94  547   29    8 4941 1096    3    0]
 [  49   64  220    3    0    0    0    0    0]
 [ 203   13    3    0    0    0    0    0    0]
 [2917 1523   59    4  204   35  330    3    0]
 [  24   97   44   17   12   41    1 2429   87]
 [  79    4  985 4074 1143  557 3187    3    0]
 [  92   55    4 7987 3611   12    8 1578   37]
 [  24    4  212  806   12  159    1    3    0]
 [1410  336   68 2515    3    0    0    0    0]
 [ 150   28  194   29 2887   17    1    3    0]
 [  15    4  101   37  500   12    1    3    0]
 [  96   18    1 2972   43   11  139   14    3]
 [2629 6333 2802    3    0    0    0    0    0]
 [ 610    4    1   11   90  445    9    3    0]
 [7991  138   10 1481    3    0    0    0    0]
 [  24 5119  122   10    4  360  984    3    0]
 [6026  689    3    0    0    0    0    0    0]
 [1033   34    9 1202    3    0    0    0    0]
 [ 203   13  231   18   14   57    4   89    3]
 [  43   17   12   33   69   11  182    3    0]
 [  11   18  275   93  311    3    0    0    0]
 [2016   22   84    4 1428  899    3    0    0]
 [  24   16  132    5   64 1493   25 1744  289]
 [  98 1298 3170  703  390    3    0    0    0]
 [  24   11  353  232  214   19    8    1    3]
 [  15   28   11  238   72 1850   54    3    0]
 [ 203   13    3    0    0    0    0    0    0]
 [ 231  278    9  489   12   33   53   48    3]
 [ 600   12  878    5 1582    4  572    3    0]] #由原來的32*10變成了32*9
tf.concat((tf.ones_like(self.y[:, :1])*2, self.y[:, :-1]), -1) 
 [[   2  394   33 1133   30   32    3    0    0    0]
 [   2 1675   28   13   64    3    0    0    0    0]
 [   2   15 4777  944    5  102    5    4 1871    3]
 [   2  291   12   33 2686    3    0    0    0    0]
 [   2   11  105 5707    3    0    0    0    0    0]
 [   2   11  119 4231   20    6    4 1577 1135    3]
 [   2  965    4  326   18 4762    3    0    0    0]
 [   2  284  967  111 1484  968 1622    3    0    0]
 [   2   15    6  191 3498    3    0    0    0    0]
 [   2   24   16   21    4    1    1    3    0    0]
 [   2 1160  101   91   20  159    1 3232    3    0]
 [   2   15   42   51  584    3    0    0    0    0]
 [   2  394   33 4681    3    0    0    0    0    0]
 [   2   80   12   41  448 5956   10    1 6524    3]
 [   2   98  416 2000   17   23    4  204    3    0]
 [   2   80   12  301    7 8061    3    0    0    0]
 [   2  284 1579    6    4    1    3    0    0    0]
 [   2  125    4 1385   88 6913 2257   20   25  162]
 [   2 2219 1346    3    0    0    0    0    0    0]
 [   2   67   12  490   28  124  260    5   34    3]
 [   2   47  315  177   12    1    3    0    0    0]
 [   2 6239   12    4  290    6    1    3    0    0]
 [   2   78    3    0    0    0    0    0    0    0]
 [   2   92  103  133  504   14    3    0    0    0]
 [   2   15   28   25    4  668    6 7226 2454    3]
 [   2   24  473  288  543  430  706   48   81    3]
 [   2  689   11  424   21 8420 6647   87    3    0]
 [   2  203   13    3    0    0    0    0    0    0]
 [   2   15   26   71    8  137    6  742    3    0]
 [   2   79   72  803    3    0    0    0    0    0]
 [   2  589   12   41  240    6    4 4257    3    0]
 [   2   15   39 6379    3    0    0    0    0    0]]

encoder的輸入是德語部分，k，q都是德語句子本身，v也是本句，開始self-attention，最後形成的是self.encoder。

decoder的輸入是英語部分，輸入的是英語句子，但是第一個數值都換成了2，其餘的10個數值不變，k，q，都是本身。最後形成self.decoder。同樣是self-attention，

在decoder和encoder鏈接的地方進行普通的attention，q=self.decoder，k=self.encoder,v=self.encoder。 softmax(q*k/d)*v

標籤y的輸入是英語部分。

最後計算概率的一層shape=(32, 10, 8767)，8767代表單詞表的大小，最後給出的是對應概率最大的位置，與label對比，也就是和y對比。input_decoder做了特殊的處理：在decoder的輸入中，self-attention的部分要進行對未來信息的屏蔽，用來預防看到未來的信息，起最終的屏蔽後的結果是一個下三角矩陣(10,10)

1   0   0   0   0 0 0 0 0 0
1 6   0   0   0 0 0 0 0 0
1 6 10 0   0 0 0 0 0 0
1 6 10 13 0 0 0 0 0 0
1 6 10 13 15 0 0 0 0 0

...........................................................其意思是有encoder端的一句10個詞彙構成的語義向量並且配合矩陣第一行的第一個元素，預測出第二行的第二元素。

其實在預測階段是沒有decoder的輸入的部分的，只有根據encode端預測decoder的第一個字符，然後根據encoder的端加上預測出的第一個字符，預測第二個字符，直到預測出第10個字符

import tensorflow as tf
a = tf.constant([[[2, 2, 12,22], [3, 3, 31,23],[3, 4, 5,6], [3, 9, 3,8]]], dtype=tf.float32)
diag_vals = tf.ones_like(a[0, :, :])
tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()

sess = tf.Session()
print(sess.run(tril))


[[1. 0. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 1. 0.]
 [1. 1. 1. 1.]]

對於decoder的self.attention部分：數據的變換如上面的程序將(10,10)的矩陣轉換成下三角矩陣。

masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])通過這條語句將形成(256,10,10)的結構，embedding=512，一句話有10個單詞，mutihead=8，一個批次有32句話。

轉換過程：(32,10,512)--8*(32,10,512/8)----8*(32,10,64)---(256,10,64)------->q*k---------(256,10,64)*(256,10,64)-----(256,10,10)---32句話，每句話有10個單詞，按照將512的空間分解成64的空間，然後進行self.attention的計算得出該句話的每個單詞同這句話的其他單詞的權重如下圖：1--10代表的是一句話中的10個單詞，連線上是權重，是通過softmax函數計算出的。

key的處理部分：32句話，每句話10個人單詞，(32,10,512)----最後一個維度相加得到(32,10)，8個attention -----8個32*10，每個都是同樣的。(256,10)----8個一組，最後是(256,10,10),同樣的一句話變成了---------(8*32,10,10)以

a = tf.constant([[2, 2, 12,22],[3, 3, 31,23],[3, 4, 5,6]], dtype=tf.float32)  #3*4
tril = tf.tile(a,[8,1])
[[ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]
 [ 2.  2. 12. 22.]
 [ 3.  3. 31. 23.]
 [ 3.  4.  5.  6.]]
tril = tf.tile(tf.expand_dims(tril, 1), [1, 2, 1]) #(24, 2, 4)
[[[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]

 [[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]
[[ 2.  2. 12. 22.]
  [ 2.  2. 12. 22.]]

 [[ 3.  3. 31. 23.]
  [ 3.  3. 31. 23.]]

 [[ 3.  4.  5.  6.]
  [ 3.  4.  5.  6.]]]

經過計算後變成了(8*32,10,10)其中第二個維度中的10代表了一個句子中的第一個單詞重複10次，其他詞也一樣，然後進行權重與（10,10）的矩陣的相乘權重矩陣同樣爲10*10，代表的意思是第一行爲，第一個單詞同其他所有單詞的權重，第二行爲第二個單詞和其他所有單詞的權重，依次類推......................

[[25 55 95 36 34 88 47 19 35 84]
[23 82 51 11 79 34 73 90 37 23]
[47 4 4 21 3 77 72 9 29 26]
[47 96 34 49 27 71 4 86 73 24]
[92 99 85 37 44 67 28 67 90 30]
[73 23 34 47 88 48 33 54 79 77]
[87 79 45 56 58 25 16 60 77 22]
[19 8 67 96 84 31 13 21 76 61]
[49 5 56 7 75 0 12 9 56 93]
[72 99 56 3 2 79 52 70 17 79]]

padding

[[[-4.2949673e+09 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [-4.2949673e+09 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [-4.2949673e+09 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [-4.2949673e+09 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]]
.....................................

masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1])主要是爲了形成和shape(256,10,10)一樣形狀的張量，paddings = tf.ones_like(masks)*(-2**32+1)，每個元素值都爲-4.2949673e+09爲最小值。outputs = tf.where(tf.equal(masks, 0), paddings, outputs)將(q×k)/(d^1/2)，將其中爲0的元素替換成-4.2949673e+09，outputs =tf.nn.softmax(outputs)

然後對query進行mask，

a = tf.constant([[2, 2, 12,-2],[3, 3, 31,0],[3, 4, 0,0]], dtype=tf.float32)  #3*4  batchsize = 3
query_masks = tf.sign(tf.abs(a))
[[1. 1. 1. 1.]
 [1. 1. 1. 0.]
 [1. 1. 0. 0.]]

key_masks維度的擴充是在 1上，query_masks的維度擴充是在-1上，也就是最後一個維度上。

在msak的處理上，形成類似的張量，維度是在

[[[1. 0. 0. 0.]
  [1. 1. 0. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 1.]]

 [[1. 0. 0. 0.]
  [1. 1. 0. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 1.]]

 [[1. 0. 0. 0.]
  [1. 1. 0. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 1.]]

 [[1. 0. 0. 0.]
  [1. 1. 0. 0.]
  [1. 1. 1. 0.]
  [1. 1. 1. 1.]]]

進行屏蔽後q×k/(d^1/2)的值類似於如下的輸出

[[[ 2.0000000e-01 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [ 5.1999998e-01  5.1999998e-01 -4.2949673e+09 -4.2949673e+09]
  [ 1.1200000e-01  1.1200000e-01  1.1200000e-01 -4.2949673e+09]
  [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]]

 [[ 4.3000001e-01 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [ 3.0000001e-01  3.0000001e-01 -4.2949673e+09 -4.2949673e+09]
  [ 0.0000000e+00  0.0000000e+00  0.0000000e+00 -4.2949673e+09]
  [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]]

 [[ 3.0000001e-01 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [ 4.0000001e-01  4.0000001e-01 -4.2949673e+09 -4.2949673e+09]
  [ 5.0000000e-01  5.0000000e-01  5.0000000e-01 -4.2949673e+09]
  [ 6.0000002e-01  6.0000002e-01  6.0000002e-01  6.0000002e-01]]

 [[ 8.9999998e-01 -4.2949673e+09 -4.2949673e+09 -4.2949673e+09]
  [ 4.3000001e-01  4.3000001e-01 -4.2949673e+09 -4.2949673e+09]
  [ 5.0999999e-01  5.0999999e-01  5.0999999e-01 -4.2949673e+09]
  [ 6.2000000e-01  6.2000000e-01  6.2000000e-01  6.2000000e-01]]]

成爲了一個下三角矩陣，也就形成了對未來信息的屏蔽。

函數運行的例子

#-*-coding:utf-8-*-
import tensorflow as tf
# x = [[1,2,3],[4,5,6]]
# y = [[7,8,9],[10,11,12]]
# condition3 = [[True,False,False],
#              [False,True,True]]
# condition4 = [[True,False,False],
#              [True,False,False]]
# with tf.Session() as sess:
#     print(sess.run(tf.where(condition3,x,y)))
#     print(sess.run(tf.where(condition4,x,y)))

#shape=(2, 2, 3)
# a = tf.constant([[[2, 2, 12,22], [3, 3, 31,23]],[[3, 4, 5,6], [3, 9, 3,8]]], dtype=tf.int32)
# b = tf.constant([[1, 7], [2, 9]], dtype=tf.int32)
# c = tf.constant([[1, 11], [2, 12]], dtype=tf.float32)
# oneHot = tf.one_hot(a,depth=35)
# # print(a)
# # print(b)
# # print(c)
# # d = tf.matmul(b,c)
# # Q_ = tf.concat(tf.split(a, 2, axis=2),axis=0) # (h*N, T_q, C/h)
# # print(Q_)
# with tf.Session() as sess:
#     # print(sess.run(tf.split(a, 2, axis=2)))
#     # print(Q_ )
#     # print(sess.run(Q_))
#     print(sess.run(oneHot))#, [3, 3, 31,23],[3, 4, 5,6], [3, 9, 3,8]
a = tf.constant([[2, 2, 12,-2],[3, 3, 31,0],[3, 4, 0,0],[3, 9, 3,8]], dtype=tf.float32)  #3*4  batchsize = 3
q = tf.constant([[2, 2, 12,-2],[3, 3, 31,0],[3, 4, 0,0]], dtype=tf.float32)

diag_vals = tf.ones_like(a)
outputs =  tf.constant([[0.2, 0.52, 0.112,0],[0.43, 0.3, 0,0],[0.3, 0.4, 0.5,0.6],[0.9, 0.43, 0.51,0.62]], dtype=tf.float32)
tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(tril)[0], 1, 1])
paddings = tf.ones_like(masks)*(-2**32+1)

#tril = tf.expand_dims(a, 0)
outputs = tf.tile(outputs,[1,1])
outputs =  tf.tile(tf.expand_dims(outputs, -1), [1, 1, 4])
# print(tf.Session().run(outputs))
# query_masks = tf.sign(tf.abs(q))
# query_masks = tf.tile(query_masks, [8, 1]) # (h*N, T_q)
# query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, 4])

# tril = tf.tile(a,[8,1]) #(24,4)===(3*8,4)
# masks = tf.tile(tf.expand_dims(tril, 1), [1, 4, 1]) #(24, 2, 4)
#
# paddings = tf.ones_like(masks)*(-2**32+1)  #-4 294 967 295
#
# outputs = tf.where(tf.equal(masks, 0), paddings, outputs)
sess = tf.Session()
# # outputs = tf.where(tf.equal(masks, 0), paddings, outputs)
# outputs *=query_masks
print(sess.run(tf.where(tf.equal(masks, 0), paddings, outputs)))
# print(sess.run(tf.equal(masks, 0)))
# print("---",sess.run(outputs))

# import numpy as np
#
# num = []
# arr = np.random.random((10,10))
#
# print(arr)

最後的輸出 outputs*v至此整個attention的過程結束。(在傳統的attention中v和k表示的同一個句子，而q表示的是目標句子),在預測未來的目標語句時，對於目標語句中未來的信息，要進行屏蔽，避免影響。

self.target,留下非0 的是1 ，0的位置還是0

[[1. 1. 1. 1.]
 [1. 1. 1. 0.]
 [1. 1. 0. 0.]
 [1. 1. 1. 1.]]

bert代碼解讀2之模型transformer的解讀

attention表示成k、q、v的方式:

乘法VS加法attention

self-attention

Layer normalization(LN)

encoder:

decoder:

mutli-head attention:

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

關於transformer-xl中rel-shift實現的解讀

transforer-xl代碼解讀

圖卷積網絡中的傅里葉變化和逆變換

ELECTRA:Efficiently Learning an Encoder that Classifies Token Replacements Accurately

bert代碼解讀2之模型transformer的解讀

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結