一 Self Attention

Self Attention也經常被稱爲intra Attention（內部Attention），最近一年也獲得了比較廣泛的使用，比如Google最新的機器翻譯模型內部大量採用了Self Attention模型。

在一般任務的Encoder-Decoder框架中，輸入Source和輸出Target內容是不一樣的，比如對於英-中機器翻譯來說，Source是英文句子，Target是對應的翻譯出的中文句子，Attention機制發生在Target的元素Query和Source中的所有元素之間。

而Self Attention顧名思義，指的不是Target和Source之間的Attention機制，而是Source內部元素之間或者Target內部元素之間發生的Attention機制，也可以理解爲Target=Source這種特殊情況下的注意力計算機制。其具體計算過程是一樣的，只是計算對象發生了變化而已，所以此處不再贅述其計算過程細節。

如果是常規的Target不等於Source情形下的注意力計算，其物理含義正如上文所講，比如對於機器翻譯來說，本質上是目標語單詞和源語單詞之間的一種單詞對齊機制。那麼如果是Self Attention機制，一個很自然的問題是：通過Self Attention到底學到了哪些規律或者抽取出了哪些特徵呢？或者說引入Self Attention有什麼增益或者好處呢？我們仍然以機器翻譯中的Self Attention來說明，圖1和圖2是可視化地表示Self Attention在同一個英語句子內單詞間產生的聯繫。

圖1 可視化Self Attention實例

圖2 可視化Self Attention實例

從兩張圖可以看出，Self Attention可以捕獲同一個句子中單詞之間的一些句法特徵（比如圖1展示的有一定距離的短語結構）或者語義特徵（比如圖2展示的its的指代對象Law）。

很明顯，引入Self Attention後會更容易捕獲句子中長距離的相互依賴的特徵，因爲如果是RNN或者LSTM，需要依次序序列計算，對於遠距離的相互依賴的特徵，要經過若干時間步步驟的信息累積才能將兩者聯繫起來，而距離越遠，有效捕獲的可能性越小。

但是Self Attention在計算過程中會直接將句子中任意兩個單詞的聯繫通過一個計算步驟直接聯繫起來，所以遠距離依賴特徵之間的距離被極大縮短，有利於有效地利用這些特徵。除此外，Self Attention對於增加計算的並行性也有直接幫助作用。這是爲何Self Attention逐漸被廣泛使用的主要原因。

二 Attention機制的本質思想

如果把Attention機制從Encoder-Decoder框架中剝離，並進一步做抽象，可以更容易看懂Attention機制的本質思想。

Attention機制的本質思想

我們可以這樣來看待Attention機制：將Source中的構成元素想象成是由一系列的<Key,Value>數據對構成，此時給定Target中的某個元素Query，通過計算Query和各個Key的相似性或者相關性，得到每個Key對應Value的權重係數，然後對Value進行加權求和，即得到了最終的Attention數值。所以本質上Attention機制是對Source中元素的Value值進行加權求和，而Query和Key用來計算對應Value的權重係數。即可以將其本質思想改寫爲如下公式：

其中，=||Source||代表Source的長度，公式含義即如上所述。

上文所舉的機器翻譯的例子裏，因爲在計算Attention的過程中，Source中的Key和Value合二爲一，指向的是同一個東西，也即輸入句子中每個單詞對應的語義編碼，所以可能不容易看出這種能夠體現本質思想的結構。

當然，從概念上理解，把Attention仍然理解爲從大量信息中有選擇地篩選出少量重要信息並聚焦到這些重要信息上，忽略大多不重要的信息，這種思路仍然成立。聚焦的過程體現在權重係數的計算上，權重越大越聚焦於其對應的Value值上，即權重代表了信息的重要性，而Value是其對應的信息。

從圖中可以引出另外一種理解，也可以將Attention機制看作一種軟尋址（Soft Addressing）:Source可以看作存儲器內存儲的內容，元素由地址Key和值Value組成，當前有個Key=Query的查詢，目的是取出存儲器中對應的Value值，即Attention數值。通過Query和存儲器內元素Key的地址進行相似性比較來尋址，之所以說是軟尋址，指的不像一般尋址只從存儲內容裏面找出一條內容，而是可能從每個Key地址都會取出內容，取出內容的重要性根據Query和Key的相似性來決定，之後對Value進行加權求和，這樣就可以取出最終的Value值，也即Attention值。所以不少研究人員將Attention機制看作軟尋址的一種特例，這也是非常有道理的。

至於Attention機制的具體計算過程，如果對目前大多數方法進行抽象的話，可以將其歸納爲兩個過程：

第一個過程是根據Query和Key計算權重係數，第一個過程細分爲兩個階段：

第一個階段根據Query和Key計算兩者的相似性或者相關性；
第二個階段對第一階段的原始分值進行歸一化處理；

二個過程根據權重係數對Value進行加權求和。

這樣，可以將Attention的計算過程抽象爲如圖10展示的三個階段。

圖10 三階段計算Attention過程

在第一個階段，可以引入不同的函數和計算機制，根據Query和某個，計算兩者的相似性或者相關性，最常見的方法包括：求兩者的向量點積、求兩者的向量Cosine相似性或者通過再引入額外的神經網絡來求值，即如下方式：

第一階段產生的分值根據具體產生的方法不同其數值取值範圍也不一樣，第二階段引入類似SoftMax的計算方式對第一階段的得分進行數值轉換，一方面可以進行歸一化，將原始計算分值整理成所有元素權重之和爲1的概率分佈；另一方面也可以通過SoftMax的內在機制更加突出重要元素的權重。即一般採用如下公式計算：

第二階段的計算結果即爲對應的權重係數，然後進行加權求和即可得到Attention數值：

通過如上三個階段的計算，即可求出針對Query的Attention數值，目前絕大多數具體的注意力機制計算方法都符合上述的三階段抽象計算過程。

三 Self Attention模型與實現

通過上述對Attention本質思想的梳理，我們可以更容易理解本節介紹的Self Attention模型。

有了query，key，value概念之後，就比較好理解self-attention
1. 輸入的詞彙(翻譯中一句話分成的一組詞)都要embedding成一個固定長度的向量x才輸入模型的。即對於一句話的所有詞，組成了一個輸入矩陣X。
2. 隨機生成3個矩陣Q,K,V對應query，key，value
3. 對於一個輸入x，用x點乘Q得到query，用x點乘K得到key，用x點乘V得到value
對於一句話中的所有x，都可以得到對應的query，key，value

每個x，都可以用自己的query去和其他key計算score，然後用該score和對應的其他value來計算自己的注意力向量C。經過這樣的計算，x變成了C。

上圖中的z即爲C。對於self-attention來講，Q(Query), K(Key), V(Value)三個矩陣均來自同一輸入，爲了防止其結果過大，會除以一個尺度標度.

如果將輸入的所有向量合併爲矩陣形式，則所有query, key, value向量也可以合併爲矩陣形式表示

其中 WQ, WK, WV是我們模型訓練過程學習到的合適的參數。上述操作即可簡化爲矩陣形式

同樣，可以多疊加幾層self-attention，用同樣的操作不同的QKV矩陣由C變成CC，變成CCC，這就是self-attention。

self-attention像是一種向量轉換。x變爲c，維度沒變，值變了。而同時，這種轉變又蘊含了x與上下文x之間的關係。rnn也可以實現由x變爲另一個向量，同時也考慮了上下文關係，但是，他存在循環神經網絡的弊端，無法並行。而self-attention組成的transformer則可以實現並行運算。即，他不需要等待下一個狀態h計算出來再計算C，而是直接通過QKV矩陣和當前x計算所得。
那QKV怎麼得到？隨機初始，訓練所得。

tf.keras實現自定義網絡層。需要實現以下三個方法:（注意input_shape是包含batch_size項的）

build(input_shape): 這是你定義權重的地方。這個方法必須設 self.built = True，可以通過調用 super([Layer], self).build() 完成。
call(x): 這裏是編寫層的功能邏輯的地方。你只需要關注傳入 call 的第一個參數：輸入張量，除非你希望你的層支持masking。
compute_output_shape(input_shape): 如果你的層更改了輸入張量的形狀，你應該在這裏定義形狀變化的邏輯，這讓Keras能夠自動推斷各層的形狀。

#! -*- coding: utf-8 -*-

import tensorflow.keras.backend as K
import tensorflow as tf


class Position_Embedding(tf.keras.layers.Layer):

    def __init__(self, size=None, mode='sum', **kwargs):
        self.size = size  # 必須爲偶數
        self.mode = mode
        super(Position_Embedding, self).__init__(**kwargs)

    def call(self, x):
        if (self.size == None) or (self.mode == 'sum'):
            self.size = int(x.shape[-1])
        batch_size, seq_len = K.shape(x)[0], K.shape(x)[1]
        position_j = 1. / K.pow(10000., 2 * K.arange(self.size / 2, dtype='float32') / self.size)
        position_j = K.expand_dims(position_j, 0)
        position_i = K.cumsum(K.ones_like(x[:, :, 0]), 1) - 1  # K.arange不支持變長，只好用這種方法生成
        position_i = K.expand_dims(position_i, 2)
        position_ij = K.dot(position_i, position_j)
        position_ij = K.concatenate([K.cos(position_ij), K.sin(position_ij)], 2)
        if self.mode == 'sum':
            return position_ij + x
        elif self.mode == 'concat':
            return K.concatenate([position_ij, x], 2)

    def compute_output_shape(self, input_shape):
        if self.mode == 'sum':
            return input_shape
        elif self.mode == 'concat':
            return (input_shape[0], input_shape[1], input_shape[2] + self.size)


class Attention(tf.keras.layers.Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(Attention, self).__init__(**kwargs)


    def build(self, input_shape):
        self.WQ = self.add_weight(name='WQ',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WK = self.add_weight(name='WK',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WV = self.add_weight(name='WV',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        super(Attention, self).build(input_shape)

    def call(self, x):
        # 對Q、K、V做線性變換
        Q_seq = K.dot(x, self.WQ)
        K_seq = K.dot(x, self.WK)
        V_seq = K.dot(x, self.WV)
        print("\n")
        print("--"*25)
        print("Q_seq.shape: ", Q_seq.shape)
        print("K.permute_dimensions(K_seq, [0, 2, 1]).shape: ",K.permute_dimensions(K_seq, [0, 2, 1]).shape)

        QK = K.batch_dot(Q_seq, K.permute_dimensions(K_seq, [0, 2, 1]))
        QK = QK / K.int_shape(x)[-1] ** 0.5
        QK = K.softmax(QK)
        print("QK.shape: ",QK.shape)
        Z_seq = K.batch_dot(QK, V_seq)
        print("Z_seq.shape: ",Z_seq.shape)
        print("=="*25)

        return Z_seq

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1], self.output_dim)

from __future__ import print_function

import tensorflow  as tf

import tensorflow.keras.datasets.imdb as imdb
import tensorflow.keras.preprocessing.sequence as sequence

from attention import Position_Embedding, Attention

max_features = 10000
maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


S_inputs = tf.keras.layers.Input(shape=(None,), dtype='int32')
embeddings = tf.keras.layers.Embedding(max_features, 128)(S_inputs)
# 增加Position_Embedding能輕微提高準確率
embeddings = Position_Embedding()(embeddings)

O_seq = Attention(8)(embeddings)
O_seq = tf.keras.layers.GlobalAveragePooling1D()(O_seq)
O_seq = tf.keras.layers.Dropout(0.5)(O_seq)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(O_seq)

model = tf.keras.Model(inputs=S_inputs, outputs=outputs)
print(model.summary())

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=1,
          validation_data=(x_test, y_test))

score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

muti-head步驟，直白的解釋就是將上面的Scaled Dot-Product Attention步驟重複執行，然後將每次執行的結果拼接起來，需要注意的是每次重複執行Scaled Dot-Product Attention步驟的參數並不共享。

#! -*- coding: utf-8 -*-

from __future__ import absolute_import, division, print_function
import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras.backend as K


class Position_Embedding(layers.Layer):

    def __init__(self, size=None, mode='sum', **kwargs):
        self.size = size  # 必須爲偶數
        self.mode = mode
        super(Position_Embedding, self).__init__(**kwargs)

    def call(self, x):
        if (self.size == None) or (self.mode == 'sum'):
            self.size = int(x.shape[-1])
        batch_size, seq_len = K.shape(x)[0], K.shape(x)[1]
        position_j = 1. / K.pow(10000., 2 * K.arange(self.size / 2, dtype='float32') / self.size)
        position_j = K.expand_dims(position_j, 0)
        position_i = K.cumsum(K.ones_like(x[:, :, 0]), 1) - 1  # K.arange不支持變長，只好用這種方法生成
        position_i = K.expand_dims(position_i, 2)
        position_ij = K.dot(position_i, position_j)
        position_ij = K.concatenate([K.cos(position_ij), K.sin(position_ij)], 2)
        if self.mode == 'sum':
            return position_ij + x
        elif self.mode == 'concat':
            return K.concatenate([position_ij, x], 2)

    def compute_output_shape(self, input_shape):
        if self.mode == 'sum':
            return input_shape
        elif self.mode == 'concat':
            return (input_shape[0], input_shape[1], input_shape[2] + self.size)


class Attention(layers.Layer):

    def __init__(self, head_num, head_size, **kwargs):
        self.head_num = head_num
        self.head_size = head_size
        self.output_dim = head_num * head_size
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.WQ = self.add_weight(name='WQ',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WK = self.add_weight(name='WK',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WV = self.add_weight(name='WV',
                                  shape=(input_shape[-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        super(Attention, self).build(input_shape)

    def call(self, x):
        # 對Q、K、V做線性變換
        print("\n")
        print("--"*25)
        Q_seq = K.dot(x, self.WQ)
        Q_seq = K.reshape(Q_seq, (-1, K.shape(Q_seq)[1], self.head_num, self.head_size))
        Q_seq = K.permute_dimensions(Q_seq, (0, 2, 1, 3))
        print("Q_seq.shape: ", Q_seq.shape)

        K_seq = K.dot(x, self.WK)
        K_seq = K.reshape(K_seq, (-1, K.shape(K_seq)[1], self.head_num, self.head_size))
        K_seq = K.permute_dimensions(K_seq, (0, 2, 1, 3))
        print("K_seq.shape: ", K_seq.shape)

        V_seq = K.dot(x, self.WV)
        V_seq = K.reshape(V_seq, (-1, K.shape(V_seq)[1], self.head_num, self.head_size))
        V_seq = K.permute_dimensions(V_seq, (0, 2, 1, 3))
        print("V_seq.shape: ", V_seq.shape)

        # 計算內積，然後softmax
        QK_seq = tf.matmul(Q_seq, K.permute_dimensions(K_seq, (0, 1, 3, 2))) / self.head_size ** 0.5
        QK_seq = K.softmax(QK_seq)
        print("QK_seq.shape: ", QK_seq.shape)

        Z_seq = tf.matmul(QK_seq, V_seq)
        Z_seq = K.permute_dimensions(Z_seq, (0, 2, 1, 3))
        Z_seq = K.reshape(Z_seq, (-1, K.shape(Z_seq)[1], self.output_dim))
        print("Z_seq.shape: ", Z_seq.shape)

        print("-="*25)
        return Z_seq

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[0], self.output_dim)

imdb測試代碼裏修改成

O_seq = Attention(2, 8)(embeddings)

四 Self Attention 動畫演示與代碼演示

動畫演示

Step 1: Prepare inputs

For this tutorial, we start with 3 inputs, each with dimension 4.

Input 1: [1, 0, 1, 0] 
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3.

(the dimension of value is also the dimension of the output.)

In order to obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the right spelling), and a set of weights for values. In our example, we ‘initialise’ the three sets of weights as follows.

Weights for key:

[[0, 0, 1],
 [1, 1, 0],
 [0, 1, 0],
 [1, 1, 0]]

Weights for query:

[[1, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 1]]

Weights for value:

[[0, 2, 0],
 [0, 3, 0],
 [1, 0, 3],
 [1, 1, 0]]

PS： In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the key, query and value representations for every input.

Key representation for Input 1:

               [0, 0, 1]
[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

               [0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
               [0, 1, 0]
               [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

               [0, 0, 1]
[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
               [0, 1, 0]
               [1, 1, 0]

1. A faster way is to vectorise the above key operations:

               [0, 0, 1]
[1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

2. Let’s do the same to obtain the value representations for every input:

               [0, 2, 0]
[1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
[1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]

3. finally the query representations:

               [1, 0, 1]
[1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]

PS: In practice, a bias vector may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]

we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

PS: The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

Step 5: Calculate softmax

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Step 6: Multiply scores with values

The softmaxed attention scores for each input (blue) is multiplied with its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Take all the weighted values (yellow) and sum them element-wise:

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Query 與 Key 的緯度一定要相同，因爲兩者需要進行點積相乘，然而， Value的緯度可以與Q, K的緯度不一樣

The resulting output will consequently follow the dimension of value.

對應代碼

Step 1: Prepare inputs

import tensorflow as tf

x = [
  [1, 0, 1, 0], # Input 1
  [0, 2, 0, 2], # Input 2
  [1, 1, 1, 1]  # Input 3
 ]
x = tf.Variable(x, dtype=tf.float32)

Step 2: Initialise weights

w_key = [
  [0, 0, 1],
  [1, 1, 0],
  [0, 1, 0],
  [1, 1, 0]
]
w_query = [
  [1, 0, 1],
  [1, 0, 0],
  [0, 0, 1],
  [0, 1, 1]
]
w_value = [
  [0, 2, 0],
  [0, 3, 0],
  [1, 0, 3],
  [1, 1, 0]
]
w_key = tf.Variable(w_key, dtype=tf.float32)
w_query = tf.Variable(w_query, dtype=tf.float32)
w_value = tf.Variable(w_value, dtype=tf.float32)

Step 3: Derive key, query and value

keys = x @ w_key
querys = x @ w_query
values = x @ w_value

print(keys)
# tensor([[0., 1., 1.],
#         [4., 4., 0.],
#         [2., 3., 1.]])

print(querys)
# tensor([[1., 0., 2.],
#         [2., 2., 2.],
#         [2., 1., 3.]])

print(values)
# tensor([[1., 2., 3.],
#         [2., 8., 0.],
#         [2., 6., 3.]])

Step 4: Calculate attention scores

attn_scores = querys @ tf.transpose(keys, perm=[1, 0])  # [[1, 4]
print(attn_scores)
# tensor([[ 2.,  4.,  4.],  # attention scores from Query 1
#         [ 4., 16., 12.],  # attention scores from Query 2
#         [ 4., 12., 10.]]) # attention scores from Query 3

Step 5: Calculate softmax

attn_scores_softmax = tf.nn.softmax(attn_scores)
print(attn_scores_softmax)
# tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
#         [6.0337e-06, 9.8201e-01, 1.7986e-02],
#         [2.9539e-04, 8.8054e-01, 1.1917e-01]])

# For readability, approximate the above as follows
attn_scores_softmax = [
  [0.0, 0.5, 0.5],
  [0.0, 1.0, 0.0],
  [0.0, 0.9, 0.1]
]
attn_scores_softmax = tf.Variable(attn_scores_softmax)
print(attn_scores_softmax)

Step6+Step7:

print(attn_scores_softmax)
print(values)
outputs = tf.matmul(attn_scores_softmax, values)
print(outputs)

<tf.Variable 'Variable:0' shape=(3, 3) dtype=float32, numpy=
array([[0. , 0.5, 0.5],
       [0. , 1. , 0. ],
       [0. , 0.9, 0.1]], dtype=float32)>
tf.Tensor(
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]], shape=(3, 3), dtype=float32)
tf.Tensor(
[[2.        7.        1.5      ]
 [2.        8.        0.       ]
 [2.        7.7999997 0.3      ]], shape=(3, 3), dtype=float32)

Step 6: Multiply scores with values

weighted_values = values[:,None] * tf.transpose(attn_scores_softmax, perm=[1, 0])[:,:,None]
print(weighted_values)
# tensor([[[0.0000, 0.0000, 0.0000],
#          [0.0000, 0.0000, 0.0000],
#          [0.0000, 0.0000, 0.0000]],
# 
#         [[1.0000, 4.0000, 0.0000],
#          [2.0000, 8.0000, 0.0000],
#          [1.8000, 7.2000, 0.0000]],
# 
#         [[1.0000, 3.0000, 1.5000],
#          [0.0000, 0.0000, 0.0000],
#          [0.2000, 0.6000, 0.3000]]])

Step 7: Sum weighted values

outputs = tf.reduce_sum(weighted_values, axis=0)
print(outputs)
# tensor([[2.0000, 7.0000, 1.5000],  # Output 1
#         [2.0000, 8.0000, 0.0000],  # Output 2
#         [2.0000, 7.8000, 0.3000]]) # Output 3

五 Attention任務

Attention機制只是一種思想，可以用到很多任務上，Attention機制比較適合有以下特點的任務：

1）長文本任務，document級別，因爲長文本本身所攜帶的信息量比較大，可能會帶來信息過載問題，很多任務可能只需要用到其中一些關鍵信息（比如文本分類），所以Attention機制用在這裏正適合capture這些關鍵信息。

2）涉及到兩段的相關文本，可能會需要對兩段內容進行對齊，找到這兩段文本之間的一些相關關係。比如機器翻譯，將英文翻譯成中文，英文和中文明顯是有對齊關係的，Attention機制可以找出，在翻譯到某個中文字的時候，需要對齊到哪個英文單詞。又比如閱讀理解，給出問題和文章，其實問題中也可以對齊到文章相關的描述，比如“什麼時候”可以對齊到文章中相關的時間部分。

3）任務很大部分取決於某些特徵。我舉個例子，比如在AI+法律領域，根據初步判決文書來預測所觸犯的法律條款，在文書中可能會有一些罪名判定，而這種特徵對任務是非常重要的，所以用Attention來capture到這種特徵就比較有用。（CNN也可以）

下面介紹我瞭解到的一些task，其中機器翻譯、摘要生成、圖文互搜屬於seq2seq任務，需要對兩段內容進行對齊，文本蘊含用到前提和假設兩段文本，閱讀理解也用到了文章和問題兩段文本，文本分類、序列標註和關係抽取屬於單文本Attention的做法。

1）機器翻譯：encoder用於對原文建模，decoder用於生成譯文，attention用於連接原文和譯文，在每一步翻譯的時候關注不同的原文信息。

2）摘要生成：encoder用於對原文建模，decoder用於生成新文本，從形式上和機器翻譯都是seq2seq任務，但是從任務特點上看，機器翻譯可以具體對齊到某幾個詞，但這裏是由長文本生成短文本，decoder可能需要capture到encoder更多的內容，進行總結。

3）圖文互搜：encoder對圖片建模，decoder生成相關文本，在decoder生成每個詞的時候，用attention機制來關注圖片的不同部分。

4）文本蘊含：判斷前提和假設是否相關，attention機制用來對前提和假設進行對齊。

5）閱讀理解：可以對文本進行self attention，也可以對文章和問題進行對齊。

6）文本分類：一般是對一段句子進行attention，得到一個句向量去做分類。

7）序列標註：Deep Semantic Role Labeling with Self-Attention，這篇論文在softmax前用到了self attention，學習句子結構信息，和利用到標籤依賴關係的CRF進行pk。

8）關係抽取：也可以用到self attention

六總結

總的來說，attention的機制就是一個加權求和的機制，只要我們使用了加權求和，不管你是怎麼花式加權，花式求和，只要你是根據了已有信息計算的隱藏狀態的加權和求和，那麼就是使用了attention，而所謂的self attention就是僅僅在句子內部做加權求和（區別與seq2seq裏面的decoder對encoder的隱藏狀態做的加權求和）。
self attention我個人認爲作用範圍更大一點，而key-value其實是對attention進行了一個更廣泛的定義罷了，我們前面的attention都可以套上key-value attention，比如很多時候我們是把k和v都當成一樣的算來，做self的時候還可能是quey=key=value。

https://zhuanlan.zhihu.com/p/47282410

https://www.jianshu.com/p/27514668a1a3

https://www.jianshu.com/p/c3e1a9c04204

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

[深度學習] 自然語言處理 --- Attention (下) [Self-Attention]

一 Self Attention

二 Attention機制的本質思想

三 Self Attention模型與實現

四 Self Attention 動畫演示與代碼演示

動畫演示

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores for Input 1

Step 5: Calculate softmax

Step 6: Multiply scores with values

Step 7: Sum weighted values to get Output 1

Step 8: Repeat for Input 2 & Input 3

對應代碼

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores

Step 5: Calculate softmax

Step6+Step7:

Step 6: Multiply scores with values

Step 7: Sum weighted values

五 Attention任務

六總結

[數據處理] Pandas利用groupby拆分csv

[深度學習] 自然語言處理 --- 文本分類模型總結

[kubernetes] 證書詳細總結

[深度學習]自然語言處理 --- ELMo

[Linux]Centos 6.3 下源代碼安裝gcc 4.8.2/4.9.2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

[深度學習] 自然語言處理 --- Attention (下) [Self-Attention]

一 Self Attention

二 Attention機制的本質思想

三 Self Attention模型與實現

四 Self Attention 動畫演示與代碼演示

動畫演示

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores for Input 1

Step 5: Calculate softmax

Step 6: Multiply scores with values

Step 7: Sum weighted values to get Output 1

Step 8: Repeat for Input 2 & Input 3

對應代碼

Step 1: Prepare inputs

Step 2: Initialise weights

Step 3: Derive key, query and value

Step 4: Calculate attention scores

Step 5: Calculate softmax

Step6+Step7:

Step 6: Multiply scores with values

Step 7: Sum weighted values

五 Attention任務

六 總結

六總結