Attention 總結

1.self attention

Self attention 在NLP中有很多的應用，對於它的作用，個人覺得是通過attention score，能夠區分出文本的不同部分對最終的任務有不同的重要性，比如，對於文本的分類任務，不同的字/詞對於任務是有不同的重要程度，Self Attention從《Attention Is All You Need》提出。

具體過程可以參考《The Illustrated Transformer》

對於計算出來的Z交給後面的任務前，有兩種辦法對tensor進行‘拉平’

1. K.sum

K.sum(weighted_input, axis=1)

2. Pooling layer

input_seq = Self_Attention(128)(embeddings)
input_seq = GlobalAveragePooling1D()(input_seq)

3. 注：如果要使用attention則需要讓LSTM每一步都有返回，因爲每一步都要計算attention(如上1，2)，model.add(Bidirectional(LSTM(64, return_sequences=True)))

其它與self attention有關的可以參考:

2.encode-decode attention

出自：Neural Machine Translation by Jointly Learning to Align and Translate

出自：《attention各種形式總結》-Encode-Decode attention

2.1 Bahdanau Attention(Hard attention)

2.2 LuongAttention - Global(Soft attention)

global-attention 其實就是 soft-attention

3. Multi-head Attention

下面出自：《爲什麼需要Multi-head Attention》《10分鐘帶你深入理解Transformer原理及實現》

可以類比CNN中同時使用多個濾波器的作用，直觀上講，多頭的注意力有助於網絡捕捉到更豐富的特徵/信息。

論文中是這麼說的：

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

關於different representation subspaces，舉一個不一定妥帖的例子：當你瀏覽網頁的時候，你可能在顏色方面更加關注深色的文字，而在字體方面會去注意大的、粗體的文字。這裏的顏色和字體就是兩個不同的表示子空間。同時關注顏色和字體，可以有效定位到網頁中強調的內容。使用多頭注意力，也就是綜合利用各方面的信息/特徵。

我覺得也可以把多頭注意力看作是一種ensemble，模型內部的集成。不過另外的答主也提到了，多頭注意力的機理還不是很清楚。事實上，注意力機制本身如何工作，這個可解釋性工作也還沒有完成，目前的一些解釋都還只是intuition，除了seq2seq中起到一種alignment的作用外，在許多模型中加入注意力以後如何起作用，還是有一點爭議的。

Multi-Head其實不是必須的，去掉一些頭效果依然有不錯的效果（而且效果下降可能是因爲參數量下降），這是因爲在頭足夠的情況下，這些頭已經能夠有關注位置信息、關注語法信息、關注罕見詞的能力了，再多一些頭，無非是一種enhance或noise而已。

4. attention的各種細節

Attention首先分爲兩大類：Hard Attention 與 Soft Attention，兩者的區別在於 Hard Attention 關注一個很小的區域，而soft Attention 關注的相對要發散。

在 soft attention陣營中，很快又劃分爲兩大陣營： Glocal attention 與 Local attention， 二者的區別在於關注的範圍大小不同，其中，Global attention 關注全部的文字序列，而 Local attention 關注的是固定的窗口中的所有文字序列。

比較二者， Global attention 的計算量要比 Local Attention 要大，尤其是對於長句子而言，效率變得很低；而 Local Attention 只與窗口內的文字相關，因此窗口的大小就顯得至關重要了，且在local attention 中多了一個預測中心詞的過程，這有可能會忽略一些重要的詞。而對於窗口的設置，論文中採用高斯分佈來實現。

由於Global Attention考慮的信息較多，因此從原理上講要更好一些，畢竟local attention 可能會忽略對當前輸出很重要的詞，且 Local Attention 的表現與窗口的大小密切相關，如果設置小了，可能會導致效果變得很差。而考慮到NLP中問題的複雜性（如句子長短不一，句子之間可能有很強的相關性），因此後來的很多論文中很少考慮採用 Local Attention 方法，且我自己在做閱讀理解任務時，也基本不會考慮Local Attention，畢竟窗口大小的設置實在太考驗人了。

1. 乘法VS加法attention

乘法是Self Attention中的Attention，下面是加法Attention:

2. hard attention/soft attention

《Hard&Soft Attention》

《soft/hard attention-->multi-head attention》

《NLP中的Attention機制》

Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, at 2015 ICLR
Luong et al., Effective Approaches to Attention Based Neural Machine Translation, at 2015 EMNLP

這兩種算法在 Tensorflow attention wrapper 裏都可以找到

https://github.com/tensorflow/nmt#background-on-the-attention-mechanism

Luong 的計算方法也被稱爲 “Soft Attention“， Bahdanau 的算法被稱爲 “Hard Attention". 兩者主要區別在於 attention 計算的時候是否可以 access 所有的數據，比如在給圖像配文字的任務中（Xu et al, 2015, Show, Attend and Tell）先把圖像切成了不同的 batch，就是典型的 CNN 了，在生成文本描述時，attention 是基於特定的 batch 的表徵, 還是同時可見所有 batch 的表徵。