seq2seq中的attention機制

前言

本文來講一講應用於seq2seq模型的兩種attention機制：Bahdanau Attention和Luong Attention。文中用公式+圖清晰地展示了兩種注意力機制的結構，最後對兩者進行了對比。seq2seq傳送門：click here.

文中爲了簡潔使用基礎RNN進行講解，當然現在一般都是用LSTM，這裏並不影響，用法是一樣的。另外同樣爲了簡潔，公式中省略掉了偏差。

第一種attention結構：Bahdanau Attention

兩種機制基於上篇博客第一種seq2seq結構。Encoder生成的語義向量 ${\color{Red} {c}}$ 會傳給Decoder的每一時刻，傳給每一時刻的語義向量都是同一個 ${\color{Red} {c}}$ ，這是不合理的。比如翻譯一句話，I like watching movie.翻譯成：我喜歡看電影。，其中喜歡基本上是由like得來的，I like watching movie.中每個詞對翻譯成喜歡的影響是不同的。所以，在Decoder中，每個時刻的語義向量 ${\color{Red} {c_t}}$ 都應該是不同的。

該模型來自於Bahdanau et.al(2014)，模型框架如下圖：

計算公式如下更方便理解。

Encoder：

$\begin{aligned} h_i &=tanh(W[h_{i-1},x_i])\\ o_i &=softmax(Vh_i) \\ \end{aligned}$

Decoder：

分爲兩步：
第一步，生成該時刻語義向量：

$\begin{aligned} {\color{Red} {c_t}} &=\sum ^T_{i=1} \alpha_{tj}h_i\\ \alpha_{ti} &=\frac{exp(e_{ti})}{\sum^T_{k=1}exp(e_{tk})}\\ e_{ti} &=V_a^{\top}tanh(w_a[s_{i-1},h_i])\\ \end{aligned}$

其中 ${\color{Red} {c_t}}$ 是 t 時刻的語義向量； $e_{ti}$ 是Encoder中 i 時刻 Encoder隱層狀態 $h_i$ 對Decoder中 t 時刻隱層狀態 $s_t$ 的影響程度；通過softmax函數（第二個式子）將 $e_{ti}$ 概率歸一化爲 $\alpha_{ti}$ 。

第二步，傳遞隱層信息並預測：

$\begin{aligned} s_t &=tanh(W[s_{t-1},y_{t-1},{\color{Red} {c_t}}])\\ o_t &=softmax(Vs_t) \\ \end{aligned}$

第二種attention結構：Luong Attention

該模型來自於Luong et.al(2015)，模型框架如下圖：

與第一種attention結構區別在Decoder部分，Encoder部分完全相同。Decoder還是分兩步，與前者的區別部分在公式中用綠色字體標出：

第一步，生成該時刻語義向量：
$\begin{aligned} {\color{Red} {c_t}} &=\sum ^T_{i=1} \alpha_{tj}h_i\\ \alpha_{ti} &=\frac{exp(e_{ti})}{\sum^T_{k=1}exp(e_{tk})}\\ e_{ti} &={\color{Green} {s_t^{\top}W_ah_i}}\\ \end{aligned}$

可以看出區別在計算影響程度 $e_{ti}$ 這個公式，這裏我只寫出了最優公式，有興趣可以研讀下論文。

第二步，傳遞隱層信息並預測：
$\begin{aligned} s_t &=tanh(W[s_{t-1},y_{t-1}])\\ {\color{Green} {\widetilde{s}_t}} &=tanh(W_c[{\color{Green} {s_t}},{\color{Red} {c_t}}]) \\ o_t &=softmax(V{\color{Green} {\widetilde{s}_t}}) \\ \end{aligned}$

先計算出初始的隱層狀態 $s_t$ ，再計算注意力層的隱層狀態 ${\color{Green} {\widetilde{s}_t}}$ ，最後送入softmax層輸出預測分佈。

總結

Bahdanau Attention與Luong Attention兩種注意力機制大體結構一致，區別在於計算影響程度的對齊函數。在計算 $t$ 時刻的影響程度時，前者使用 $h_i$ 和 $s_{t-1}$ 來計算，後者使用 $h_i$ 和 $s_{t}$ 來計算。從邏輯來看貌似後者更合邏輯，但兩種機制現在都有在用，TensorFlow中兩者都有對應的函數，效果應該沒有很大差別。

References:

[1] Bahdanau et.al (2014) Neural Machine Translation by Jointly Learning to Align and Translate
[2] Luong et.al (2015) Effective Approaches to Attention-based Neural Machine Translation

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

seq2seq中的attention機制

前言

第一種attention結構：Bahdanau Attention

Encoder：

Decoder：

第二種attention結構：Luong Attention

總結

References:

druid數據源 xml配置

最全的激活函數詳解

numpy常見用法總結

三種梯度下降算法的比較和幾種優化算法

seq2seq中的attention機制

Git詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結