【自然語言處理】聊聊注意力機制(Attention Mechanism)的發展

前言

其實，關於注意力機制的文章真的有很多，而且寫得相當精彩（畢竟過去這麼多年了），這篇博文的主要目的在於對注意力機制作一個簡單介紹讓人們。

淺談

首先這件事還要從序列到序列模型（Seq2seq Model）開始說起，最早的序列到序列模型是一個CNN+LSTM。
簡單來說就是把CNN把編碼端映射成一個固定向量，然後用LSTM一步步解碼。
接着一個自然的想法是使用LSTM[1]，因爲LSTM的處理序列能力更強嘛。那麼一種簡單的做法是將編碼端的最後一個隱層狀態作爲解碼端的初始狀態。

那麼很明顯最後一個隱層的狀態(固定長度的向量)隨着句子長度正常攜帶的信息能力會明顯不足。
爲了解決這個問題，引入注意力機制，在解碼端的每一步都得到一個創建一個 $\color{red}{上下文向量}$ （Context Vector），通過上下文向量和解碼端的輸入生成下一個詞。
而Transformer的自注意力機制（Self-Attention）就是將注意力機制從編碼端到解碼端抽取出來，使其可以捕獲句子間的關係，實驗證明了這種效果確實比LSTM好太多。
如果你只是單純想了解下Attention，那麼淺談看看就夠了，如果想深入，請往下繼續看。（前方高能）

深入瞭解

Bahdanau Attention

第一個注意力機制，稱爲Bahdanau Attention[2]。
記編碼器（Encoder）的隱含狀態爲 $h$ ，記解碼器（Decoder）的隱含狀態爲 $s$ 。

$p(y_i|y_1,...,y_{i-1},x)=g(y_{i-1},s_i,c_i)$
首先解碼器當前詞 $y_i$ 的輸出根據上一個詞 $y_{i_1}$ ，上一個解碼器的隱含狀態爲 $s_i$ 以及上下文向量 $c_i$ 得到。
$s_i=f(s_{i-1},y_{i-1},c_i)$
那麼，解碼器當前隱含層狀態 $s_i$ 則跟上一個隱含層狀態 $s_{i-1}$ ，上一個輸出 $y_{i-1}$ 和當前的上下文向量 $c_i$ 。
$c_i=\sum\limits_{j=1}^{T_x} {a_{ij}h_j}$
當前的上下文向量 $c_i$ 是加權求和，注意到 $T_x$ 是編碼器層的步長，而 $a_{ij}$ 表示這個編碼器的隱含層構成上下文向量的一個權重。
$a_{ij}=\frac{exp(e_{ij})}{\sum\limits_{k=1}^{T_x}exp(e_{ik})}$
那麼 $a_{ij}$ 則是通過 $e_{ij}$ 的softmax得到。
以及 $e_{ij}=a(s_{i-1},h_j)$
這裏的 $e_{ij}$ 可以意味着解碼器的前一個隱層狀態 $s_{i-1}$ 和編碼器的某一個隱層狀態 $h_j$ 的一個相似度，這裏是用的一個前饋網絡得到。
memory是encoder_outputs，也就是編碼器的所有隱層狀態，query。
來瞅瞅Tensorflow的代碼。BahdanauAttention代碼

    score = _bahdanau_score(
          processed_query,
          self._keys,
          attention_v,
          attention_g=attention_g,
          attention_b=attention_b)
    alignments = self._probability_fn(score, state)
    next_state = alignments

這裏的score就是 $a_j$ ， $a_j$ 的大小是一個編碼器層步長 $T_x$ 的向量。

def _bahdanau_score(processed_query,
                    keys,
                    attention_v,
                    attention_g=None,
                    attention_b=None):
  """Implements Bahdanau-style (additive) scoring function.
  This attention has two forms.  The first is Bhandanau attention,
  as described in:
  Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
  "Neural Machine Translation by Jointly Learning to Align and Translate."
  ICLR 2015. https://arxiv.org/abs/1409.0473
  The second is the normalized form.  This form is inspired by the
  weight normalization article:
  Tim Salimans, Diederik P. Kingma.
  "Weight Normalization: A Simple Reparameterization to Accelerate
   Training of Deep Neural Networks."
  https://arxiv.org/abs/1602.07868
  To enable the second form, set please pass in attention_g and attention_b.
  Args:
    processed_query: Tensor, shape `[batch_size, num_units]` to compare to keys.
    keys: Processed memory, shape `[batch_size, max_time, num_units]`.
    attention_v: Tensor, shape `[num_units]`.
    attention_g: Optional scalar tensor for normalization.
    attention_b: Optional tensor with shape `[num_units]` for normalization.
  Returns:
    A `[batch_size, max_time]` tensor of unnormalized score values.
  """
  # Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
  processed_query = array_ops.expand_dims(processed_query, 1)
  if attention_g is not None and attention_b is not None:
    normed_v = attention_g * attention_v * math_ops.rsqrt(
        math_ops.reduce_sum(math_ops.square(attention_v)))
    return math_ops.reduce_sum(
        normed_v * math_ops.tanh(keys + processed_query + attention_b), [2])
  else:
    return math_ops.reduce_sum(
        attention_v * math_ops.tanh(keys + processed_query), [2])

而這個分數的計算則是通過processed_query（編碼器所有隱層狀態， $T_x$ 個），keys（編碼器當前的隱層狀態，1個）得到。

Loung Attention

第二個注意力機制，是繼Bahdanau Attention後的一個注意力機制叫Loung Attention[4]。
記編碼器（Encoder）的隱含狀態爲 $h_t$ ，記解碼器（Decoder）的隱含狀態爲 $h_s$ 。（請注意，跟Bahdanau Attention不一樣了）

首先說說他們的區別。
第一，Loung Attention提供了兩種的注意力機制，一個是全局注意力機制（Global Attention），考慮全部的編碼器的隱含狀態；一個是局部注意力機制（Local Attention），只考慮局部窗口的隱含狀態，這樣可以介紹計算量。
第二，全局注意力機制和Bahdanau Attention相似，但是還是有些區別，一個是Bahdanau Attention的編碼器隱含層是由一個雙向LSTM然後拼接而成， $h_i=[\overrightarrow h_i; \overleftarrow h_j]$ ，而Loung Attention僅僅是用了最頂層的隱含層。二個，兩者計算解碼器隱含層狀態不同。Bahdanau Attention是 $h_{t-1} \to a_t \to c_t \to h_t$ ，Loung Attention是 $h_{t} \to a_t \to c_t \to \tilde h_t$ ，簡單來講，Loung Attention像Seq2seq一樣先計算出解碼層的隱含狀態 $h_t$ ，再結合上下文向量 $c_t$ 得出最終的解碼器隱含狀態（後文我把它稱爲解碼器頂層隱含狀態） $\tilde h_t$ 。

$p(y_t|y_{<t,x})=softmax(W_s\tilde h_t)$
我們可以看到生成下一個詞的主要是依靠這個當前的頂層解碼器隱層狀態 $\tilde h_t$ 。
$\tilde h_t = tanh(W_c[c_t;h_t])$
而這個當前的頂層解碼器隱層狀態則是根據當前上下文向量 $c_t$ 和解碼器隱含層狀態 $h_t$ 得到。
$\begin{array}{c} A_t(s) = align(h_t,h_s)\\ = \frac {exp(score(h_t,\bar h_s))}{\sum\nolimits_{s^’} {exp(score(h_t,\bar h_s)}} \end{array}$
其實 $c_t$ 的創建跟Bahdanau Attention類似，主要是計算 $a_{ij}$ 的方式有些不同。並且Loung Attention提供了三種計算分數（score）的方式。

    with variable_scope.variable_scope(None, "luong_attention", [query]):
      attention_g = None
      if self._scale:
        attention_g = variable_scope.get_variable(
            "attention_g",
            dtype=query.dtype,
            initializer=init_ops.ones_initializer,
            shape=())
      score = _luong_score(query, self._keys, attention_g)
    alignments = self._probability_fn(score, state)
    next_state = alignments

def _luong_score(query, keys, scale):
  """Implements Luong-style (multiplicative) scoring function.
  This attention has two forms.  The first is standard Luong attention,
  as described in:
  Minh-Thang Luong, Hieu Pham, Christopher D. Manning.
  "Effective Approaches to Attention-based Neural Machine Translation."
  EMNLP 2015.  https://arxiv.org/abs/1508.04025
  The second is the scaled form inspired partly by the normalized form of
  Bahdanau attention.
  To enable the second form, call this function with `scale=True`.
  Args:
    query: Tensor, shape `[batch_size, num_units]` to compare to keys.
    keys: Processed memory, shape `[batch_size, max_time, num_units]`.
    scale: the optional tensor to scale the attention score.
  Returns:
    A `[batch_size, max_time]` tensor of unnormalized score values.
  Raises:
    ValueError: If `key` and `query` depths do not match.
  """
  depth = query.get_shape()[-1]
  key_units = keys.get_shape()[-1]
  if depth != key_units:
    raise ValueError(
        "Incompatible or unknown inner dimensions between query and keys.  "
        "Query (%s) has units: %s.  Keys (%s) have units: %s.  "
        "Perhaps you need to set num_units to the keys' dimension (%s)?" %
        (query, depth, keys, key_units, key_units))

  # Reshape from [batch_size, depth] to [batch_size, 1, depth]
  # for matmul.
  query = array_ops.expand_dims(query, 1)

  # Inner product along the query units dimension.
  # matmul shapes: query is [batch_size, 1, depth] and
  #                keys is [batch_size, max_time, depth].
  # the inner product is asked to **transpose keys' inner shape** to get a
  # batched matmul on:
  #   [batch_size, 1, depth] . [batch_size, depth, max_time]
  # resulting in an output shape of:
  #   [batch_size, 1, max_time].
  # we then squeeze out the center singleton dimension.
  score = math_ops.matmul(query, keys, transpose_b=True)
  score = array_ops.squeeze(score, [1])

  if scale is not None:
    score = scale * score
  return score

這裏看到兩種Attention的計算方式是非常相似的。

Self-Attention

自注意力機制是Transformer[4]提出的。以上在代碼裏面query,key,value都是這篇文章提出的一些概念，就是這篇文章把注意力機制給抽取出來，把它變得更通用、一般化了。
一般的來講query和value代表着之前編碼器端的隱含狀態，key指代解碼器端的隱含狀態。那麼注意力機制就是用key跟query對比得到一個權重矩陣然後跟value進行加權求出一個上下文向量。

$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt d_k})V$
主要看下Attention的計算，其實這裏會發現這個跟之前的Attention計算方式非常相似， $QK^T$ 的結果就類似一個score，但是這裏多了一個縮放因子 $\sqrt d_k$ ，文章提到是用來縮放的，避免score太大，會把概率分化接近成1和0。

總結

目前好多論文似乎都要"注意力機制"一下，不過也間接地證明了注意力機制的有效性。其實目前的一些ELMO把詞向量從靜態走向動態，其中使用了加權得到的詞向量的思想跟注意力機制非常相似，所以瞭解注意力機制的本質對後面的論文的幫助還是非常大的。

參考文獻

【1】Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.
【2】Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014.
【3】Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[J]. arXiv preprint arXiv:1508.04025, 2015.
【4】Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

【自然語言處理】聊聊注意力機制(Attention Mechanism)的發展

前言

淺談

深入瞭解

Bahdanau Attention

Loung Attention

Self-Attention

總結

參考文獻

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

【Python】問題小記錄

【自然語言處理】tf.contrib.seq2seq.dynamic_decode源碼分析

[數據結構]單鏈表C語言的簡單實現

[數據結構]圖鄰接矩陣C語言簡單實現

[數據結構]棧的C語言簡單實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結