自然語言處理:機器翻譯模型(MT、NMT、Seq2Seq with Attention)

Refenence:
1. Effective Approaches to Attention-based Neural Machine Translation


Machine Translation, MT

MT is the task of translating a sentence xx from one language to a sentence yy in another language.

1950s: Early Machine Translation

Machine translation research in early 1950s. Systems were mostly rule-based, using a bilingual dictionary to map Russian words to their English counterparts.

1990s-2010s: Statistical Machine Translation

Core idea is learn a probabilistic model form data, i.e. we want to find best English sentence y\boldsymbol y, given French sentence x\boldsymbol x:
argmaxyP(xy)=argmaxyP(xy)P(y) \arg\max_{\boldsymbol y}P(\boldsymbol x|\boldsymbol y)=\arg\max_{\boldsymbol y}P(\boldsymbol x|\boldsymbol y)P(\boldsymbol y)

Here P(xy)P(\boldsymbol x|\boldsymbol y) is a translation model, P(y)P(\boldsymbol y) is a language model.


Learning Alignment for SMT

How to learn translation model P(xy)P(\boldsymbol x|\boldsymbol y) from the parallel corpus?
Break it down further: we actually want to consider:
P(x,ay) P(\boldsymbol x,\boldsymbol a|\boldsymbol y)

where a\boldsymbol a is the alignment, i.e. word-level correspondence between French sentence x\boldsymbol x and English sentence y\boldsymbol y.


Alignment is complex

Alignment is the correspondence between particular words in the translated sentence pair:


Learn P(x,ay)P(\boldsymbol x,\boldsymbol a|\boldsymbol y) as a combination of many factors, including:

  • Probability of particular words aligning (also depends on position in sent).
  • Probability of particular words having particular fertility (number of corresponding words).

Decoding for SMT

Considering the translation model:

We could enumerate every possible y\boldsymbol y and calculate the probability? Too expensive! A simplified idea is use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability.


The best SMT systems were extremely complex, such as:

  • Lots of feature engineering.
  • Like tables of equivalent phrase, etc.

Neural Machine Translation, NMT

Sequence to sequence model:

seq2seq translation model
training a NMT model

NMT Training

The seq2seq model is an example of a Conditional Language Model. Decoder predicts the next word of the target sentence y\boldsymbol y conditioned on the source sentence x\boldsymbol x (encoder hidden state).

NMT directly calculates P(yx)P(\boldsymbol y|\boldsymbol x), NMT is generative model unlike SMT which is discriminative model:
P(yx)=i=1TP(yiy1,,yi1,x) P(\boldsymbol y|\boldsymbol x)=\prod_{i=1}^TP(y_i|y_1,\cdots,y_{i-1},\boldsymbol x)

Seq2Seq is optimized as a single system. Backpropagation operates “end-to-end”.


NMT Greedy Decoding

Greedy decoding that takes most probable word on each step of the decoder by taking argmax .

greedy decoding
prolems with greedy decoding

Beam Search Decoding

Find the optimal target sentence x\boldsymbol x by exhaustive search all possible sequences y\boldsymbol y, O(VT)O(V^T) complexity, is far too expensive.

The core idea of beam search decoding is on each step of decoder, keep track of the kk most probable partial translations (which we call hypotheses), k is the beam size around 5 to 10 in practice. Beam search is not guaranteed to find optimal solution, but much more efficient.


Stopping Criterion

In greedy decoding, usually we decode until the model produces a token.

In beam search decoding, different hypotheses may produce token on different timesteps. When a hypothesis produces , that hypothesis is complete. Place is aside and continue exploring other hypotheses via beam search.

Usually we continue beam search until: we reach timestep TT, or we have at least nn completed hypotheses (nn and TT is some pre-defined cutoff.).


How to select top one with highest score?
score(y)=logPLM(yx)=i=1tlogPLM(yiy1,,yi1,x) \text{score}(\boldsymbol y)=\log P_{\text{LM}}(\boldsymbol y|\boldsymbol x) = \sum_{i=1}^t\log P_{\text{LM}}(y_i|y_1,\cdots,y_{i-1},\boldsymbol x)

Problem with this evaluation criteria: longer hypotheses have lower scores.

Fix: Normalize by length, use this to select top one instead:
1ti=1tlogPLM(yiy1,,yi1,x) \frac{1}{t}\sum_{i=1}^t\log P_{\text{LM}}(y_i|y_1,\cdots,y_{i-1},\boldsymbol x)


Advantages of NMT

NMT has many advantages compared to SMT: better performance, more fluent, better use of context, better use of phrase similarities, requires much less human engineering effort.


Challenges of NMT

Many difficulties remain: out-of-vocabulary words, domain mismatch between train and test data, maintaining context over longer text.


Attention

Using one encoding vector of the source sentence to decode/translate the target sentence, which needs to capture all information about the source sentence. This is information bottleneck.
Attention core idea: on each step of the decoder, use direct connection to the decoder to focus on a particular part of the source sequence.


Sequence-to-Sequence with Attention

Use attention distribution to take a weighted sum of the encoder hidden states, thus the decoder can decide (by self learning) which states to use to predict next word.
Attention provides some interpretability, we can see what the decoder was focusing on!

On each step tt:

  • Use the decoder hidden state htRh\boldsymbol h_t\in\R^h (query vector) with each encoder hidden state hsRh\boldsymbol{\overline h_s}\in\R^h to compute the attention scores etRN\boldsymbol e_t\in\R^N (N timesteps).
    ets=score(ht,hs)={hthsdothtWahsgeneralvatanh(Wa[ht;hs])concat e_t^s = \text{score}(\boldsymbol h_t, \boldsymbol{\overline h_s})= \begin{cases} \boldsymbol h_t^\top\boldsymbol{\overline h}_s &dot\\[.5ex] \boldsymbol h_t^\top\boldsymbol W_a\boldsymbol{\overline h}_s &general\\[.5ex] \boldsymbol v_a^\top\tanh(\boldsymbol W_a[\boldsymbol h_t;\boldsymbol{\overline h}_s]) &concat \end{cases}

  • Take softmax to get the attention distribution αt\boldsymbol \alpha_t.
    αt=softmax(et)RN \boldsymbol\alpha_t=\text{softmax}(\boldsymbol e_t)\in\R^N

  • Use αt\boldsymbol\alpha_t to take a weighted sum of the overall encoder hidden states to compute the attention output ct\boldsymbol c^t (global context vector), overall all the source states.
    ct=i=1NαtihiRh \boldsymbol c_t=\sum_{i=1}^N\alpha_t^i\boldsymbol{\overline h}_i \in \R^h

  • Employ a simple concatenation layer to combine the information from both vectors to produce an attentional hidden state.
    h~t=tanh(Wc[ct;ht]) \tilde{\boldsymbol h}_t=\tanh(\boldsymbol W_c[\boldsymbol c_t;\boldsymbol h_t])

  • The attention vector h~t\tilde{\boldsymbol h}_t is then fed through the softmax layer to produce the predictive distribution:
    p(yty<t,x)=softmax(Wsh~t) p(y_t|\boldsymbol y_{<t},\boldsymbol x) = \text{softmax}(\boldsymbol W_s\tilde{\boldsymbol h}_t)


Implementation with Tensorflow

以下實現僅對於多分類任務,非NMT任務

def attention(inputs, inputs_size, atten_size):
    """
    atten_inputs:
    	(batch_size, max_time, hidden_size)
    expression:
    	u = tanh(w·h+b)
    	alpha = exp(u^t·v)/sum(exp(u^t·v))
    	s = sum(alpha·h)
    """
    hidden_size = int(inputs.shape[2])
    w = tf.Variable(tf.random_normal([hidden_size, atten_size]))
    b = tf.Variable(tf.random_normal([atten_size], stddev=0.1))
    v = tf.Variable(tf.random_normal([atten_size], stddev=0.1))
    
    # [batch_size, max_times, atten_size]
    u = tf.tanh(tf.matmul(inputs, w) + b)
    # [batch_size, max_times]
    uv = tf.linalg.matvec(u, v) / 2.0
    # set the alpha of padding symbol to zero by add negtive infinity number
    mask = tf.sequence_mask(inputs_size, max_len=TIME_STEPS, dtype=tf.float32)
    uv_mask = uv + tf.float32.min * (1- mask)
    alphas = tf.nn.softmax(uv_mask, axis=1)
    
    # [batch_size, max_times]
    # alphas = tf.exp(uv) * mask
    # alphas = alphas / tf.expand_dims(tf.reduce_sum(alphas, axis=1), -1)
    output = tf.reduce_sum(tf.multiply(inputs, tf.extend_dims(alphas, -1)), axis=1)
    
    return alphas, output

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章