【转】Tensorflow的RNN和Attention实现过程

原文链接http://cairohy.github.io/2017/06/05/ml-coding-summarize/Tensorflow%E7%9A%84RNN%E5%92%8CAttention%E7%9B%B8%E5%85%B3/
Tensorflow的RNN和Attention实现过程
发表于 2017-06-05 | 分类于 编程总结 | 还没有评论 | 阅读次数: 8074
本文字数: 7.7k | 阅读时长 ≈ 0:08
今天就来看一看不同种类的RNN和Attention在Tensorflow中到底是怎么实现的。

1、从_RNNCell到LSTM
任何Recurrent Neural Network都必须有一个或者多个cell,而这些cell的公共父类就是RNNCell,一个抽象类。拥有_call()方法,每次调用接受一个input(BatchSize×input_size)和一个state,输出一个output和new state的元组。

1.BasicRNNCell,也就是经典的RNN,其调用的时候输出和状态的计算公式是:output = new_state = act(W input + U state + B),其内部调用了_linear()函数。
2._linear()函数,接受输入,并将输入与参数矩阵W相乘,加上偏置b,并返回。
3.BasicLSTMCell,也就是LSTM,其调用函数:
def call(self, inputs, state, scope=None):
“”“Long short-term memory cell (LSTM).”""
with _checked_scope(self, scope or “basic_lstm_cell”, reuse=self._reuse):
# Parameters of gates are concatenated into one multiply for efficiency.
if self._state_is_tuple:
# 一般都走这个分支,取出c_t和h_t
c, h = state
else:
c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

参考了《Recurrent Neural Network Regularization》,一次计算四个gate

concat = _linear([inputs, h], 4 * self._num_units, True)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)

new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) *
         self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)

if self._state_is_tuple:
  new_state = LSTMStateTuple(new_c, new_h)
else:
  new_state = array_ops.concat([new_c, new_h], 1)
# 注意这里返回的输出是h_t,而state是(c,h)
return new_h, new_state

和下图中论文中公式是完全对应的:

这个公式总的看来就是:ht=G(ht−1,xt,ct)

4.GRUCell,参考了2014年EMNLP论文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》中的实现,论文中公式如下(参数简化表示了):
下面的公式总的看来就是:ht=G(ht−1,xt)
r=σ(Wxt+Uht−1)z=σ(Wxt+Uht−1)ht=zht−1+(1−z)htht=ϕ(Wx+U(r⊙ht−1))
而代码如下:

def call(self, inputs, state, scope=None):
“”“Gated recurrent unit (GRU) with nunits cells.”""
with _checked_scope(self, scope or “gru_cell”, reuse=self._reuse):
with vs.variable_scope(“gates”): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
# 一次计算出两个gate的值
value = sigmoid(_linear(
[inputs, state], 2 * self._num_units, True, 1.0))
# 这里的u就是上面的z
r, u = array_ops.split(
value=value,
num_or_size_splits=2,
axis=1)
with vs.variable_scope(“candidate”):
c = self._activation(_linear([inputs, r * state],
self._num_units, True))
new_h = u * state + (1 - u) * c

GRU里面输出和state都是一个h

return new_h, new_h
此外,还有支持peephole和projection的LSTMCell。也是__call__()方法不一样。

2、Cell的Wrapper
包括inputProjectionWrapper,outputProjectionWrapper在内的一些用于映射输入输出的类,往往没有直接在外面用tf的操作快。

DropoutWrapper,将cell作为属性,并实现call方法,在调用cell前后进行dropout,支持对于输入,state和输出进行dropout。

ResidualWrapper,就是把输入concat到输出上一起返回。

DeviceWrapper,确保这个cell在指定的设备上运行(2333)。

MultiRNNCell,这个也算是一个wrapper,因为可以拥有cell的数组作为属性,用于实现多层RNN。

AttentionCellWrapper,参照了《Neural Machine Translation by Jointly Learning to Align and Translate》的实现,也就是Bahdanau风格的实现,公式如下,其中y是t时刻的输入(在这篇文章中同时也是t-1时刻的输出),s是隐藏状态,c通过和encoder的隐藏状态进行相似度计算、归一化、加权求和得到:

si=(1−zi)∘si−1+zi∘sisi=tanh(W×e(yi−1)+U[ri∘si−1])zi=σ(f(yi−1,si−1,ci))ri=σ(f(yi−1,si−1,ci))
这篇文章中,相似度对比函数a是通过一个前馈神经网络实现的。

eij=a(si−1,hj)=Vtanh(g(si−1,hj))
接下来是TF的代码部分:

def call(self, inputs, state, scope=None):
“”“Long short-term memory cell with attention (LSTMA).”""
# state \in R^{B\times T}
with _checked_scope(self, scope or “attention_cell_wrapper”,
reuse=self._reuse):
if self._state_is_tuple:
# 这里把state分为三个部分,LSTM的state,attns(代表attention向量)和attn的state
state, attns, attn_states = state
else:
# 如果不是元组,就按照长度切分
states = state
state = array_ops.slice(states, [0, 0], [-1, self._cell.state_size])
attns = array_ops.slice(
states, [0, self._cell.state_size], [-1, self._attn_size])
attn_states = array_ops.slice(
states, [0, self._cell.state_size + self._attn_size],
[-1, self._attn_size * self._attn_length])
# attention状态是[None x Attention向量长度 x Attention窗口长度]
attn_states = array_ops.reshape(attn_states,
[-1, self._attn_length, self._attn_size])
input_size = self._input_size
if input_size is None:
input_size = inputs.get_shape().as_list()[1]
# 让input和attns进行一个什么运算呢?
inputs = _linear([inputs, attns], input_size, True)
lstm_output, new_state = self._cell(inputs, state)
if self._state_is_tuple:
new_state_cat = array_ops.concat(nest.flatten(new_state), 1)
else:
new_state_cat = new_state
# 利用attention机制计算出下一时刻需要的上下文向量c_t和attention状态(隐藏状态)h_j
new_attns, new_attn_states = self.attention(new_state_cat, attn_states)
with vs.variable_scope(“attn_output_projection”):
# 利用c_t和x_t(y
{t-1})计算出t时刻输出s_t
output = _linear([lstm_output, new_attns], self._attn_size, True)
# 把当前时刻输出s_t增加到下一时刻attention状态去
new_attn_states = array_ops.concat(
[new_attn_states, array_ops.expand_dims(output, 1)], 1)
new_attn_states = array_ops.reshape(
new_attn_states, [-1, self._attn_length * self._attn_size])
new_state = (new_state, new_attns, new_attn_states)
if not self._state_is_tuple:
new_state = array_ops.concat(list(new_state), 1)
# 最后返回s_t和h,注意这里的h就是s_t,所以这个AttentionWrapper应用范围有限,有些情况下不能用,需要自己修改定制
return output, new_state

def _attention(self, query, attn_states):
conv2d = nn_ops.conv2d
reduce_sum = math_ops.reduce_sum
softmax = nn_ops.softmax
tanh = math_ops.tanh

with vs.variable_scope("attention"):
  k = vs.get_variable(
      "attn_w", [1, 1, self._attn_size, self._attn_vec_size])
  v = vs.get_variable("attn_v", [self._attn_vec_size])
  # 相当于所有的h_j
  hidden = array_ops.reshape(attn_states,
                             [-1, self._attn_length, 1, self._attn_size])
  # 计算Uh_j,shape:[[None, attn_len, 1, attn_vec_size]]
  hidden_features = conv2d(hidden, k, [1, 1, 1, 1], "SAME")
  y = _linear(query, self._attn_vec_size, True)
  # 计算WS_i
  y = array_ops.reshape(y, [-1, 1, 1, self._attn_vec_size])
  # attention相似度计算公式,s\in R^{-1, attn_len},对应所有的e_{ij}
  s = reduce_sum(v * tanh(hidden_features + y), [2, 3])
  # a \in R^{-1, attn_len},对应论文中的\alpha
  a = softmax(s)
  # 计算上下文向量c_i=\sum \alpha_{ij} * h_j
  d = reduce_sum(
      array_ops.reshape(a, [-1, self._attn_length, 1, 1]) * hidden, [1, 2])
  new_attns = array_ops.reshape(d, [-1, self._attn_size])
  # 扔掉最早的一个attention-states
  new_attn_states = array_ops.slice(attn_states, [0, 1, 0], [-1, -1, -1])
  return new_attns, new_attn_states

最后,从staticrnn到dynamicrnn到bidirection_dynamic_rnn,其内部都是调用了这些cell的__call()方法。

3、各种Attention
_BaseAttentionMechanism
BahdanauAttention
LuongAttention
DynamicAttentionWrapper
官方给的用法参考:

cell = tf.contrib.rnn.DeviceWrapper(LSTMCell(512), “/gpu:0”)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(
cell, attention_mechanism, attention_size=256)
参考
https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Attention
https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cell_wrappers_RNNCells_that_wrap_other_RNNCells_

机器学习 # 深度学习 # 自然语言处理 # tensorflow

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章