原文链接http://cairohy.github.io/2017/06/05/ml-coding-summarize/Tensorflow%E7%9A%84RNN%E5%92%8CAttention%E7%9B%B8%E5%85%B3/
Tensorflow的RNN和Attention实现过程
发表于 2017-06-05 | 分类于编程总结 | 还没有评论 | 阅读次数： 8074
本文字数： 7.7k | 阅读时长 ≈ 0:08
今天就来看一看不同种类的RNN和Attention在Tensorflow中到底是怎么实现的。

1、从_RNNCell到LSTM
任何Recurrent Neural Network都必须有一个或者多个cell，而这些cell的公共父类就是RNNCell，一个抽象类。拥有_call()方法，每次调用接受一个input（BatchSize×input_size）和一个state，输出一个output和new state的元组。

1.BasicRNNCell，也就是经典的RNN，其调用的时候输出和状态的计算公式是：output = new_state = act(W input + U state + B)，其内部调用了_linear()函数。
2._linear()函数，接受输入，并将输入与参数矩阵W相乘，加上偏置b，并返回。
3.BasicLSTMCell，也就是LSTM，其调用函数：
def call(self, inputs, state, scope=None):
“”“Long short-term memory cell (LSTM).”""
with _checked_scope(self, scope or “basic_lstm_cell”, reuse=self._reuse):
# Parameters of gates are concatenated into one multiply for efficiency.
if self._state_is_tuple:
# 一般都走这个分支，取出c_t和h_t
c, h = state
else:
c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

参考了《Recurrent Neural Network Regularization》，一次计算四个gate

concat = _linear([inputs, h], 4 * self._num_units, True)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)

new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) *
         self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)

if self._state_is_tuple:
  new_state = LSTMStateTuple(new_c, new_h)
else:
  new_state = array_ops.concat([new_c, new_h], 1)
# 注意这里返回的输出是h_t,而state是(c,h)
return new_h, new_state

和下图中论文中公式是完全对应的：

这个公式总的看来就是：ht=G(ht−1,xt,ct)

4.GRUCell，参考了2014年EMNLP论文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》中的实现，论文中公式如下（参数简化表示了）：
下面的公式总的看来就是：ht=G(ht−1,xt)
r=σ(Wxt+Uht−1)z=σ(Wxt+Uht−1)ht=zht−1+(1−z)h_tht=ϕ(Wx+U(r⊙ht−1))
而代码如下：

def call(self, inputs, state, scope=None):
“”“Gated recurrent unit (GRU) with nunits cells.”""
with _checked_scope(self, scope or “gru_cell”, reuse=self._reuse):
with vs.variable_scope(“gates”): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
# 一次计算出两个gate的值
value = sigmoid(_linear(
[inputs, state], 2 * self._num_units, True, 1.0))
# 这里的u就是上面的z
r, u = array_ops.split(
value=value,
num_or_size_splits=2,
axis=1)
with vs.variable_scope(“candidate”):
c = self._activation(_linear([inputs, r * state],
self._num_units, True))
new_h = u * state + (1 - u) * c

GRU里面输出和state都是一个h

return new_h, new_h
此外，还有支持peephole和projection的LSTMCell。也是__call__()方法不一样。

2、Cell的Wrapper
包括inputProjectionWrapper，outputProjectionWrapper在内的一些用于映射输入输出的类，往往没有直接在外面用tf的操作快。

DropoutWrapper，将cell作为属性，并实现call方法，在调用cell前后进行dropout，支持对于输入，state和输出进行dropout。

ResidualWrapper，就是把输入concat到输出上一起返回。

DeviceWrapper，确保这个cell在指定的设备上运行（2333）。

MultiRNNCell，这个也算是一个wrapper，因为可以拥有cell的数组作为属性，用于实现多层RNN。

AttentionCellWrapper，参照了《Neural Machine Translation by Jointly Learning to Align and Translate》的实现，也就是Bahdanau风格的实现，公式如下，其中y是t时刻的输入（在这篇文章中同时也是t-1时刻的输出），s是隐藏状态，c通过和encoder的隐藏状态进行相似度计算、归一化、加权求和得到：

si=(1−zi)∘si−1+zi∘s_isi=tanh(W×e(yi−1)+U[ri∘si−1])zi=σ(f(yi−1,si−1,ci))ri=σ(f(yi−1,si−1,ci))
这篇文章中，相似度对比函数a是通过一个前馈神经网络实现的。

eij=a(si−1,hj)=Vtanh(g(si−1,hj))
接下来是TF的代码部分：

def call(self, inputs, state, scope=None):
“”“Long short-term memory cell with attention (LSTMA).”""
# state \in R^{B\times T}
with _checked_scope(self, scope or “attention_cell_wrapper”,
reuse=self._reuse):
if self._state_is_tuple:
# 这里把state分为三个部分，LSTM的state，attns（代表attention向量）和attn的state
state, attns, attn_states = state
else:
# 如果不是元组，就按照长度切分
states = state
state = array_ops.slice(states, [0, 0], [-1, self._cell.state_size])
attns = array_ops.slice(
states, [0, self._cell.state_size], [-1, self._attn_size])
attn_states = array_ops.slice(
states, [0, self._cell.state_size + self._attn_size],
[-1, self._attn_size * self._attn_length])
# attention状态是[None x Attention向量长度 x Attention窗口长度]
attn_states = array_ops.reshape(attn_states,
[-1, self._attn_length, self._attn_size])
input_size = self._input_size
if input_size is None:
input_size = inputs.get_shape().as_list()[1]
# 让input和attns进行一个什么运算呢？
inputs = _linear([inputs, attns], input_size, True)
lstm_output, new_state = self._cell(inputs, state)
if self._state_is_tuple:
new_state_cat = array_ops.concat(nest.flatten(new_state), 1)
else:
new_state_cat = new_state
# 利用attention机制计算出下一时刻需要的上下文向量c_t和attention状态（隐藏状态）h_j
new_attns, new_attn_states = self.attention(new_state_cat, attn_states)
with vs.variable_scope(“attn_output_projection”):
# 利用c_t和x_t(y{t-1})计算出t时刻输出s_t
output = _linear([lstm_output, new_attns], self._attn_size, True)
# 把当前时刻输出s_t增加到下一时刻attention状态去
new_attn_states = array_ops.concat(
[new_attn_states, array_ops.expand_dims(output, 1)], 1)
new_attn_states = array_ops.reshape(
new_attn_states, [-1, self._attn_length * self._attn_size])
new_state = (new_state, new_attns, new_attn_states)
if not self._state_is_tuple:
new_state = array_ops.concat(list(new_state), 1)
# 最后返回s_t和h，注意这里的h就是s_t，所以这个AttentionWrapper应用范围有限，有些情况下不能用，需要自己修改定制
return output, new_state

def _attention(self, query, attn_states):
conv2d = nn_ops.conv2d
reduce_sum = math_ops.reduce_sum
softmax = nn_ops.softmax
tanh = math_ops.tanh

with vs.variable_scope("attention"):
  k = vs.get_variable(
      "attn_w", [1, 1, self._attn_size, self._attn_vec_size])
  v = vs.get_variable("attn_v", [self._attn_vec_size])
  # 相当于所有的h_j
  hidden = array_ops.reshape(attn_states,
                             [-1, self._attn_length, 1, self._attn_size])
  # 计算Uh_j,shape:[[None, attn_len, 1, attn_vec_size]]
  hidden_features = conv2d(hidden, k, [1, 1, 1, 1], "SAME")
  y = _linear(query, self._attn_vec_size, True)
  # 计算WS_i
  y = array_ops.reshape(y, [-1, 1, 1, self._attn_vec_size])
  # attention相似度计算公式，s\in R^{-1, attn_len}，对应所有的e_{ij}
  s = reduce_sum(v * tanh(hidden_features + y), [2, 3])
  # a \in R^{-1, attn_len}，对应论文中的\alpha
  a = softmax(s)
  # 计算上下文向量c_i=\sum \alpha_{ij} * h_j
  d = reduce_sum(
      array_ops.reshape(a, [-1, self._attn_length, 1, 1]) * hidden, [1, 2])
  new_attns = array_ops.reshape(d, [-1, self._attn_size])
  # 扔掉最早的一个attention-states
  new_attn_states = array_ops.slice(attn_states, [0, 1, 0], [-1, -1, -1])
  return new_attns, new_attn_states

最后，从staticrnn到dynamicrnn到bidirection_dynamic_rnn，其内部都是调用了这些cell的__call()方法。

3、各种Attention
_BaseAttentionMechanism
BahdanauAttention
LuongAttention
DynamicAttentionWrapper
官方给的用法参考：

cell = tf.contrib.rnn.DeviceWrapper(LSTMCell(512), “/gpu:0”)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(
cell, attention_mechanism, attention_size=256)
参考
https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Attention
https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cell_wrappers_RNNCells_that_wrap_other_RNNCells_

机器学习 # 深度学习 # 自然语言处理 # tensorflow

【转】Tensorflow的RNN和Attention实现过程

参考了《Recurrent Neural Network Regularization》，一次计算四个gate

GRU里面输出和state都是一个h

机器学习 # 深度学习 # 自然语言处理 # tensorflow

sm4加密工具类

多項式擬合和高斯擬合結合調研

mysql中GROUP BY結合GROUP_CONCAT的使用

C#加密解密

asp.net Session介紹

通過配置文件將自動映射到對應的類

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結