RNN：从BasicRNN到GRU/LSTM

循环神经网络（RNN）是一种重要的神经网络模型，尤其适用于序列化标注问题。初学循环神经网络的过程中，经常迷惑于各种似曾相识的原理图，纠结于不同的Cell是什么原理，它们是怎么组合起来的，输入数据究竟长啥样，它们是怎么被单个Cell处理的，又是怎样在Cell间流转的，代码层是怎么实现的，复杂程度咋样？本文将试图从多个角度，提纲挈领的对诸多问题进行由浅入深的探讨。基本思路是首先讲解单个Cell的结构，然后讲解多个Cell如何构成一条链状结构，最后讲解如何利用不同的Cell和链状结构构造出一个更复杂的seq2seq模型。本文借鉴了斯坦福CS224D课程的部分内容，每个环节都会伴随着原理的讲解，并列兄弟结构的差别和演进，并从代码的角度进行展示，希望能给初学者一个有点有面的认识。

单个RNN cell:

RNN cell是循环神经网络最基本的单元，代表了一个基本的神经元。该Cell的输入除了常规的X(t)，还多出了一个代表上一步记忆的H(t-1)，这里可以称之为记忆，也可以称之为上一步的HiddenState。
如下图：

对于最基本的RNN Cell(对应于rnn_cell_impl.py中的BasicRNNCell)，H(t-1)和Y(t-1)是一样的，没错，是一样的，看下面源码：

class BasicRNNCell(RNNCell):
  def __init__(self, num_units, activation=None, reuse=None):
    super(BasicRNNCell, self).__init__(_reuse=reuse)
    self._num_units = num_units
    self._activation = activation or math_ops.tanh
    self._linear = None

  @property
  def state_size(self):
    return self._num_units

  @property
  def output_size(self):
    return self._num_units

  def call(self, inputs, state):
    """Most basic RNN: output = new_state = act(W * input + U * state + B)."""
    if self._linear is None:
      self._linear = _Linear([inputs, state], self._num_units, True)

    output = self._activation(self._linear([inputs, state]))
    return output, output

call函数返回的两个值分别代表一个Cell的output和hidden_state，可以看出返回的output和hidden_state值是一样的，所以output_size=hidden_state_size，都等于num_units。同时，通过上面的代码我们还可以看到，BasicRNNCell构造函数只有一个必不可少的参数：num_units。该参数代表的是BasicRNNCell里面的全连接网络的输出维度。输入维度是任意的，取决于用户调用call函数时输入的input的size，所以构造函数只需要输出维度一个必选参数。

接下来我们通过一个简单的例子说明如何通过X(t)和H(t-1)计算output Y(t)，同时也就是hidden_state。假设我们要做一个语言翻译模型，将中文翻译成英语。我们的训练数据有100万条，每一条都是一句中文，和对应的英文翻译。我们指定batch_size=100，也就是说每次处理100条训练数据。我们首先对每句话进行切词，并转换成WordEmbedding表示。这里为了画图方便，假设我们的WordEmbedding维度是4，也就是每个单词用一个长度为4的向量表示(实际上WordEmbedding一般是256)。

这是一个极简的BasicRNNCell的计算过程，深刻的理解它对理解整个RNN网络至关重要。这里强调几点，首先这是一个全连接网络，X(t)由矩阵W映射到x_output，H(t-1)由矩阵U映射到state_output。第二，x_output和state_output维度是一样的，这样才能进行后续的累加操作。第三，x_output和state_output累加并加入Bias后，一份作为Y(t)，一份作为H(t)，所以对于BasicRNNCell，Y(t)=H(t)。

所以训练这个网络，就是训练W和U矩阵，Bias参数，以及初始化的H(0)矩阵的过程。有了这些概念，我们就可以理解下面这个按照时间展开的RNN网络图了：

上面讲的BasicRNNCell是最基本的RNN Cell样式，在此基础上发展出了很多更复杂的RNN Cell，比如BasicLSTMCell/LSTMCell/GRUCell/MultiRNNCell等。为什么要去改进BasicRNNCell呢？这是因为虽然RNN Cell在长距离梯度更新的时候容易出现梯度消失和(或)梯度爆炸的问题，具体证明方法超出了本文的讨论范围，请参照附录里斯坦福大学的教程，里面给出了详细复杂的证明方法。而GRU和LSTM Cell的出现解决了这个问题。

GRUCell公式和原理图如下：

GRUCell最显著的特征是添加了两个Gate，ResetGate和UpdateGate。之前BasicRNNCell是将X(t)和H(t-1)分别经过W，U矩阵映射之后无脑累加起来，而GRU会通过两个门函数进行取舍。ResetGate用于决定从之前的记忆H(t-1)中获取多少来和当前的输入进行合并计算（虚线框的部分），这时会计算出新的hidden_state，也就是新的记忆，而UpdateGate用于决定新的记忆和旧的记忆如何按比例被传递到下一步。所以这里比较tricky的是UpdateGate不只是直接作用在虚线框计算出的新记忆上，而是同时作用在新记忆和旧记忆H(t-1)上，它决定从新记忆中取多少，从旧记忆中取多少，累加后作为传递到下一步的H(t)。可以从这个角度理解，如果模型认为当前时刻t的这个输入对最终结果意义很大，那么新记忆会以很大的权重被累加进H(t)，反之，会被以很小的权重被累加进H(t)，旧记忆同理。

同BasicRNNCell一样，GRU每一个时刻t输出的output和hidden_state是一样的，都是H(t)，所以他的state_size和output_size是一样的，看代码：

class GRUCell(RNNCell):
  def __init__(self,
               num_units,
               activation=None,
               reuse=None,
               kernel_initializer=None,
               bias_initializer=None):
    super(GRUCell, self).__init__(_reuse=reuse)
    self._num_units = num_units
    self._activation = activation or math_ops.tanh
    self._kernel_initializer = kernel_initializer
    self._bias_initializer = bias_initializer
    self._gate_linear = None
    self._candidate_linear = None

  @property
  def state_size(self):
    return self._num_units

  @property
  def output_size(self):
    return self._num_units

  def call(self, inputs, state):
    """Gated recurrent unit (GRU) with nunits cells."""
    if self._gate_linear is None:
      bias_ones = self._bias_initializer
      if self._bias_initializer is None:
        bias_ones = init_ops.constant_initializer(1.0, dtype=inputs.dtype)
      with vs.variable_scope("gates"):  # Reset gate and update gate.
        self._gate_linear = _Linear(
            [inputs, state],
            2 * self._num_units,
            True,
            bias_initializer=bias_ones,
            kernel_initializer=self._kernel_initializer)

    value = math_ops.sigmoid(self._gate_linear([inputs, state]))
    r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)

    r_state = r * state
    if self._candidate_linear is None:
      with vs.variable_scope("candidate"):
        self._candidate_linear = _Linear(
            [inputs, r_state],
            self._num_units,
            True,
            bias_initializer=self._bias_initializer,
            kernel_initializer=self._kernel_initializer)
    c = self._activation(self._candidate_linear([inputs, r_state]))
    new_h = u * state + (1 - u) * c
    return new_h, new_h

理解了GRUCell后，再简单介绍一下LSTMCell。
LSTMCell相比于GRUCell更复杂一些，加入了更多的门。它包括了三个Gate，分别是InputGate，ForgetGate和OutputGate。理解上和GRUCell类似，只不过控制更为精细，当然，控制更精细并不意味着结果一定会更好，实际使用的时候可以分别测试一下比较效果择优使用即可。LSTMCell公式和原理图如下：

LSTMCell源码如下：

class LSTMCell(RNNCell):
  def __init__(self, num_units,
               use_peepholes=False, cell_clip=None,
               initializer=None, num_proj=None, proj_clip=None,
               num_unit_shards=None, num_proj_shards=None,
               forget_bias=1.0, state_is_tuple=True,
               activation=None, reuse=None):
    
    super(LSTMCell, self).__init__(_reuse=reuse)
    if not state_is_tuple:
      logging.warn("%s: Using a concatenated state is slower and will soon be "
                   "deprecated.  Use state_is_tuple=True.", self)
    if num_unit_shards is not None or num_proj_shards is not None:
      logging.warn(
          "%s: The num_unit_shards and proj_unit_shards parameters are "
          "deprecated and will be removed in Jan 2017.  "
          "Use a variable scope with a partitioner instead.", self)

    self._num_units = num_units
    self._use_peepholes = use_peepholes
    self._cell_clip = cell_clip
    self._initializer = initializer
    self._num_proj = num_proj
    self._proj_clip = proj_clip
    self._num_unit_shards = num_unit_shards
    self._num_proj_shards = num_proj_shards
    self._forget_bias = forget_bias
    self._state_is_tuple = state_is_tuple
    self._activation = activation or math_ops.tanh

    if num_proj:
      self._state_size = (
          LSTMStateTuple(num_units, num_proj)
          if state_is_tuple else num_units + num_proj)
      self._output_size = num_proj
    else:
      self._state_size = (
          LSTMStateTuple(num_units, num_units)
          if state_is_tuple else 2 * num_units)
      self._output_size = num_units
    self._linear1 = None
    self._linear2 = None
    if self._use_peepholes:
      self._w_f_diag = None
      self._w_i_diag = None
      self._w_o_diag = None

  @property
  def state_size(self):
    return self._state_size

  @property
  def output_size(self):
    return self._output_size

  def call(self, inputs, state):
    num_proj = self._num_units if self._num_proj is None else self._num_proj
    sigmoid = math_ops.sigmoid

    if self._state_is_tuple:
      (c_prev, m_prev) = state
    else:
      c_prev = array_ops.slice(state, [0, 0], [-1, self._num_units])
      m_prev = array_ops.slice(state, [0, self._num_units], [-1, num_proj])

    dtype = inputs.dtype
    input_size = inputs.get_shape().with_rank(2)[1]
    if input_size.value is None:
      raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
    if self._linear1 is None:
      scope = vs.get_variable_scope()
      with vs.variable_scope(
          scope, initializer=self._initializer) as unit_scope:
        if self._num_unit_shards is not None:
          unit_scope.set_partitioner(
              partitioned_variables.fixed_size_partitioner(
                  self._num_unit_shards))
        self._linear1 = _Linear([inputs, m_prev], 4 * self._num_units, True)

    # i = input_gate, j = new_input, f = forget_gate, o = output_gate
    lstm_matrix = self._linear1([inputs, m_prev])
    i, j, f, o = array_ops.split(
        value=lstm_matrix, num_or_size_splits=4, axis=1)
    # Diagonal connections
    if self._use_peepholes and not self._w_f_diag:
      scope = vs.get_variable_scope()
      with vs.variable_scope(
          scope, initializer=self._initializer) as unit_scope:
        with vs.variable_scope(unit_scope):
          self._w_f_diag = vs.get_variable(
              "w_f_diag", shape=[self._num_units], dtype=dtype)
          self._w_i_diag = vs.get_variable(
              "w_i_diag", shape=[self._num_units], dtype=dtype)
          self._w_o_diag = vs.get_variable(
              "w_o_diag", shape=[self._num_units], dtype=dtype)

    if self._use_peepholes:
      c = (sigmoid(f + self._forget_bias + self._w_f_diag * c_prev) * c_prev +
           sigmoid(i + self._w_i_diag * c_prev) * self._activation(j))
    else:
      c = (sigmoid(f + self._forget_bias) * c_prev + sigmoid(i) *
           self._activation(j))

    if self._cell_clip is not None:
      # pylint: disable=invalid-unary-operand-type
      c = clip_ops.clip_by_value(c, -self._cell_clip, self._cell_clip)
      # pylint: enable=invalid-unary-operand-type
    if self._use_peepholes:
      m = sigmoid(o + self._w_o_diag * c) * self._activation(c)
    else:
      m = sigmoid(o) * self._activation(c)

    if self._num_proj is not None:
      if self._linear2 is None:
        scope = vs.get_variable_scope()
        with vs.variable_scope(scope, initializer=self._initializer):
          with vs.variable_scope("projection") as proj_scope:
            if self._num_proj_shards is not None:
              proj_scope.set_partitioner(
                  partitioned_variables.fixed_size_partitioner(
                      self._num_proj_shards))
            self._linear2 = _Linear(m, self._num_proj, False)
      m = self._linear2(m)

      if self._proj_clip is not None:
        # pylint: disable=invalid-unary-operand-type
        m = clip_ops.clip_by_value(m, -self._proj_clip, self._proj_clip)
        # pylint: enable=invalid-unary-operand-type

    new_state = (LSTMStateTuple(c, m) if self._state_is_tuple else
                 array_ops.concat([c, m], 1))
    return m, new_state

LSTMCell输出了两个值，m代表hidden_state，c代表final memory，new_state是简单的把c和m连接在了一起。

RNN链式结构（按时间展开）：

上面我们介绍了基本的RNNCell和两种复杂变体，将这些单个的RNNCell按照时间t展开，或者将每个Cell换成多层的Cell并按照时间展开，或者将反向连接叠加到正向连接的计算上，可以构建起RNN的链式结构，这些结构包括：

raw_rnn
static_rnn
static_bidirectional_rnn
static_state_saving_rnn
dynamic_rnn
bidirectional_dynamic_rnn

详情可以参考tensorflow/contrib/legacy_seq2seq/python/ops/rnn.py里面的实现，鉴于篇幅限制，不再展开。

seq2seq(encoder-decoder):

上面提到的RNN链式结构使我们可以完成部分序列标注问题，而seq2seq提供了一种更加复杂的实现，也能解决更广泛的非对称序列标注问题，比如语言翻译，输入和输出的字数并没有直接的对应关系，这是就可以通过将RNN链式结构分别应用于encoder和decoder过程，从而摆脱输入输出必须一一对应的限制。

在详情可以参考tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py里给出了基础seq2seq的实现以及若干个变种，如下，此处也不再展开。

basic_rnn_seq2seq: The most basic RNN-RNN model.
tied_rnn_seq2seq: The basic model with tied encoder and decoder weights.
embedding_rnn_seq2seq: The basic model with input embedding.
embedding_tied_rnn_seq2seq: The tied model with input embedding.
embedding_attention_seq2seq: Advanced model with input embedding and
the neural attention mechanism; recommended for complex tasks.
one2many_rnn_seq2seq: The embedding model with multiple decoders.

自定义encoder、decoder过程

seq2seq.py也提供了两个单独的decoder过程，如下：

rnn_decoder: The basic decoder based on a pure RNN.
attention_decoder: A decoder that uses the attention mechanism.

作为前面直接调用现成的seq2seq模型的替代，可以自己构造encoder结构并和decoder随意组合，通过实验验证哪种效果更好。同时如果seq2seq.py里现成的encoder-decoder模型不能满足你的要求，还可以自定义Decoder过程。

总结：通过本文的介绍，希望能让初学者从各种混乱的表述中解脱出来，从更高的层次了解Cell，链式结构，seq2seq模型之间的关系，希望能有些启发。

参考：
https://github.com/tensorflow/nmt
https://tensorflowkorea.files.wordpress.com/2017/03/cs224n-2017winter-notes-all.pdf

RNN：从BasicRNN到GRU/LSTM

单个RNN cell:

RNN链式结构（按时间展开）：

seq2seq(encoder-decoder):

自定义encoder、decoder过程

[转帖]cpupower

今天，昨天，近七天，近30天，近90天，js封装

圖注意力網絡（GAT，GraphAttentionNetwork)

論文閱讀：Do Transformers Really Perform Bad for Graph Representation

NewGeoCoding：一種外賣場景下的GeoCoding算法

使用CRF++進行模型訓練

智能客服架構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結