(ICLR 2017) Hypernetworks
Paper: https://openreview.net/pdf?id=rkpACe1lx
Code: https://github.com/hardmaru/supercell
Blog: http://blog.otoro.net/2016/09/28/hyper-networks/
學習一個動態更新的循環神經網絡,利用一個小網絡去學習另一個大網絡的權重,學習到的權重也會是大網絡某個層的特定的。
HN 提供了一種新的權重共享方式,介於 CNN 和 RNN 之間,使得 HN 能在參數的個數和模型的效果和靈活性之間做出比較不錯的平衡。
using one network, also known as a hypernetwork, to generate the weights for another network.
We apply hypernetworks to generate adaptive weights for recurrent networks.
hypernetworks can generate non-shared weights for LSTM.
Introduction
using a small network (called a “hypernetwork”) to generate the weights for a larger network (called a main network)
the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weights for that layer.
The focus of this work is to use hypernetworks to generate weights for recurrent networks (RNN).
We perform experiments to investigate the behaviors of hypernetworks in a range of contexts and find that hypernetworks mix well with other techniques such as batch normalization and layer normalization.
Our main result is that hypernetworks can generate non-shared weights for LSTM that work better than the standard version of LSTM.
Related Work
difficult to directly operate in large search spaces consisting of millions of weight parameters
HyperNEAT framework: Compositional Pattern-Producing Networks (CPPNs) are evolved to define the weight structure of the much larger main network.
Differentiable Pattern Producing Networks (DPPNs): the structure is evolved but the weights are learned
ACDC-Networks: linear layers are compressed with DCT and the parameters are learned
Methods
when they are applied to recurrent networks, hypernetworks can be seen as a form of relaxed weight-sharing in the time dimension.
HyperRNN
When a hypernetwork is used to generate the weights for an RNN, we refer to it as the HyperRNN.
The standard formulation of a Basic RNN is given by:
In HyperRNN, we allow
b(z_b) = W_{bz} z_b + b_0
Figure 1: An overview of HyperRNNs. Black connections and parameters are associated basic
RNNs. Orange connections and parameters are introduced in this work and associated with HyperRNNs. Dotted arrows are for parameter generation.
We use a recurrent hypernetwork to compute z_h
\hat{x}_t = \begin{pmatrix} h_{t - 1} \\ x_t \\ \end{pmatrix}
\hat{h}_t = \phi(W_{\hat{h}} \hat{h}_{t - 1} + W_{\hat{x}} \hat{x}_t + \hat{b})
However, Equation 2 is not practical because the memory usage becomes too large for real problems.
We will use an intermediate hidden vector
only be using memory in the order
Related Approaches
The formulation of the HyperRNN in Equation 5 has similarities to Recurrent Batch Normalization (Cooijmans et al., 2016) and Layer Normalization (Ba et al., 2016).
The central idea for the normalization techniques is to calculate the first two statistical moments of the inputs to the activation function, and to linearly scale the inputs to have zero mean and unit variance.
After the normalization, an additional set of fixed parameters are learned to unscale the inputs if required.
The element-wise operation also has similarities to the Multiplicative RNN and its extensions (mRNN, mLSTM) (Sutskever et al., 2011; Krause et al., 2016) and Multiplicative Integration RNN (MI-RNN) (Wu et al., 2016).
Experiments
Character-level Penn Treebank Language Modelling
Hutter Prize Wikipedia Language Modelling
Handwriting Sequence Generation
Neural Machine Translation
Conclusion
In this paper, we presented a method to use one network to generate weights for another neural network. Our hypernetworks are trained end-to-end with backpropagation and therefore are efficient and scalable. We focused on applying hypernetworks to generate weights for recurrent networks. On language modelling and handwriting generation, hypernetworks are competitive to or sometimes better than state-of-the-art models. On machine translation, hypernetworks achieve a significant gain on top of a state-of-the-art production-level model.