[paper] Hypernetworks

(ICLR 2017) Hypernetworks
Paper: https://openreview.net/pdf?id=rkpACe1lx
Code: https://github.com/hardmaru/supercell
Blog: http://blog.otoro.net/2016/09/28/hyper-networks/

學習一個動態更新的循環神經網絡,利用一個小網絡去學習另一個大網絡的權重,學習到的權重也會是大網絡某個層的特定的。

HN 提供了一種新的權重共享方式,介於 CNN 和 RNN 之間,使得 HN 能在參數的個數和模型的效果和靈活性之間做出比較不錯的平衡。

using one network, also known as a hypernetwork, to generate the weights for another network.

We apply hypernetworks to generate adaptive weights for recurrent networks.

hypernetworks can generate non-shared weights for LSTM.

Introduction

using a small network (called a “hypernetwork”) to generate the weights for a larger network (called a main network)

the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weights for that layer.

The focus of this work is to use hypernetworks to generate weights for recurrent networks (RNN).

We perform experiments to investigate the behaviors of hypernetworks in a range of contexts and find that hypernetworks mix well with other techniques such as batch normalization and layer normalization.

Our main result is that hypernetworks can generate non-shared weights for LSTM that work better than the standard version of LSTM.

difficult to directly operate in large search spaces consisting of millions of weight parameters

HyperNEAT framework: Compositional Pattern-Producing Networks (CPPNs) are evolved to define the weight structure of the much larger main network.

Differentiable Pattern Producing Networks (DPPNs): the structure is evolved but the weights are learned

ACDC-Networks: linear layers are compressed with DCT and the parameters are learned

Methods

when they are applied to recurrent networks, hypernetworks can be seen as a form of relaxed weight-sharing in the time dimension.

HyperRNN

When a hypernetwork is used to generate the weights for an RNN, we refer to it as the HyperRNN.

The standard formulation of a Basic RNN is given by:

ht=ϕ(Whht1+Wxxt+b)

In HyperRNN, we allow Wh and Wx to float over time by using a smaller hypernetwork to generate these parameters of the main RNN at each step. More concretely, the parameters Wh , Wx , b of the main RNN are different at different time steps, so that ht can now be computed as:

ht=ϕ(Wh(zh)ht1+Wx(zx)xt+b(zb)) , where

Wh(zh)=Whz,zh

Wx(zx)=Wxz,zx

b(z_b) = W_{bz} z_b + b_0b(zb)=Wbzzb+b0

Figure 1

Figure 1: An overview of HyperRNNs. Black connections and parameters are associated basic
RNNs. Orange connections and parameters are introduced in this work and associated with HyperRNNs. Dotted arrows are for parameter generation.

We use a recurrent hypernetwork to compute z_hzh , z_xzx and z_bzb as a function of x_txt and h_{t−1}ht1 :

\hat{x}_t = \begin{pmatrix} h_{t - 1} \\ x_t \\ \end{pmatrix}x^t=(ht1xt)

\hat{h}_t = \phi(W_{\hat{h}} \hat{h}_{t - 1} + W_{\hat{x}} \hat{x}_t + \hat{b})h^t=ϕ(Wh^h^t1+Wx^x^t+b^)

zh=Wh^hh^t1+bh^h

zx=Wh^xh^t1+bh^x

zb=Wh^bh^t1

However, Equation 2 is not practical because the memory usage becomes too large for real problems.

We will use an intermediate hidden vector d(z)RNh to parametrize each weight matrix, where d(z) will be a linear function of z . We refer d as a weight scaling vector. Below is the modification to W(z) :

w(z)=w(d(z))=d0(z)W0d1(z)W1dNh(z)WNh

only be using memory in the order Nz times the number of hidden units, which is an acceptable amount of extra memory usage.

The formulation of the HyperRNN in Equation 5 has similarities to Recurrent Batch Normalization (Cooijmans et al., 2016) and Layer Normalization (Ba et al., 2016).

The central idea for the normalization techniques is to calculate the first two statistical moments of the inputs to the activation function, and to linearly scale the inputs to have zero mean and unit variance.

After the normalization, an additional set of fixed parameters are learned to unscale the inputs if required.

The element-wise operation also has similarities to the Multiplicative RNN and its extensions (mRNN, mLSTM) (Sutskever et al., 2011; Krause et al., 2016) and Multiplicative Integration RNN (MI-RNN) (Wu et al., 2016).

Experiments

Character-level Penn Treebank Language Modelling

Hutter Prize Wikipedia Language Modelling

Handwriting Sequence Generation

Neural Machine Translation

Conclusion

In this paper, we presented a method to use one network to generate weights for another neural network. Our hypernetworks are trained end-to-end with backpropagation and therefore are efficient and scalable. We focused on applying hypernetworks to generate weights for recurrent networks. On language modelling and handwriting generation, hypernetworks are competitive to or sometimes better than state-of-the-art models. On machine translation, hypernetworks achieve a significant gain on top of a state-of-the-art production-level model.

發佈了104 篇原創文章 · 獲贊 66 · 訪問量 22萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章