《Neural Architecture Search with Reinforcement Learning》翻譯

Neural Architecture Search with Reinforcement Learning

ABSTRACT

Neural networks are powerful and flexible models that work well for many diffi-cult learning tasks in image, speech and natural language understanding. Despitetheir success, neural networks are still hard to design. In this paper, we use a re-current network to generate the model descriptions of neural networks and trainthis RNN with reinforcement learning to maximize the expected accuracy of thegenerated architectures on a validation set. On the CIFAR-10 dataset, our method,starting from scratch, can design a novel network architecture that rivals the besthuman-invented architecture in terms of test set accuracy. Our CIFAR-10 modelachieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster thanthe previous state-of-the-art model that used a similar architectural scheme. Onthe Penn Treebank dataset, our model can compose a novel recurrent cell that out-performs the widely-used LSTM cell, and other state-of-the-art baselines. Our cellachieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplex-ity better than the previous state-of-the-art model. The cell can also be transferredto the character language modeling task on PTB and achieves a state-of-the-artperplexity of 1.214.

帶強化學習的神經結構搜索
摘要

神經網絡是功能強大且靈活的模型，適用於圖像，語音和自然語言理解等衆多難度較大的學習任務。儘管他們取得了成功，神經網絡仍然很難設計。在本文中，我們使用迴流網絡來生成神經網絡的模型描述，並通過強化學習來訓練這個RNN，以最大化生成的體系結構在驗證集上的預期精度。在CIFAR-10數據集上，我們的方法從頭開始，可以設計出一種新穎的網絡架構，在測試集精度方面與最佳人造架構相媲美。我們的CIFAR-10模型的測試錯誤率爲3.65，比使用類似架構方案的先前模型的測試錯誤率提高了0.09％和1.05倍。在Penn Treebank數據集上，我們的模型可以組成一個新型複發性細胞，它可以勝任廣泛使用的LSTM細胞和其他最先進的基線。我們在賓夕法尼亞州立大學的課堂上對測試集的困惑度爲62.4，比以前的最先進的模型要好3.6倍。單元也可以轉移到PTB上的字符語言建模任務，並達到1.214的最新狀態。

1 INTRODUCTION

The last few years have seen much success of deep neural networks in many challenging appli-cations, such as speech recognition (Hinton et al., 2012), image recognition (LeCun et al., 1998;Krizhevsky et al., 2012) and machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Wuet al., 2016). Along with this success is a paradigm shift from feature designing to architecturedesigning, i.e., from SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevskyet al., 2012), VGGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), andResNet (He et al., 2016a). Although it has become easier, designing architectures still requires alot of expert knowledge and takes ample time.

1引言
近幾年來，在許多具有挑戰性的應用中，如語音識別（Hinton等，2012），圖像識別（LeCun等，1998; Krizhevsky等，2012），深度神經網絡取得了很大的成功。機器翻譯（Sutskever等，2014; Bahdanau等，2015; Wuet等，2016）。隨着這一成功，從SIFT（Lowe，1999）和HOG（Dalal＆Triggs，2005）到AlexNet（Krizhevskyet al。，2012），VGGNet（Simonyan＆Zisserman， 2014），GoogleNet（Szegedy等，2015）和ResNet（He等，2016a）。雖然它變得更容易，但設計架構仍需要大量專業知識並需要足夠的時間。

This paper presents Neural Architecture Search, a gradient-based method for finding good architec-tures (see Figure 1) . Our work is based on the observation that the structure and connectivity of a neural network can be typically specified by a variable-length string. It is therefore possible to use a recurrent network – the controller – to generate such string. Training the network specified by the string – the “child network” – on the real data will result in an accuracy on a validation set. Using this accuracy as the reward signal, we can compute the policy gradient to update the controller. As a result, in the next iteration, the controller will give higher probabilities to architectures that receive high accuracies. In other words, the controller will learn to improve its search over time.

Our experiments show that Neural Architecture Search can design good models from scratch, anachievement considered not possible with other methods. On image recognition with CIFAR-10,Neural Architecture Search can find a novel ConvNet model that is better than most human-inventedarchitectures. Our CIFAR-10 model achieves a 3.65 test set error, while being 1.05x faster than thecurrent best model. On language modeling with Penn Treebank, Neural Architecture Search candesign a novel recurrent cell that is also better than previous RNN and LSTM architectures. The cellthat our model found achieves a test set perplexity of 62.4 on the Penn Treebank dataset, which is3.6 perplexity better than the previous state-of-the-art.

本文介紹了神經架構搜索（Neural Architecture Search），這是一種用於找到良好架構的基於梯度的方法（參見圖1）。我們的工作基於這樣的觀察：神經網絡的結構和連接性通常可以由可變長度的字符串來指定。因此可以使用經常性網絡 - 控制器 - 來生成這樣的字符串。訓練由字符串指定的網絡 - “子網” - 對真實數據的處理將導致驗證集的準確性。使用這個精度作爲獎勵信號，我們可以計算策略梯度來更新控制器。因此，在下一次迭代中，控制器將爲獲得高精度的架構提供更高的概率。換句話說，控制器將學習隨着時間的推移改進搜索。

我們的實驗表明，神經架構搜索可以從頭開始設計出好的模型，但不能用其他方法來實現。在用CIFAR-10進行圖像識別時，神經架構搜索可以找到比大多數人類發明的建築更好的新型ConvNet模型。我們的CIFAR-10模型實現了3.65測試集錯誤，比當前最佳模型快1.05倍。在與Penn Treebank進行語言建模時，Neural Architecture Search可以設計出一種比以前的RNN和LSTM體系結構更好的新穎復發單元。我們的模型發現的細胞在Penn Treebank數據集上達到了62.4的測試集困惑度，這比以前的先進水平要好3.6倍。

2 RELATED WORK

Hyperparameter optimization is an important research topic in machine learning, and is widely usedin practice (Bergstra et al., 2011; Bergstra & Bengio, 2012; Snoek et al., 2012; 2015; Saxena &Verbeek, 2016). Despite their success, these methods are still limited in that they only search modelsfrom a fixed-length space. In other words, it is difficult to ask them to generate a variable-lengthconfiguration that specifies the structure and connectivity of a network. In practice, these methodsoften work better if they are supplied with a good initial model (Bergstra & Bengio, 2012; Snoeket al., 2012; 2015). There are Bayesian optimization methods that allow to search non fixed lengtharchitectures (Bergstra et al., 2013; Mendoza et al., 2016), but they are less general and less flexiblethan the method proposed in this paper.

Modern neuro-evolution algorithms, e.g., Wierstra et al. (2005); Floreano et al. (2008); Stanley et al.(2009), on the other hand, are much more flexible for composing novel models, yet they are usuallyless practical at a large scale. Their limitations lie in the fact that they are search-based methods,thus they are slow or require many heuristics to work well.

2相關工作
超參數優化是機器學習中的一個重要研究課題，在實踐中被廣泛應用（Bergstra et al。，2011; Bergstra＆Bengio，2012; Snoek et al。，2012; 2015; Saxena＆Verbeek，2016）。儘管他們取得了成功，但這些方法仍然有限，因爲他們只能從固定長度的空間搜索模型。換句話說，要求他們生成一個規定網絡結構和連接性的可變長度配置是很困難的。在實踐中，如果這些方法提供了良好的初始模型（Bergstra＆Bengio，2012; Snoeket等，2012; 2015），這些方法往往會更好地工作。有貝葉斯優化方法可以用來搜索非固定尺寸的建築物（Bergstra等，2013; Mendoza等，2016），但與本文提出的方法相比，它們不那麼一般和靈活。

現代神經進化算法，例如Wierstra等人（2005年）; Floreano等人（2008）;另一方面，斯坦利等人（2009）在組成新模型方面更加靈活，然而它們在大規模時是無法實用的。它們的侷限性在於它們是基於搜索的方法，因此它們很慢或需要許多啓發式才能運行良好。

Neural Architecture Search has some parallels to program synthesis and inductive programming, theidea of searching a program from examples (Summers, 1977; Biermann, 1978). In machine learning,probabilistic program induction has been used successfully in many settings, such as learning tosolve simple Q&A (Liang et al., 2010; Neelakantan et al., 2015; Andreas et al., 2016), sort a list ofnumbers (Reed & de Freitas, 2015), and learning with very few examples (Lake et al., 2015).

The controller in Neural Architecture Search is auto-regressive, which means it predicts hyperpa-rameters one a time, conditioned on previous predictions. This idea is borrowed from the decoderin end-to-end sequence to sequence learning (Sutskever et al., 2014). Unlike sequence to sequencelearning, our method optimizes a non-differentiable metric, which is the accuracy of the child net-work. It is therefore similar to the work on BLEU optimization in Neural Machine Translation (Ran-zato et al., 2015; Shen et al., 2016). Unlike these approaches, our method learns directly from thereward signal without any supervised bootstrapping.

Also related to our work is the idea of learning to learn or meta-learning (Thrun & Pratt, 2012), ageneral framework of using information learned in one task to improve a future task. More closelyrelated is the idea of using a neural network to learn the gradient descent updates for another net-work (Andrychowicz et al., 2016) and the idea of using reinforcement learning to find update policiesfor another network (Li & Malik, 2016).

神經架構搜索與程序合成和歸納編程有一些相似之處，它們從例子中搜索程序（Summers，1977; Biermann，1978）。在機器學習中，概率性程序誘導已成功用於許多環境中，比如學習簡單問答（Liang et al。，2010; Neelakantan et al。，2015; Andreas et al。，2016），對數字列表（Reed ＆de Freitas，2015年），並以極少數例子進行學習（Lake等，2015）。

神經架構搜索中的控制器是自動迴歸的，這意味着它預測一次一次的超參數，並以先前的預測爲條件。這個想法是從decoderin端對端序列借鑑序列學習（Sutskever等，2014）。與序列學習序列不同，我們的方法優化了一個不可區分的度量標準，這是子網絡的準確性。因此它類似於神經機器翻譯中的BLEU優化工作（Ran-zato等，2015; Shen等，2016）。與這些方法不同，我們的方法直接從沒有任何監督引導的信號中學習。

與我們的工作相關的還有學習學習或元學習的想法（Thrun＆Pratt，2012），這是一個使用在一項任務中學到的信息來改進未來任務的通用框架。更密切相關的是使用神經網絡學習另一網絡的梯度下降更新（Andrychowicz et al。，2016）以及使用強化學習爲另一網絡找到更新策略的想法（Li＆Malik，2016）。

3 METHODS

In the following section, we will first describe a simple method of using a recurrent network togenerate convolutional architectures. We will show how the recurrent network can be trained witha policy gradient method to maximize the expected accuracy of the sampled architectures. We willpresent several improvements of our core approach such as forming skip connections to increasemodel complexity and using a parameter server approach to speed up training. In the last part of the section, we will focus on generating recurrent architectures, which is another key contribution of our paper.

3方法
在下一節中，我們將首先描述使用循環網絡生成卷積體系結構的簡單方法。我們將展示如何使用策略梯度法來訓練週期性網絡，以最大化採樣體系結構的預期準確度。我們將對我們的核心方法進行若干改進，例如形成跳過連接以增加模型複雜性，並使用參數服務器方法加速培訓。在本節的最後部分，我們將重點介紹生成經常性架構，這是我們論文的另一個重要貢獻。

3.1 GENERATE MODEL DESCRIPTIONS WITH A CONTROLLER RECURRENT NEURALNETWORK

In Neural Architecture Search, we use a controller to generate architectural hyperparameters ofneural networks. To be flexible, the controller is implemented as a recurrent neural network. Let’ssuppose we would like to predict feedforward neural networks with only convolutional layers, wecan use the controller to generate their hyperparameters as a sequence of tokens:

In our experiments, the process of generating an architecture stops if the number of layers exceedsa certain value. This value follows a schedule where we increase it as training progresses. Once thecontroller RNN finishes generating an architecture, a neural network with this architecture is builtand trained. At convergence, the accuracy of the network on a held-out validation set is recorded.The parameters of the controller RNN, θc, are then optimized in order to maximize the expectedvalidation accuracy of the proposed architectures. In the next section, we will describe a policygradient method which we use to update parameters θc so that the controller RNN generates betterarchitectures over time.

3.1用控制器遞歸神經網絡生成模型描述
在神經架構搜索中，我們使用控制器來生成神經網絡的架構超參數。爲了靈活，控制器被實現爲循環神經網絡。假設我們想要預測只有卷積層的前向神經網絡，我們可以使用控制器生成他們的超參數作爲令牌序列：

在我們的實驗中，如果層數超過一定值，生成架構的過程將停止。這個值遵循一個時間表，我們隨着培訓的進展而增加它。一旦控制器RNN完成生成架構，具有此架構的神經網絡就建立起來並接受培訓。在收斂時，網絡在保留驗證集上的準確性被記錄下來。然後對控制器RNN，θc的參數進行優化，以使所提出的體系結構的預期驗證準確度最大化。在下一節中，我們將描述一個策略梯度方法，我們用它來更新參數θc，以便控制器RNN隨着時間的推移生成相應的建築物。

3.2 TRAINING WITH REINFORCE

The list of tokens that the controller predicts can be viewed as a list of actions a1:T to design anarchitecture for a child network. At convergence, this child network will achieve an accuracy R ona held-out dataset. We can use this accuracy R as the reward signal and use reinforcement learningto train the controller. More concretely, to find the optimal architecture, we ask our controller tomaximize its expected reward, represented by J(θc): J(θc) = EP(a1:T ;θc)[R]

3.2加強訓練
控制器預測的令牌列表可以被視爲一系列動作a1：T來爲子網絡設計架構。在收斂時，這個子網絡將實現一個準確的數據集。我們可以使用這個精度R作爲獎勵信號，並使用強化學習來訓練控制器。更具體地說，爲了找到最佳的架構，我們要求我們的控制器最大化其期望的獎勵，用J（θc）表示：J(θc) = EP(a1:T ;θc)[R]

Since the reward signal R is non-differentiable, we need to use a policy gradient method to iterativelyupdate θc. In this work, we use the REINFORCE rule from Williams (1992): 公式

An empirical approximation of the above quantity is: 公式

Where m is the number of different architectures that the controller samples in one batch and T is

the number of hyperparameters our controller has to predict to design a neural network architecture.3

The validation accuracy that the k-th neural network architecture achieves after being trained on atraining dataset is Rk.

The above update is an unbiased estimate for our gradient, but has a very high variance. In order toreduce the variance of this estimate we employ a baseline function:公式，

As long as the baseline function b does not depend on the on the current action, then this is still anunbiased gradient estimate. In this work, our baseline b is an exponential moving average of theprevious architecture accuracies.

由於回報信號R是不可微分的，因此我們需要使用策略梯度方法迭代更新θc。在這項工作中，我們使用Williams（1992）的REINFORCE規則：公式
上述數量的經驗近似值爲：公式
其中m是控制器在一個批次中採樣並且T是不同架構的數量
我們的控制器必須預測設計神經網絡體系結構的超參數的數量
第k個神經網絡體系結構在訓練數據集之後達到的驗證準確度爲Rk。
以上更新是我們漸變的無偏估計，但具有非常高的方差。爲了減少這個估計的方差，我們使用了一個基線函數：公式，
只要基線函數b不依賴於當前動作，那麼這仍然是有偏差的梯度估計。在這項工作中，我們的基線b是以前架構精度的指數移動平均值。

Accelerate Training with Parallelism and Asynchronous Updates: In Neural ArchitectureSearch, each gradient update to the controller parameters θc corresponds to training one child net-work to convergence. As training a child network can take hours, we use distributed training andasynchronous parameter updates in order to speed up the learning process of the controller (Deanet al., 2012). We use a parameter-server scheme where we have a parameter server of S shards, thatstore the shared parameters for K controller replicas. Each controller replica samples m differentchild architectures that are trained in parallel. The controller then collects gradients according to theresults of that minibatch of m architectures at convergence and sends them to the parameter serverin order to update the weights across all controller replicas. In our implementation, convergence ofeach child network is reached when its training exceeds a certain number of epochs. This scheme ofparallelism is summarized in Figure 3.

通過並行和異步更新加速訓練：在Neural ArchitectureSearch中，控制器參數θc的每個梯度更新對應於訓練一個孩子網絡收斂。由於訓練子網絡可能需要數小時，我們使用分佈式培訓和同步參數更新以加速控制器的學習過程（Dean等，2012）。我們使用參數服務器方案，其中有一個S分片的參數服務器，它存儲K控制器副本的共享參數。每個控制器副本採樣m個並行訓練的不同的兒童架構。然後，控制器根據收斂的m個體繫結構的小批次的結果收集梯度，並將它們發送到參數服務器以更新所有控制器副本中的權重。在我們的實施中，當其訓練超過一定數量的時期時，達到每個兒童網絡的融合。圖3總結了這種並行性方案。

3.3 INCREASE ARCHITECTURE COMPLEXITY WITH SKIP CONNECTIONS AND OTHERLAYER TYPES

In Section 3.1, the search space does not have skip connections, or branching layers used in modernarchitectures such as GoogleNet (Szegedy et al., 2015), and Residual Net (He et al., 2016a). In thissection we introduce a method that allows our controller to propose skip connections or branchinglayers, thereby widening the search space.

To enable the controller to predict such connections, we use a set-selection type attention (Neelakan-tan et al., 2015) which was built upon the attention mechanism (Bahdanau et al., 2015; Vinyals et al.,2015). At layer N , we add an anchor point which has N − 1 content-based sigmoids to indicate theprevious layers that need to be connected. Each sigmoid is a function of the current hiddenstate ofthe controller and the previous hiddenstates of the previous N − 1 anchor points:

P(Layer j is an input to layer i) = sigmoid(vTtanh(Wprev ∗ hj + Wcurr ∗ hi)),

3.3增加跳過連接和其他層類型的體系結構複雜性

在3.1節中，搜索空間沒有跳過連接，或者在GoogleNet（Szegedy et al。，2015）和Residual Net（He et al。，2016a）等現代架構中使用的分支層。在本節中，我們介紹一種方法，允許我們的控制器提出跳過連接或分支層，從而拓寬搜索空間。

爲了使控制器能夠預測這種聯繫，我們使用了基於注意機制（Bahdanau等，2015; Vinyals等，2015）的集合選擇型關注（Neelakan-tan等，2015）。在第N層，我們添加一個錨點，它具有N - 1個基於內容的S形指示，以指示需要連接的以前的圖層。每個S形是控制器的當前隱藏狀態和先前N-1個錨點的先前隱藏狀態的函數：P(Layer j is an input to layer i) = sigmoid(vTtanh(Wprev ∗ hj + Wcurr ∗ hi)),

where hj represents the hiddenstate of the controller at anchor point for the j-th layer, where jranges from 0 to N − 1. We then sample from these sigmoids to decide what previous layers to beused as inputs to the current layer. The matrices Wprev, Wcurr and v are trainable parameters. As these connections are also defined by probability distributions, the REINFORCE method still applies without any significant modifications. Figure 4 shows how the controller uses skip connections to decide what layers it wants as inputs to the current layer.

In our framework, if one layer has many input layers then all input layers are concatenated in thedepth dimension. Skip connections can cause “compilation failures” where one layer is not compat-ible with another layer, or one layer may not have any input or output. To circumvent these issues,we employ three simple techniques. First, if a layer is not connected to any input layer then theimage is used as the input layer. Second, at the final layer we take all layer outputs that have notbeen connected and concatenate them before sending this final hiddenstate to the classifier. Lastly,if input layers to be concatenated have different sizes, we pad the small layers with zeros so that theconcatenated layers have the same sizes.

Finally, in Section 3.1, we do not predict the learning rate and we also assume that the architecturesconsist of only convolutional layers, which is also quite restrictive. It is possible to add the learningrate as one of the predictions. Additionally, it is also possible to predict pooling, local contrastnormalization (Jarrett et al., 2009; Krizhevsky et al., 2012), and batchnorm (Ioffe & Szegedy, 2015)in the architectures. To be able to add more types of layers, we need to add an additional step in thecontroller RNN to predict the layer type, then other hyperparameters associated with it.

其中hj表示控制器在第j層的錨點處的隱藏狀態，其中從0到N-1的範圍從0到N-1。然後，我們從這些乙狀結構中採樣以確定先前哪些層被用作當前層的輸入。矩陣Wprev，Wcurr和v是可訓練參數。由於這些連接也是由概率分佈定義的，REINFORCE方法仍然適用，沒有任何重大修改。圖4顯示了控制器如何使用跳轉連接來決定它想要的層作爲當前層的輸入。

在我們的框架中，如果一個圖層具有多個輸入圖層，則所有輸入圖層都將在深度維度中連接起來。跳過連接會導致“編譯失敗”，其中一個圖層與另一圖層不兼容，或者一個圖層可能沒有任何輸入或輸出。爲了規避這些問題，我們採用了三種簡單的技術。首先，如果圖層沒有連接到任何輸入圖層，則圖像將被用作輸入圖層。其次，在最後一層，我們將所有未連接的圖層輸出連接起來，並將它們連接起來，然後將最終的隱藏狀態發送給分類器。最後，如果要連接的輸入圖層具有不同的大小，我們使用零填充小圖層，以便相關圖層具有相同的大小。

最後，在第3.1節中，我們不預測學習速率，並且我們還假定架構僅包含卷積層，這也是非常嚴格的。可以將學習速率添加爲預測之一。此外，還可以在體系結構中預測彙集，局部對比度歸一化（Jarrett等人，2009; Krizhevsky等人，2012）和蝙蝠技術（Ioffe＆Szegedy，2015）。爲了能夠添加更多類型的圖層，我們需要在控制器RNN中添加一個額外的步驟來預測圖層類型，然後預測與其關聯的其他超參數。

3.4 GENERATE RECURRENT CELL ARCHITECTURES

In this section, we will modify the above method to generate recurrent cells. At every time step t,the controller needs to find a functional form for ht that takes xt and ht−1 as inputs. The simplestway is to have ht = tanh(W1 ∗xt +W2 ∗ht−1), which is the formulation of a basic recurrent cell. Amore complicated formulation is the widely-used LSTM recurrent cell (Hochreiter & Schmidhuber,1997).

3.4生成迴歸的細胞結構
在本節中，我們將修改上述方法以生成循環單元格。在每一步t，控制器都需要找到一個以xt和ht-1爲輸入的ht的函數形式。最簡單的方法是讓ht = tanh（W1 * xt + W2 * ht-1），這是基本復發單元格的表達式。 Amore複雜的配方是廣泛使用的LSTM復發細胞（Hochreiter＆Schmidhuber，1997）。

The computations for basic RNN and LSTM cells can be generalized as a tree of steps that take xtand ht−1 as inputs and produce ht as final output. The controller RNN needs to label each node inthe tree with a combination method (addition, elementwise multiplication, etc.) and an activationfunction (tanh, sigmoid, etc.) to merge two inputs and produce one output. Two outputs are thenfed as inputs to the next node in the tree. To allow the controller RNN to select these methods andfunctions, we index the nodes in the tree in an order so that the controller RNN can visit each nodeone by one and label the needed hyperparameters.

Inspired by the construction of the LSTM cell (Hochreiter & Schmidhuber, 1997), we also need cellvariables ct−1 and ct to represent the memory states. To incorporate these variables, we need thecontroller RNN to predict what nodes in the tree to connect these two variables to. These predictionscan be done in the last two blocks of the controller RNN.

基本RNN和LSTM單元的計算可以概括爲以xtand ht-1作爲輸入併產生ht作爲最終輸出的步驟樹。控制器RNN需要用組合方法（加法，元素乘法等）和激活函數（tanh，sigmoid等）來標記樹中的每個節點以合併兩個輸入併產生一個輸出。然後兩個輸出作爲樹中下一個節點的輸入。爲了允許控制器RNN選擇這些方法和功能，我們按照順序對樹中的節點進行索引，使得控制器RNN可以訪問每個節點，並標記所需的超參數。

受到LSTM細胞構建的啓發（Hochreiter＆Schmidhuber，1997），我們還需要細胞變量ct-1和ct來表示記憶狀態。爲了結合這些變量，我們需要控制器RNN來預測樹中將這兩個變量連接到哪些節點。這些預測可以在控制器RNN的最後兩個塊中完成。

To make this process more clear, we show an example in Figure 5, for a tree structure that has twoleaf nodes and one internal node. The leaf nodes are indexed by 0 and 1, and the internal node isindexed by 2. The controller RNN needs to first predict 3 blocks, each block specifying a combina-tion method and an activation function for each tree index. After that it needs to predict the last 2blocks that specify how to connect ct and ct−1 to temporary variables inside the tree. Specifically,according to the predictions of the controller RNN in this example, the following computation steps will occur:

The controller predicts Add and T anh for tree index 0, this means we need to computea0 = tanh(W1 ∗xt +W2 ∗ht−1).
The controller predicts ElemMult and ReLU for tree index 1, this means we need tocompute a1 = ReLU (W3 ∗ xt) ⊙ (W4 ∗ ht−1) .
The controller predicts 0 for the second element of the “Cell Index”, Add and ReLU forelements in “Cell Inject”, which means we need to compute anew = ReLU(a + c ).

0 0t−1Notice that we don’t have any learnable parameters for the internal nodes of the tree.

The controller predicts ElemMult and Sigmoid for tree index 2, this means we need tocompute a = sigmoid(anew ⊙ a ). Since the maximum index in the tree is 2, h is set to201t a2.
The controller RNN predicts 1 for the first element of the “Cell Index”, this means that weshould set ct to the output of the tree at index 1 before the activation, i.e., ct = (W3 ∗ xt ) ⊙(W4 ∗ ht−1).

In the above example, the tree has two leaf nodes, thus it is called a “base 2” architecture. In ourexperiments, we use a base number of 8 to make sure that the cell is expressive.

爲了使這個過程更加清晰，我們在圖5中顯示了一個例子，其中包含兩個節點和一個內部節點的樹結構。葉節點由0和1索引，內部節點索引爲2.控制器RNN需要首先預測3個塊，每個塊指定每個樹索引的組合方法和激活函數。之後，它需要預測最後的2blocks，它指定如何將ct和ct-1連接到樹內的臨時變量。具體而言，根據該示例中的控制器RNN的預測，將發生以下計算步驟：

控制器預測樹索引0的Add和T anh，這意味着我們需要計算a0 = tanh（W1 * xt + W2 * ht-1）。
控制器預測樹索引1的ElemMult和ReLU，這意味着我們需要計算a1 = ReLU（W3 * xt）⊙（W4 * ht-1）。
控制器對“細胞索引”的第二個元素，“細胞入射”中的Add和ReLU前綴預測0，這意味着我們需要計算新的= ReLU（a + c）。0 0t-1注意我們沒有任何樹的內部節點的可學習參數。
控制器預測樹索引2的ElemMult和Sigmoid，這意味着我們需要計算一個= sigmoid（a new⊙a）。由於樹中的最大索引是2，所以h被設置爲
控制器RNN爲“Cell索引”的第一個元素預測1，這意味着在激活之前，應該將ct設置爲索引1處的樹的輸出，即ct =（W3 * xt）⊙（W4 * ht -1）。

在上面的例子中，樹有兩個葉子節點，因此它被稱爲“基礎2”體系結構。在實驗中，我們使用8的基數來確保細胞具有表達力。

4 EXPERIMENTS AND RESULTS

We apply our method to an image classification task with CIFAR-10 and a language modeling taskwith Penn Treebank, two of the most benchmarked datasets in deep learning. On CIFAR-10, ourgoal is to find a good convolutional architecture whereas on Penn Treebank our goal is to find a goodrecurrent cell. On each dataset, we have a separate held-out validation dataset to compute the rewardsignal. The reported performance on the test set is computed only once for the network that achievesthe best result on the held-out validation dataset. More details about our experimental proceduresand results are as follows.

4實驗和結果
我們將該方法應用於CIFAR-10的圖像分類任務和Penn Treebank的語言建模任務，這是深度學習中兩個最基準的數據集。在CIFAR-10上，我們的目標是找到一個良好的卷積體系結構，而在賓夕法尼亞州立大學我們的目標是找到一個良好的細胞。在每個數據集上，我們都有一個單獨的外部驗證數據集來計算獎勵信號。所報告的測試集的性能僅對在持有驗證數據集上獲得最佳結果的網絡計算一次。關於我們的實驗程序和結果的更多細節如下。

4.1 LEARNING CONVOLUTIONAL ARCHITECTURES FOR CIFAR-10

Dataset: In these experiments we use the CIFAR-10 dataset with data preprocessing and aug-mentation procedures that are in line with other previous results. We first preprocess the data bywhitening all the images. Additionally, we upsample each image then choose a random 32x32 cropof this upsampled image. Finally, we use random horizontal flips on this 32x32 cropped image.

Search space: Our search space consists of convolutional architectures, with rectified linear unitsas non-linearities (Nair & Hinton, 2010), batch normalization (Ioffe & Szegedy, 2015) and skipconnections between layers (Section 3.3). For every convolutional layer, the controller RNN has toselect a filter height in [1, 3, 5, 7], a filter width in [1, 3, 5, 7], and a number of filters in [24, 36, 48,64]. For strides, we perform two sets of experiments, one where we fix the strides to be 1, and onewhere we allow the controller to predict the strides in [1, 2, 3].

Training details: The controller RNN is a two-layer LSTM with 35 hidden units on each layer.It is trained with the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of 0.0006. Theweights of the controller are initialized uniformly between -0.08 and 0.08. For the distributed train-ing, we set the number of parameter server shards S to 20, the number of controller replicas K to100 and the number of child replicas m to 8, which means there are 800 networks being trained on800 GPUs concurrently at any time.

Once the controller RNN samples an architecture, a child model is constructed and trained for 50epochs. The reward used for updating the controller is the maximum validation accuracy of the last5 epochs cubed. The validation set has 5,000 examples randomly sampled from the training set,the remaining 45,000 examples are used for training. The settings for training the CIFAR-10 childmodels are the same with those used in Huang et al. (2016a). We use the Momentum Optimizerwith a learning rate of 0.1, weight decay of 1e-4, momentum of 0.9 and used Nesterov Momentum(Sutskever et al., 2013).

4.1 CIFAR-10的學習型教學體系結構
數據集：在這些實驗中，我們使用CIFAR-10數據集，其數據預處理和增強過程與其他先前的結果一致。我們首先通過清理所有圖像來預處理數據。另外，我們對每幅圖像進行上採樣，然後選擇上採樣圖像的隨機32x32作物。最後，我們在這個32x32裁剪圖像上使用隨機水平翻轉。

搜索空間：我們的搜索空間由卷積體系結構組成，其整體線性單位爲非線性（Nair＆Hinton，2010），批量歸一化（Ioffe＆Szegedy，2015）以及層間跳過連接（第3.3節）。對於每個卷積層，控制器RNN選擇[1,3,5,7]中的濾波器高度，[1,3,5,7]中的濾波器寬度和[24,36,48中的濾波器數量，64]。對於步幅，我們執行兩組實驗，其中我們將步幅固定爲1，另一個位置我們允許控制器預測[1,2，3]中的步幅。

訓練細節：控制器RNN是一個兩層LSTM，每層有35個隱藏單元。它使用ADAM優化器（Kingma＆Ba，2015）進行訓練，學習率爲0.0006。控制器的重量在-0.08和0.08之間均勻初始化。對於分佈式訓練，我們將參數服務器分片數S設置爲20，控制器副本數K爲100，子副本數m爲8，這意味着在任何時候800個GPU同時訓練了800個網絡。
一旦控制器RNN對架構進行採樣，就構建並訓練了50個子模型的子模型。用於更新控制器的獎勵是最後5個時期立方體的最大驗證準確性。驗證集有5000個從訓練集中隨機抽樣的例子，其餘45,000個例子用於訓練。培訓CIFAR-10兒童模型的設置與Huang等人使用的相同。（2016a）。我們使用Momentum Optimizer，學習率爲0.1，體重衰減爲1e-4，動量爲0.9，並使用Nesterov Momentum（Sutskever et al。，2013）。

During the training of the controller, we use a schedule of increasing number of layers in the childnetworks as training progresses. On CIFAR-10, we ask the controller to increase the depth by 2 forthe child models every 1,600 samples, starting at 6 layers.

Results: After the controller trains 12,800 architectures, we find the architecture that achieves thebest validation accuracy. We then run a small grid search over learning rate, weight decay, batchnormepsilon and what epoch to decay the learning rate. The best model from this grid search is then rununtil convergence and we then compute the test accuracy of such model and summarize the resultsin Table 1. As can be seen from the table, Neural Architecture Search can design several promisingarchitectures that perform as well as some of the best models on this dataset.

在對控制器進行培訓期間，隨着培訓的進展，我們使用一個日益增加的子網絡層數的時間表。在CIFAR-10上，我們要求控制器從6層開始，每1,600個樣本增加2個子模型的深度。
結果：控制器訓練了12,800個體繫結構後，我們發現達到最佳驗證準確性的體系結構。然後，我們對學習速率，體重衰減，batchnormepsilon以及衰減學習速率的時間進行小網格搜索。然後，這個網格搜索的最佳模型運行收斂，然後我們計算這種模型的測試精度並總結表1中的結果。從表中可以看出，神經架構搜索可以設計幾個有前景的體系結構，其性能和部分這個數據集上的最佳模型。

First, if we ask the controller to not predict stride or pooling, it can design a 15-layer architecture that achieves 5.50% error rate on the test set. This architecture has a good balance between accuracy and depth. In fact, it is the shallowest and perhaps the most inexpensive architecture among the top performing networks in this table. This architecture is shown in Appendix A, Figure 7. A notable feature of this architecture is that it has many rectangular filters and it prefers larger filters at the top layers. Like residual networks (He et al., 2016a), the architecture also has many one-step skip connections. This architecture is a local optimum in the sense that if we perturb it, its performance becomes worse. For example, if we densely connect all layers with skip connections, its performance becomes slightly worse: 5.56%. If we remove all skip connections, its performance drops to 7.97%.

In the second set of experiments, we ask the controller to predict strides in addition to other hyper-parameters. As stated earlier, this is more challenging because the search space is larger. In thiscase, it finds a 20-layer architecture that achieves 6.01% error rate on the test set, which is not muchworse than the first set of experiments.

Finally, if we allow the controller to include 2 pooling layers at layer 13 and layer 24 of the archi-tectures, the controller can design a 39-layer network that achieves 4.47% which is very close tothe best human-invented architecture that achieves 3.74%. To limit the search space complexity wehave our model predict 13 layers where each layer prediction is a fully connected block of 3 layers.Additionally, we change the number of filters our model can predict from [24, 36, 48, 64] to [6, 12,24, 36]. Our result can be improved to 3.65% by adding 40 more filters to each layer of our archi-tecture. Additionally this model with 40 filters added is 1.05x as fast as the DenseNet model thatachieves 3.74%, while having better performance. The DenseNet model that achieves 3.46% errorrate (Huang et al., 2016b) uses 1x1 convolutions to reduce its total number of parameters, which wedid not do, so it is not an exact comparison.

首先，如果我們要求控制器不預測跨度或池，它可以設計一個15層架構，在測試集上可以達到5.50％的錯誤率。這種架構在精度和深度之間具有良好的平衡。事實上，它是本表中表現最佳的網絡中最淺的，也許是最便宜的架構。該體系結構如附錄A，圖7所示。該體系結構的一個顯着特點是它具有許多矩形濾波器，並且它在頂層更喜歡較大的濾波器。像殘差網絡一樣（He et al。，2016a），該架構還有許多一步跳過連接。這種架構是一種局部最優的，因爲如果我們擾亂它，它的性能就會變差。例如，如果我們用跳躍連接密集連接所有層，則其性能會變得稍差：5.56％。如果我們刪除所有跳過連接，它的性能下降到7.97％。

在第二組實驗中，除了其他超參數之外，我們還要求控制器預測步幅。如前所述，這是更具挑戰性的，因爲搜索空間更大。在這種情況下，它找到了一個20層體系結構，在測試集上可以達到6.01％的錯誤率，這比第一組實驗沒有什麼大不了。

最後，如果我們允許控制器在架構的第13層和第24層包含2個匯聚層，控制器可以設計一個39層網絡，達到4.47％，這非常接近最佳人造架構，達到3.74 ％。爲了限制搜索空間的複雜性，我們的模型預測了13層，其中每層預測是3層完全連接的塊。此外，我們將模型可以預測的濾波器數量從[24,36,48,64]改變爲[6 ，12,24,36]。我們的結果可以通過在我們的架構的每一層增加40個更多的過濾器而提高到3.65％。此外，這款機型增加了40個過濾器，其速度是DenseNet模型的1.05倍，達到3.74％，同時具有更好的性能。達到3.46％誤差率的DenseNet模型（Huang等人，2016b）使用1x1卷積來減少其參數的總數量，這是無法完成的，所以它不是一個確切的比較。

4.2 LEARNING RECURRENT CELLS FOR PENN TREEBANK

Dataset: We apply Neural Architecture Search to the Penn Treebank dataset, a well-known bench-mark for language modeling. On this task, LSTM architectures tend to excel (Zaremba et al., 2014;Gal, 2015), and improving them is difficult (Jozefowicz et al., 2015). As PTB is a small dataset, reg-ularization methods are needed to avoid overfitting. First, we make use of the embedding dropoutand recurrent dropout techniques proposed in Zaremba et al. (2014) and (Gal, 2015). We also try tocombine them with the method of sharing Input and Output embeddings, e.g., Bengio et al. (2003);Mnih & Hinton (2007), especially Inan et al. (2016) and Press & Wolf (2016). Results with thismethod are marked with “shared embeddings.”

Search space: Following Section 3.4, our controller sequentially predicts a combination methodthen an activation function for each node in the tree. For each node in the tree, the controllerRNN needs to select a combination method in [add, elem mult] and an activation method in[identity,tanh,sigmoid,relu]. The number of input pairs to the RNN cell is called the “basenumber” and set to 8 in our experiments. When the base number is 8, the search space is has ap-proximately 6 × 1016 architectures, which is much larger than 15,000, the number of architecturesthat we allow our controller to evaluate.

4.2爲PENN TREEBANK學習循環細胞
數據集：我們將神經架構搜索應用於Penn Treebank數據集，這是一個衆所周知的語言建模基準。在這項任務中，LSTM架構傾向於優秀（Zaremba等，2014; Gal，2015），並且改進它們是困難的（Jozefowicz等，2015）。由於PTB是一個小型數據集，因此需要使用reg-ularization方法來避免過度擬合。首先，我們利用Zaremba等人提出的嵌入丟失和迴歸丟失技術。（2014年）和（Gal，2015年）。我們也嘗試將它們與共享輸入和輸出嵌入的方法結合起來，例如Bengio等人。（2003）; Mnih＆Hinton（2007），特別是Inan等人（2016）和Press＆Wolf（2016）。此方法的結果標記爲“共享嵌入”。

搜索空間：在3.4節之後，我們的控制器順序地預測一個組合方法，然後是樹中每個節點的激活函數。對於樹中的每個節點，controllerRNN需要在[add，elem mult]中選擇一種組合方法，並在[identity，tanh，sigmoid，relu]中選擇一種激活方法。 RNN小區的輸入對數目稱爲“基數”，在我們的實驗中設爲8。當基數爲8時，搜索空間大約有6×1016個體繫結構，這比我們允許控制器評估的體系結構的數量大得多。

Training details: The controller and its training are almost identical to the CIFAR-10 experimentsexcept for a few modifications: 1) the learning rate for the controller RNN is 0.0005, slightly smallerthan that of the controller RNN in CIFAR-10, 2) in the distributed training, we set S to 20, K to 400and m to 1, which means there are 400 networks being trained on 400 CPUs concurrently at anytime, 3) during asynchronous training we only do parameter updates to the parameter-server once10 gradients from replicas have been accumulated.

In our experiments, every child model is constructed and trained for 35 epochs. Every child modelhas two layers, with the number of hidden units adjusted so that total number of learnable parametersapproximately match the “medium” baselines (Zaremba et al., 2014; Gal, 2015). In these experi-ments we only have the controller predict the RNN cell structure and fix all other hyperparameters.The reward function is c where c is a constant, usually set at 80.

After the controller RNN is done training, we take the best RNN cell according to the lowest validation perplexity and then run a grid search over learning rate, weight initialization, dropout rates and decay epoch. The best cell found was then run with three different configurations and sizes to increase its capacity.

訓練細節：控制器及其培訓與CIFAR-10實驗幾乎相同，只是進行了一些修改：1）控制器RNN的學習率爲0.0005，略低於CIFAR-10中控制器RNN的學習率; 2）分佈式培訓，我們將S設置爲20，K設爲400，m設爲1，這意味着400個網絡隨時都在400個CPU上同時進行培訓，3）在異步培訓期間，我們只對參數服務器進行一次參數更新，已經積累了。

在我們的實驗中，每個兒童模型都經過了35個時期的構建和訓練。每個兒童模型都有兩層，調整隱藏單元的數量，以使可學習參數的總數大致與“中等”基線相匹配（Zaremba等，2014; Gal，2015）。在這些實驗中，我們只有控制器預測RNN單元結構並修復所有其他超參數。獎勵函數是c，其中c是常數，通常設爲80。

在控制器RNN完成訓練後，我們根據最低驗證困惑度選取最佳RNN小區，然後對學習速率，權重初始化，丟失率和衰減時期進行網格搜索。然後找到最好的細胞，然後用三種不同的配置和尺寸運行以增加其容量。

Results: In Table 2, we provide a comprehensive list of architectures and their performance onthe PTB dataset. As can be seen from the table, the models found by Neural Architecture Searchoutperform other state-of-the-art models on this dataset, and one of our best models achieves a gainof almost 3.6 perplexity. Not only is our cell is better, the model that achieves 64 perplexity is alsomore than two times faster because the previous best network requires running a cell 10 times pertime step (Zilly et al., 2016).

The newly discovered cell is visualized in Figure 8 in Appendix A. The visualization reveals thatthe new cell has many similarities to the LSTM cell in the first few steps, such as it likes to computeW1 ∗ ht−1 + W2 ∗ xt several times and send them to different components in the cell.

Transfer Learning Results: To understand whether the cell can generalize to a different task, weapply it to the character language modeling task on the same dataset. We use an experimental setupthat is similar to Ha et al. (2016), but use variational dropout by Gal (2015). We also train our ownLSTM with our setup to get a fair LSTM baseline. Models are trained for 80K steps and the best testset perplexity is taken according to the step where validation set perplexity is the best. The resultson the test set of our method and state-of-art methods are reported in Table 3. The results on smallsettings with 5-6M parameters confirm that the new cell does indeed generalize, and is better thanthe LSTM cell.

結果：在表2中，我們提供了關於PTB數據集的體系結構及其性能的完整列表。從表中可以看出，Neural Architecture Search找到的模型可以在這個數據集上實現其他最先進的模型，而我們最好的模型之一可以獲得接近3.6的困惑。我們的細胞不僅更好，而且實現64次困惑的模型比現在快兩倍，因爲之前最好的網絡需要運行10次細胞步驟（Zilly等，2016）。
新發現的細胞在圖8中附錄A中的可視化顯示thatthe新的細胞有許多相似之處LSTM細胞中的前幾個步驟，比如它喜歡computeW1 * HT-1 + W2 * XT幾次，送可視化他們對細胞中的不同組分。
遷移學習結果：爲了理解單元格是否可以推廣到不同的任務，我們將它應用到同一數據集上的角色語言建模任務。我們使用與Ha等人相似的實驗設置。（2016），但使用Gal（2015）的變差輟學率。我們還通過我們的設置培訓我們自己的LSTM以獲得公平的LSTM基準。模型經過80K步驟的訓練，根據驗證集合困惑度最好的步驟，採用最好的測試集困惑度。表3中報告了我們的方法和現有技術方法的測試集的結果。具有5-6M參數的小型化的結果證實，新細胞確實是泛化的，並且比LSTM細胞更好。

Additionally, we carry out a larger experiment where the model has 16.28M parameters. This modelhas a weight decay rate of 1e − 4, was trained for 600K steps (longer than the above models) andthe test perplexity is taken where the validation set perplexity is highest. We use dropout rates of 0.2and 0.5 as described in Gal (2015), but do not use embedding dropout. We use the ADAM optimizerwith a learning rate of 0.001 and an input embedding size of 128. Our model had two layers with800 hidden units. We used a minibatch size of 32 and BPTT length of 100. With this setting, ourmodel achieves 1.214 perplexity, which is the new state-of-the-art result on this task.

Finally, we also drop our cell into the GNMT framework (Wu et al., 2016), which was previouslytuned for LSTM cells, and train an WMT14 English → German translation model. The GNMT network has 8 layers in the encoder, 8 layers in the decoder. The first layer of the encoder hasbidirectional connections. The attention module is a neural network with 1 hidden layer. When aLSTM cell is used, the number of hidden units in each layer is 1024. The model is trained in adistributed setting with a parameter sever and 12 workers. Additionally, each worker uses 8 GPUsand a minibatch of 128. We use Adam with a learning rate of 0.0002 in the first 60K training steps,and SGD with a learning rate of 0.5 until 400K steps. After that the learning rate is annealed bydividing by 2 after every 100K steps until it reaches 0.1. Training is stopped at 800K steps. Moredetails can be found in Wu et al. (2016).

另外，我們在模型有16.28M參數的情況下進行更大的實驗。該模型的權重衰減率爲1e-4，訓練了600k步（比上述模型長），並且在驗證集合困惑度最高時採用測試困惑度。如Gal（2015）所述，我們使用0.2和0.5的丟失率，但不使用嵌入丟失。我們使用ADAM優化器，學習率爲0.001，輸入嵌入大小爲128.我們的模型有兩層，包含800個隱藏單元。我們使用了32的minibatch大小和100的BPTT長度。通過這個設置，我們的模型達到了1.214的困惑度，這是該任務中最新的最新結果。

最後，我們也將我們的細胞放入GNMT框架（Wu et al。，2016），這個框架以前是用於LSTM細胞的，並培訓WMT14英語→德語翻譯模型。 GNMT網絡在編碼器中有8層，在解碼器中有8層。編碼器的第一層具有雙向連接。注意模塊是具有1個隱藏層的神經網絡。當使用LSTM單元時，每層中隱藏單元的數量爲1024.該模型通過參數服務器和12名工人在分佈式設置中訓練。此外，每位工作人員使用8個GPU和128個小批次。我們在前60,000個培訓步驟中使用Adam的學習率爲0.0002，SGD學習率爲0.5，直到40萬步。之後，每100K步驟將學習速率退化爲2，直至達到0.1。訓練停在800K步。更多細節可以在吳等人中找到。（2016）。

In our experiment with the new cell, we make no change to the above settings except for dropping inthe new cell and adjusting the hyperparameters so that the new model should have the same compu-tational complexity with the base model. The result shows that our cell, with the same computationalcomplexity, achieves an improvement of 0.5 test set BLEU than the default LSTM cell. Though thisimprovement is not huge, the fact that the new cell can be used without any tuning on the existingGNMT framework is encouraging. We expect further tuning can help our cell perform better.

在我們對新單元的實驗中，除了丟棄新單元和調整超參數以外，我們不改變上述設置，以便新模型與基本模型具有相同的計算複雜性。結果表明，我們的單元具有相同的計算複雜性，比默認的LSTM單元實現了0.5個測試集BLEU的改進。雖然這個改進並不是很大，但是新的單元可以在現有的GNMT框架中不用調整就可以使用，這是令人鼓舞的。我們預計進一步調整可以幫助我們的電池表現更好。

Control Experiment 1 – Adding more functions in the search space: To test the robustness ofNeural Architecture Search, we add max to the list of combination functions and sin to the listof activation functions and rerun our experiments. The results show that even with a bigger searchspace, the model can achieve somewhat comparable performance. The best architecture with maxand sin is shown in Figure 8 in Appendix A.

Control Experiment 2 – Comparison against Random Search: Instead of policy gradient, onecan use random search to find the best network. Although this baseline seems simple, it is often veryhard to surpass (Bergstra & Bengio, 2012). We report the perplexity improvements using policygradient against random search as training progresses in Figure 6. The results show that not onlythe best model using policy gradient is better than the best model using random search, but also theaverage of top models is also much better.

控制實驗1 - 在搜索空間中添加更多功能：爲了測試神經架構搜索的魯棒性，我們將max函數添加到激活函數列表的組合函數列表中，並重新運行我們的實驗。結果顯示，即使有更大的搜索空間，該模型也可以達到一定的可比性能。 maxand sin的最佳架構如附錄A中的圖8所示。

控制實驗2 - 與隨機搜索的比較：可以使用隨機搜索來找到最佳網絡，而不是策略梯度。雖然這個基線看起來很簡單，但通常很難超越（Bergstra＆Bengio，2012）。在圖6中，我們報告了隨機搜索使用策略升級的困惑性改進情況。結果表明，不僅使用策略梯度的最佳模型優於使用隨機搜索的最佳模型，而且頂級模型的平均值也更好。

5 CONCLUSION

In this paper we introduce Neural Architecture Search, an idea of using a recurrent neural network tocompose neural network architectures. By using recurrent network as the controller, our method isflexible so that it can search variable-length architecture space. Our method has strong empirical per-formance on very challenging benchmarks and presents a new research direction for automaticallyfinding good neural network architectures. The code for running the models found by the controlleron CIFAR-10 and PTB will be released at https://github.com/tensorflow/models . Additionally, wehave added the RNN cell found using our method under the name NASCell into TensorFlow, soothers can easily use it.

ACKNOWLEDGMENTS

We thank Greg Corrado, Jeff Dean, David Ha, Lukasz Kaiser and the Google Brain team for theirhelp with the project.

5結論

在本文中，我們介紹神經架構搜索，一種使用遞歸神經網絡來構成神經網絡架構的想法。通過使用循環網絡作爲控制器，我們的方法是靈活的，因此它可以搜索可變長度的架構空間。我們的方法在非常具有挑戰性的基準上具有很強的經驗性能，併爲自動尋找良好的神經網絡架構提供了一個新的研究方向。在CIFAR-10和PTB上運行控制器發現的模型的代碼將在https://github.com/tensorflow/models上發佈。此外，我們還將使用我們的方法發現的名爲NASCell的RNN細胞添加到TensorFlow中，奶嘴可以輕鬆使用它。

致謝

我們感謝Greg Corrado，Jeff Dean，David Ha，Lukasz Kaiser和Google Brain團隊爲他們提供的幫助。

參考：https://blog.csdn.net/xjz18298268521/article/details/79078835