DARTS: Differentiable Architecture Search
Abstract
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and nondifferentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR10, ImageNet, Penn Treebank and WikiText2 show that our algorithm excels in discovering highperformance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than stateoftheart nondifferentiable techniques.
DARTS: Differentiable Architecture Search
Hanxiao Liu CMU hanxiaol@cs.cmu.edu Karen Simonyan DeepMind simonyan@google.com Yiming Yang CMU yiming@cs.cmu.edu
noticebox[b]\end@float
1 Introduction
Discovering stateoftheart neural network architectures requires substantial effort of human experts. Recently, there has been a growing interest in developing algorithmic solutions to automate the manual process of architecture design. The automatically searched architectures have achieved highly competitive performance in tasks such as image classification (Zoph and Le, 2016; Zoph et al., 2017; Liu et al., 2017b, a; Real et al., 2018) and object detection (Zoph et al., 2017).
The best existing architecture search algorithms are computationally demanding despite their remarkable performance. For example, obtaining a stateoftheart architecture for CIFAR10 and ImageNet required 1800 GPU days of reinforcement learning (RL) (Zoph et al., 2017) or 3150 GPU days of evolution (Real et al., 2018). Several approaches for speeding up have been proposed, such as imposing a particular structure of the search space (Liu et al., 2017b, a), weights or performance prediction for each individual architecture (Brock et al., 2017; Baker et al., 2018) and weight sharing across multiple architectures (Pham et al., 2018b; Cai et al., 2018), but the fundamental challenge of scalability remains. An inherent cause of inefficiency for the dominant approaches, e.g. based on RL, evolution, MCTS (Negrinho and Gordon, 2017), SMBO (Liu et al., 2017a) or Bayesian optimization (Kandasamy et al., 2018), is the fact that architecture search is treated as a blackbox optimization problem over a discrete domain, which leads to a large number of architecture evaluations required.
In this work, we approach the problem from a different angle, and propose a method for efficient architecture search called DARTS (Differentiable ARchiTecture Search). Instead of searching over a discrete set of candidate architectures, we relax the search space to be continuous, so that the architecture can be optimized with respect to its validation set performance by gradient descent. The data efficiency of gradientbased optimization, as opposed to inefficient blackbox search, allows DARTS to achieve competitive performance with the state of the art using orders of magnitude less computation resources. It also outperforms another recent efficient architecture search method, ENAS (Pham et al., 2018b). Notably, DARTS is simpler than many existing approaches as it does not involve any controllers (Zoph and Le, 2016; Baker et al., 2016; Zoph et al., 2017; Pham et al., 2018b), hypernetworks (Brock et al., 2017) or performance predictors (Liu et al., 2017a), yet it is generic enough to search for both convolutional and recurrent architectures.
The idea of searching architectures within a continuous domain is not new (Saxena and Verbeek, 2016; Ahmed and Torresani, 2017; Shin et al., 2018), but there are several major distinctions. While prior works seek to finetune a specific aspect of an architecture, such as filter shapes or branching patterns in a convolutional network, DARTS is able to discover highperformance architectures with complex graph topologies within a rich search space. Moreover, DARTS is not restricted to any specific architecture family, and is able to discover both convolutional and recurrent networks.
In our experiments (Sect. 3) we show that DARTS is able to design a convolutional cell that achieves 2.83 0.06% test error on CIFAR10 for image classification, which is competitive with the stateoftheart result by regularized evolution (Real et al., 2018) obtained using three orders of magnitude more computation resources. The same convolutional cell also achieves 26.9% top1 error when transferred to ImageNet (mobile setting), which is comparable to the best RL method (Zoph et al., 2017). On the language modeling task, DARTS discovers a recurrent cell that achieves 56.1 perplexity on Penn Treebank (PTB) in a single GPU day, outperforming both extensively tuned LSTM (Melis et al., 2017) and all the existing automatically searched cells based on NAS (Zoph and Le, 2016) and ENAS (Pham et al., 2018b).
Our contributions can be summarized as follows:

We introduce a novel algorithm for differentiable network architecture search that is applicable to both convolutional and recurrent architectures.

Through extensive experiments on image classification and language modeling tasks we show that gradientbased architecture search achieves highly competitive results on CIFAR10 and outperforms the state of the art on PTB. This is a very interesting result, considering that so far the best architecture search methods used nondifferentiable search techniques, e.g. based on RL (Zoph et al., 2017) or evolution (Real et al., 2018; Liu et al., 2017b).

We achieve remarkable architecture search efficiency (with 4 GPUs: 2.83 error on CIFAR10 in 1 day; 56.1 perplexity on PTB in 6 hours) which we attribute to the use of gradientbased optimization as opposed to nondifferentiable search techniques.

We show that the architectures learned by DARTS on CIFAR10 and PTB are transferable to ImageNet and WikiText2, respectively.
The implementation of DARTS is available at https://github.com/quark0/darts
2 Differentiable Architecture Search
We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2). Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).
2.1 Search Space
Following Zoph et al. (2017); Real et al. (2018); Liu et al. (2017a, b), we search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or recursively connected to form a recurrent network.
A cell is a directed acyclic graph consisting of an ordered sequence of nodes. Each node is a latent representation (e.g. a feature map in convolutional networks) and each directed edge is associated with some operation that transforms . We assume the cell to have two input nodes and a single output node. For convolutional cells, the input nodes are defined as the cell outputs in the previous two layers (Zoph et al., 2017). For recurrent cells, these are defined as the input at the current step and the state carried from the previous step. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all the intermediate nodes.
Each intermediate node is computed based on all of its predecessors:
(1) 
A special zero operation is also included to indicate a lack of connection between two nodes. The task of learning the cell therefore reduces to learning the operations on its edges.
2.2 Continuous Relaxation and Optimization
Let be a set of candidate operations (e.g., convolution, max pooling, zero) where each operation represents some function to be applied to . To make the search space continuous, we relax the categorical choice of a particular operation as a softmax over all possible operations:
(2) 
where the operation mixing weights for a pair of nodes are parameterized by a vector of dimension . After the relaxation, the task of architecture search reduces to learning a set of continuous variables , as illustrated in Figure 1. At the end of search, a discrete architecture is obtained by replacing each mixed operation with the most likely operation, i.e., . In the following, we refer to as the (encoding of the) architecture.
After relaxation, our goal is to jointly learn the architecture and the weights within all the mixed operations (e.g. weights of the convolution filters). Analogous to architecture search using RL (Zoph and Le, 2016; Zoph et al., 2017; Pham et al., 2018b) or evolution (Liu et al., 2017b; Real et al., 2018) where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the validation loss, but using gradient descent.
Denote by and the training and the validation loss, respectively. Both losses are determined not only by the architecture , but also the weights in the network. The goal for architecture search is to find that minimizes the validation loss , where the weights associated with the architecture are obtained by minimizing the training loss .
This implies a bilevel optimization problem (Anandalingam and Friesz, 1992; Colson et al., 2007) with as the upperlevel variable and as the lowerlevel variable:
(3)  
s.t.  (4) 
The nested formulation also arises in gradientbased hyperparameter optimization (Maclaurin et al., 2015; Pedregosa, 2016), which is related in a sense that the continuous architecture could be viewed as a special type of hyperparameter, although its dimension is substantially higher than scalarvalued hyperparameters (such as the learning rate), and it is harder to optimize.
2.3 Approximation
Solving the bilevel optimization exactly is prohibitive, as it would require recomputing by solving the inner problem (4) whenever there is any change in . We thus propose an approximate iterative optimization procedure where and are optimized by alternating between gradient descent steps in the weight and architecture spaces respectively (Alg. 1). At step , given the current architecture , we obtain by moving in the direction of minimising the training loss . Then, keeping the weights fixed, we update the architecture so as to minimize the the validation loss after a single step of gradient descent w.r.t. the weights:
(5) 
where is the learning rate for this virtual gradient step. The motivation behind (5) is that we would like to find an architecture which has a low validation loss when its weights are optimized by (a single step of) gradient descent, where the onestep unrolled weights serve as the surrogate for . A related approach has been used in metalearning for model transfer (Finn et al., 2017). Notably, the dynamics of our iterative algorithm define a Stackelberg game (Von Stackelberg, 1934) between ’s optimizer (leader) and ’s optimizer (follower), which typically requires the leader to anticipate the follower’s nextstep move in order to achieve an equilibrium. While we are not currently aware of the convergence guarantees for our optimization algorithm, in practice it is able to converge with a suitable choice of ^{1}^{1}1A simple working strategy is to set equal to the learning rate for ’s optimizer.. We also note that when momentum is enabled for weight optimisation, the onestep forward learning objective (5) is modified accordingly and all of our analysis still applies.
The architecture gradient is given by differentiating (5) w.r.t. (we omit the step index for brevity):
(6) 
where denotes the weights for a onestep forward model. The gradient (6) contains a matrixvector product in its second term, which is expensive to compute. Fortunately, the complexity can be substantially reduced using the finite difference approximation. Let be a small scalar ^{2}^{2}2We found to be sufficiently accurate in all of our experiments., and . Then:
(7) 
Evaluating the finite difference requires only two forward passes for the weights and two backward passes for , and the complexity is reduced from to .
Firstorder Approximation: When , the secondorder derivative in (6) will then disappear. In this case, the architecture gradient is given by , corresponding to the simple heuristic of optimizing the validation loss by assuming and are independent of each other. This leads to some speedup but empirically worse performance, according to our experimental results in Table 1 and Table 2. In the following, we refer to the case of as the firstorder approximation, and refer to the gradient formulation with as the secondorder approximation.
2.4 Deriving Discrete Architectures
After obtaining the continuous architecture encoding , the discrete architecture is derived by

Retaining strongest predecessors for each intermediate node, where the strength of an edge is defined as . To make our derived architecture comparable with those in the existing works, we use for convolutional cells (Zoph et al., 2017; Real et al., 2018) and for recurrent cells (Pham et al., 2018b).

Replacing every mixed operation as the most likely operation by taking the argmax.
3 Experiments and Results
Our experiments on CIFAR10 and PTB consist of two stages, architecture search (Sect. 3.1) and architecture evaluation (Sect. 3.2). In the first stage, we search for the cell architectures using DARTS, and determine the best cells based on their validation performance. In the second stage, we use these cells to construct larger architectures, which we train from scratch and report their performance on the test set. Finally, we investigate the transferability of the best cells learned on CIFAR10 and PTB by evaluating them on ImageNet and WikiText2 (WT2) respectively (Sect. 3.4).
3.1 Architecture Search
3.1.1 Searching for Convolutional Cells on CIFAR10
We include the following operations in : and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and . All operations are of stride one (if applicable) and the convolved feature maps are padded to preserve their spatial resolution. We use the ReLUConvBN order for convolutional operations, and each separable convolution is always applied twice (Zoph et al., 2017; Real et al., 2018; Liu et al., 2017a).
Our convolutional cell consists of nodes, among which the output node is defined as the depthwise concatenation of all the intermediate nodes (input nodes excluded). The rest of the setup follows Zoph et al. (2017); Liu et al. (2017a); Real et al. (2018), where a network is then formed by stacking multiple cells together. The first and second nodes of cell are set equal to the outputs of cell and cell , respectively, and convolutions are inserted as necessary. Cells located at the and of the total depth of the network are reduction cells, in which all the operations adjacent to the input nodes are of stride two. The architecture encoding therefore is , where is shared by all the normal cells and is shared by all the reduction cells.
Since the architecture will be varying throughout the search process, we always use batchspecific statistics for batch normalization rather than the global moving average. Learnable affine parameters in all batch normalizations are disabled during the search process to avoid rescaling the outputs of the candidate operations.
To carry out architecture search, we hold out half of the CIFAR10 training data as the validation set. A small network consisting of 8 cells is trained using DARTS for 50 epochs, with batch size (for both the training and validation sets) and the initial number of channels . The numbers were chosen to ensure the network can fit into a single GPU. We use momentum SGD to optimize the weights , with initial learning rate (annealed down to zero following a cosine schedule), momentum , and weight decay . We use Adam as the optimizer for the architecture variables (the ’s in both the normal and reduction cells), with initial learning rate , momentum and weight decay . The search takes one day on a single GPU^{3}^{3}3All of our experiments were performed using NVIDIA GTX 1080Ti GPUs..
3.1.2 Searching for Recurrent Cells on Penn Treebank
Our set of available operations includes the special zero operation, as well as linear transformations followed by , , , and identity mapping, respectively. The choice of these candidate operations follows Zoph and Le (2016); Pham et al. (2018b).
Our recurrent cell consists of nodes. The very first intermediate node is obtained by linearly transforming the two input nodes, adding up the results and then passing through a activation function, as done in the ENAS cell (Pham et al., 2018b). The rest of the cell is learned. Other settings are similar to ENAS, where each operation is enhanced with a highway bypass (Zilly et al., 2016) and the cell output is defined as the average of all the intermediate nodes. As in ENAS, we enable batch normalization in each node to prevent gradient explosion during architecture search, and disable it during architecture evaluation. Learnable affine parameters in batch normalization are disabled, as we did for convolutional cells. Our recurrent network consists of only a single cell. Namely, we do not assume any repetitive patterns within the architecture by vertically stacking the cells.
For architecture search, both the embedding and the hidden sizes are set to 300. The linear transformation parameters across all candidate operations on the same edge are shared (their shapes are all 300 300). This allows us to fit the continuous architecture within a single GPU. The network is then trained for 50 epochs using SGD without momentum, with learning rate , batch size 256, BPTT length 35, and weight decay . We apply variational dropout (Gal and Ghahramani, 2016) of to word embeddings, to the cell input, and to all the hidden nodes. A dropout of is also applied to the output layer. Other training settings are identical to those in Merity et al. (2017); Yang et al. (2017). Similarly to the convolutional architectures, we use Adam for the optimization of , with initial learning rate , momentum and weight decay . The search takes 6 hours on a single GPU.
3.2 Architecture Evaluation
To select the architecture for evaluation, we run DARTS four times with different random seeds and pick the best cell based on the validation performance. This is particularly important for recurrent cells, as the optimization outcomes can be initializationsensitive (Fig. 3).
To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set. We note the test set is never used for architecture search or architecture selection.
3.2.1 Cifar10
A large network of 20 cells is trained for 600 epochs with batch size 96. Other hyperparameters remain the same as the ones used for architecture search. Following existing works (Pham et al., 2018b; Zoph et al., 2017; Liu et al., 2017a; Real et al., 2018), additional enhancements include cutout (DeVries and Taylor, 2017), path dropout of probability and auxiliary towers with weight . The training takes 1.5 days on a single GPU with our implementation in PyTorch (Paszke et al., 2017). Since the CIFAR results are subject to high variance even with exactly the same setup (Liu et al., 2017b), we report the mean and standard deviation of 4 independent runs for our full model.
3.2.2 Penn Treebank
A singlelayer recurrent network with the discovered cell is trained for 1600 epochs with batch size 64 using averaged SGD (Polyak and Juditsky, 1992) (ASGD), with learning rate and weight decay . To speedup, we start with SGD and trigger ASGD using the same protocol as in Yang et al. (2017); Merity et al. (2017). Both the embedding and the hidden sizes are set to 850 to ensure our model size is comparable with other baselines. Other hyperparameters, including dropouts, remain exactly the same as those for architecture search. For fair comparison, we do not finetune our model at the end of the optimization, nor do we use any additional enhancements such as dynamic evaluation (Krause et al., 2017) or continuous cache (Grave et al., 2016). The training takes 1.5 days on a single 1080Ti GPU with our PyTorch implementation. To account for implementation discrepancies, we also incorporated the ENAS cell (Pham et al., 2018b) into our codebase and trained their network under the same setup as our discovered cells.
[b]
Architecture  Test Error  Params  Search Cost  Search 

(%)  (M)  (GPU days)  Method  
DenseNetBC (Huang et al., 2017)  3.46  25.6  –  manual 
NASNetA + cutout (Zoph et al., 2017)  2.65  3.3  1800  RL 
NASNetA + cutout (Zoph et al., 2017)  2.83  3.1  3150  RL 
AmoebaNetA + cutout (Real et al., 2018)  3.34 0.06  3.2  3150  evolution 
AmoebaNetA + cutout (Real et al., 2018)  3.12  3.1  3150  evolution 
AmoebaNetB + cutout (Real et al., 2018)  2.55 0.05  2.8  3150  evolution 
Hierarchical Evo (Liu et al., 2017b)  3.75 0.12  15.7  300  evolution 
PNAS (Liu et al., 2017a)  3.41 0.09  3.2  225  SMBO 
ENAS + cutout (Pham et al., 2018b)  2.89  4.6  0.5  RL 
Random + cutout  3.49  3.1  –  – 
DARTS (first order) + cutout  2.94  2.9  1.5  gradientbased 
DARTS (second order) + cutout  2.83 0.06  3.4  4  gradientbased 
[b]
Architecture  Perplexity  Params  Search Cost  Search  
valid  test  (M)  (GPU days)  Method  
Variational RHN (Zilly et al., 2016)  67.9  65.4  23  –  manual 
LSTM (Merity et al., 2017)  60.7  58.8  24  –  manual 
LSTM + skip connections (Melis et al., 2017)  60.9  58.3  24  –  manual 
LSTM + 5 softmax experts (Yang et al., 2017)  –  57.4  –  –  manual 
LSTM + 15 softmax experts (Yang et al., 2017)  58.1  56.0  22  –  manual 
NAS (Zoph and Le, 2016)  –  64.0  25  1e4 CPU days  RL 
ENAS (Pham et al., 2018b)^{*}  68.3  63.1  24  0.5  RL 
ENAS (Pham et al., 2018b)  60.8  58.6  24  0.5  RL 
Random  64.1  61.5  23  –  – 
DARTS (first order)  62.7  60.5  23  0.5  gradientbased 
DARTS (second order)  58.8  56.6  23  1  gradientbased 
DARTS (second order) + 1e3 more training epochs  58.3  56.1  23  1  gradientbased 

The results were obtained using the code (Pham et al., 2018a) publicly released by the authors.
3.3 Results Analysis
The CIFAR10 results for convolutional architectures are presented in Table 1. Notably, DARTS achieved comparable results with the state of the art (Zoph et al., 2017; Real et al., 2018) while using three orders of magnitude less computation resources (i.e. 1.5 or 4 GPU days vs 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet). Moreover, with slightly longer search time, DARTS outperformed ENAS (Pham et al., 2018b) by discovering cells with comparable error rates but less parameters. The longer search time is due to the fact that we have repeated the search process for four times for cell selection. This practice is less important for convolutional cells however, because the performance of discovered architectures does not strongly depend on initialization (Fig. 3).
Table 2 presents the results for recurrent architectures on PTB, where a cell discovered by DARTS achieved the test perplexity of 56.1. This is competitive with the stateoftheart model enhanced by a mixture of softmaxes (Yang et al., 2017), and better than all the rest of the existing architectures that are either manually or automatically discovered. To the best of our knowledge, this is the first time an automatically searched architecture outperforms the extensively tuned LSTM (Melis et al., 2017), demonstrating the importance of architecture search in addition to hyperparameter search. In terms of efficiency, the overall cost (4 runs in total) is within 1 GPU day, which is comparable to ENAS and significantly faster than NAS (Zoph and Le, 2016).
Finally, it is interesting to note that random architectures are competitive when used in both convolutional and recurrent models, which reflects the importance of the search space design. Nevertheless, DARTS is able to significantly improve upon random architectures in both cases (2.83 0.06 vs 3.49 on CIFAR10; 56.6 vs 61.5 on PTB).
3.4 Transferability of Learned Architectures
3.4.1 ImageNet
We consider the mobile setting where the input image size is 224224 and the number of multiplyadd operations in the model is restricted to be less than 600M. A network of 14 cells is trained for 250 epochs with batch size 128, weight decay and initial SGD learning rate 0.1 (decayed by a factor of 0.97 after each epoch). Other hyperparameters follow Zoph et al. (2017); Real et al. (2018); Liu et al. (2017a)^{4}^{4}4We did not conduct extensive hyperparameter tuning due to limited computation resources.. The training takes 12 days on a single GPU.
Results in Table 3 show that the cell learned on CIFAR10 is indeed transferable to ImageNet. It is worth noticing that DARTS achieves competitive performance with the stateoftheart RL method (Zoph et al., 2017) while using three orders of magnitude less computation resources.
Architecture  Test Error (%)  Params  Search Cost  Search  
top1  top5  (M)  (M)  (GPU days)  Method  
Inceptionv1 (Szegedy et al., 2015)  30.2  10.1  6.6  1448  –  manual 
MobileNet (Howard et al., 2017)  29.4  10.5  4.2  569  –  manual 
ShuffleNet 2 (v1) (Zhang et al., 2017)  29.1  10.2  5  524  –  manual 
ShuffleNet 2 (v2) (Zhang et al., 2017)  26.3  –  5  524  –  manual 
NASNetA (Zoph et al., 2017)  26.0  8.4  5.3  564  1800  RL 
NASNetB (Zoph et al., 2017)  27.2  8.7  5.3  488  1800  RL 
NASNetC (Zoph et al., 2017)  27.5  9.0  4.9  558  1800  RL 
AmoebaNetA (Real et al., 2018)  25.5  8.0  5.1  555  3150  evolution 
AmoebaNetB (Real et al., 2018)  26.0  8.5  5.3  555  3150  evolution 
AmoebaNetC (Real et al., 2018)  24.3  7.6  6.4  570  3150  evolution 
PNAS (Liu et al., 2017a)  25.8  8.1  5.1  588  225  SMBO 
DARTS (searched on CIFAR10)  26.9  9.0  4.9  595  4  gradientbased 
3.4.2 WikiText2
We use embedding and hidden sizes 700, weight decay , and hiddennode variational dropout 0.15. Other hyperparameters remain the same as in our PTB experiments.
Table 4 shows that the cell identified by DARTS transfers better than ENAS on WT2, though the overall results are less strong than those presented in Table 2 for PTB. The weaker transferability between PTB and WT2 (as compared to that between CIFAR10 and ImageNet) could be explained by the relatively small size of the source dataset (PTB) for architecture search. The issue of transferability could potentially be circumvented by directly optimizing the architecture on the task of interest.
Architecture  Perplexity  Params  Search Cost  Search  

valid  test  (M)  (GPU days)  Method  
LSTM + augmented loss (Inan et al., 2017)  91.5  87.0  28  –  manual 
LSTM + continuous cache pointer (Grave et al., 2016)  –  68.9  –  –  manual 
LSTM (Merity et al., 2017)  69.1  66.0  33  –  manual 
LSTM + skip connections (Melis et al., 2017)  69.1  65.9  24  –  manual 
LSTM + 15 softmax experts (Yang et al., 2017)  66.0  63.3  33  –  manual 
ENAS (Pham et al., 2018b) (searched on PTB)  72.4  70.4  33  0.5  RL 
DARTS (searched on PTB)  69.5  66.9  33  1  gradientbased 
4 Conclusion
We presented DARTS, the first differentiable architecture search algorithm for both convolutional and recurrent networks. By searching in a continuous space, DARTS is able to match or outperform the stateoftheart nondifferentiable architecture search methods on image classification and language modeling tasks with remarkable efficiency improvement by several orders of magnitude. In the future, we would like to investigate direct architecture search on larger tasks (e.g. ImageNet) using DARTS.
Acknowledgement
The authors thank Zihang Dai, Hieu Pham and Zico Kolter for useful discussions.
References
 Ahmed and Torresani (2017) Karim Ahmed and Lorenzo Torresani. Connectivity learning in multibranch networks. arXiv preprint arXiv:1709.09582, 2017.
 Anandalingam and Friesz (1992) G Anandalingam and TL Friesz. Hierarchical optimization: An introduction. Annals of Operations Research, 34(1):1–11, 1992.
 Baker et al. (2016) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
 Baker et al. (2018) Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. ICLR Workshop, 2018.
 Brock et al. (2017) Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
 Cai et al. (2018) Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. AAAI, 2018.
 Colson et al. (2007) Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
 DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.
 Grave et al. (2016) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
 Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
 Inan et al. (2017) Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. International Conference on Learning Representations, 2017.
 Kandasamy et al. (2018) Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. arXiv preprint arXiv:1802.07191, 2018.
 Krause et al. (2017) Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. arXiv preprint arXiv:1709.07432, 2017.
 Liu et al. (2017a) Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017a.
 Liu et al. (2017b) Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017b.
 Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 Negrinho and Gordon (2017) Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 Pedregosa (2016) Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
 Pham et al. (2018a) Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Authors’ implementation of “Efficient Neural Architecture Search via Parameter Sharing”. https://github.com/melodyguan/enas/tree/2734eb2657847f090e1bc5c51c2b9cbf0be51887, 2018a. Accessed: 20180405.
 Pham et al. (2018b) Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018b.
 Polyak and Juditsky (1992) Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 Real et al. (2018) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Saxena and Verbeek (2016) Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016.
 Shin et al. (2018) Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. In Workshop at International Conference on Learning Representations, 2018.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, JenHao Rick Chang, et al. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 Von Stackelberg (1934) Heinrich Von Stackelberg. Marktform und gleichgewicht. J. springer, 1934.
 Yang et al. (2017) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 Zhang et al. (2017) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
 Zilly et al. (2016) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
 Zoph and Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.