Efficient Forward Architecture Search
Abstract
We propose a neural architecture search (NAS) algorithm, Petridish, to iteratively add shortcut connections to existing network layers. The added shortcut connections effectively perform gradient boosting on the augmented layers. The proposed algorithm is motivated by the feature selection algorithm forward stagewise linear regression, since we consider NAS as a generalization of feature selection for regression, where NAS selects shortcuts among layers instead of selecting features. In order to reduce the number of trials of possible connection combinations, we train jointly all possible connections at each stage of growth while leveraging feature selection techniques to choose a subset of them. We experimentally show this process to be an efficient forward architecture search algorithm that can find competitive models using few GPU days in both the search space of repeatable network modules (cellsearch) and the space of general networks (macrosearch). Petridish is particularly wellsuited for warmstarting from existing models crucial for lifelonglearning scenarios.
Efficient Forward Architecture Search
Hanzhang Hu,^{1} John Langford,^{2} Rich Caruana,^{2} Saurajit Mukherjee,^{2} Eric Horvitz,^{2} Debadeepta Dey^{2} ^{1}Carnegie Mellon University, ^{2}Microsoft Research hanzhang@cs.cmu.edu, {jcl,rcaruana,saurajim,horvitz,dedey}@microsoft.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Neural networks have achieved stateoftheart performance on large scale supervised learning tasks across domains like computer vision, natural language processing, audio and speechrelated tasks using architectures manually designed by skilled practitioners, often via trial and error. Neural architecture search (NAS) (Zoph & Le, 2017; Zoph et al., 2018; Real et al., 2018; Pham et al., 2018; Liu et al., 2019; Han Cai, 2019) algorithms attempt to automatically find good architectures given datasets. In this work, we view NAS as a bilevel combinatorial optimization problem (Liu et al., 2019), where we seek both the optimal architecture and its associated optimal parameters. Interestingly, this formulation generalizes the wellstudied problem of feature selection for linear regression (Tibshirani, 1994; Efron et al., 2004; Das & Kempe, 2011). This observation permits us to draw and leverage parallels between NAS algorithms and feature selection algorithms.
A plethora of NAS works have leveraged sampling methods including reinforcement learning (Zoph & Le, 2017; Zoph et al., 2018; Liu et al., 2018), evolutionary algorithms (Real et al., 2017, 2018; Elsken et al., 2018a), and Bayesian optimization (Kandasamy et al., 2018) to enumerate architectures that are then independently trained. Interestingly, these approaches are uncommon for feature selection. Indeed, samplebased NAS often takes hundreds of GPUdays to find good architectures, and can be barely better than random search (Elsken et al., 2018b).
Another common NAS approach is analogous to sparse optimization (Tibshirani, 1994) or backward elimination for feature selection, e.g., (Liu et al., 2019; Pham et al., 2018; Han Cai, 2019; Xie et al., 2019). The approach starts with a supergraph that is the union of all possible architectures, and learns to downweight the unnecessary edges gradually via gradient descent or reinforcement learning. Such approaches drastically cut down the search time of NAS. However, these methods require domain knowledge to create the initial supergraphs, and typically need to reboot the search if the domain knowledge is updated.
In this work, we instead take an approach that is analogous to a forward feature selection algorithm and iteratively grow existing networks. Although forward methods such as Orthogonal Matching Pursuit (Pati et al., 1993) and Leastangle Regression (Efron et al., 2004) are common in feature selection and can often result in performance guarantees, there are only a similar NAS approaches (Liu et al., 2017). Such forward algorithms are attractive, when one wants to expand existing models as extra computation becomes viable. Forward methods can utilize such extra computational resources without rebooting the training as in backward methods and sparse optimization. Furthermore, forward methods naturally result in a spectrum of models of various complexities to suitably choose from. Crucially, unlike backward approaches, forward methods do not need to specify a finite search space up front making them more general and easier to use when warmstarting from prior available models and for lifelong learning.
Specifically, inspired by early neural network growth work (Fahlman & Lebiere, 1990), we propose a method (Petridish) of growing networks from small to large, where we opportunistically add shortcut connections in a fashion that is analogous to applying gradient boosting (Friedman, 2002) to the intermediate feature layers. To select from the possible shortcut connections, we also exploit sparsityinducing regularization (Tibshirani, 1994) during the training of the eligible shortcuts.
We experiment with Petridish for both the cellsearch (Zoph et al., 2018), where we seek a shortcut connection pattern and repeat it using a manually designed skeleton network to form an architecture, and the less common but more general macrosearch, where shortcut connections can be freely formed. Experimental results show Petridish macrosearch to be better than previous macrosearch NAS approaches on vision tasks, and brings macrosearch performance up to par with cellsearch counter to beliefs from other NAS works (Zoph & Le, 2017; Pham et al., 2018) that macrosearch is inferior to cellsearch. Petridish cellsearch also finds models that are more costefficient than those from (Liu et al., 2019), while using similar training computation. This indicates that forward selection methods for NAS are effective and useful.
We summarize our contribution as follows.

We propose a forward neural architecture search algorithm that is analogous to gradient boosting on intermediate layers, allowing models to grow in complexity during training and warmstart from existing architectures and weights.

On CIFAR10 and PTB, the proposed method finds competitive models in few GPUdays with both cellsearch and macrosearch.

The ablation studies of the hyperparameters highlight the importance of starting conditions to algorithm performance.
2 Background and Related Work
Samplebased. Zoph & Le (2017) leveraged policy gradients (Williams, 1992) to learn to sample networks, and established the nowcommon framework of sampling networks and evaluating them after a few epochs of training. The policygradient sampler has been replaced with evolutionary algorithms (Schaffer et al., 1990; Real et al., 2018; Elsken et al., 2018a), Bayesian optimization (Kandasamy et al., 2018), and Monte Carlo tree search (Negrinho & Gordon, 2017). Multiple searchspaces (Elsken et al., 2018b) are also studied under this framework. Zoph et al. (2018) introduce the idea of cellsearch, where we learn a connection pattern, called a cell, and stack cells to form networks. Liu et al. (2018) further learn how to stack cells with hierarchical cells. Cai et al. (2018) evolve networks starting from competitive existing models via netmorphism (Wei et al., 2016).
Weightsharing. The samplebased framework of (Zoph & Le, 2017) spends most of its training computation in evaluating the sampled networks independently, and can cost hundreds of GPUdays during search. This framework is revolutionized by Pham et al. (2018), who share the weights of the possible networks and train all possible networks jointly. Liu et al. (2019) formalize NAS with weightsharing as a bilevel optimization (Colson et al., 2007), where the architecture and the model parameters are jointly learned. Xie et al. (2019) leverage policy gradient to update the architecture in order to update the whole bilevel optimization with gradient descent.
Forward NAS. Forward NAS originates from one of the earliest NAS works by Fahlman & Lebiere (1990) termed “CascadeCorrelation”, in which, neurons are added to networks iteratively. Each new neuron takes input from existing neurons, and maximizes the correlation between its activation and the residual in network prediction. Then the new neuron is frozen and is used to improve the final prediction. This idea of iterative growth has been recently studied in (Cortes et al., 2017; Huang et al., 2018) via gradient boosting (Friedman, 2002). While Petridish is similar to gradient boosting, it is applicable to any layer, instead of only the final layer. Furthermore, Petridish initializes weak learners without freezing or affecting the current model, unlike in gradient boosting, which freezes previous models. Liu et al. (2017) have proposed forward search within the sampling framework of (Zoph & Le, 2017). Petridish instead utilizes weightsharing, reducing the search time from hundreds of GPUdays to just a few.
3 Preliminaries
Gradient Boosting: Let be a space of weak learners. Gradient boosting matches weak learners to the functional gradient of the loss with respect to the prediction . The weak learner that matches the negative gradient the best is added to the ensemble of learners, i.e.,
(1) 
Then the predictor is updated to become , where is the learning rate.
NAS Optimization: Given data sample with label from a distribution , a neural network architecture with parameters produces a prediction and suffers a prediction loss . The expected loss is then
(2) 
In practice, the loss is estimated on the empirical training data . Following (Liu et al., 2019), the problem of neural architecture search can be formulated as a bilevel optimization (Colson et al., 2007) of the network architecture and the model parameters under the loss as follows.
(3) 
where is the testtime computational cost of the architecture, and is some constant. Formally, let be intermediate layers in a feedforward network. Then a shortcut from layer to () using operation is represented by , where the operation is a unary operation such as 3x3 conv. We merge multiple shortcuts to the same with summation, unless specified otherwise using ablation studies. Hence, the architecture is a collection of shortcut connections.
Feature Selection Analogy: We note that Eq. 3 generalizes feature selection for linear prediction (Tibshirani, 1994; Pati et al., 1993; Das & Kempe, 2011), where selects feature subsets, is the prediction coefficient, and the loss is expected square error. Hence, we can understand a NAS algorithm by considering its application to feature selection, as discussed in the introduction and related works. This work draws a parallel to the feature selection algorithm ForwardStagewise Linear Regression (FSLR) (Efron et al., 2004) with small step sizes, which is an approximation to Leastangle Regression (Efron et al., 2004). In FSLR, we iteratively update with small step sizes the weight of the feature that correlates the most with the prediction residual. Viewing candidate features as weak learners, the residuals become the gradient of the square loss with respect to the linear prediction. Hence, FSLR is also understood as gradient boosting (Friedman, 2002).
Cellsearch vs. Macrosearch: In this work, we consider both cellsearch, a popular NAS search space where a network is a predefined sequence of some learned connection patterns (Zoph et al., 2018; Real et al., 2018; Pham et al., 2018; Liu et al., 2019), called cells, and macrosearch, a more general NAS where no repeatable patterns are required. For a fair comparison between the two, we set both macro and cell searches to start with the same seed model, which consists of a sequence of simple cells. Both searches also choose from the same set of shortcuts. The only difference is cellsearch cells changing uniformly and macrosearch cells changing independently.
4 Methodology: Efficient Forward Architecture Search (Petridish)
Following gradient boosting strictly would limit the model growth to be only at the prediction layer of the network, . Instead, this work seeks to jointly expand the expressiveness of the network at intermediate layers, . Specifically, we consider adding a weak learner at each , where (specified next) is the space of weak learners for layer . helps reduce the gradient of the loss with respect to , , i.e., we choose with
(4) 
Then we expand the model by adding to . In other words, we replace each with in the original network, where is a scalar variable initialized to 0. The modified model then can be trained with backpropagation. We next specify the weak learner space, and how they are learned.
Weak Learner Space: The weak learner space for a layer is formally
(5) 
where Op is the set of eligible unary operations, is the set of allowed input layers, is the number of shortcuts to merge together in a weak learner, and is a merge operation to combine the shortcuts into a tensor of the same shape as . On vision tasks, following (Liu et al., 2019), we set Op to contain separable conv 3x3 and 5x5, dilated conv 3x3 and 5x5, max and average pooling 3x3, and identity. The separable conv is applied twice as per (Liu et al., 2019). Following (Zoph et al., 2018; Liu et al., 2019), we set to be layers that are topologically earlier than , and are either in the same cell as or the outputs of the previous two cells. We choose through an ablation study from amongst 2, 3 or 4 in Sec. B.5, and we set to be a concatenation followed by a projection with conv 1x1 through an ablation study in Sec. B.3 against weighted sum.
Weak Learning with Weight Sharing: In gradient boosting, one typically optimizes Eq. 4 by minimizing for multiple , and selecting the best afterwards. However, as there are possible weak learners in the space of Eq. 5, where and , it may be costly to enumerate all possibilities. Inspired by the parameter sharing works in NAS (Pham et al., 2018; Liu et al., 2019) and model compression in neural networks (Huang et al., 2017a), we propose to jointly train the union of all weak learners, while learning to select the shortcut connections. This process also only costs a constant factor more than training one weak learner. Specifically, we fit the following joint weak learner for a layer in order to minimize :
(6) 
where and enumerate all possible operations and inputs, and is the weight of the shortcut . Each is normalized with batchnormalization to have approximately zero mean and unit variance in expectation, so reflects the importance of the operation. To select the most important operations, we minimize with an regularization on the weight vector , i.e.,
(7) 
where is a hyperparameter which we choose in the appendix B.6. regularization, known as Lasso (Tibshirani, 1994), induces sparsity in the parameter and is widely used for feature selection.
Weak Learning Implementation: A naïve implementation of joint weak learning needs to compute and freeze the existing model during weak learner training. Here we provide a modification to avoid these two costly requirements. Algorithm 1 describes the proposed implementation and Fig. (a)a illustrates the weak learning computation graph. We leverage a custom operation called stopgradient, sg, which has the property that for any , and . Similarly, we define the complimentary operation stopforward, , i.e., and , the identity function. Specifically, on line 7, we apply sg to inputs of weak learners, so that does not affect the gradient of the existing model. Next, on line 11, we replace the layer with , so that the prediction of the model is unaffected by weak learning. Finally, the gradient of the loss with respect to any weak learner parameter is:
(8) 
This means that sf and sg not only prevent the weak learning from affecting the training of existing model, but also enable us to minimize via backpropagation on the whole network. Thus, we no longer need explicitly compute nor freeze the existing model weights during weak learning. Furthermore, since weak learners of different layers do not interact during weak learning, we grow the network at all that are ends of cells at the same time.
Finalize Weak Learners: In Algorithm 2 and Fig. (b)b, we finalize the weak learners. We select in each the top shortcuts according to the absolute value of , and merge them with a concatenation followed by a projection to the shape of . We note that the weighted sum during weak learning is a special case of concatenationprojection, and we use an ablation study in appendix B.3 to validate this replacement. We also note that most NAS works (Zoph et al., 2018; Real et al., 2018; Pham et al., 2018; Liu et al., 2019; Xie et al., 2019; Han Cai, 2019) have similar setups of concatenating intermediate layers in cells and projecting the results. We train the finalized models for a few epochs, warmstarting from the parameters in weak learning.
Remarks: A key design concept of Petridish is amortization, where we require the computational costs of weak learning and model training to be a constant factor of each other. We further design Petridish to do both at the same time. Following these principles, it only costs a constant factor of additional computation to augment models with Petridish while training the model concurrently.
We also note that since Petridish only grows models, noise in weak learning and model training can result in suboptimal shortcut selections. To mitigate this potential problem and to reduce the search variance, we utilize multiple parallel workers of Petridish, each of which can warmstart from intermediate models of each other. We defer this implementation detail to the appendix.
5 Experiments
We report the search results on CIFAR10 (Krizhevsky, 2009) and the transfer result on ImageNet (Russakovsky et al., 2015). Ablation studies for choosing the hyper parameters are deferred to appendix B, which also demonstrates the importance of blocking the influence of weak learners to the existing models during weak learning via sf and sg. We also search on Penn Tree Bank (Marcus et al., 1993), and show that it is not an interesting dataset for evaluating NAS algorithms.
5.1 Search Results on CIFAR10
Setup: Following (Zoph et al., 2018; Liu et al., 2019), we search on a shallow and slim networks, which have normal cells in each of the three feature map resolution, one transition cell between each pair of adjacent resolutions, and initial filter size. Then we scale up the found model to have and for a final training from scratch. During search, we use the last 5000 training images as a validation set. The starting seed model is a modified ResNet (He et al., 2016), where the output of a cell is the sum of the input and the result of applying two 3x3 separable conv to the input. This is one of the simplest seeds in the search space popularized by (Zoph et al., 2018; Pham et al., 2018; Liu et al., 2019). The seed model is trained for 200 epochs, with a batch size of 32 and a learning rate that decays from 0.025 to 0 in cosine decay (Loshchilov & Hutter, 2017). We apply droppath (Larsson et al., 2017) with probability 0.6 and the standard CIFAR10 cutout (DeVries & Taylor, 2017). Weak learner selection and finalization are trained for 80 epochs each, using the same parameters. The final model training is from scratch for 600 epochs on all training images with the same parameters.
Method  # params  Search  Test Error 

(mil.)  (GPUDays)  (%)  
Zoph & Le (2017)  7.1  1680+  4.47 
Zoph & Le (2017) + more filters  37.4  1680+  3.65 
Real et al. (2017)  5.4  2500  5.4 
ENAS macro (Pham et al., 2018)  21.3  0.32  4.23 
ENAS macro + more filters  38  0.32  3.87 
Lemonade I (Elsken et al., 2018a)  8.9  56  3.37 
Petridish initial model (, )  0.4  –  4.6 
Petridish macro  2.2  5  2.83 2.85 0.12 
NasNetA (Zoph et al., 2018)  3.3  1800  2.65 
AmoebaNetA (Real et al., 2018)  3.2  3150  3.3 
AmoebaNetB (Real et al., 2018)  2.8  3150  2.55 
PNAS (Liu et al., 2017)  3.2  225  3.41 
ENAS cell (Pham et al., 2018)  4.6  0.45  2.89 
Lemonade II (Elsken et al., 2018a)  3.98  56  3.50 
Darts (Liu et al., 2019)  3.4  4  2.83 
Darts random (Liu et al., 2019)  3.1  –  3.49 
Luo et al. (2018)  3.3  0.4  3.53 
PARSEC (Casale et al., 2019)  3.7  1  2.81 
Petridish cell  2.5  5  2.61 2.87 0.13 
Petridish cell more filters (F=37)  3.2  5  2.51 2.75 0.21 
Method  # params  # multiadd  Search  top1 Test Error 

(mil.)  (mil.)  (GPUDays)  (%)  
Inceptionv1 (Szegedy et al., 2015)  6.6  1448  –  30.2 
MobileNetV2 (Sandler et al., 2018)  6.9  585  –  28.0 
NASNetA (Zoph et al., 2017)  5.3  564  1800  26.0 
NASNetB (Zoph et al., 2017)  5.3  488  1800  27.2 
AmoebaNetA (Real et al., 2018)  5.1  555  3150  25.5 
Pathlevel (Cai et al., 2018)  –  588  8.3  25.5 
PNAS (Liu et al., 2017a)  5.1  588  225  25.8 
DARTS (Liu et al., 2019)  4.9  595  4  26.9 
SNAS (Xie et al., 2019)  4.3  522  1.6  27.3 
Proxyless (Han Cai, 2019)  7.1  465  8.3  24.9 
PARSEC (Casale et al., 2019)  5.6  –  1  26.0 
Petridish macro (N=6,F=44)  4.3  511  5  28.5 28.7 0.15 
Petridish cell (N=6,F=44)  4.8  598  5  26.0 26.3 0.20 
Search Results: Table 1 depicts the testerrors, model parameters, and search computation of the proposed methods along with many stateoftheart methods. We mainly compare against models of fewer than 3.5M parameters, since these models can be easily transferred to ILSVRC (Russakovsky et al., 2015) mobile setting via a standard procedure (Zoph et al., 2018). The final training of Petridish models is repeated five times. Petridish cell search finds a model with 2.870.13% error rate with 2.5M parameters, in 5 GPUdays using GTX 1080. Increasing filters to , the model has 2.750.21% error rate with 3.2M parameters. This is one of the better models among models that have fewer than 3.5M parameters, and is in particular better than DARTS (Liu et al., 2019).
Petridish macro search finds a model that achieves 2.85 0.12% error rate using 2.2M parameters in the same search computation. This is significantly better than previous macro search results, and showcases that macro search can find costeffective architectures that are previously only found through cell search. This is important, because the NAS literature has been moving away from macro architecture search, as early works (Zoph et al., 2018; Pham et al., 2018; Real et al., 2018) have shown that cell search results tend to be superior to those from macro search. However, this result may be explained by the superior initial models of cell search: the initial model of Petridish is one of the simplest models that any of the listed cell search methods proposes and evaluates, and it already achieves 4.6% error rate using only 0.4M parameters, a result already onpar or better than any other macro search result.
We also run multiple instances of Petridish cellsearch to study the variance in search results, and Table 5.1 reports performance of the best model of each search run. We observe that the models from the separate runs have similar performances. Averaging over the runs, the search time is 10.5 GPUdays and the model takes 2.8M parameters to achieve 2.88% average mean error rate. Their differences may be caused by the randomness in stochastic batches, variable initialization, image preprocessing, and droppath.
Transfer to ImageNet: We focus on the mobile setting for the model transfer results on ILSVRC (Russakovsky et al., 2015), which means we limit the number of multiadd per image to be within 600M. We transfer the final models on CIFAR10 to ILSVRC by adding an initial 3x3 conv of stride of 2, followed by two transition cells, to downsample the 224x224 input images to 28x28 with filters. In macrosearch, where no transition cells are specifically learned, we again use the the modified ResNet cells from the initial seed model as the replacement. After this initial downsampling, the architecture is the same as in CIFAR10 final models. Following (Liu et al., 2019), we train these models for 250 epochs with batch size 128, weight decay , and initial SGD learning rate of 0.1 (decayed by a factor of 0.97 per epoch).
Table 2 depicts performance of the transferred models. The Petridish cellsearch model achieves 26.30.2% error rate using 4.8M parameters and 598M multiplyadds, which is on par with stateoftheart results listed in the second block of Table 2. By utilizing feature selection techniques to evaluate multiple model expansions at the same time, Petridish is able to find models faster by one or two orders of magnitude than early methods that train models independently, such as NASNet (Zoph et al., 2018), AmoebaNet (Real et al., 2018), and PNAS (Liu et al., 2017). In comparison to supergraph methods such as DARTS (Liu et al., 2019), Petridish cellsearch takes similar search time to find a more accurate model.
The Petridish macrosearch model achieves 28.70.15% error rate using 4.3M parameters and 511M multiplyadds, a comparable result to the humandesigned models in the first block of Table 2. Though this is one of the first successful transfers of macrosearch result on CIFAR to ImageNet, the relative performance gap between cellsearch and macrosearch widens after the transfer. This may be because the default transition cell is not adequate for transfer to more complex datasets. As Petridish gradually expands existing models, we naturally receive a gallery of models of various computational costs and accuracy. Figure 5.1 showcases the found models.
5.2 Search Results on Penn Treebank
Petridish when used to grow the cell of a recurrent neural network achieves a best test perplexity of and average test perplexity of across search runs with different random seeds on PTB. This is competitive with the best search result of (Li & Talwalkar, 2019) of via random search with weight sharing. In spite of good performance we don’t put much significance on this particular languagemodeling task with this data set because no NAS algorithm appears to perform better than random search (Li & Talwalkar, 2019), as detailed in appendix C.
6 Conclusion
We formulate NAS as a bilevel optimization problem, which generalizes feature selection for linear regression. We propose an efficient forward selection algorithm that applies gradient boosting to intermediate layers, and generalizes the feature selection algorithm LARS (Efron et al., 2004). We also speed weak learning via weight sharing, training the union of weak learners and selecting a subet from the union via regularization. We demonstrate experimentally that forward model growth can find accurate models in a few GPUdays via cell and macro searches. Source code for Petridish is available at https://github.com/microsoft/petridishnn.
References
 Cai et al. (2018) Cai, Han, Yang, Jiacheng, Zhang, Weinan, Han, Song, and Yu, Yong. Pathlevel network transformation for efficient architecture search. In ICML, 2018.
 Casale et al. (2019) Casale, Francesco Paolo, Gordon, Jonathan, and Fusi, Nicolo. Probabilistic neural architecture search. In arxiv.org/abs/1902.05116, 2019.
 Colson et al. (2007) Colson, Benoît, Marcotte, Patrice, and Savard, Gilles. An overview of bilevel optimization. In Annals of operations research, 2007.
 Cortes et al. (2017) Cortes, Corinna, Gonzalvo, Xavier, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Adanet: Adaptive structural learning of artificial neural networks. In ICML, 2017.
 Das & Kempe (2011) Das, A. and Kempe, D. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In ICML, 2011.
 DeVries & Taylor (2017) DeVries, Terrance and Taylor, Graham. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017.
 Efron et al. (2004) Efron, Bradley, Hastie, Trevor, Johnstone, Iain, and Tibshirani, Robert. Least angle regression. Annals of Statistics, 32:407–499, 2004.
 Elsken et al. (2018a) Elsken, Thomas, Metzen, Jan Hendrik, and Hutter, Frank. Efficient multiobjective neural architecture search via lamarckian evolution. 2018a.
 Elsken et al. (2018b) Elsken, Thomas, Metzen, Jan Hendrik, and Hutter, Frank. Neural architecture search: A survey. CoRR, abs/1808.05377, 2018b.
 Fahlman & Lebiere (1990) Fahlman, Scott E. and Lebiere, Christian. The cascadecorrelation learning architecture. In NIPS, 1990.
 Friedman (2002) Friedman, J.H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 2002.
 Han Cai (2019) Han Cai, Ligeng Zhu, Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
 Huang et al. (2018) Huang, Furong, Ash, Jordan, Langford, John, and Schapire, Robert. Learning deep resnet blocks sequentially using boosting theory. In ICML, 2018.
 Huang et al. (2017a) Huang, G., Liu, S., van der Maaten, L., and Weinberger, K. Condensenet: An efficient densenet using learned group convolutions. arXiv preprint arXiv:1711.09224, 2017a.
 Huang et al. (2017b) Huang, Gao, Liu, Zhuang, van der Maaten, Laurens, and Weinberger, Kilian Q. Densely connected convolutional networks. In CVPR, 2017b.
 Kandasamy et al. (2018) Kandasamy, Kirthevasan, Neiswanger, Willie, Schneider, Jeff, Poczos, Barnabas, and Xing, Eric. Neural architecture search with bayesian optimisation and optimal transport. In NIPS, 2018.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
 Larsson et al. (2017) Larsson, Gustav, Maire, Michael, and Shakhnarovich, Gregory. Fractalnet: Ultradeep neural networks without residuals. In ICLR, 2017.
 Li & Talwalkar (2019) Li, Liam and Talwalkar, Ameet. Random search and reproducibility for neural architecture search. CoRR, abs/1902.07638, 2019. URL http://arxiv.org/abs/1902.07638.
 Liu et al. (2017) Liu, Chenxi, Zoph, Barret, Shlens, Jonathon, Hua, Wei, Li, LiJia, FeiFei, Li, Yuille, Alan L., Huang, Jonathan, and Murphy, Kevin. Progressive neural architecture search. CoRR, abs/1712.00559, 2017.
 Liu et al. (2018) Liu, Hanxiao, Simonyan, Karen, Vinyals, Oriol, Fernando, Chrisantha, and Kavukcuoglu, Koray. Hierarchical representations for efficient architecture search. In ICLR, 2018.
 Liu et al. (2019) Liu, Hanxiao, Simonyan, Karen, and Yang, Yiming. Darts: Differentiable architecture search. 2019.
 Loshchilov & Hutter (2017) Loshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
 Luo et al. (2018) Luo, Renqian, Tian, Fei, Qin, Tao, Chen, Enhong, and Liu, TieYan. Neural architecture optimization. In NIPS, 2018.
 Marcus et al. (1993) Marcus, Mitchell, Santorini, Beatrice, and Marcinkiewicz, Mary Ann. Building a large annotated corpus of english: The penn treebank. 1993.
 Negrinho & Gordon (2017) Negrinho, Renato and Gordon, Geoffrey J. Deeparchitect: Automatically designing and training deep architectures. CoRR, abs/1704.08792, 2017.
 Pati et al. (1993) Pati, Y, Rezaiifar, R., and Krishnaprasad, P. Orthogonal matching pursuit: recursive function approximation with application to wavelet decomposition. In Signals, Systems and Computation, 1993.
 Pham et al. (2018) Pham, Hieu, Guan, Melody Y., Zoph, Barret, Le, Quoc V., and Dean, Jeff. Efficient neural architecture search via parameter sharing. In ICML, 2018.
 Real et al. (2017) Real, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Suematsu, Yutaka Leon, Tan, Jie, Le, Quoc, and Kurakin, Alex. Largescale evolution of image classifiers. CoRR, abs/1703.01041, 2017.
 Real et al. (2018) Real, Esteban, Aggarwal, Alok, Huang, Yanping, and Le, Quoc V. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 Schaffer et al. (1990) Schaffer, J David, Caruana, Richard A, and Eshelman, Larry J. Using genetic search to exploit the emergent behavior of neural networks. Physica D: Nonlinear Phenomena, 42(13):244–248, 1990.
 Tibshirani (1994) Tibshirani, Robert. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
 Wei et al. (2016) Wei, Tao, Wang, Changhu, Rui, Yong, and Chen, Chang Wen. Network morphism. In ICML, 2016.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning, 1992.
 Xie et al. (2019) Xie, Sirui, Zheng, Hehui, Liu, Chunxiao, and Lin, Liang. Snas: Stochastic neural architecture search. In ICLR, 2019.
 Yang et al. (2018) Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen, William W. Breaking the softmax bottleneck: A highrank rnn language model. ICML, 2018.
 Ying et al. (2019) Ying, Chris, Klein, Aaron, Real, Esteban, Christiansen, Eric, Murphy, Kevin, and Hutter, Frank. Nasbench101: Towards reproducible neural architecture search. In arxiv.org/abs/1902.09635, 2019.
 Zoph & Le (2017) Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. In ICLR, 2017.
 Zoph et al. (2018) Zoph, Barret, Vasudevan, Vijay, Shlens, Jonathon, and Le, Quoc V. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
Appendix A Additional Implementation Details
a.1 Parallel Workers
Since there are many sources of randomness in model training and weak learning, including SGD batches, droppath, cutout, and variable initialization, Petridish can benefit from multiple runs. Furthermore, if one worker finds a costefficient model of a medium size, other workers may want the option to warmstart from this checkpoint. Petridish workers warmstart from models on the lower convex hull of the scatter plot of model validation error versus model complexity, because any mixture of other models are either more complex or less accurate.
As there are multiple models on the convex hull, the workers need also choose one at each iteration. To do so, we loop over the models on the hull from the most accurate to the least, and choose a model with a probability , where is the number of times that is already chosen. This probability is chosen because if a model has been sampled times, then the next child is the best among the children with probability . We favor the accurate models, because it is typically more difficult to improve accurate models. In practice, Petridish sample fewer than 100 models, so performances of different sampling algorithms are often indistinguishable, and we settle on this simple algorithm.
a.2 Select Models for Final Training
The search can be interrupted at anytime, and the best models are the models on the performance convex hull at the time of interruption. For evaluating Petridish on CIFAR10 (Krizhevsky, 2009), we perform final training on models that are on the searchtime convex hull and have near 60 million multiadds on CIFAR10 during search with and . We focus on these models can be translated to the ILSVRC mobile setting easily with a fixed procedure of setting and .
a.3 Computation Resources
Appendix B Ablation Studies
b.1 Evaluation Criteria
On CIFAR10 (Krizhevsky, 2009), we often find that standard deviation of final training and search results to be high in comparison to the difference among different search algorithms. In contrast, the testerror on ILSVRC is more stable, and so that one can more clearly differentiate the performances of models from different search algorithms. Hence, we use ILSVRC transfer results to compare search algorithms whenever the results are available. We use CIFAR10 final training results to compare search algorithms, if otherwise.
b.2 Search Space: Direct versus Proxy
This section provides an ablation study on a common theme of recent neural architecture search works, where the search is conducted on a proxy space of small and shallow models, with results transferred to larger models later. In particular, since Petridish uses iterative growth, it need not consider the complexity of a super graph containing all possible models. Thus, Petridish can be applied directly to the final model setting on CIFAR10, where and . However, this implies each model takes about eight times the computation, and may introduce extra difficulty in convergence. Table 3 shows the transfer results of the two approaches to ILSVRC. We see that this popular proxy search heuristic indeed leads to more accurate models.
Method  # params  # multiadd  Search  top1 Test Error 

(mil.)  (mil.)  (GPUDays)  (%)  
Petridish cell direct (F=40)  4.4  583  15.3  26.9 
Petridish cell proxy (F=44)  4.8  598  5  26.3 
b.3 : Weighted Sum versus ConcatenationProjection
Method  # params  # multiadd  Search  top1 Test Error 

(mil.)  (mil.)  (GPUDays)  (%)  
WS macro(F=48)  5.9  756  29.5  32.5 
CPend macro (F=36)  5.4  680  29.5  29.1 
Petridish macro (F=32)  4.9  593  27.2  29.4 
WS cell (F=48)  3.3  477  22.8  32.7 
CPend cell (F=44)  4.7  630  22.8  27.2 
Petridish cell (F=40)  4.4  583  15.3  26.9 
After selecting the shortcuts in Sec. 4, we concatenate them and project the result with 1x1 conv so that the result can be added to the output layer . Here we empirically justify this design choice through consideration of two alternatives. We first consider applying the switch only to the final reported model. In other words, instead of using concatenationprojection as the merge operation during search we switch all weak learner weightedsums to concatenationprojections in the final model, which are trained from scratch to report results. We call this variant CPend. Another variant where we never switch to concatenationprojection is called WS. Since concatenationprojection incurs additional computation to the model, we increase the channel size of WS variants so that the two variants have similar testtime multiplyadds for fair comparisons. The default Petridish option is switching the weak learner weightedsums to concatenationprojections each time weak learners are finalized with Alg. 2. We compare WS, CPend and Petridish on the transfer results on ILSVRC in Table 4, and observe that Petridish achieves similar or better prediction error using less testtime computation and trainingtime search.
b.4 Is Weak Learning Necessary?
Method  # params  # multiadd  Search  top1 Test Error 

(mil.)  (mil.)  (GPUDays)  (%)  
Petridish Joint cell (F=32)  4.0  546  20.6  32.8 
Petridish cell (F=40)  4.4  583  15.3  26.9 
An interesting consideration is whether to stop the influence of the weak learners to the models during the weak learning. On the one hand, we eventually want to add the weak learners into the model and allow them to be backpropagated together to improve the model accuracy. On the other hand, the introduction of untrained weak learners to trained models may negatively affect training. Furthermore, the models may develop dependency on weaklearner shortcuts that are not selected, which can also negatively affect future models. To study the effects through an ablation study, we remove sg and replace sg with a variable scalar multiplication that is initialized to zero in Algorithm 1. This is equivalent to adding the joint weak learner of Eq. 6 directly to the boosted layer after random initialization, and then we train the existing model and the joint weak learner together with backpropagation. We call this variant Joint, and compare it against the default Petridish. Table 5 showcases the transfer results of Isolated and Joint to ILSVRC. We compare Petridish cell (F=40) with Joint cell (F=32), two models that have similar computational cost but very different accuracy, and we observe that Isolated leads to much better model than Joint for cellsearch. This suggests that the randomly initialized joint weak learners should not directly be added to the existing model to be backpropagated, and the weak learning step is beneficial for the overall search.
b.5 Number of Merged Operations,
Average Lowest Error Rate  
2  3.08 
3  2.88 
4  2.93 
As we initialize all possible shortcuts during weak learning, we need decide , the number of them to select for forming the weak learner. On one hand, adding complex weak learners can boost performance rapidly. On the other, this may add suboptimal weak learners that hinder future growth. We test the choice of during search. We run with each choice five times, and take the average of their most accurate models that take under 60 million multiadds on the CIFAR model with and . Models in this range are chosen, because their transferred models to ILSVRC can have 600 million multiadds with and , and hence, they are natural candidate models for ILSVRC mobile setting. Table 6 reports the test error rates on CIFAR10, and we see that yields the best results.
b.6 L1 Regularization Constant
Regularization Constant  Average Lowest Error Rate 

3.02  
0.001  2.88 
3.13 
We choose the L1 regularization constant of Eq. 7 to be 0.001 from the range of , with the performances of the found models in Table 7. High means that the regularization is highly valued, so that the shortcut selection is more sparse. However, strong regularization also prevents weak learners to fit their target loss gradient well. Since we mainly aim to select the most relevant shortcuts, and not to enforce the strict sparsity, we favor a small regularization constant.
We also note that (Huang et al., 2017a) has previously applied group Lasso to select filters in a DenseNet (Huang et al., 2017b). They apply a changing regularization constant that gradually increases throughout the training. It will be interesting future improvement to select weak learners through dynamically changed regularization during weak learning.
Appendix C Search results on Penn Treebank (PTB)
Method  # params  Search  Test Error 

(M)  (GPUDays)  (perplexity)  
Darts (first order) (Liu et al., 2019)  23  1.5  57.6 
Darts (second order) (Liu et al., 2019)  23  2  55.7 
Darts (second order) (Liu et al., 2019)  23  2  55.9 
ENAS (Pham et al., 2018)  24  0.5  56.3 
ENAS (Pham et al., 2018)  24  0.5  58.6 
Random search baseline (Li & Talwalkar, 2019)  23  2  59.4 
Random search WS (Li & Talwalkar, 2019)  23  1.25  55.5 
Petridish  23  1  55.85 56.39 0.38 
PTB (Marcus et al., 1993) has become a standard dataset in the NAS community for benchmarking NAS algorithms for RNNs. We apply Petridish to search for the cell architectures of a recurrent neural network (RNN) ^{1}^{1}1Note that for the case of architecture search of RNNs, cellsearch and macrosearch are equivalent.. To keep the results as comparable as possible to most recent and wellperforming work we keep the search space the same as used by DARTS (Liu et al., 2019) which in turn is also used byvery recent work (Li & Talwalkar, 2019). There is a set of five primitives {sigmoid, relu, tanh, identity, none} that can be chosen amongst to decide connections between nodes in the cell. We modify the source code provided by Liu et al. (2019) to implement Petridish where we iteratively grow starting from a cell which contains only a single node relu connected to the incoming hidden activation and current input, until we have a total of nodes in the cell to match the size used in DARTS. At each stage of growth we train directly with an embedding size of , epochs, batch size and a L1 weight of and select the candidate with the highest L1 weight value. We then add this candidate to the cell by removing the stopgradient and stopforward layers and replacing with regular connections. Table 8 shows a summary of the results. The rest of the parameters were kept the same as that used by Liu et al. (2019).
The final genotype obtained from the search procedure is then trained from scratch for epochs, learning rate of and batch size to obtain final test perplexity reported below. We repeat the search procedure times with different random seeds and report the best and average test perplexity along with the standard deviation across search trials. Table 8 shows the results of running Petridish on PTB. Petridish obtains comparable results to DARTS, ENAS and Random Search WS.
Note that since random search is essentially stateoftheart search algorithm on PTB^{2}^{2}2As noted by Li & Talwalkar (2019) current humandesigned architecture by Yang et al. (2018) still beats the best NAS results albeit using a mixtureofexperts layer which is not in the search space used by DARTS, ENAS, and Petridish to keep results comparable. we caution the community to not use PTB as a benchmark for comparing search algorithms for RNNs. The merits of any particular algorithm are difficult to compare at least on this particular dataset and task pairing. More research along the lines of Ying et al. (2019) is needed on 1. whether the nature of the search space for RNNs specific to language modeling is particularly amenable to random search and or 2. whether it is the specific nature of RNNs by itself such that random search is competitive on any task which uses RNNs as the hypothesis space. We are presenting the results on PTB for the sake of completion since it has become one of the default benchmarks but ourselves don’t derive any particular signal either way in spite of competitive performance.