BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search

BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search

Abstract

Neural Architecture Search (NAS) has seen an explosion of research in the past few years, with techniques spanning reinforcement learning, evolutionary search, Gaussian process (GP) Bayesian optimization (BO), and gradient descent. While BO with GPs has seen great success in hyperparameter optimization, there are many challenges applying BO to NAS, such as the requirement of a distance function between neural networks. In this work, we develop a suite of techniques for high-performance BO applied to NAS that allows us to achieve state-of-the-art NAS results. We develop a BO procedure that leverages a novel architecture representation (which we term the path encoding) and a neural network-based predictive uncertainty model on this representation.

On popular search spaces, we can predict the validation accuracy of a new architecture to within one percent of its true value using only 200 training points. This may be of independent interest beyond NAS. We also show experimentally and theoretically that our method scales far better than existing techniques. We test our algorithm on the NASBench (Ying et al. 2019) and DARTS (Liu et al. 2018) search spaces and show that our algorithm outperforms a variety of NAS methods including regularized evolution, reinforcement learning, BOHB, and DARTS. Our method achieves state-of-the-art performance on the NASBench dataset and is over 100x more efficient than random search. We adhere to the recent NAS research checklist (Lindauer and Hutter 2019) to facilitate NAS research. In particular, our implementation is publicly available 1 and includes all details needed to fully reproduce our results.

\printAffiliationsAndNotice

1 Introduction

Since the deep learning revolution in 2012, neural networks have been growing increasingly more specialized and more complex Krizhevsky et al. (2012); Huang et al. (2017); Szegedy et al. (2017). Developing new state-of-the-art architectures often takes a vast amount of engineering and domain knowledge. A new area of research, neural architecture search (NAS), seeks to automate this process. Since the popular work by Zoph and Lee (Zoph and Le, 2017), there has been a flurry of research on NAS Liu et al. (2018a); Pham et al. (2018); Liu et al. (2018b); Kandasamy et al. (2018b); Elsken et al. (2018); Jin et al. (2018).

Many methods have been proposed for NAS, including random search, evolutionary search, reinforcement learning, Bayesian optimization (BO), and gradient descent. In certain settings, zeroth-order (non-differentiable) algorithms such as BO are of particular interest over first-order techniques, due to advantages such as simple parallelism, joint optimization with other hyperparameters, easy implementation, portability to diverse architecture spaces, and optimization of other/multiple non-differentiable objectives.

BO with Gaussian processes (GPs) has had success in deep learning hyperparameter optimization Golovin et al. (2017); Falkner et al. (2018), and is a leading method for efficient zeroth order optimization of expensive-to-evaluate functions in Euclidean spaces. However, applying BO to NAS comes with challenges that have so far limited its ability to achieve state-of-the-art results. For example, current approaches require specifying a distance function between architectures in order to define a surrogate GP model. This is often a cumbersome task involving tuning hyperparameters of the distance function Kandasamy et al. (2018b); Jin et al. (2018). Furthermore, it can be quite challenging to achieve highly accurate prediction performance with GPs, given the potentially high dimensional input architectures.

In this work, we develop a suite of techniques for high-performance BO applied to NAS that allows us to achieve state of the art NAS results. We develop a BO procedure that leverages a novel architecture representation (which we term a path encoding) and a neural network-based predictive uncertainty model (which we term a meta neural network) defined on this representation. In every iteration of BO, we use our model to estimate accuracies and uncertainty estimates for unseen neural architectures in the search space. This procedure avoids the aforementioned problems with BO in NAS: the model is powerful enough to provide high accuracy predictions of neural architecture accuracies, and there is no need to construct a distance function between architectures. Furthermore, our meta neural network scales far better than a GP model, as it avoids computationally intensive matrix inversions. We call our algorithm BANANAS: Bayesian optimization with neural architectures for NAS.

Training a meta neural network to predict with high accuracy is a challenging task. The majority of popular NAS algorithms are deployed over a directed acyclic graph (DAG) search space – the set of possible architectures is comprised of the set of all DAGs of a certain size, together with all possible combinations of operations on each node Zoph and Le (2017); Pham et al. (2018); Liu et al. (2018b). This poses a roadblock when predicting architecture accuracies, since graph structures can be difficult for neural networks when the adjacency matrix is given directly as input Zhou et al. (2018). By contrast, our path encoding scheme provides a representation for architectures that drastically improves the predictive accuracy of our meta neural network. Each binary feature represents a unique path through the DAG from the input layer to the output layer. With just 200 training points, our meta neural network is able to predict the accuracy of unseen neural networks to within one percent on the popular NASBench dataset Ying et al. (2019). This is an improvement over all previously reported results, and may be of interest beyond NAS. We also show experimentally and theoretically that our method scales to larger search spaces far better than the standard adjacency matrix encoding.

Figure 1.1: Illustration of the meta neural network in the BANANAS algorithm.

In BANANAS, we train an ensemble of neural networks to predict the mean and variance of validation error for candidate neural architectures, from which we compute an acquisition function. We define a new variant of the Thompson sampling acquisition function Thompson (1933), called independent Thompson sampling, and empirically show that it is well-suited for parallel Bayesian optimization. Finally, we use a mutation algorithm to optimize the acquisition function. We compare our NAS algorithm against a host of NAS algorithms including regularized evolution Real et al. (2019), REINFORCE Williams (1992), Bayesian optimization with a GP model Snoek et al. (2012), AlphaX Wang et al. (2018), ASHA Li and Talwalkar (2019), DARTS Liu et al. (2018b), TPE Bergstra et al. (2011), DNGO Snoek et al. (2015), and NASBOT Kandasamy et al. (2018b). On the NASBench dataset, our method achieves state-of-the-art performance and beats random search by a factor of over 100. On the search space from DARTS, when given a budget of 100 neural architecture queries for 50 epochs each, our algorithm achieves a best of 2.57% and average of 2.64% test error, which beats all NAS algorithms with which we could fairly compare (same search space used, and same hyperparameters for the final training). We also show that our algorithm outperforms other methods at optimizing functions of the model accuracy and the number of model parameters.

BANANAS has several moving parts, including the meta neural network, the path-based feature encoding of neural architectures, the acquisition function, and the acquisition optimization strategy. We run a thorough ablation study by removing each piece of the algorithm separately. We show all components are necessary to achieve the best performance. Finally, we check all items on the NAS research checklist Lindauer and Hutter (2019), due to recent claims that NAS research is in need of more fair and reproducible empirical evaluations Ying et al. (2019); Li and Talwalkar (2019); Lindauer and Hutter (2019). In particular, we experiment on well-known search spaces and NAS pipelines, we run enough trials to reach statistical significance, and our implementation, including all details needed to reproduce our results, is available at https://www.github.com/naszilla/bananas.

Our contributions. We summarize our main contributions.

  • We propose a novel path-based encoding for architectures. Using this featurization to predict the validation accuracy of architectures reduces the error by a factor of four, compared to the adjacency matrix encoding. We give theoretical and empirical results that the path encoding scales better than the adjacency matrix encoding.

  • We develop BANANAS, a BO-based NAS algorithm which uses a meta neural network predictive uncertainty model defined on this path encoding. Our algorithm outperforms other state-of-the-art NAS methods on two search spaces.

2 Related Work

NAS has been studied since at least the 1990s and has gained significant attention in the past few years Kitano (1990); Stanley and Miikkulainen (2002); Zoph and Le (2017). Some of the most popular recent techniques for NAS include evolutionary algorithms Shah et al. (2018); Maziarz et al. (2018), reinforcement learning Zoph and Le (2017); Pham et al. (2018); Liu et al. (2018a); Tan and Le (2019), Bayesian optimization (BO) Kandasamy et al. (2018b); Jin et al. (2018), and gradient descent Liu et al. (2018b). Recent papers have highlighted the need for fair and reproducible NAS comparisons Li and Talwalkar (2019); Ying et al. (2019); Lindauer and Hutter (2019). There are several works which predict the validation accuracy of neural networks Deng et al. (2017); Istrate et al. (2019); Zhang et al. (2018), or the curve of validation accuracy with respect to training time Klein et al. (2017); Domhan et al. (2015); Baker et al. (2017). A recent algorithm, AlphaX, uses a meta neural network to perform NAS Wang et al. (2018). The search is progressive, and each iteration makes a small change to the current neural network, rather than choosing a completely new neural network. A few recent papers use graph neural networks (GNNs) to encode neural architectures in NAS Shi et al. (2019); Zhang et al. (2019). Unlike the path encoding, these algorithms require re-training a GNN for each new dataset. For a survey of neural architecture search, see Elsken et al. (2018). There is also prior work on using neural network models in BO for hyperparameter optimization Snoek et al. (2015); Springenberg et al. (2016), The explicit goal of these papers is to improve the efficiency of Gaussian process-based BO from cubic to linear time, not to develop a different type of prediction model in order to improve the performance of BO with respect to the number of iterations.

Ensembles of neural networks is a popular approach for uncertainty estimates, shown in many settings to be more effective than all other methods such as Bayesian neural networks even for an ensemble of size five Lakshminarayanan et al. (2017); Beluch et al. (2018); Choi et al. (2016); Snoek et al. (2019). We provide additional details and related work on NAS, BO, and architecture prediction in the appendix.

3 Preliminaries

In this section, we give a background on BO. In applications of BO for deep learning, the typical goal is to find a neural architecture and/or set of hyperparameters that lead to an optimal validation error. Formally, BO seeks to compute , where is the search space, and denotes the validation error of architecture after training on a fixed dataset for a fixed number of epochs. In the standard BO setting, over a sequence of iterations, the results from all previous iterations are used to model the topology of using the posterior distribution of the model (often a GP). The next architecture is then chosen by optimizing an acquisition function such as expected improvement (EI) Močkus (1975), upper confidence bound (UCB) Srinivas et al. (2009), or Thompson sampling (TS) Thompson (1933). These functions balance exploration with exploitation during the iterative search. The chosen architecture is then trained and used to update the model of . Evaluating in each iteration is the bottleneck of BO (since a neural network must be trained). To mitigate this, parallel BO methods typically output architectures to train in each iteration instead of just one, so that the architectures can be trained in parallel.

4 Methodology

In this section, we discuss all of the components of our NAS algorithm. First we describe our featurization of neural architectures. Next, we describle how to predict mean and uncertainty estimates using an ensemble of meta neural networks. Then, we describe our acquisition function and acquisition optimization strategy, and show how these pieces come together to form our final NAS algorithm. We finish by presenting theoretical results showing that our architecture encoding scales well.

Architecture featurization via the path encoding.

Prior work aiming to encode or featurize neural networks have proposed using an adjacency matrix-based approach. The adjacency matrix encoding gives an arbitrary ordering to the nodes, and then gives a binary feature for an edge between node and node , for all , . Then a list of the operations at each node must also be included in the encoding. This is a challenging data structure for a NAS algorithm to interpret, because it relies on an arbitrary indexing of the nodes, and features are highly dependent on one another. For example, an edge from the input to node 2 is useless if there is no path from node 2 to the output. Also, even if there is an edge from node 2 to the output, it matters a lot whether node 2 is a convolution or a pooling operation. Ying et al. (2019) tested another encoding that is similar to the adjacency matrix encoding, but the features of each edge are continuous.

We introduce a novel encoding which we term a path encoding, and we show that it substantially increases the performance of our predictive uncertainty model. Each feature of the path encoding corresponds with a directed path from the input to the output of an architecture cell (for example: inputconv_1x1conv_3x3pool_3x3output). To encode an architecture, we simply check which of the possible paths it contains (i.e. which paths are present in the cell), and write this as a binary vector. Therefore, we do not need to arrange the nodes in any particular order. Furthermore, the features are not nearly as dependent on one another as they are in the adjacency matrix encoding. The total number of paths is where denotes the number of nodes in the cell, and denotes the number of operations for each node. See Figure 4.1. For example, the search space from NASBench Ying et al. (2019) has possible paths, and the search space from DARTS Liu et al. (2018b) has possible paths. A downside of this encoding is that it scales exponentially in , while the adjacency matrix encoding scales quadratically. At the end of this section, we give theoretical results giving evidence that simply truncating the path encoding allows it to scale linearly with a negligible decrease in performance. We back this up with experimental results in the next section.

Figure 4.1: Example of our path encoding.

Acquisition function and optimization.

Many acquisition functions used in BO can be approximately computed using a mean and uncertainty estimate for each input datapoint. We train an ensemble of five meta neural networks in order to predict mean and uncertainty estimates for new neural architectures. Concretely, we predict the validation error, as well as confidence intervals around the predicted validation error, of neural architectures that we have not yet observed. Suppose we have an ensemble of neural networks, , where for all . We optimize the following acquisition function, which we call independent Thompson sampling (ITS), which at each time step is defined to be

(4.1)

Here, is the dataset of architecture evaluations at time , and is our posterior belief about meta neural networks at time given . Equation 4.1 can be viewed as defining the acquisition function equal to the output of a sample from the posterior distribution of our model. This is similar to classic Thompson sampling (TS) Thompson (1933), which has advantages when running parallel experiments in batch BO Kandasamy et al. (2018a). However, in contrast with TS, the ITS acquisition function returns a unique posterior function sample for each input architecture . We choose not to use classic TS here for a few reasons: based on our ensemble model for predictive uncertainty, it is unclear how to draw exact posterior samples of functions that can be evaluated on multiple architectures , while we can develop procedures to sample from the posterior conditioned on a single input architecture . One potential strategy to carry out classic TS is to use elements of our ensemble as approximate posterior samples, but this has not shown to perform well in practice. In the next section, we show empirically that ITS performs better than this strategy for TS as well as other acquisition functions.

In practice, in order to compute given our ensemble of meta neural networks, , we assume that the posterior conditioned on a given input architecture follows a Gaussian distribution. In particular, we assume , with parameters and .

In each iteration of BO, our goal is to find the neural network from the search space which minimizes the acquisition function. Evaluating the acquisition function for every neural network in the search space is computationally infeasible. Instead, we optimize the acquisition function via a mutation procedure, in which we randomly mutate the best architectures that we have trained so far and then select the architecture from this set which minimizes the acquisition function. A neural architecture is mutated by either adding an edge, removing an edge, or changing one of the operations with some probability.

BANANAS: Bayesian optimization with neural architectures for NAS.

Now we present our full NAS algorithm. At the start, we draw random architectures from the search space. Then we begin an iterative process, where in iteration , we train an ensemble of meta neural networks on architectures . Each meta neural network is a feedforward network with fully-connected layers, and each is given a different random initialization of the weights and a different random ordering of the training set. We use a slight variant of mean absolute percentage error (MAPE),

where and are the predicted and true values of the validation error for architecture , and is a global lower bound on the minimum true validation error. This loss function gives a higher weight to losses for architectures with smaller values of .

We create a candidate set of architectures by mutating the best architectures we have seen so far, and we choose to train the candidate architecture which minimizes the ITS acquisition function. See Algorithm 1. To parallelize Algorithm 1, in step iv. we simply choose the architectures with the smallest values of the acquisition function and evaluate the architectures in parallel.

  Input: Search space , dataset , parameters , acquisition function , function returning validation error of after training.
  1. Draw architectures uniformly at random from and train them on .
  2. For from to ,
  1. Train an ensemble of meta neural networks on .

  2. Generate a set of candidate architectures from by randomly mutating the architectures from that have the lowest value of .

  3. For each candidate architecture , evaluate the acquisition function .

  4. Denote as the candidate architecture with minimum , and evaluate .

  Output: .
Algorithm 1 BANANAS

Truncating the path encoding.

In this section, we give theoretical results for a truncated version of the path encoding. One of the downsides of the path encoding is that it scales exponentially in . However, the vast majority of paths rarely appear in any neural architecture during a full run of a NAS algorithm. This is because many NAS algorithms sample architectures from a random procedure (and/or mutate these samples). We show that the vast majority of paths have a very low probability of occurring in a cell returned from random_spec(), a popular random procedure (as in Ying et al. (2019) and used by BANANAS). Our results show that by simply truncating the least-likely paths, our encoding scales linearly in the size of the cell, with an arbitrarily small amount of information loss. We back this up with experimental evidence in the next section. For the full proofs, see Appendix C. We start by defining a random graph model corresponding to random_spec().

{restatable}

redefrandomgraph Given nonzero integers , and , a random graph is generated as follows:
(1) Denote nodes by 1 to .
(2) Label each node randomly with one of operations.
(3) For all , add edge with probability .
(4) If there is no path from node 1 to node , goto (1).

The probability value in step (3) is chosen so that the expected number of edges after this step is exactly . In this section, we use ‘path’ to mean a path from node 1 to node . We prove the following theorem.

Theorem 4.1 (informal).

Given integers , there exists an such that for all , there exists a set of paths such that the probability that contains a path not in is less than .

This theorem says that when , and when is large enough compared to and , then we can truncate the path encoding to a set of size , because the probability that random_spec() returns a graph with a path outside of is very small.

We choose to be the set of paths with length less than . Our argument relies on a simple concept: the probability of a long path (length ) from node 1 to node is much lower than the probability of a short path. For example, the probability that contains a path of length is on the order of , because edges must be chosen, each with a probability of roughly .

Proof sketch of Thorem 4.1.

We set as the set of paths of length less than , therefore,

In Definition 4, the probability that step (3) returns a graph with a path from node 1 to node is at least , because the probability of a path of length 1 is the probability of having edge , which is for large enough .

In general, denote as the expected value of the number of paths of length after step (3). Then

We can prove that

using the well-known inequality (e.g. Stanica (2001)) and the fact that . Then the probability that contains a path outside of is

5 Experiments

In this section, we discuss our experimental setup and results. We give experimental results on the performance of the meta neural network itself, as well as the full NAS algorithm compared to several NAS algorithms.

We check every box in the NAS research checklist Lindauer and Hutter (2019). In particular, our code is publicly available, we used a tabular NAS dataset, and we ran many trials of each algorithm. In Appendix E, we give the full details of our answers to the NAS research checklist.

Search space from NASBench.

The NASBench dataset is a tabular dataset designed to facilitate NAS research and fair comparisons between NAS algorithms Ying et al. (2019). It consists of over 423,000 unique neural architectures from a cell-based search space, and each architecture comes with precomputed training, validation, and test accuracies for 108 epochs on CIFAR-10.

The search space consists of a cell with 7 nodes. The first node is the input, and the last node is the output. The remaining five nodes can be either convolution, convolution, or max pooling. The cell can take on any DAG structure from the input to the output with at most 9 edges. The NASBench search space was chosen to contain ResNet-like and Inception-like cells He et al. (2016); Szegedy et al. (2016). The hyper-architecture consists of nine cells stacked sequentially, with each set of three cells separated by downsampling layers. The first layer before the first cell is a convolutional layer, and the hyper-architecture ends with a global average pooling layer and dense layer.

Search space from DARTS.

One of the most popular convolutional cell-based search spaces is the one from DARTS Liu et al. (2018b), used for CIFAR-10. The search space consists of two cells with 6 nodes each: a convolutional cell and a reduction cell, and the hyper-architecture stacks the convolutional and reduction cells. For each cell, the first two nodes are the inputs from the previous two cells in the hyper-architecture. The next four nodes each contain exactly two edges as input, such that the cell forms a connected DAG. Each edge can take one of seven operations: and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and zero (this is in contrast to NASBench, where the nodes take on operations).

Figure 5.1: Predictive uncertainty estimates for architecture validation error under our ensemble model. We train this model on 200 architectures drawn from random_spec(), and test the adjacency matrix encoding and path encoding on a held-out set of architectures as well as on a subset of the training set.

5.1 NASBench Experiments

First we evaluate the performance of the meta neural network on the NASBench dataset. The meta neural network consists of a sequential fully-connected neural network. The number of layers is set to 10, and each layer has width 20. We use the Adam optimizer with a learning rate of . For this set of experiments, we use mean absolute error as the loss function (MAE), as we are evaluating the performance of the meta neural network without running the full BANANAS algorithm. We test the standard adjacency matrix encoding as well as the path encoding discussed in Section 4. We draw architectures using the random_spec() method described in Section 4. The path encoding outperforms the adjacency matrix encoding by up to a factor of 4 with respect to mean absolute error. See Figure 5.1. In order to compare our meta neural network to similar work, we trained on 1100 NASBench architectures and computed the correlation between the predicted and true validation accuracies on a test set of size 1000. See Table 1. In Appendix D, we give a more detailed experimental study on the performance of our meta neural network.

Architecture Source Correlation No. of parameters Multilayer Perceptron Wang et al. (2018) pt 6,326,000 Long Short-Term Memory (LSTM) Wang et al. (2018) pt 92,000 Graph Convolutional Network Shi et al. (2019) pt 14,000 Meta NN with Adjacency Encoding Ours pt4,400 Meta NN with Truncated Path Encoding Ours pt4,700 Meta NN with Full Path Encoding Ours pt0.699 pt11,000
Table 1: Performance of neural architecture predictors trained on NASBench. The truncated path encoding is truncated to a length of .

Next, we evaluate the performance of BANANAS. We use the same meta neural network parameters as in the previous paragraph, but for the loss function we use MAPE as defined in Section 4. We baseline against random search (which multiple papers have concluded is a competitive baseline for NAS  Li and Talwalkar (2019); Sciuto et al. (2019)). We compare to several state-of-the-art methods, representing the most common zeroth-order paradigms. Regularized Evolution is a popular evolutionary algorithm for NAS Real et al. (2019). We compared to two reinforcement learning (RL) algorithms: REINFORCE Williams (1992), which prior work showed to be more effective than other RL algorithms Ying et al. (2019), and AlphaX, which uses a neural network to select the best action in each round, such as making a small change to, or growing, the current architecture Wang et al. (2018). We compared to several Bayesian approaches: vanilla Bayesian optimization with a Gaussian process prior; Bayesian optimization Hyperband (BOHB) Falkner et al. (2018), which combines multi-fidelity Bayesian optimization with principled early-stopping; Deep Networks for Global Optimization (DNGO) Snoek et al. (2015), which is an implementation of Bayesian optimization using adaptive basis regression using neural networks instead of Gaussian processes to avoid the cubic scaling; and neural architecture search with Bayesian optimization and optimal transport (NASBOT) Kandasamy et al. (2018b), which uses an optimal transport-based distance function in BO. We also compared to Tree-structured Parzen estimator (TPE) Bergstra et al. (2011), which is a hyperparameter optimization algorithm based on adaptive Parzen windows. See Appendix D for more details on the implementation of these algorithms.

Figure 5.2: Performance of BANANAS on NASBench compared to other algorithms (left). Ablation study (middle). NAS dual-objective experiments minimizing a function of validation loss and number of model parameters (right).
NAS Algorithm Source Test error Queries Runtime Method
Avg Best
SNAS Xie et al. (2018) Gradient based
ENAS Pham et al. (2018) RL
Random search Liu et al. (2018b) Random
DARTS Liu et al. (2018b) Gradient-based
ASHA Li and Talwalkar (2019) pt700 Successive halving
Random search WS Li and Talwalkar (2019) pt1000 Random
DARTS Ours pt2.57 Gradient-based
ASHA Ours pt700 Successive halving
BANANAS Ours pt2.64 pt2.57 pt100 Neural BayesOpt
Table 2: Comparison of the mean test error of the best architectures returned by three NAS algorithms. The runtime is in total GPU-days on a Tesla V100. Note that ASHA queries use varying numbers of epochs.

Experimental setup and results.

Each NAS algorithm is given a budget of 47 TPU hours, or about 150 queries. That is, each algorithm can train and output the validation error of at most 150 architectures. We chose this number of queries as it is a realistic setting in practice (on other search spaces such as DARTS, 150 queries takes 15 GPU days), though we also ran experiments to 156.7 TPU hours or 500 queries in Appendix D, which resulted in similar trends. Every 10 iterations, each algorithm returns the architecture with the best validation error so far. After all NAS algorithms have completed, we output the test error for each returned architecture. We ran 200 trials for each algorithm. See Figure 5.2. BANANAS significantly outperforms all other baselines, and to the best of our knowledge, BANANAS achieves state-of-the-art error on NASBench in the 100-150 queries setting. The standard deviations among 200 trials for all NAS algorithms were between 0.16 and 0.22. BANANAS had the lowest standard deviation, and regularized evolution had the highest.

Ablation study.

Our NAS algorithm has several moving parts, including the meta neural network model, the path-based feature encoding of neural architectures, the acquisition function, and the acquisition optimization strategy. We run a thorough ablation study by removing each piece of the algorithm separately. In particular, we compare against the following algorithms: (1) BANANAS in which the acquisition funtion is optimized by drawing 1000 random architectures instead of a mutation algorithm; BANANAS in which the featurization of each architecture is not the path-based encoding, but instead (2) the adjacency matrix encoding, or (3) the continuous adjacency matrix encoding from Ying et al. (2019); BANANAS with a GP model instead of a neural network model, where the distance function in the GP is computed as the Hamming distance between (4) the path encoding, or (5) adjacency matrix encoding. We found that BANANAS distinctly outperformed all variants, with the meta neural network having the greatest effect on performance. See Figure 5.2 (middle). In Appendix D, we show a separate study testing five different acquisition functions.

Dual-objective experiments.

One of the benefits of a zeroth order NAS algorithm is that it allows for optimization with respect to additional non-differentiable objectives. For example, we define and optimize a dual-objective function of both validation loss and number of model parameters. Specifically, we use

(5.1)

which is similar to prior work Cai et al. (2019); Tan et al. (2019), where 4.8 is a lower bound on the minimum validation loss. We run 200 trials of random search, Regularized Evolution, and BANANAS on this objective. We plot the average test loss and the number of parameters of the returned models after 10 to 150 queries. See Figure 5.2 (right). The gray lines are contour lines of Equation 5.1.

In Appendix D, we present experiments for BANANAS in other settings: 500 queries instead of 150 queries, and using random validation error instead of mean validation error. We show the trends are largely the same and BANANAS performs the best. We also study the effect of the length of the path encoding on the performance of both the meta neural network, and BANANAS as a whole. That is, we truncate the path enoding vector as in Section 4. Surprisingly, the length of the path encoding can be reduced by an order of magnitude with a negligible decrease in performance.

5.2 DARTS search space experiments.

We test the BANANAS algorithm on the search space from DARTS. We give BANANAS a budget of 100 queries. In each query, a neural network is trained for 50 epochs and the average validation error of the last 5 epochs is recorded. As in the NASBench experiments, we parallelize the algorithm by choosing 10 neural architectures to train in each iteration of Bayesian optimization using ITS. We use the same meta neural net architecture as in the NASBench experiment, but we change the learning rate to and the number of epochs to . In Appendix D, we show the best architecture found by BANANAS. The algorithm requires 11.8 GPU days of computation to run.

To ensure a fair comparison by controlling all hyperparameter settings and hardware, we re-trained the architectures from papers which used the DARTS search space and reported the final architecture. We report the mean test error over five random seeds of the best architectures found by BANANAS, DARTS, and ASHA Li and Talwalkar (2019). We trained each architecture using the default hyperparameters from Li and Talwalkar (2019).

The BANANAS architecture achieved an average of 2.64% error, which is state-of-the-art for this search-space and final training parameter settings. See Table 2. We also compare to other NAS algorithms using the DARTS search space, however, the comparison is not perfect due to differences in the pytorch version of the final training code (which appears to change the final percent error by 0.1%). We cannot fairly compare our method to recent NAS algorithms which use a larger search space than DARTS, or which train the final architecture for significantly more than 600 epochs Laube and Zell (2019); Liang et al. (2019); Zhou et al. (2019). Our algorithm significantly beats ASHA, a multi-fidelity zeroth order NAS algorithm, and is on par with DARTS, a first-order method. We emphasize that since BANANAS is a zeroth order method, it allows for easy parallelism, integration with optimizing other hyperparameters, easy implementation, and optimization with respect to other non-differentiable objectives. 2

6 Conclusion and Future Work

In this work, we propose a new method for NAS. Using novel techniques such as an architecture path encoding and meta neural network for predictive uncertainty estimates, our Bayesian optimization-based NAS algorithm achieves state-of-the-art results. We present results on the NASBench and DARTS search spaces. Our path encoding may be of independent interest, as it substantially improves the predictive power of our meta neural network, and we give theoretical and empirical results showing that it scales better than existing encoding techniques. An interesting follow-up idea is to develop a multi-fidelity version of BANANAS. For example, incorporating a successive-halving approach to BANANAS could result in a significant decrease in the runtime without substantially sacrificing accuracy.

Acknowledgments

We thank Naveen Sundar Govindarajulu, Liam Li, Jeff Schneider, Sam Nolen, and Mark Rogers for their help with this project.

Appendix A Related Work Continued

Neural architecture search.

Neural architecture search has been studied since at least the 1990s Floreano et al. (2008); Kitano (1990); Stanley and Miikkulainen (2002), but the field was revitalized in 2017 when the work of Zoph and Le (2017) gained significant attention. Some of the most popular techniques for NAS include evolutionary algorithms Shah et al. (2018); Maziarz et al. (2018), reinforcement learning Zoph and Le (2017); Pham et al. (2018); Liu et al. (2018a); Tan and Le (2019); Wang et al. (2019), Bayesian optimization Kandasamy et al. (2018b); Jin et al. (2018); Zhou et al. (2019), and gradient descent Liu et al. (2018b); Liang et al. (2019); Laube and Zell (2019). See Elsken et al. (2018) for a survey on NAS.

Recent papers have called for fair and reproducible experiments in the future Li and Talwalkar (2019); Ying et al. (2019). In this vein, the NASBench dataset was created, which contains over 400k neural architectures with precomputed training, validation, and test accuracy Ying et al. (2019). A recent algorithm, AlphaX, uses a meta neural network to perform NAS Wang et al. (2018), where the meta neural network is trained to make small changes to the current architecture, given adjacency matrix featurizations of neural architectures. The search is progressive, and each iteration makes a small change to the current neural network, rather than choosing a completely new neural network.

A few recent papers use graph neural networks (GNNs) to encode neural architectures in NAS Shi et al. (2019); Zhang et al. (2019). However, GNNs require re-training for each new dataset/search space, unlike the path encoding which is a fixed encoding for every search space. Shi et al. (2019) has experiments on NASBench, however, they only report the number of queries needed to find the optimal architecture (which is roughly 1500), and no code is provided, so it is not clear how well their algorithm performs with a budget of fewer queries.

Bayesian optimization.

Bayesian optimization is a leading technique for zeroth order optimization when function queries are expensive Rasmussen (2003); Frazier (2018), and it has seen great success in hyperparameter optimization for deep learning Rasmussen (2003); Golovin et al. (2017); Li et al. (2016). The majority of Bayesian optimization literature has focused on Euclidean or categorical input domains, and has used a GP model Rasmussen (2003); Golovin et al. (2017); Frazier (2018); Snoek et al. (2012). There are techniques for parallelizing Bayesian optimization González et al. (2016); Kandasamy et al. (2018a); Očenášek and Schwarz (2000). There is also prior work on using neural network models in Bayesian optimization for hyperparameter optimization Snoek et al. (2015); Springenberg et al. (2016), The goal of these papers is to improve the efficiency of Gaussian Process-based Bayesian optimization from cubic to linear time, not to develop a different type of prediction model in order to improve the performance of BO with respect to the number of iterations. In our work, we present new techniques which deviate from Gaussian Process-based Bayesian optimization and see a large performance boost with respect to the number of iterations.

Predicting neural network accuracy.

There are several approaches for predicting the validation accuracy of neural networks, such as a layer-wise encoding of neural networks with an LSTM algorithm Deng et al. (2017), and a layer-wise encoding and dataset features to predict the accuracy for neural network and dataset pairs Istrate et al. (2019). There is also work in predicting the learning curve of neural networks for hyperparameter optimization Klein et al. (2017); Domhan et al. (2015) or NAS Baker et al. (2017) using Bayesian techniques. None of these methods have predicted the accuracy of neural networks drawn from a cell-based DAG search space such as NASBench or the DARTS search space. Another recent work uses a hypernetwork for neural network prediction in NAS Zhang et al. (2018).

Ensembling of neural networks is a popular approach for uncertainty estimates, shown in many settings to be more effective than all other methods such as Bayesian neural networks even for an ensemble of size five Lakshminarayanan et al. (2017); Beluch et al. (2018); Choi et al. (2016); Snoek et al. (2019).

Appendix B Preliminaries Continued

We give background information on three key ingredients of NAS algorithms.

Search space.

Before deploying a NAS algorithm, we must define the space of neural networks that the algorithm can search through. Perhaps the most common type of search space for NAS is a cell-based search space Zoph and Le (2017); Pham et al. (2018); Liu et al. (2018b); Li and Talwalkar (2019); Sciuto et al. (2019); Ying et al. (2019). A cell consists of a relatively small section of a neural network, usually 6-12 nodes forming a directed acyclic graph (DAG). A neural architecture is then built by repeatedly stacking one or two different cells on top of each other sequentially, possibly separated by specialized layers. The layout of cells and specialized layers is called a hyper-architecture, and this is fixed, while the NAS algorithm searches for the best cells. The search space over cells consists of all possible DAGs of a certain size, where each node can be one of several operations such as convolution, convolution, or max pooling. It is also common to set a restriction on the number of total edges or the in-degree of each node Ying et al. (2019); Liu et al. (2018b). In this work, we focus on NAS over convolutional cell-based search spaces, though our method can be applied more broadly.

Search strategy.

The search strategy is the optimization method that the algorithm uses to find the optimal or near-optimal neural architecture from the search space. There are many varied search strategies, such as Bayesian optimization, evolutionary search, reinforcement learning, and gradient descent. In Section 4, we introduced a novel search strategy based on Bayesian optimization with a neural network model using a path-based encoding.

Evaluation method.

Many types of NAS algorithms consist of an iterative framework in which the algorithm chooses a neural network to train, computes its validation error, and uses this result to guide the choice of neural network in the next iteration. The simplest instantiation of this approach is to train each neural network in a fixed way, i.e., the algorithm has black-box access to a function that trains a neural network for epochs and then returns the validation error. Algorithms with black-box evaluation methods can be compared by returning the architecture with the lowest validation error after a certain number of queries to the black-box function. There are also multi-fidelity methods, for example, when a NAS algorithm chooses the number of training epochs in addition to the architecture.

Appendix C Details from Section 4

In this section, we give the full details from Section 4.

Recall that the size of the path encoding is equal to the number of unique paths, which is , where is the number of nodes in the cell, and is the number of operations to choose from at each node. This is at least . By contrast, the adjacency matrix encoding scales quadratically in .

However, the vast majority of the paths rarely show up in any neural architecture throughout a full run of a NAS algorithm. This is because many NAS algorithms can only sample architectures from a random procedure or mutate architectures drawn from the random procedure. Now we will give the full details of Theorem 4.1, showing that the vast majority of paths have a very low probability of occurring in a cell outputted from random_spec(), a popular random procedure (as in Ying et al. (2019) and used by BANANAS). We backed this up with experimental evidence in Section 5. Our results show that by simply truncating the least-likely paths, our encoding scales linearly in the size of the cell, with an arbitrarily small amount of information loss.

For convenience, we restate the random graph model defined in Section 4.

\randomgraph

*

The probability value in step (3) is chosen so that the expected number of edges after this step is exactly . This definition is slightly different from random_spec(). In random_spec(), edges are added with probability , and a cell is rejected if there are greater than 9 edges (in addition to being rejected if there is no path from node 1 to node ).

Recall that we use ‘path’ to mean a path from node 1 to node . We restate the theorem formally. Denote as the set of all possible paths from node 1 to node that could occur in .

Theorem 4.1 (formal). Given integers , there exists such that for all , there exists a set of paths such that

This theorem says that when , and when is large enough compared to and , then we can truncate the path encoding to a set of size , because the probability that random_spec() outputs a graph with a path outside of is very small.

Note that there are two caveats to this theorem. First, BANANAS may mutate architectures drawn from Definition 4, and Theorem 4.1 does not show the probability of paths from mutated architectures is small. However, our experiments in the next section give evidence that the mutated architectures do not change the distribution of paths too much. Second, the most common paths in Definition 4 are not necessarily the paths whose existence or non-existence give the most entropy in predicting the validation accuracy of a neural architecture. Again, while this is technically true, our experiments back up Theorem 4.1 as a reasonable argument that truncating the path encoding does not sacrifice performance.

Denote by the random graph outputted by Definiton 4 without step (4). In other words, is a random graph that could have no path from node 1 to node . Since there are pairs such that , the expected number of edges of is . For reference, in the NASBench dataset, there are nodes and operations, and the maximum number of edges is 9.

We choose as the shortest paths from node 1 to node . The argument for Theorem 4.1 relies on a simple concept: the probability that contains a long path (length ) is much lower than the probability that it contains a short path. For example, the probability that contains a path of length is very low, because there are potential edges but the expected number of edges is . We start by upper bounding the length of the shortest paths.

Lemma C.1.

Given a graph with nodes and node labels, there are fewer than paths of length less than or equal to .

Proof.

The number of paths of length is , since there are choices of labels for each node. Then

To continue our argument, we will need the following well-known bounds on binomial coefficients, e.g. Stanica (2001).

Theorem C.2.

Given , we have

Now we define as the expected number of paths from node 1 to node of length in . Formally,

The following lemma, which is the driving force behind Theorem 4.1, shows that the value of for small is much larger than the value of for large .

Lemma C.3.

Given integers , then there exists such that for , we have

Proof.

We have that

This is because on a path from node 1 to of length , there are choices of intermediate nodes from 1 to . Once the nodes are chosen, we need all edges between the nodes to exist, and each edge exists independently with probability

When , we have . Therefore,

for sufficiently large . Now we will derive an upper bound for using Theorem C.2.

The last inequality is true because for sufficiently large . Now we have

(C.1)
(C.2)

In inequality C.1, we use the fact that for large enough , , therefore,

In inequality C.2, we use the fact that

Now we can prove Theorem 4.1.

Proof of Theorem 4.1.

Recall that denotes the set of all possible paths from node 1 to node that could be present in , and let . Then by Lemma C.1, . In Definition 4, the probability that we return a graph in step (4) is at least the probability that there exists an edge from node 1 to node . This probability is from Lemma C.3. Now we will compute the probability that there exists a path in in by conditioning on returning a graph in step (4). The penultimate inequality is due to Lemma C.3.

Appendix D Additional Experiments and Details

In this section, we present details from Section 5 and more experiments for BANANAS. In Section D.2, we evaluate the meta neural network used by BANANAS with different training set sizes. In Section D.3, we evaluate BANANAS on NASBench with five different acquisition functions. In Sections D.4 and D.5, we evaluate BANANAS on NASBench in settings different from Section 5: 500 queries instead of 150 queries, and using random validation error instead of mean validation error. Finally, in Section D.6, we study the effect of the length of the path encoding on the performance of both the meta neural network, and BANANAS as a whole.

d.1 Details from Section 5

Here, we give more details of the NAS algorithms we compared in Section 5.

Random search. The simplest baseline, random search, draws architectures at random and outputs the architecture with the lowest validation error. Despite its simplicity, multiple papers have concluded that random search is a competitive baseline for NAS algorithms Li and Talwalkar (2019); Sciuto et al. (2019).

Regularized evolution. This algorithm consists of iteratively mutating the best achitectures out of a sample of all architectures evaluated so far Real et al. (2019). We used the same hyperparameters as in the NASBench implementation, but changed the population size from 50 to 30 to account for fewer total queries.

Reinforcement Learning. We use the NASBench implementation of reinforcement learning for NAS based on the REINFORCE algorithm Williams (1992). We used this algorithm because prior work has shown that a 1-layer LSTM controller trained with PPO is not effective on the NASBench dataset Ying et al. (2019).

Bayesian optimization with a GP model. We set up Bayesian optimization with a Gaussian process model and UCB acquisition. In the Gaussian process, we set the distance function between two neural networks as the sum of the Hamming distances between the adjacency matrices and the list of operations. Note that we also used a path-based distance function in our ablation study. We use the ProBO implementation Neiswanger et al. (2019).

AlphaX. AlphaX casts NAS as a reinforcement learning problem, using a neural network to guide the search Wang et al. (2018). Each iteration, a neural network is trained to select the best action, such as making a small change to, or growing, the current architecture.

TPE. Tree-structured Parzen estimator (TPE) is a hyperparameter optimization algorithm based on adaptive Parzen windows. We use the NASBench implementation.

BOHB. Bayesian Optimization HyperBand (BOHB) combines multi-fidelity Bayesian optimization with principled early-stopping from Hyperband Falkner et al. (2018). We use the NASBench implementation.

DNGO. Deep Networks for Global Optimization (DNGO) is an implementation of Bayesian optimization using adaptive basis regression using neural networks instead of Gaussian processes to avoid the cubic scaling Snoek et al. (2015).

NASBOT. Neural architecture search with Bayesian optimization and optimal transport (NASBOT) Kandasamy et al. (2018b) works by defining a distance function between neural networks by computing the similarities between layers and then running an optimal transport algorithm to find the minimum earth-mover’s distance between the two architectures. Then Bayesian optimization is run using this distance function. The NASBOT algorithm is specific to macro NAS, and we put in a good-faith effort to implement it in the cell-based setting. Specifically, we compute the distance between two cells by taking the earth-mover’s distance between the set of row-sums, column-sums, and node operations. This is a version of the OTMANN distance defined in Section 3 and Table 1 in Kandasamy et al. (2018b), defined for the cell-based setting.

Additional notes from Section 5. In the main NASBench experiments and the ablation study, Figure 5.2, we added an isomorphism-removing subroutine to any algorithm that uses the adjacency matrix encoding. This is because multiple adjacency matrices can map to the same architecture. With the path encoding, this is not necessary. Note that without the isomorphism-removing subroutine, some algorithms perform significantly worse, e.g. BANANAS with the adjacency matrix encoding performs almost as bad as random search. This is another strength of the path encoding.

In the main plot, DNGO was run with the adjacency matrix encoding. We also ran DNGO with the path encoding, and it performed similarly to DNGO with the adjacency matrix encoding. We also ran DNGO with the objective in Equation 5.1. It performed similar to regularized evolution (worse than BANANAS). We also ran NASBOT Kandasamy et al. (2018b), which performed better than BO with the adjacency encoding but worse than BO with the path encoding.

In Section 5, we described the details of running BANANAS on the DARTS search space, which resulted in an architecture. We show this architecture in Figure D.1.

Figure D.1: The best neural architecture found by BANANAS on CIFAR-10. Normal cell (left) and reduction cell (right).

d.2 Meta Neural Network Experiments

We plot the meta neural network performance with both featurizations for training sets of size 20, 100, and 500. We use the same experimental setup as described in Section 5. See Figure D.2. We note that a training set of size 20 neural networks is realistic at the start of a NAS algorithm, and size 200 is realistic near the middle or end of a NAS algorithm.

Figure D.2: Performance of the meta neural network with adjacency matrix encoding (row 1) and path encoding (row 2).

d.3 Acquisition functions.

We tested BANANAS with four other common acquisition functions in addition to ITS: expected improvement (EI) Močkus (1975), probability of improvement (PI) Kushner (1964), Thompson sampling (TS) Thompson (1933), and upper confidence bound (UCB) Srinivas et al. (2009). First we give the formal definitions of each acquisition function.

Suppose we have trained a collection of predictive models, , where . Following previous work Neiswanger et al. (2019), we use the following acquisition function estimates for an input architecture :

(D.1)
(D.2)
(D.3)
(D.4)
(D.5)

In these acquisition function definitions, if is true and otherwise, and we are making a normal approximation for our model’s posterior predictive density, where we estimate parameters

In the UCB acquisition function, we use UCB to denote some estimate of a lower confidence bound for the posterior predictive density (note that we use a lower confidence bound rather than an upper confidence bound since we are performing minimization), and is a tradeoff parameter. In experiments we set .

Figure D.3: Comparison of different acquisition functions (left). NAS algorithms with a budget of 500 queries (right).

See Figure D.3 (left). We see that the acquisition function has a small effect on the performance of BANANAS on NASBench, and UCB and ITS perform the best. We note that since the DARTS search space is much larger than the NASBench search space, in the parallel setting, the 10 neural architectures chosen by UCB in each round will have less diversity on the DARTS search space. Since ITS inherently gives a more randomized batch of 10 neural architectures, we expect more diversity and therefore ITS to perform better than UCB on DARTS. Due to the extreme computational cost of running a single BANANAS trial on the DARTS search space, we were unable to give an ablation study.

d.4 NAS algorithms for 500 queries.

In our main set of experiments, we gave each NAS algorithm a budget of 150 queries. That is, each NAS algorithm can only choose 150 architectures to train and evaluate. Now we give each NAS algorithm a budget of 500 queries. See Figure D.3 (right). We largely see the same trends, although the gap between BANANAS and regularized evolution becomes smaller after query 400.

d.5 Mean vs. random validation error.

In the NASBench dataset, each architecture was trained to 108 epochs three separate times with different random seeds. The NASBench paper conducted experiments by choosing a random validation error when querying each architecture, and then reporting the mean test error at the conclusion of the NAS algorithm. We found that the mismatch (choosing random validation error, but mean test error) added extra noise that would not be present during a real-life NAS experiment. Another way to conduct experiments on NASBench is to use mean validation error and mean test error. This is the method we used in Section 5 and the previous experiments from this section. Perhaps the most realistic experiment would be to use a random validation error and the test error corresponding to that validation error, however, the NASBench dataset does not explicitly give this functionality. In Figure D.4, we give results in the setting of the NASBench paper, using random validation error and mean test error. We ran each algorithm for 500 trials. Note that the test error across the board is higher because the correlation between validation and test error is lower (i.e., validation error is noisier than normal). BANANAS still performs the best out of all algorithms that we tried. We see the trends are largely unchanged, although Bayesian optimization with a GP model performs better than regularized evolution. A possible explanation is that Bayesian optimization with a GP model is better-suited to handle noisy data than regularized evolution.

Figure D.4: NAS experiments with random validation error.

d.6 Path encoding length

In this section, we give experimental results which back up the theoretical claims from Section 4. In particular, we show that truncating the path encoding, even by an order of magnitude, has minimal effects on the performance of both the meta neural network and BANANAS.

We sort the paths from highest probability to lowest probability, and then we truncate the path encoding to length . Note that there is a direct correlation between the probability of a path occurrring in a random spec, and the length of a path, so this is equivalent to truncating all but the smallest paths.

First, we give a table of the probabilities of paths by length from NASBench generated from random_spec(). These probabilities were computed experimentally by making 100000 calls to random_spec(). See Table 3. This table gives experimental evidence to support Theorem 4.1.

Path Length Probability Total num. paths Expected num. paths 1 2 3 4 5 6
Table 3: Probabilities of path lengths in NASBench using random_spec().

In our first experiment on the NASBench dataset, we draw 200 random training architectures, and 500 random test architectures, train a meta neural network, and plot the train and test errors. We perform this experiment with values of of . Note that these are natural cutoffs as they correspond to all paths of length , length length . We ran 100 trials for each value of . See Figure D.5 (left).

Then we compare the path length to the error of an outputted architecture from a full run of BANANAS. We average 500 trials for each value of . See Figure D.5 (right). Our experiments show that the path encoding can be truncated by an order of magnitude, with minimal effects on the performance of the meta neural network and BANANAS.

Figure D.5: Experiments on the path length versus the error of the meta neural network, for NASBench (left). Experiments on the path length versus the performance of BANANAS (right).

Appendix E Best practices checklist for NAS research

The area of NAS has seen problems with reproducibility, as well as fair empirical comparisons, even more so than other areas of machine learning. Following calls for fair and reproducible NAS research Li and Talwalkar (2019); Ying et al. (2019), a best practices checklist was recently created Lindauer and Hutter (2019). In order to promote fair and reproducible NAS research, we address all points on the checklist, and we encourage future papers to do the same.

  • Code for the training pipeline used to evaluate the final architectures. We used two of the most popular search spaces in NAS research, the NASBench search space, and the DARTS search space. For NASBench, the accuracy of all architectures were precomputed. For the DARTS search space, we released our fork of the DARTS repo, which is forked from the DARTS repo designed specifically for reproducible experiments Li and Talwalkar (2019), making trivial changes to account for pytorch .

  • Code for the search space. We used the popular and publicly avaliable NASBench and DARTS search spaces with no changes.

  • Hyperparameters used for the final evaluation pipeline, as well as random seeds. We left all hyperparameters unchanged. We trained the architectures found by BANANAS, ASHA, and DARTS five times each, using random seeds 0, 1, 2, 3, 4.

  • For all NAS methods you compare, did you use exactly the same NAS benchmark, including the same dataset, search space, and code for training the architectures and hyperparameters for that code? Yes, we did this by virtue of the NASBench dataset. For the DARTS experiments, we used the reported architectures (found using the same search space and dataset as our method), and then we trained the final architectures using the same code, including hyperparameters. We compared different NAS methods using exactly the same NAS benchmark.

  • Did you control for confounding factors? Yes, we used the same setup for all of our NASBench experiments. For the DARTS search space, we compared our algorithm to two other algorithms using the same setup (pytorch version, CUDA version, etc). Across training over 5 seeds for each algorithm, we used different GPUs, which we found to have no greater effect than using a different random seed.

  • Did you run ablation studies? Yes, we ran a thorough ablation study.

  • Did you use the same evaluation protocol for the methods being compared? Yes, we used the same evaluation protocol for all methods and we tried multiple evaluation protocols.

  • Did you compare performance over time? In our main NASBench plots, we compared performance to number of queries, since all of our comparisons were to black-box optimization algorithms. Note that the number of queries is almost perfectly correlated with runtime. We computed the run time and found all algorithms were between 46.5 and 47.0 TPU hours.

  • Did you compare to random search? Yes.

  • Did you perform multiple runs of your experiments and report seeds? We ran 200 trials of our NASBench experiments. Since we ran so many trials, we did not report random seeds. We ran four total trials of BANANAS on the DARTS search space. Currently we do not have a fully deterministic version of BANANAS on the DARTS search space (which would be harder to implement as the algorithm runs on 10 GPUs). However, the average final error across trials was within 0.1%.

  • Did you use tabular or surrogate benchmarks for in-depth evaluations Yes, we used NASBench.

  • Did you report how you tuned hyperparameters, and what time and resources this required? We performed light hyperparameter tuning for the meta neural network to choose the number of layers, layer size, learning rate, and number of epochs. In general, we found our algorithm to work well without hyperparameter tuning.

  • Did you report the time for the entire end-to-end NAS method? We reported time for the entire end-to-end NAS method.

  • Did you report all details of your experimental setup? We reported all details of our experimental setup.

Footnotes

  1. https://www.github.com/naszilla/bananas.
  2. We also ran four trails of BANANAS on an older version (UCB instead of ITS and MAE instead of MAPE) and the average final error was within 0.1% of DARTS). The difference in final error between different runs of BANANAS was also within 0.1%.

References

  1. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: Appendix A, §2.
  2. The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A, §2.
  3. Algorithms for hyper-parameter optimization. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §1, §5.1.
  4. Proxylessnas: direct neural architecture search on target task and hardware. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §5.1.
  5. Ensemble of deep convolutional neural networks for prognosis of ischemic stroke. In International Workshop on Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Cited by: Appendix A, §2.
  6. Peephole: predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: Appendix A, §2.
  7. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: Appendix A, §2.
  8. Neural architecture search: a survey. arXiv preprint arXiv:1808.05377. Cited by: Appendix A, §1, §2.
  9. BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §D.1, §1, §5.1.
  10. Neuroevolution: from architectures to learning. Evolutionary intelligence 1 (1), pp. 47–62. Cited by: Appendix A.
  11. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811. Cited by: Appendix A.
  12. Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. Cited by: Appendix A, §1.
  13. Batch bayesian optimization via local penalization. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: Appendix A.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
  15. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  16. Tapas: train-less accuracy predictor for architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Appendix A, §2.
  17. Auto-keras: efficient neural architecture search with network morphism. arXiv preprint arXiv:1806.10282. Cited by: Appendix A, §1, §1, §2.
  18. Parallelised bayesian optimisation via thompson sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: Appendix A, §4.
  19. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: Appendix A, §D.1, §D.1, §1, §1, §1, §2, §5.1.
  20. Designing neural networks using genetic algorithms with graph generation system. Complex systems 4 (4), pp. 461–476. Cited by: Appendix A, §2.
  21. Learning curve prediction with bayesian neural networks. ICLR 2017. Cited by: Appendix A, §2.
  22. Imagenet classification with deep convolutional neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §1.
  23. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86 (1), pp. 97–106. Cited by: §D.3.
  24. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: Appendix A, §2.
  25. Prune and replace nas. arXiv preprint arXiv:1906.07528. Cited by: Appendix A, §5.2.
  26. Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638. Cited by: Appendix A, Appendix B, §D.1, 1st item, Appendix E, §1, §1, §2, §5.1, §5.2, Table 2.
  27. Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: Appendix A.
  28. Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: Appendix A, §5.2.
  29. Best practices for scientific research on neural architecture search. arXiv preprint arXiv:1909.02453. Cited by: Appendix E, §1, §2, §5.
  30. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Appendix A, §1, §2.
  31. Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Appendix A, Appendix B, §1, §1, §1, §2, §4, §5, Table 2.
  32. Evolutionary-neural hybrid agents for architecture search. arXiv preprint arXiv:1811.09828. Cited by: Appendix A, §2.
  33. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Cited by: §D.3, §3.
  34. Probo: a framework for using probabilistic programming in bayesian optimization. arXiv preprint arXiv:1901.11515. Cited by: §D.1, §D.3.
  35. The parallel bayesian optimization algorithm. In The State of the Art in Computational Intelligence, Cited by: Appendix A.
  36. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: Appendix A, Appendix B, §1, §1, §2, Table 2.
  37. Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Appendix A.
  38. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §D.1, §1, §5.1.
  39. Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142. Cited by: Appendix B, §D.1, §5.1.
  40. AmoebaNet: an sdn-enabled network service for big data science. Journal of Network and Computer Applications 119, pp. 70–82. Cited by: Appendix A, §2.
  41. Multi-objective neural architecture search via predictive network performance optimization. arXiv preprint arXiv:1911.09336. Cited by: Appendix A, §2, Table 1.
  42. Practical bayesian optimization of machine learning algorithms. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Cited by: Appendix A, §1.
  43. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Cited by: Appendix A, §2.
  44. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: Appendix A, §D.1, §1, §2, §5.1.
  45. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems, pp. 4134–4142. Cited by: Appendix A, §2.
  46. Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §D.3, §3.
  47. Good lower and upper bounds on binomial coefficients. Journal of Inequalities in Pure and Applied Mathematics. Cited by: Appendix C, §4.
  48. Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: Appendix A, §2.
  49. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
  50. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §5.
  51. Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.1.
  52. EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Appendix A, §2.
  53. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §D.3, §1, §3, §4.
  54. Sample-efficient neural architecture search by learning action space. arXiv preprint arXiv:1906.06832. Cited by: Appendix A.
  55. Alphax: exploring neural architectures with deep neural networks and monte carlo tree search. arXiv preprint arXiv:1805.07440. Cited by: Appendix A, §D.1, §1, §2, §5.1, Table 1.
  56. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, pp. 229–256. Cited by: §D.1, §1, §5.1.
  57. SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: Table 2.
  58. Nas-bench-101: towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635. Cited by: Appendix A, Appendix B, Appendix C, §D.1, Appendix E, §1, §1, §2, §4, §4, §4, §5, §5.1, §5.1.
  59. Graph hypernetworks for neural architecture search. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: Appendix A, §2.
  60. D-vae: a variational autoencoder for directed acyclic graphs. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Cited by: Appendix A, §2.
  61. BayesNAS: a bayesian approach for neural architecture search. arXiv preprint arXiv:1905.04919. Cited by: Appendix A, §5.2.
  62. Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §1.
  63. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: Appendix A, Appendix B, §1, §1, §2.