BayesNAS
BayesNAS: A Bayesian Approach for Neural Architecture Search
Abstract
OneShot Neural Architecture Search (NAS) is a promising method to significantly reduce search time without any separate training. It can be treated as a Network Compression problem on the architecture parameters from an overparameterized network. However, there are two issues associated with most oneshot NAS methods. First, dependencies between a node and its predecessors and successors are often disregarded which result in improper treatment over zero operations. Second, architecture parameters pruning based on their magnitude is questionable. In this paper, we employ the classic Bayesian learning approach to alleviate these two issues by modeling architecture parameters using hierarchical automatic relevance determination (HARD) priors. Unlike other NAS methods, we train the overparameterized network for only one epoch then update the architecture. Impressively, this enabled us to find the architecture on CIFAR10 within only GPU days using a single GPU. Competitive performance can be also achieved by transferring to ImageNet. As a byproduct, our approach can be applied directly to compress convolutional neural networks by enforcing structural sparsity which achieves extremely sparse networks without accuracy deterioration.
equal*
Hongpeng Zhoudelft,equal \icmlauthorMinghao Yangdelft,equal \icmlauthorJun Wangucl \icmlauthorWei Pandelft
delftDepartment of Cognitive Robotics, Delft University of Technology, Netherlands \icmlaffiliationuclDepartment of Computer Science, University College London, UK \icmlcorrespondingauthorWei Panwei.pan@tudelft.nl
Machine Learning, ICML
1 Introduction
Neural Architecture Search (NAS), the process of automating architecture engineering, is thus a logical next step in automating machine learning since (Zoph & Le, 2017). There are basically three existing frameworks for neural architecture search. Reinforcement learning based NAS (Baker et al., 2017; Zoph & Le, 2017; Zhong et al., 2018; Zoph et al., 2018; Cai et al., 2018) methods take the generation of a neural architecture as an agent’s action with the action space identical to the search space. More recent neuroevolutionary approaches (Real et al., 2017; Liu et al., 2018b; Real et al., 2019; Miikkulainen et al., 2019; Xie & Yuille, 2017; Elsken et al., 2019a) use gradientbased methods for optimizing weights and solely use evolutionary algorithms for optimizing the neural architecture itself. However, these two frameworks take enormous computational power when compared to a search using a single GPU. OneShot based NAS is a promising approach to significantly reduce search time without any separate training, which treats all architectures as different subgraphs of a supergraph (the oneshot model) and shares weights between architectures that have edges of this supergraph in common (Saxena & Verbeek, 2016; Brock et al., 2018; Pham et al., 2018; Bender et al., 2018; Liu et al., 2019b; Cai et al., 2019; Xie et al., 2019; Zhang et al., 2019a, b). A comprehensive survey on Neural Architecture Search can be found in (Elsken et al., 2019b).
Our approach is a oneshot based NAS solution which treats NAS as a Network Compression/pruning problem on the architecture parameters from an overparameterized network. However, despite it’s remarkable less searching time compared to reinforcement learning and neuroevolutionary approaches, we can identify a number of significant and practical disadvantages of the current oneshot based NAS. First, dependencies between a node and its predecessors and successors are disregarded in the process of identifying the redundant connections. This is mainly motivated by the improper treatment of zero operations. On one hand, the logit of zero may dominate some of the edges while the child network still has other nonzero edges to keep it connected (Liu et al., 2019b; Xie et al., 2019; Cai et al., 2019; Zhang et al., 2019b), for example, node 2 in Figure1a. Similarly, as shown in Figure 1 of (Xie et al., 2019), the probability of invalid/disconnected graph sampled will be when there are three nonzero plus one zero operation. Though postprocessing to safely remove isolated nodes is possible, e.g., for chainlike structure, it demands extensive extra computations to reconstruct the graph for complex search space with additional layer types and multiple branches and skip connections. This may prevent the use of modern network structure as the backbone such as DenseNet (Huang et al., 2017), newly designed motifs (Liu et al., 2018b) and complex computer vision tasks such as semantic segmentation (Liu et al., 2019a). On the other hand, zero operations should have higher priority to rule out other possible operations, since zero operations equal to all nonzero operations not being selected. Second, most oneshot NAS methods (Liu et al., 2019b; Cai et al., 2019; Xie et al., 2019; Zhang et al., 2019b; Gordon et al., 2018) rely on the magnitude of architecture parameters to prune redundant parts and this is not necessarily true. From the perspective of Network Compression (Lee et al., 2019), magnitudebased metric depends on the scale of weights thus requiring pretraining and is very sensitive to the architectural choices. Also the magnitude does not necessarily imply the optimal edge. Unfortunately, these drawbacks exist not only in Network Compression but also in oneshot NAS.
In this work, we propose a novel, efficient and highly automated framework based on the classic Bayesian learning approach to alleviate these two issues simultaneously. We model architecture parameters by a hierarchical automatic relevance determination (HARD) prior. The dependency can be translated by multiplication and addition of some independent Gaussian distributions. The classic Bayesian learning framework MacKay (1992a); Neal (1995); Tipping (2001) prevents overfitting and promotes sparsity by specifying sparse priors. The uncertainty of the parameter distribution can be used as a new metric to prune the redundant parts if its associated entropy is nonpositive. The majority of parameters are automatically zeroed out during the learning process.
Our Contributions

Bayesian approach: BayesNAS is the first Bayesian approach for NAS. Therefore, our approach shares the advantages of Bayesian learning, which prevents overfitting and does not require tuning a lot of hyperparameters. Hierarchical sparse priors are used to model the architecture parameters. Priors can not only promote sparsity, but model the dependency between a node and its predecessors and successors ensuring a connected derived graph after pruning. Furthermore, it provides a principled way to prioritize zero operations over other nonzero operations. In our experiment on CIFAR10, we found that the variance of the prior, as well as that of posterior, is several magnitudes smaller than posterior mean which renders a good metric for architecture parameters pruning.

Simple and fast search: Our algorithm is formulated simply as an iteratively reweighted type algorithm (Candes et al., 2008) where the reweighting coefficients used for the next iteration are computed not only from the value of the current solution but also from its posterior variance. The update of posterior variance is based on Laplace approximation in Bayesian learning which requires computation of the inverse Hessian of log likelihood. To make the computation for large networks feasible, a fast Hessian calculation method is proposed. In our experiment, we train the model for only one epoch before calculating the Hessian to update the posterior variance. Therefore, the search time for very deep neural networks can be kept within GPU days.

Network compression: As a byproduct, our approach can be extended directly to Network Compression by enforcing various structural sparsity over network parameters. Extremely sparse models can be obtained at the cost of minimal or no loss in accuracy across all tested architectures. This can be effortlessly integrated into BayesNAS to find sparse architecture along with sparse kernels for resourcelimited hardware.
2 Related Work
Network Compression. The de facto standard criteria to prune redundant weights depends on their magnitude and is designed to be incorporated with the learning process. These methods are prohibitively slow as they require many iterations of pruning and learning steps. One category is based on the magnitude of weights. The conventional approach to achieve sparsity is by enforcing penalty terms (Chauvin, 1989; Weigend et al., 1991; Ishikawa, 1996). Weights below a certain threshold could be removed. In recent years, impressive results have been achieved using the magnitude of weight as the criterion (Han et al., 2016) as well as other variations (Guo et al., 2016). The other category is based on the magnitude of Hessian of loss with respect to weights, i.e., higher the value of Hessian, greater the importance of the parameters (LeCun et al., 1990; Hassibi et al., 1993). Despite being popular, both of these categories require pretraining and are very sensitive to architectural choices. For instance, different normalization layers affect the magnitude of weights in different ways. This issue has been elaborated in (Lee et al., 2019) where the gradient information at the beginning of training is utilized for ranking the relative importance of weights’ contribution to the training loss.
Oneshot Neural Architecture Search. In oneshot NAS, redundant architecture parameters are pruned based on the magnitude of weights similar to that used in Network Compression. In DARTS, Liu et al. (2019b) applied a softmax function to the magnitude of to rank the relative importance for each operation. Similar to DARTS, there are two related works: ProxylessNAS (Cai et al., 2019) and SNAS (Xie et al., 2019). ProxylessNAS binarizes using (Courbariaux et al., 2015) where plays the role of threshold and edge with the highest weight will be selected in the end. While SNAS applies a softened onehot random variable to rank the architecture parameter, Gordon et al. (2018) treats the scaling factor of Batch Normalization as an edge and normalization as its associated operation. Zhang et al. (2019b) proposed DSONAS which relaxes norm by replacing it with norm and prunes the edges by a threshold, e.g., the learning rate is multiplied by a predefined regularization parameter to prune edges gradually over the course of training.
Bayesian Learning and Compression. Our approach is based on Bayesian learning. In principle, the Bayesian approach to learn neural networks does not have problems of tuning a large amount of hyperparameters or overfitting the training data (MacKay, 1992b, a; Neal, 1995; HernándezLobato & Adams, 2015). By employing sparsityinducing priors, the obtained model depends only on a subset of kernel functions for linear models (Tipping, 2001) and deep neural networks where the neurons can be pruned as well as all their ingoing and outgoing weights (Louizos et al., 2017). Other Bayesian methods have also been applied to network pruning (Ullrich et al., 2017; Molchanov et al., 2017a) where the former extends the soft weightsharing to obtain a sparse and compressed network and the latter uses variational inference to learn the dropout rate that can then be used for network pruning.
3 Search Space Design
The search space defines which neural architectures a NAS approach might discover in principle. Designing a good search space is a challenging problem for NAS. Some works (Zoph & Le, 2017; Zoph et al., 2018; Pham et al., 2018; Cai et al., 2018; Zhang et al., 2019b; Liu et al., 2019b; Cai et al., 2019) have proposed that the search space could be represented by a Directed Acyclic Graph (DAG). We denote as the edge from node to node and stands for the operation that is associated with edge .
Similar to other oneshot based NAS approaches (Bender et al., 2018; Zhang et al., 2019b; Liu et al., 2019b; Cai et al., 2019; Gordon et al., 2018), we also include (different or same) scaling scalars over all operations of all edges to control the information flow, denoted as which also represent architecture parameters. The output of a mixed operation is defined based on the outputs of its edge
(1) 
Then can be obtained as .
To this end, the objective is to learn a simple/sparse subgraph while maintaining/improving the accuracy of the overparameterized DAG (Bender et al., 2018). Let us formulate the search problem as an optimization problem. Given a dataset and the desired sparsity level (i.e., the number of nonzero edges), oneshot NAS problem can be written as an optimization problem with the following constraints:
(2)  
where are split into two parts: network parameters and architecture parameters with dimension of and respectively, and is the standard norm. The formulation in equation 2 can be substantiated by incorporating zero operations into to allow removal of (Liu et al., 2019b; Cai et al., 2019) aiming to further reduce the size of cells and improve the design flexibility.
To alleviate the negative effect induced by the dependency and magnitudebased metric whose issues have been discussed in Introduction, for each , we introduce a switch that is analogous to the one used in an electric circuit. There are four features associated with these switches. First, the “onoff” status is not solely determined by its magnitude. Second, dependency will be taken into account, i.e., the predecessor has superior control over its successors as illustrated in Figure 1c. Third, is an auxiliary variable that will not be updated by gradient descent but computed directly to switch on or off the edge. Lastly, should work for both proxy and proxyless scenarios and can be better embedded into existing algorithmic frameworks Liu et al. (2019b); Cai et al. (2019); Gordon et al. (2018). The calculation method will be introduced later in Section 4.
Inspired by the hierarchical representation in a DAG (Liu et al., 2019b, 2018b), we abstract a single motif as the building block of DAG, as shown in Figure 1e. Apparently, any derived motif, path, or network can be constructed by such a multiinputmultioutput motif. It shows that a successor can have multiple predecessors and each predecessor can have multiple operations over each of its successors. Since the representation is general, each directed edge can be associated with some primitive operations (e.g., convolution, pooling, etc.) and a node can represent output of motifs, cells, or a network.
4 Dependency Based OneShot Performance Estimation Strategy
4.1 Encoding the Dependency Logic
In the following, we will formally state the criterion to identify the redundant connections in Proposition 4.1. The idea can be illustrated by Figure 1b in which both the blue and red edges from node 2 to 3 and from node 2 to 4 might be nonzeros but should be removed as a consequence. To enable this, we have the following proposition.
Proposition
There is information flow from node to under operation as shown in Figure 1e if and only if at least one operation of at least one predecessor of node is nonzero and is also nonzero.
Remark
The same expression for Proposition 4.1 is: there is no information flow from node to under operation if and only if all the operation of all the predecessors of node are zeros or is zero. This explains the incompleteness of the problem 2 as well as the possible phenomenon that nonzero edges become dysfunctional in Figure 1b.
Remark
As can be seen in Remark 4.1, we will construct a probability distribution jointly over , , in the sequel, denoted as
(3) 
where is a possible expression like in Remark 4.1 to encode Proposition 4.1.
In the following, we will show how the “switches” can be used to implement Proposition 4.1. If we assume has two states , is redundant when is OFF or all are OFF, . How to use to encode the redundancy of , i.e., ? One possible solution is
(4) 
If is a continuous variable with for ON and for OFF, set union and intersection can be arithmetically represented by addition and multiplication respectively. does not directly determine the magnitude of but plays the role as uncertainty or confidence for zero magnitude.
A straightforward way to encode this logic is to assign a probability distribution, for example Gaussian distribution, over
Since are independent with each other, we construct the following distribution to express equation 3:
(5)  
where
(6) 
Since in equation 5 always holds, regardless of what is, we can use the following simpler alternative to substitute equation 5 to encode Proposition 4.1:
(7) 
Interestingly, equation 7 and 4 are equivalent. This means that we may find an algorithm that is able to find the sparse solution in a probabilistic manner. However, Gaussian distribution, in general, does not promote sparsity. Fortunately, some classic yet powerful techniques in Bayesian learning are applicable, i.e., sparse Bayesian learning (SBL) (Tipping, 2001) and automatic relevance determination (ARD) prior (MacKay, 1996; Neal, 1995) in Bayesian neural networks.
4.2 Zero Operation Ruling All
In our paper, we do not include zero operation as a primitive operation. Instead, between node and we compulsively add one more node and allow only a single identity operation (see Figure 1f). The associated weight is trainable and initialized to as well as its switch . The idea is that if is OFF, all the operations from to will be disabled as a consequence. Then in equation 6 can be substituted by
(8) 
5 Bayesian Learning Search Strategy
5.1 Bayesian Neural Network
The likelihood for the network weights and the noise precision with data is
(9) 
To complete our probabilistic model, we specify a Gaussian prior distribution for each entry in each of the weight matrices in . In particular,
(10)  
(11) 
where is defined in equation 8. , and are hyperparameters. Importantly, there is an individual hyperparameter associated independently with every edge weight and a single one with all network weight. Follow Mackayâs evidence framework (MacKay, 1992a), “hierarchical priorsâ are employed on the latent variables using Gamma priors on the inverse variances. The hyperpriors for , and are chosen to be a gamma distribution (Berger, 2013), i.e., , with , and . Essentially, the choice of Gamma priors has the effect of making the marginal distribution of the latent variable prior the nonGaussian Studentâs t therefore promoting the sparsity (Tipping, 2001, Section 2 and 5.1). To make these priors noninformative (i.e., flat), we simply fix and to zero by assuming uniform scale priors for analysis and implementation. This formulation of prior distributions is a type of hierarchically constructed automatic relevance determination (HARD) prior which is built upon classic ARD prior (Neal, 1995; Tipping, 2001).
The posterior distribution for the parameters , and can then be obtained by applying Bayes’ rule:
(12) 
where is a normalization constant. Given a new input vector , we can make predictions for its output using the predictive distribution given by
(13)  
where . However, the exact computation of and is not tractable in most cases. Therefore, in practice, we have to resort to approximate inference methods.
It should be noted that is the same for all network parameters. However, it can be different for or constructed to represent the structural sparsity for Convolutional kernels in NN aiming for Network Compression, which is related to Bayesian compression (Louizos et al., 2017) and structural sparsity compression (Wen et al., 2016). We give some examples in Figure 2 and more can be found in the Appendix S2.2 where extremely sparse networks on MNIST and CIFAR10 can be obtained without accuracy deterioration. Since our main focus is on architecture parameters, without breaking the flow, we will fix which is equivalent to the weight decay coefficient in SGD and that is equivalent to the regularization coefficient for network parameters.
(14) 
We assume that the distribution of data likelihood belongs to the exponential family
(15) 
where is the energy function over data.
5.2 Laplace Approximation and Efficient Hessian Computation
In related Bayesian models, the quantity in equation 14 is known as the marginal likelihood and its maximization is known as the typeII maximum likelihood method (Berger, 2013). And neural networks can also be treated in a Bayesian manner known as Bayesian learning for neural networks (MacKay, 1992b; Neal, 1995). Several approaches have been proposed based on, e.g., the Laplace approximation (MacKay, 1992b), Hamiltonian Monte Carlo (Neal, 1995), expectation propagation (Jylänki et al., 2014; HernándezLobato & Adams, 2015), and variational inference (Hinton & Van Camp, 1993; Graves, 2011). Among these methods, we adopt Laplace approximation. However, Laplace approximation requires computation of the inverse Hessian of loglikelihood, which can be infeasible to compute for large networks. Nevertheless, we are motivated by 1) its easy implementation, especially using recent popular deep learning open source software; 2) versatility for modern NN structures such as CNN and RNN as well as their modern variations; 3) close relationship between computation of Hessian and Network Compression using Hessian metric (LeCun et al., 1990; Hassibi et al., 1993); 4) acceleration effect to training convergence by secondorder optimization algorithm (Botev et al., 2017) to which it is related. In this paper, we propose the efficient calculation/approximation of Hessian for convolutional layer and architecture parameter. The detailed calculation procedures are explained in Appendix S3.2 and S3.3 respectively.
5.3 Optimization Algorithm
As analyzed before, the optimization objective of searching architecture becomes removing redundant edges. The training algorithm is iteratively indexed by . Each iteration may contain several epochs. The pseudo code is summarized in Algorithm 1. The cost function is simply maximum likelihood over the data with regularization whose intensity is controlled by the reweighted coefficient
(16) 
The derivation can be found in Appendix S1.1 and S1.2. The algorithm mainly includes five parts. The first part is to jointly train and . The second part is to freeze the architecture parameters and prepare to compute their Hessian. The third part is to update the variables associated with the architecture parameters. The fourth part is to prune the architecture parameters and the pruned net will be trained in a standard way in the fifth part. As discussed previously on the drawback of magnitude based pruning metric,
(17)  
(18)  
(19)  
(20) 
we propose a new metric based on maximum entropy of the distribution. Since in equation 5 is Gaussian with zero mean variance, the maximum entropy is . We set the threshold for to prune related edges when , i.e., .
The algorithm can be easily transferred to other scenarios. One scenario involves proxy tasks to find the cell. Similar to equation 16, we group same edge/operation in the repeated stacked cells where is the index. The cost function for proxy tasks is then given as follows in the form of reweighted group Lasso:
(21) 
The details are summarized in Algorithm 2 of Appendix S1.3. Another scenario is on Network Compression with structural sparsity, which is summarized in Algorithm 3 of Appendix S2.
Architecture 





DenseNetBC (Huang et al., 2017)  3.46  25.6    manual  
NASNetA + cutout (Zoph et al., 2018)  2.65  3.3  1800  RL  
AmoebaNetB + cutout (Real et al., 2019)  2.55 0.05  2.8  3150  evolution  
Hierarchical Evo (Liu et al., 2018b)  3.75 0.12  15.7  300  evolution  
PNAS (Liu et al., 2018a)  3.41 0.09  3.2  225  SMBO  
ENAS + cutout (Pham et al., 2018)  2.89  4.6  0.5  RL  
Random search baseline + cutout (Liu et al., 2019b)  3.29 0.15  3.2  1  random  
DARTS (2nd order bilevel) + cutout (Liu et al., 2019b)  2.76 0.09  3.4  1  gradient  
SNAS (singlelevel) + moderate con + cutout (Xie et al., 2019)  2.85 0.02  2.8  1.5  gradient  
DSONASshare+cutout (Zhang et al., 2019b)  2.84 0.07  3.0  1  gradient  
ProxylessG + cutout Cai et al. (2019)  2.08  5.7    gradient  
BayesNAS + cutout +  3.020.04  2.590.23  0.2  gradient  
BayesNAS + cutout +  2.900.05  3.100.15  0.2  gradient  
BayesNAS + cutout +  2.810.04  3.400.62  0.2  gradient  
BayesNAS + TreeCellA + Pyrimaid backbone + cutout  2.41  3.4  0.1  gradient 
Architecture  Test Error (%)  Params  Search Cost  Search Method  
top1  top5  (M)  (GPU days)  
Inceptionv1 (Szegedy et al., 2015)  30.2  10.1  6.6  –  manual 
MobileNet (Howard et al., 2017)  29.4  10.5  4.2  –  manual 
ShuffleNet 2 (v1) (Zhang et al., 2018)  29.1  10.2  5  –  manual 
ShuffleNet 2 (v2) (Zhang et al., 2018)  26.3  –  5  –  manual 
NASNetA (Zoph et al., 2018)  26.0  8.4  5.3  1800  RL 
NASNetB (Zoph et al., 2018)  27.2  8.7  5.3  1800  RL 
NASNetC (Zoph et al., 2018)  27.5  9.0  4.9  1800  RL 
AmoebaNetA (Real et al., 2019)  25.5  8.0  5.1  3150  evolution 
AmoebaNetB (Real et al., 2019)  26.0  8.5  5.3  3150  evolution 
AmoebaNetC (Real et al., 2019)  24.3  7.6  6.4  3150  evolution 
PNAS (Liu et al., 2018a)  25.8  8.1  5.1  225  SMBO 
DARTS Liu et al. (2019b)  26.9  9.0  4.9  4  gradient 
BayesNAS ()  28.1  9.4  4.0  0.2  gradient 
BayesNAS ()  27.3  8.4  3.3  0.2  gradient 
BayesNAS ()  26.5  8.9  3.9  0.2  gradient 
6 Experiments
The experiments focus on two scenarios in NAS: proxy NAS and proxyless NAS. For proxy NAS, we follow the pipeline in DARTS Liu et al. (2019b) and SNAS (Xie et al., 2019). First BayesNAS is applied to search for the best convolutional cells in a complete network on CIFAR10. Then a network constructed by stacking learned cells is retrained for performance comparison. For proxyless NAS, we follow the pipeline in ProxylessNAS Cai et al. (2019). First, the treelike cell from Cai et al. (2018) with multiple paths is integrated into the PyramidNet Han et al. (2017). Then we search for the optimal path(s) within each cell by BayesNAS. Finally, the network is reconstructed by retaining only the optimal path(s) and retrained on CIFAR10 for performance comparison. Detailed experiments setting is in Appendix S4.1.
6.1 Proxy Search
Motivation
Unlike DARTS and SNAS that rely on validation accuracy during or after search, we use in BayesNAS as performance evaluation criterion which enables us to achieve it in an oneshot manner.
Search Space
Our setup follows DARTS and SNAS, where convolutional cells of 7 nodes are stacked for multiple times to form a network. The input nodes, i.e., the first and second nodes, of cell are set equal to the outputs of cell and cell respectively, with convolutions inserted as necessary, and the output node is the depthwise concatenation of all the intermediate nodes. Reduction cells are located at the 1/3 and 2/3 of the total depth of the network to reduce the spatial resolution of feature maps. Details about all operations included are shown in Appendix S4.1. Unlike DARTS and SNAS, we exclude zero operations.
Training Settings
In the searching stage, we train a small network stacked by 8 cells using BayesNAS with different . This network size is determined to fit into a single GPU. Since we cache the feature maps in memory, we can only set batch size as 18. The optimizer we use is SGD optimizer with momentum 0.9 and fixed learning rate 0.1. Other training setups follow DARTS and SNAS (Appendix S4.1). The search takes about hours on a single GPU^{1}^{1}1All the experiments were performed using NVIDIA TITAN V GPUs.
Search Results
The normal and reduction cells learned on CIFAR10 using BayesNAS are shown in Figure 3a and 3b. A large network of 20 cells where cells at 1/3 and 2/3 are reduction cells is trained from scratch with the batch size of 128. The validation accuracy is presented in Table 1. The test error rate of BayesNAS is competitive against stateoftheart techniques and BayesNAS is able to find convolutional cells with fewer parameters when compared to DARTS and SNAS.
6.2 Proxyless Search
Motivation
Using existing treelike cell, we apply BayesNAS to search for the optimal path(s) within each cell. Varying from proxy search, cells do not share architecture in proxyless search.
Search Space
The backbone used is PyramidNet with three layers each consisting of bottleneck blocks and . All convolution in bottleneck blocks are replaced by the treecell that has in total possible paths within. The groups for grouped convolution is set to . For the detailed structure of the treecell, we refer to Cai et al. (2018).
Training Settings
In the searching stage, we set batch size to 32 and learning rate to 0.1. We use the same optimizer as for proxy search. The of BayesNAS for each possible path is set to .
Search Results
Because each cell can have a different structure in proxyless setting, we demonstrate only two typical types of cell structure among all of them in Figure 4a and Figure 4b. The first type is a chainlike structure where only one path exists in the cell connecting the input of the cell to its output. The second type is an inception structure where divergence and convergence both exist in the cell. Our further observation reveals that some cells are dispensable with respect to the entire network. After the architecture is determined, the network is trained from scratch with the batch size of 64, learning rate as 0.1 and cosine annealing learning rate decay schedule Loshchilov & Hutter (2017). The validation accuracy is also presented in Table 1. Although test error increases slightly compared to Cai et al. (2019), there is a significant drop in the number of model parameters to be learned which is beneficial for both training and inference.
7 Transferability to ImageNet
For ImageNet mobile setting, the input images are of size 224224. A network of 14 cells is trained for 250 epochs with batch size 128, weight decay and initial SGD learning rate 0.1 (decayed by a factor of 0.97 after each epoch). Results in Table 2 show that the cell learned on CIFAR10 can be transfered to ImageNet and is capable of achieving competitive performance.
8 Conclusion and Future Work
We introduce BayesNAS that can directly learn a sparse neural network architecture. We significantly reduce the search time by using only one epoch to get the candidate architecture. Our current implementation is inefficient by caching all the feature maps in memory to compute the Hessian. However, Hessian computation can be done along with backpropagation which will potentially further reduce the searching time and scale our approach to larger search spaces.
Acknowledgements
The work of Hongpeng Zhou is sponsored by the program of China Scholarships Council (No.201706120017).
References
 Amari (1998) Amari, S.I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Baker et al. (2017) Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations, 2017.
 Bender et al. (2018) Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., and Le, Q. Understanding and simplifying oneshot architecture search. In International Conference on Machine Learning, pp. 549–558, 2018.
 Berger (2013) Berger, J. O. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.
 Botev et al. (2017) Botev, A., Ritter, H., and Barber, D. Practical gaussnewton optimisation for deep learning. ICML, 2017.
 Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimisation. Cambridge university press, 2004.
 Brock et al. (2018) Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: Oneshot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
 Cai et al. (2018) Cai, H., Yang, J., Zhang, W., Han, S., and Yu, Y. Pathlevel network transformation for efficient architecture search. In ICML, volume 80 of Proceedings of Machine Learning Research, pp. 677–686. PMLR, 2018.
 Cai et al. (2019) Cai, H., Zhu, L., and Han, S. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019. URL https://arxiv.org/pdf/1812.00332.pdf.
 Candes et al. (2008) Candes, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted minimization. Journal of Fourier analysis and applications, 14(56):877–905, 2008.
 Chauvin (1989) Chauvin, Y. A backpropagation algorithm with optimal use of hidden units. In Advances in neural information processing systems, pp. 519–526, 1989.
 Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131, 2015.
 DeVries & Taylor (2017) DeVries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout, 2017.
 Elsken et al. (2019a) Elsken, T., Metzen, J. H., and Hutter, F. Efficient multiobjective neural architecture search via lamarckian evolution. In International Conference on Learning Representations, 2019a.
 Elsken et al. (2019b) Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019b.
 Gordon et al. (2018) Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., and Choi, E. Morphnet: Fast & simple resourceconstrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595, 2018.
 Graves (2011) Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356, 2011.
 Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387, 2016.
 Han et al. (2017) Han, D., Kim, J., and Kim, J. Deep pyramidal residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 6307–6315. IEEE, 2017.
 Han et al. (2016) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016.
 Hassibi et al. (1993) Hassibi, B., Stork, D. G., and Wolff, G. J. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, pp. 293–299 vol.1, March 1993. doi: 10.1109/ICNN.1993.298572.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 HernándezLobato & Adams (2015) HernándezLobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp. 1861–1869, 2015.
 Hinton & Van Camp (1993) Hinton, G. E. and Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM, 1993.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
 Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
 Ishikawa (1996) Ishikawa, M. Structural learning with forgetting. Neural networks, 9(3):509–521, 1996.
 Jylänki et al. (2014) Jylänki, P., Nummenmaa, A., and Vehtari, A. Expectation propagation for neural networks with sparsitypromoting priors. The Journal of Machine Learning Research, 15(1):1849–1901, 2014.
 LeCun (1998) LeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
 LeCun et al. (1990) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
 Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: Singleshot network pruning based on connection sensitivity. 2019.
 Liu et al. (2018a) Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., FeiFei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34, 2018a.
 Liu et al. (2019a) Liu, C., Chen, L.C., Schroff, F., Adam, H., Hua, W., Yuille, A., and FeiFei, L. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation, 2019a.
 Liu et al. (2018b) Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=BJQRKzbA.
 Liu et al. (2019b) Liu, H., Simonyan, K., and Yang, Y. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=S1eYHoC5FX.
 Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
 Louizos et al. (2017) Louizos, C., Ullrich, K., and Welling, M. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3288–3298, 2017.
 Louizos et al. (2018) Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Y8hhg0b.
 Ma & Lu (2017) Ma, W. and Lu, J. An equivalence of fully connected layer and convolutional layer, 2017.
 MacKay (1992a) MacKay, D. J. Bayesian interpolation. Neural computation, 4(3):415–447, 1992a.
 MacKay (1992b) MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992b.
 MacKay (1996) MacKay, D. J. Bayesian methods for backpropagation networks. In Models of neural networks III, pp. 211–254. Springer, 1996.
 Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pp. 2408–2417, 2015.
 Miikkulainen et al. (2019) Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Duffy, N., et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Elsevier, 2019.
 Molchanov et al. (2017a) Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 2498–2507. JMLR.org, 2017a. URL http://dl.acm.org/citation.cfm?id=3305890.3305939.
 Molchanov et al. (2017b) Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017b.
 Neal (1995) Neal, R. M. Bayesian learning for neural networks. 1995.
 Neklyudov et al. (2017) Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. P. Structured bayesian pruning via lognormal multiplicative noise. In Advances in Neural Information Processing Systems, pp. 6775–6784, 2017.
 Nocedal & Wright (2006) Nocedal, J. and Wright, S. J. Numerical Optimization. Springer, 2006.
 Pham et al. (2018) Pham, H. Q., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. In ICML, 2018.
 Real et al. (2017) Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A. Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2902–2911. JMLR. org, 2017.
 Real et al. (2019) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. In AAAI, 2019.
 Saxena & Verbeek (2016) Saxena, S. and Verbeek, J. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pp. 4053–4061, 2016.
 Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
 Tipping (2001) Tipping, M. E. Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun):211–244, 2001.
 Ullrich et al. (2017) Ullrich, K., Meeds, E., and Welling, M. Soft weightsharing for neural network compression. In International Conference on Learning Representations, 2017.
 Weigend et al. (1991) Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882, 1991.
 Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 Xie & Yuille (2017) Xie, L. and Yuille, A. Genetic cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1388–1397. IEEE, 2017.
 Xie et al. (2019) Xie, S., Zheng, H., Liu, C., and Lin, L. SNAS: stochastic neural architecture search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rylqooRqK7.
 Zhang et al. (2019a) Zhang, C., Ren, M., and Urtasun, R. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rkgW0oA9FX.
 Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856, 2018.
 Zhang et al. (2019b) Zhang, X., Huang, Z., and Wang, N. Single shot neural architecture search via direct sparse optimization, 2019b. URL https://openreview.net/forum?id=ryxjH3R5KQ.
 Zhong et al. (2018) Zhong, Z., Yan, J., Wu, W., Shao, J., and Liu, C.L. Practical blockwise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432, 2018.
 Zoph & Le (2017) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
 Zoph et al. (2018) Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018.
Appendix
Appendix S1 BayesNAS Algorithm Derivation
s1.1 Algorithm Derivation
In this subsection, we explain the detailed algorithm of updating hyperparameters for the abstracted single motif as shown in Figure 1e. The proposition about optimization objective will be illustrated firstly.
Proposition
Suppose the likelihood of the architecture parameters of a neural network could be formulated as one exponential family distribution , where is the given dataset, stands for the uncertainty and represents the energy function over data. The sparse prior with super Gaussian distribution for each architecture parameter has been defined in equation 11. The unknown architecture parameter of the network and hyperparameter can be approximately obtained by solving the following optimization problem
(S1.1.1) 
specially, for the architecture parameter which is associated with one operation of the edge (), the optimization problem could be reformulated as:
(S1.1.2)  
where is arbitrary, and
and
It should also be noted that represents the uncertainty of without considering the dependency between edge and , where and stands for one possible operation in corresponding edges.
Proof
Given the likelihood with exponential family distribution
as explained in equation 5, we define the prior of with Gaussian distribution
The marginal likelihood could be calculated as:
(S1.1.3) 
Typically, this integral is intractable or has no analytical solution.
The mean and covariance can be fixed if the family is Gaussian. Performing a Taylor series expansion around some point , can be approximated as
(S1.1.4) 
where is the gradient and is the Hessian of the energy function
(S1.1.5a)  
(S1.1.5b) 
To derive the cost function in equation S1.1.2, we introduce the posterior mean and covariance:
(S1.1.6a)  
(S1.1.6b) 
Then define the following quantities
(S1.1.7a)  
(S1.1.7b)  
(S1.1.7c)  
(S1.1.7d) 
Now the approximated likelihood is a exponential of quadratic, then Gaussian,
(S1.1.8)  
where
We can write the approximate marginal likelihood as
(S1.1.9)  
where