OneShot Neural Architecture Search Through A Posteriori Distribution Guided Sampling
Abstract
The emergence of oneshot approaches has greatly advanced the research on neural architecture search (NAS). Recent approaches train an overparameterized supernetwork (oneshot model) and then sample and evaluate a number of subnetworks, which inherit weights from the oneshot model. The overall searching cost is significantly reduced as training is avoided for subnetworks. However, the network sampling process is casually treated and the inherited weights from an independently trained supernetwork perform suboptimally for subnetworks. In this paper, we propose a novel oneshot NAS scheme to address the above issues. The key innovation is to explicitly estimate the joint a posteriori distribution over network architecture and weights, and sample networks for evaluation according to it. This brings two benefits. First, network sampling under the guidance of a posteriori probability is more efficient than conventional random or uniform sampling. Second, the network architecture and its weights are sampled as a pair to alleviate the suboptimal weights problem. Note that estimating the joint a posteriori distribution is not a trivial problem. By adopting variational methods and introducing a hybrid network representation, we convert the distribution approximation problem into an endtoend neural network training problem which is neatly approached by variational dropout. As a result, the proposed method reduces the number of sampled subnetworks by orders of magnitude. We validate our method on the fundamental image classification task. Results on Cifar10, Cifar100 and ImageNet show that our method strikes the best tradeoff between precision and speed among NAS methods. On Cifar10, we speed up the searching process by 20x and achieve a higher precision than the best network found by existing NAS methods.
OneShot Neural Architecture Search Through A Posteriori Distribution Guided Sampling
Yizhou Zhou Xiaoyan Sun Chong Luo ZhengJun Zha Wenjun Zeng University of Science and Technology of China zyz0205@mail.ustc.edu.cn, zhazj@ustc.edu.cn Microsoft Research Asia {xysun,chong.luo,wezeng}@microsoft.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Neural architecture search (NAS), which automates the design of artificial neural networks (ANN), has received increasing attention in recent years. It is capable of finding ANNs which achieve similar or even better performance than manually designed ones. NAS is essentially a bilevel optimization task as shown in Fig. 1(a). Let denote the set of possible network architectures under a predefined search space. Let and denote an architecture in and its corresponding weights, respectively. The lowerlevel objective optimizes weights as
(1) 
where is the loss criterion evaluated on the training dataset and denotes the network with architecture and weight . The upperlevel objective optimizes the network architecture on the validation dataset with the weight that has been optimized by the lowerlevel task as
(2) 
where is the loss criterion on the validation dataset . To solve this bilevel problem, approaches based on evolution [1, 2], reinforcement learning [3, 4, 5, 6, 7, 8, 2, 9, 10, 11, 12, 13] or gradientbased methods [14, 15, 16, 17] are proposed. However, most of these methods suffer from high computational complexity, (often in the orders of thousands of GPU days) [1] [2] [3] [4] [5], or lack of convergence guarantee [15, 14, 17].
Rather than directly tackling the bilevel problem, some attempts [18, 19, 20, 21, 22, 17] relax the discrete search space to a continues one denoted by , which can be written into where denotes the continuous relaxation and stands for the topology of the relaxed architecture. The weight and architecture are jointly optimized with a single objective function
(3) 
Then the optimal architecture is derived by discretizing the continuous architecture . These methods greatly simplify the optimization problem and enable endtoend training. However, since the validation set is not involved in Eq. (3), the search results are inevitably biased towards the training dataset.
More recent NAS methods tend to reduce the computational complexity by decoupling the bilevel optimization problem into a sequential one [23, 16, 24]. Specifically, a supernetwork (oneshot model) is defined and the search space is constrained to contain only subnetworks of . As shown in Fig. 1(b), recent oneshot NAS methods first optimize weights for the supernetwork by solving
(4) 
Then a number of subnetworks are sampled from and the bestperforming subnetwork is picked out with
(5) 
where denotes the weights of architecture inherited from . The core assumption of this oneshot NAS method is that the bestperforming subnetwork shares weights with the optimal supernetwork, so that each sampled subnetwork does not need to be retrained in the searching process. This greatly boosts the efficiency of NAS. However, this assumption does not always hold. Clues can be found in the common practice that previous oneshot methods rely on finetuning to further improve the performance of the found best model. Previous research has also pointed out that the mismatch between weights and architectures of sampled subnetworks could jeopardize the following ranking results [17]. Besides, the searching process is casually treated by random or uniform sampling. We believe there is large room for improvement in efficiency.
In this paper, we propose a novel NAS strategy, namely NAS through A posteriori distribution guided Sampling (NASAS). In NASAS, we propose to estimate a posteriori distribution over the architecture and weight pair () with a variational distribution , where denotes the variational parameters. The optimial , denoted by , can be found by
(6) 
where measures the distance between two distributions. Note that finding is not a trivial problem and the details will be presented in Section 2. After is found, we can look for the optimal architecture by
(7) 
In a nutshell, NASAS leverages the training dataset to estimate a posteriori distribution, based on which sampling is performed, and then uses the validation set for performance evaluation.
The flow chart of NASAS is illustrated in Fig. 1(c). Our work has two main innovations compared with the recently proposed oneshot approaches. First, we greatly improve the efficiency of network search process by a guided sampling. As a result, the searching time can be reduced by orders of magnitude to achieve the best performance. Second, we approximate the joint distribution over architecture and weight to alleviate the mismatch problem mentioned earlier. This not only improves the reliability of ranking result, but also allows us to directly output the found bestperforming network without finetuning. We evaluate our NASAS on image classification task. It is able to achieve 1.98% test error at 11.1 GPU days on Cifar10, while the best network found by existing NAS methods is only able to achieve 2.07% test error at 200 GPU days. NASAS also achieves stateoftheart performance with 14.8% test error at 8.7 GPU days on Cifar100, and 24.80% test errors at around 40 GPU days on ImageNet under relaxed mobile setting.
2 Nasas
In this section, we first formulate the target problem of our NASAS, and then propose an endtoend trainable solution to estimate the joint a posteriori distribution over architectures and weights, followed by an efficient sampling and ranking scheme to facilitate the search process.
2.1 Notation and Problem Formulation
Given a oneshot model , let denote the convolution weight matrix for layer with spatial kernel size , and and denote the number of input and output channels, respectively. We use to denote the sliced kernel operated on the input channel dimension and use to denote the weights of the whole oneshot model. As deriving a subnetwork in is equivalent to deactivating a set of convolution kernels, subnetwork architecture can be specified by a set of random variables , where indicates deactivating (zero) or activating (one) convolution kernel . Later on we will use boldface for random variables.
Although we need a joint a posteriori distribution over and , we do not have to explicitly derive the joint distribution since deactivating or activating a convolution kernel is also equivalent to multiplying a binary mask to the kernel. Instead, we combine them as a new random variable , where . Thus, the key problem in our NASAS is to estimate a posteriori distribution over the hybrid network representation . Mathematically,
(8) 
where X and Y denote the training samples and labels, respectively. is likelihood that can be inferred by where denotes a subnetwork defined by hybrid representation . is the a priori distribution of hybrid representation . Because the marginalized likelihood in Eq. (8) is intractable, we use a variational distribution to approximate the true a posteriori distribution and reformulate our target problem as
(9) 
Here we choose KL divergence and accuracy to instantiate and , respectively.
2.2 A Posteriori Distribution Approximation
We employ Variational Inference(VI) to approximate the true a posteriori distribution with by minimizing the negative Evidence Lower Bound (ELBO)
(10) 
where is the number of training samples. Inspired by [25, 26], we propose solving Eq. (10) by the network friendly Variational Dropout.
2.2.1 Approximation by Network Training
We employ the reparametrization trick [27] and choose a deterministic and differentiable transformation function that reparameterizes the as , where is a parameterfree distribution. Take a univariate Gaussian distribution as an example, its reparametrization can be with , where and are the variational parameters . Gal et.al. in [25, 26] have shown that when the network weight is reparameterized with
(11) 
the function draw w.r.t. variational distribution over network weights can be efficiently implemented via network inference. Concretely, the function draw is equivalent to randomly drawing masked deterministic weight matrix in neural networks, which is known as the Dropout operations [28]. Similarly, we replace in our hybrid representation with , and reformulate as
(12) 
In Eq. (12), we have an additional random variable that controls the activation of kernels whose distribution is unknown. Here we propose using the marginal probability to characterize its behavior, because the marginal can reflect the expected probability of selecting kernel given the training dataset. It exactly matches the real behavior if the selections of kernels in a oneshot model are independent. Since the joint distribution of network architecture is a multivariate Bernoulli distribution, its marginal distribution obeys [29], where now is also the variational parameter that should be optimized. Therefore, we have
(13) 
Here we omit the subscript in the original because the importance of branches which come from the same kernel size group and layer should be identical. By replacing with a new variable , Eq. (13) has the same form as Eq. (11). Now Eq. (10) can be rewritten as
(14) 
where variational parameters are composed of both the deterministic kernel weights and the distribution of network architecture. The expected log likelihood (the integral term) in the equation above is usually estimated by Monte Carlo (MC) estimation
(15) 
Eq. (15) indicates that the (negative) ELBO can be computed very efficiently. It is equivalent to the KL term minus the log likelihood that is inferenced by the oneshot network (now reparameterized as ). During each network inference, convolution kernels are randomly deactivated w.r.t. probability , which is exactly equivalent to a dropout neural network.
Now, approximating a posteriori distribution over the hybrid network representation is converted to optimizing the oneshot model with dropout and a KL regularization term. If the derivative of both terms is tractable, we can efficiently train it in an endtoend fashion.
2.2.2 Network Optimization
In addition to the variational parameters , the variable in Eq. (13) should also be optimized (either via gridsearch [25] or gradientbased method [30]). So we need to compute . If each convolution kernel is deactivated with a prior probability along with a Gaussian weight prior , then the a priori distribution for the hybrid representation is exactly a spike and slab prior . Following [26, 30], the derivatives of Eq. (15) can be computed as
(16) 
where and denotes the number of input channels for convolution kernel of spatial size at layer . Please note that the above derivation is obtained by setting the prior to be zero, which indicates the network architecture prior is set to be the whole oneshot model. The motivation of employing is that a proper architecture prior is usually difficult to acquire or even estimate, but can be a reasonable one when we choose the overparameterized network that proves effective on many tasks as our oneshot model. Besides, provides us a more stable way to optimize the [25]. So, we will use the oneshot models that are built upon manually designed networks in our experiments.
Since the first term in Eq. (16) involves computing the derivative of a nondifferentiable Bernoulli distribution (remember in Eq. (13)), we thus employ the Gumbelsoftmax [31] to relax the discrete distribution to a continuous space and the in Eq. (16) and Eq. (13) can be deterministically drawn with
(17) 
where is the temperature that decides how steep the sigmoid function is and if goes to infinite, the above parametrisation is exactly equivalent to drawing the sample from Bernoulli distribution. (Similar relaxation is used in [30] without using Gumbelsoftmax.)
By adopting Eq. (17), the derivatives in Eq. (16) can be propagated via chain rule. Combining the Eq. (8), Eq. (10) and Eq. (15) , one can see that the a posteriori distribution over the hybrid representation can be approximated by simply training the oneshot model in an endtoend fashion with two additional regularization terms and dropout ratio .
2.3 Sampling and Ranking
Once the variational distribution is obtained, we sample a group of network candidates w.r.t. , where the is the number of samples. According to Eq. (13), our sampling process is performed by activating convolution kernels stochastically with the learned probability , which is equivalent to a regular dropout operation. Specifically, each candidate is sampled by randomly dropping convolution kernel w.r.t. the probability for every , and in the oneshot model. Then the sampled candidates are evaluated and ranked on a heldout validation dataset. Due to the hybrid network representation, we actually sample architectureweight pairs which relieves the mismatch problem. At last, the bestperforming one is selected by Eq. (7).
Please note that our a posteriori distribution guided sampling scheme, though not intentionally, leads to an adaptive dropout that reflects the importance of different parts in the oneshot model. It thus relieves the dependency on the hyperparameter sensitive, carefully designed dropout probability in the previous oneshot methods [23].
3 Experiments
To fully investigate the behavior of the NASAS, we test our NASAS on six oneshot supernetworks. Because we use to facilitate Eq. (16), we construct the supernetworks based on architecture priors perceived from manually designed networks. We evaluate the performance of our NASAS on three databases Cifar10, Cifar100 and ImageNet, respectively. For every oneshot supernetwork, we insert a dropout layer after each convolution layer according to Eq. (17) to facilitate the computation of Eq. (16). This modification introduces parameters and FLOPS of negligible overheads. Our NASAS is trained in an endtoend way with the Stochastic Gradient Descent (SGD) using a single P40 GPU card for Cifar10/Cifar100 and 4 M40 GPU cards for ImageNet. Once a model converges, we sample different convolution kernels w.r.t. the learned dropout ratio to get 1500/5000/1500 candidate architectures for Cifar10, Cifar100 and ImageNet, respectively. These 1500 candidates are ranked on a heldout validation dataset and the one with the best performance will be selected as the final search result.
3.1 Cifar10 and Cifar100
Oneshot Model and Hyperparameters. We test our NASAS with four supernetworks, namely SupNetM/MI and SupNetE/EI, on Cifar10 and Cifar100. They are based on the manually designed multibranch ResNet [32] and the architecture obtained by ENAS [33], respectively. Please refer to the supplementary material for more details of the oneshot models and all hyperparameter settings used in this paper.
Method  Error(%)  GPUs Days  Params(M)  Search Method 
shakeshake [32]  2.86    26.2   
shakeshake + cutout [34]  2.56    26.2   
NAS [4]  4.47  22400  7.1  RL 
NAS + more filters [4]  3.65  22400  37.4  RL 
NASNETA + cutout [5]  2.65  1800  3.3  RL 
Micro NAS + QLearning [7]  3.60  96    RL 
PathLevel EAS + cutout [35]  2.30  8.3  13.0  RL 
ENAS + cutout [33]  2.89  0.5  4.6  RL 
EAS (DenseNet) [36]  3.44  10  10.7  RL 
AmoebaNetA + cutout [2]  3.34  3150  3.2  evolution 
Hierachical Evo [1]  3.63  300  61.3  evolution 
PNAS [13]  3.63  225  3.2  SMBO 
SMASH [16]  4.03  1.5  16.0  gradientbased 
DARTS + cutout [14]  2.83  4  3.4  gradientbased 
SNAS + cutout [17]  2.85  1.5  2.8  gradientbased 
NAONet + cutout [37]  2.07  200  128  gradientbased 
OneShot Top [23]  3.70    45.3  gradientbased 
NASASE  2.73  2.5  3.1  guided sampling 
NASASEI  2.56  5.5  10.8  guided sampling 
NASASM  2.20  4.8  21.6  guided sampling 
NASASMI  2.06  6.5  33.4  guided sampling 
NASASMI  1.98  11.1  32.8  guided sampling 
Method  Error(%)  GPUs Days  Params(M)  Search Method 
NASNETA [5]  19.70  1800  3.3  RL 
ENAS [33]  19.43  0.5  4.6  RL 
AmoebaNetB [2]  17.66  3150  2.8  evolution 
PNAS [13]  19.53  150  3.2  SMBO 
NAONet + cutout [37]  14.36  200  128  gradientbased 
NASASMI(ours)  14.28  11  46.4  guided sampling 




Comparison with Stateofthearts. Table. 1 shows the comparison results on Cifar10. Here NASASX denotes the performance of our NASAS on the supernetwork SupNetX. From top to bottom, the first group consists of stateoftheart manually designed architectures on Cifar10; the following three groups list the related NAS methods in which different search algorithms, e.g. RL, evolution, and gradient decent, are adopted; the last group exhibits the performance of our NASAS. It shows that our NASAS is capable of finding advanced architectures in a much efficient and effective way, e.g. it finds the architecture at the lowest errors 1.98% on 11.1 GPU days only.
We also enlist the two networks, Multibranch ResNet [32] and ENAS [33], that inspired our design of supernetworks in Table 1. Our NASASE and NASASM outperform "ENAS+cutout" and "shakeshake+cutout" by 0.16% and 0.36% at smaller model sizes. In the inflated cases, our NASASMI/EI find architectures with even higher performance. Regarding the sampling based oneshot method "OneShot Top" which achieves a competitive 3.7% classification error by randomly sampling 20000 network architectures, our NASAS attains a much higher performance by sampling only 1500 network architectures due to the a posteriori distribution guided sampling.
Table. 2 further demonstrate the performance of our NASAS on a much challenging dataset Cifar100. Our NASAS achieves a good tradeoff on efficiency and accuracy. It achieves 14.8% error rate with only 8.7 GPU days, which is very competitive in terms of both performance and search time.
Please note that results of our NASAS are achieved during search process without any additional finetuning on weights of the searched architectures, while those of other methods are obtained by finetuning the searched models. In the following ablation study, we will discuss more on this point.
Ablation Study and Parameter Analysis. We first evaluate the effect of our a posteriori distribution guided sampling method in Table. 3(a). Compared with the baseline "Random" sampling that is implemented by employing predefined dropout strategy as discussed in [23], "NASAS" successfully finds better subnetworks which bring relatively 14%  23% gain. Evidently, the a posteriori distribution guided sampling is much more effective, which validates that our approach can learn a meaningful distribution for efficient architecture search. Besides, as can be viewed in the table, there is usually a huge performance gap between the architecture searched with predefined distribution with and without finetuning, which reveals the mismatching problems.
Table. 3(b) discusses the weight prior in Eq. (17). We find that a good usually makes the term in Eq. (16) fall into a commonly used weight decay range. So we choose by grid search. As shown in this table, the weight prior affects both error rate and model size. The higher the is, the smaller the size of parameters. Since the objective of NAS is to maximize performance rather than size of parameters, we choose the one with the minimal error rate.
Table. 3(c) shows the impact of temperature value in Eq. (17). It shows that a smaller leads to a lower error, which is consistent with the analysis regarding to Eq. (17). The corresponding finetuned result of our NASAS also provides marginal improvement, which on the other hand demonstrates the reliability of our NASAS on sampling of both architecture and weights.
We further evaluate the impact of number of samples in Table. 3(d). The performance improves along with the increase of number of samples as well as the GPU days. Here we choose sampling 1500 architectures as a tradeoff between the complexity and accuracy. Please also note that compared with other samplingbased NAS methods, our scheme achieves 2.17 % error rate by sampling only 50 architectures with the assistance of the estimated a poseteriori distribution. It further reveals the fact that the estimated distribution provides essential information of the distribution of architectures and thus significantly facilitates the sampling process in terms of both efficiency and accuracy.
Model  ResNet50  Inflated ResNet50  NASASR50 

Error  23.96%  22.93%  22.73% 
Params  25.6M  44.0M  26.0M 
3.2 ImageNet
We further evaluate our NASAS on ImageNet with two supernetworks based on ResNet50 [38] and DenseNet121 [39], respectively. Please find detailed experimental settings in the supplementary material. Rather than transferring architectures searched on smaller dataset, the efficiency and flexibility of our method enable us to directly search architectures on ImageNet within few days.
We first provide test results of our NASAS on ImageNet in Table 4 using a relatively small search space by inflating ResNet50 without limiting the size of the model parameters. Hypeparameters and training process for the three models are identical for fair comparison. It can be observed that NASASR50 outperforms the ResNet50 by 1.23% with a similar size of parameters. Table. 5 shows the comparison with the stateoftheart results on ImageNet. In this test, we control the size of searched architecture to be comparable to those of other NAS methods in mobile setting. Still, our NASAS outperforms. Please note that the size control limits our choice on . As shown in Table. 3(b), it may prevent us from finding better architectures with advanced performance.
Method  Error(%)(Top1/Top5)  GPUs Days  Params(M)  Search Method 
NASNETA [5]  26.0/8.4  1800  5.3  RL 
NASNETB [5]  27.2/8.7  1800  5.3  RL 
NASNETC [5]  27.5/9.0  1800  4.9  RL 
AmoebaNetA [2]  25.5/8.0  3150  5.1  evolution 
AmoebaNetB [2]  26.0/8.5  3150  5.3  evolution 
PNAS [13]  25.8/8.1  225  5.1  SMBO 
FBNetC [14]  25.1/  9  5.5  gradientbased 
SinglePath [24]  25.3/  12    samplingbased 
DARTS [14]  26.9/9.0  4  4.9  gradientbased 
SNAS [17]  27.3/9.2  1.5  4.3  gradientbased 
NASASD121(ours)  24.8/7.5  26  6.6  guided sampling 
3.3 Discussions
Weight Sharing. Weight sharing is a popular method adopted by oneshot models to greatly boost the efficiency of NAS. But it is not well understood why sharing weight is effective [40, 23]. In NASAS, as discussed in subsection 2.2, we find that weight sharing can be viewed as a reparametrization that enables us to estimate the a posteriori distribution via an endtoend network training.
Limitations and Future Works. One limitation of our NASAS is that it can not explicitly choose the nonparametric operations such as pooling. Another one is that our NASAS requires prior knowledge on architectures which is hard to achieve. Here we approaches the prior only by manually designed networks. So our future work may be 1) enabling selections on the nonparametric operations (e.g. assigning a 1x1 convolution after each pooling operation as a surrogate to decide whether we need this pooling branch or not.) 2) investigating the robustness of our NASAS to different prior architectures.
4 Conclusion
In this paper, we propose a new oneshot based NAS approach, i.e. NASAS, which explicitly approximates a posteriori distribution of network architecture and weights via network training to facilitate an more efficient search process. It enables candidate architectures to be sampled w.r.t. the a posteriori distribution approximated on training dataset rather than uniform or predefined distribution. It also alleviates the mismatching problem between architecture and shared weights by sampling architectureweights pair, which makes the ranking results more reliable. The proposed NASAS is efficiently implemented and optimized in an endtoend way, and thus can be easily extended to other largescale tasks.
References
 [1] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
 [2] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
 [4] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [5] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
 [6] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [7] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and ChengLin Liu. Practical blockwise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2423–2432, 2018.
 [8] Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906, 2018.
 [9] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823, 2017.
 [10] Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freezethaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
 [11] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, volume 15, pages 3460–8, 2015.
 [12] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016.
 [13] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [14] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [15] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
 [16] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
 [17] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
 [18] Tom Véniat and Ludovic Denoyer. Learning time/memoryefficient deep architectures with budgeted super networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3500, 2018.
 [19] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
 [20] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016.
 [21] Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. 2018.
 [22] Karim Ahmed and Lorenzo Torresani. Connectivity learning in multibranch networks. arXiv preprint arXiv:1709.09582, 2017.
 [23] Gabriel Bender, PieterJan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying oneshot architecture search. In International Conference on Machine Learning, pages 549–558, 2018.
 [24] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
 [25] Yarin Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
 [26] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 [27] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [29] Bin Dai, Shilin Ding, Grace Wahba, et al. Multivariate bernoulli distribution. Bernoulli, 19(4):1465–1483, 2013.
 [30] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
 [31] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [32] Xavier Gastaldi. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 [33] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 [34] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 [35] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Pathlevel network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639, 2018.
 [36] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. AAAI, 2018.
 [37] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and TieYan Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7826–7837, 2018.
 [38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [39] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 [40] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.