One-Shot Neural Architecture Search Through A Posteriori Distribution Guided Sampling

One-Shot Neural Architecture Search Through A Posteriori Distribution Guided Sampling

Yizhou Zhou &Xiaoyan Sun &Chong Luo &Zheng-Jun Zha &Wenjun Zeng &University of Science and Technology of China
zyz0205@mail.ustc.edu.cn, zhazj@ustc.edu.cn &Microsoft Research Asia
{xysun,chong.luo,wezeng}@microsoft.com
Abstract

The emergence of one-shot approaches has greatly advanced the research on neural architecture search (NAS). Recent approaches train an over-parameterized super-network (one-shot model) and then sample and evaluate a number of sub-networks, which inherit weights from the one-shot model. The overall searching cost is significantly reduced as training is avoided for sub-networks. However, the network sampling process is casually treated and the inherited weights from an independently trained super-network perform sub-optimally for sub-networks. In this paper, we propose a novel one-shot NAS scheme to address the above issues. The key innovation is to explicitly estimate the joint a posteriori distribution over network architecture and weights, and sample networks for evaluation according to it. This brings two benefits. First, network sampling under the guidance of a posteriori probability is more efficient than conventional random or uniform sampling. Second, the network architecture and its weights are sampled as a pair to alleviate the sub-optimal weights problem. Note that estimating the joint a posteriori distribution is not a trivial problem. By adopting variational methods and introducing a hybrid network representation, we convert the distribution approximation problem into an end-to-end neural network training problem which is neatly approached by variational dropout. As a result, the proposed method reduces the number of sampled sub-networks by orders of magnitude. We validate our method on the fundamental image classification task. Results on Cifar-10, Cifar-100 and ImageNet show that our method strikes the best trade-off between precision and speed among NAS methods. On Cifar-10, we speed up the searching process by 20x and achieve a higher precision than the best network found by existing NAS methods.

 

One-Shot Neural Architecture Search Through A Posteriori Distribution Guided Sampling


  Yizhou Zhou Xiaoyan Sun Chong Luo Zheng-Jun Zha Wenjun Zeng University of Science and Technology of China zyz0205@mail.ustc.edu.cn, zhazj@ustc.edu.cn Microsoft Research Asia {xysun,chong.luo,wezeng}@microsoft.com

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

Neural architecture search (NAS), which automates the design of artificial neural networks (ANN), has received increasing attention in recent years. It is capable of finding ANNs which achieve similar or even better performance than manually designed ones. NAS is essentially a bi-level optimization task as shown in Fig. 1(a). Let denote the set of possible network architectures under a predefined search space. Let and denote an architecture in and its corresponding weights, respectively. The lower-level objective optimizes weights as

(1)

where is the loss criterion evaluated on the training dataset and denotes the network with architecture and weight . The upper-level objective optimizes the network architecture on the validation dataset with the weight that has been optimized by the lower-level task as

(2)

where is the loss criterion on the validation dataset . To solve this bi-level problem, approaches based on evolution [1, 2], reinforcement learning [3, 4, 5, 6, 7, 8, 2, 9, 10, 11, 12, 13] or gradient-based methods [14, 15, 16, 17] are proposed. However, most of these methods suffer from high computational complexity, (often in the orders of thousands of GPU days) [1] [2] [3] [4] [5], or lack of convergence guarantee [15, 14, 17].

Figure 1: Illustration of NAS mechanisms. (a) Solving NAS by bi-level optimization which is computational and resource demanding. (b) Sampling-based one-shot NAS. The sampling of architectures is independent of the training dataset and there is often a mismatch between the shared weights and the sampled architectures. (c) Our NASAS samples architecture-weight pairs w.r.t. a posteriori distribution estimated on the training dataset, and directly outputs the searched network without fine-tuning.

Rather than directly tackling the bi-level problem, some attempts [18, 19, 20, 21, 22, 17] relax the discrete search space to a continues one denoted by , which can be written into where denotes the continuous relaxation and stands for the topology of the relaxed architecture. The weight and architecture are jointly optimized with a single objective function

(3)

Then the optimal architecture is derived by discretizing the continuous architecture . These methods greatly simplify the optimization problem and enable end-to-end training. However, since the validation set is not involved in Eq. (3), the search results are inevitably biased towards the training dataset.

More recent NAS methods tend to reduce the computational complexity by decoupling the bi-level optimization problem into a sequential one [23, 16, 24]. Specifically, a super-network (one-shot model) is defined and the search space is constrained to contain only sub-networks of . As shown in Fig. 1(b), recent one-shot NAS methods first optimize weights for the super-network by solving

(4)

Then a number of sub-networks are sampled from and the best-performing sub-network is picked out with

(5)

where denotes the weights of architecture inherited from . The core assumption of this one-shot NAS method is that the best-performing sub-network shares weights with the optimal super-network, so that each sampled sub-network does not need to be re-trained in the searching process. This greatly boosts the efficiency of NAS. However, this assumption does not always hold. Clues can be found in the common practice that previous one-shot methods rely on fine-tuning to further improve the performance of the found best model. Previous research has also pointed out that the mismatch between weights and architectures of sampled sub-networks could jeopardize the following ranking results [17]. Besides, the searching process is casually treated by random or uniform sampling. We believe there is large room for improvement in efficiency.

In this paper, we propose a novel NAS strategy, namely NAS through A posteriori distribution guided Sampling (NASAS). In NASAS, we propose to estimate a posteriori distribution over the architecture and weight pair () with a variational distribution , where denotes the variational parameters. The optimial , denoted by , can be found by

(6)

where measures the distance between two distributions. Note that finding is not a trivial problem and the details will be presented in Section 2. After is found, we can look for the optimal architecture by

(7)

In a nutshell, NASAS leverages the training dataset to estimate a posteriori distribution, based on which sampling is performed, and then uses the validation set for performance evaluation.

The flow chart of NASAS is illustrated in Fig. 1(c). Our work has two main innovations compared with the recently proposed one-shot approaches. First, we greatly improve the efficiency of network search process by a guided sampling. As a result, the searching time can be reduced by orders of magnitude to achieve the best performance. Second, we approximate the joint distribution over architecture and weight to alleviate the mismatch problem mentioned earlier. This not only improves the reliability of ranking result, but also allows us to directly output the found best-performing network without fine-tuning. We evaluate our NASAS on image classification task. It is able to achieve 1.98% test error at 11.1 GPU days on Cifar-10, while the best network found by existing NAS methods is only able to achieve 2.07% test error at 200 GPU days. NASAS also achieves state-of-the-art performance with 14.8% test error at 8.7 GPU days on Cifar-100, and 24.80% test errors at around 40 GPU days on ImageNet under relaxed mobile setting.

2 Nasas

In this section, we first formulate the target problem of our NASAS, and then propose an end-to-end trainable solution to estimate the joint a posteriori distribution over architectures and weights, followed by an efficient sampling and ranking scheme to facilitate the search process.

2.1 Notation and Problem Formulation

Given a one-shot model , let denote the convolution weight matrix for layer with spatial kernel size , and and denote the number of input and output channels, respectively. We use to denote the sliced kernel operated on the input channel dimension and use to denote the weights of the whole one-shot model. As deriving a sub-network in is equivalent to deactivating a set of convolution kernels, sub-network architecture can be specified by a set of random variables , where indicates deactivating (zero) or activating (one) convolution kernel . Later on we will use boldface for random variables.

Although we need a joint a posteriori distribution over and , we do not have to explicitly derive the joint distribution since deactivating or activating a convolution kernel is also equivalent to multiplying a binary mask to the kernel. Instead, we combine them as a new random variable , where . Thus, the key problem in our NASAS is to estimate a posteriori distribution over the hybrid network representation . Mathematically,

(8)

where X and Y denote the training samples and labels, respectively. is likelihood that can be inferred by where denotes a sub-network defined by hybrid representation . is the a priori distribution of hybrid representation . Because the marginalized likelihood in Eq. (8) is intractable, we use a variational distribution to approximate the true a posteriori distribution and reformulate our target problem as

(9)

Here we choose KL divergence and accuracy to instantiate and , respectively.

2.2 A Posteriori Distribution Approximation

We employ Variational Inference(VI) to approximate the true a posteriori distribution with by minimizing the negative Evidence Lower Bound (ELBO)

(10)

where is the number of training samples. Inspired by [25, 26], we propose solving Eq. (10) by the network friendly Variational Dropout.

2.2.1 Approximation by Network Training

We employ the re-parametrization trick [27] and choose a deterministic and differentiable transformation function that re-parameterizes the as , where is a parameter-free distribution. Take a uni-variate Gaussian distribution as an example, its re-parametrization can be with , where and are the variational parameters . Gal et.al. in [25, 26] have shown that when the network weight is re-parameterized with

(11)

the function draw w.r.t. variational distribution over network weights can be efficiently implemented via network inference. Concretely, the function draw is equivalent to randomly drawing masked deterministic weight matrix in neural networks, which is known as the Dropout operations [28]. Similarly, we replace in our hybrid representation with , and reformulate as

(12)

In Eq. (12), we have an additional random variable that controls the activation of kernels whose distribution is unknown. Here we propose using the marginal probability to characterize its behavior, because the marginal can reflect the expected probability of selecting kernel given the training dataset. It exactly matches the real behavior if the selections of kernels in a one-shot model are independent. Since the joint distribution of network architecture is a multivariate Bernoulli distribution, its marginal distribution obeys [29], where now is also the variational parameter that should be optimized. Therefore, we have

(13)

Here we omit the subscript in the original because the importance of branches which come from the same kernel size group and layer should be identical. By replacing with a new variable , Eq. (13) has the same form as Eq. (11). Now Eq. (10) can be rewritten as

(14)

where variational parameters are composed of both the deterministic kernel weights and the distribution of network architecture. The expected log likelihood (the integral term) in the equation above is usually estimated by Monte Carlo (MC) estimation

(15)

Eq. (15) indicates that the (negative) ELBO can be computed very efficiently. It is equivalent to the KL term minus the log likelihood that is inferenced by the one-shot network (now reparameterized as ). During each network inference, convolution kernels are randomly deactivated w.r.t. probability , which is exactly equivalent to a dropout neural network.

Now, approximating a posteriori distribution over the hybrid network representation is converted to optimizing the one-shot model with dropout and a KL regularization term. If the derivative of both terms is tractable, we can efficiently train it in an end-to-end fashion.

2.2.2 Network Optimization

In addition to the variational parameters , the variable in Eq. (13) should also be optimized (either via grid-search [25] or gradient-based method [30]). So we need to compute . If each convolution kernel is deactivated with a prior probability along with a Gaussian weight prior , then the a priori distribution for the hybrid representation is exactly a spike and slab prior . Following [26, 30], the derivatives of Eq. (15) can be computed as

(16)

where and denotes the number of input channels for convolution kernel of spatial size at layer . Please note that the above derivation is obtained by setting the prior to be zero, which indicates the network architecture prior is set to be the whole one-shot model. The motivation of employing is that a proper architecture prior is usually difficult to acquire or even estimate, but can be a reasonable one when we choose the over-parameterized network that proves effective on many tasks as our one-shot model. Besides, provides us a more stable way to optimize the [25]. So, we will use the one-shot models that are built upon manually designed networks in our experiments.

Since the first term in Eq. (16) involves computing the derivative of a non-differentiable Bernoulli distribution (remember in Eq. (13)), we thus employ the Gumbel-softmax [31] to relax the discrete distribution to a continuous space and the in Eq. (16) and Eq. (13) can be deterministically drawn with

(17)

where is the temperature that decides how steep the sigmoid function is and if goes to infinite, the above parametrisation is exactly equivalent to drawing the sample from Bernoulli distribution. (Similar relaxation is used in [30] without using Gumbel-softmax.)

By adopting Eq. (17), the derivatives in Eq. (16) can be propagated via chain rule. Combining the Eq. (8), Eq. (10) and Eq. (15) , one can see that the a posteriori distribution over the hybrid representation can be approximated by simply training the one-shot model in an end-to-end fashion with two additional regularization terms and dropout ratio .

2.3 Sampling and Ranking

Once the variational distribution is obtained, we sample a group of network candidates w.r.t. , where the is the number of samples. According to Eq. (13), our sampling process is performed by activating convolution kernels stochastically with the learned probability , which is equivalent to a regular dropout operation. Specifically, each candidate is sampled by randomly dropping convolution kernel w.r.t. the probability for every , and in the one-shot model. Then the sampled candidates are evaluated and ranked on a held-out validation dataset. Due to the hybrid network representation, we actually sample architecture-weight pairs which relieves the mismatch problem. At last, the best-performing one is selected by Eq. (7).

Please note that our a posteriori distribution guided sampling scheme, though not intentionally, leads to an adaptive dropout that reflects the importance of different parts in the one-shot model. It thus relieves the dependency on the hyper-parameter sensitive, carefully designed drop-out probability in the previous one-shot methods [23].

3 Experiments

To fully investigate the behavior of the NASAS, we test our NASAS on six one-shot super-networks. Because we use to facilitate Eq. (16), we construct the super-networks based on architecture priors perceived from manually designed networks. We evaluate the performance of our NASAS on three databases Cifar-10, Cifar-100 and ImageNet, respectively. For every one-shot super-network, we insert a dropout layer after each convolution layer according to Eq. (17) to facilitate the computation of Eq. (16). This modification introduces parameters and FLOPS of negligible overheads. Our NASAS is trained in an end-to-end way with the Stochastic Gradient Descent (SGD) using a single P40 GPU card for Cifar-10/Cifar-100 and 4 M40 GPU cards for ImageNet. Once a model converges, we sample different convolution kernels w.r.t. the learned dropout ratio to get 1500/5000/1500 candidate architectures for Cifar-10, Cifar-100 and ImageNet, respectively. These 1500 candidates are ranked on a held-out validation dataset and the one with the best performance will be selected as the final search result.

3.1 Cifar-10 and Cifar-100

One-shot Model and Hyper-parameters. We test our NASAS with four super-networks, namely SupNet-M/MI and SupNet-E/EI, on Cifar-10 and Cifar-100. They are based on the manually designed multi-branch ResNet [32] and the architecture obtained by ENAS [33], respectively. Please refer to the supplementary material for more details of the one-shot models and all hyper-parameter settings used in this paper.

Method Error(%) GPUs Days Params(M) Search Method
shake-shake [32] 2.86 - 26.2 -
shake-shake + cutout [34] 2.56 - 26.2 -
NAS [4] 4.47 22400 7.1 RL
NAS + more filters [4] 3.65 22400 37.4 RL
NASNET-A + cutout [5] 2.65 1800 3.3 RL
Micro NAS + Q-Learning [7] 3.60 96 - RL
PathLevel EAS + cutout [35] 2.30 8.3 13.0 RL
ENAS + cutout [33] 2.89 0.5 4.6 RL
EAS (DenseNet) [36] 3.44 10 10.7 RL
AmoebaNet-A + cutout [2] 3.34 3150 3.2 evolution
Hierachical Evo [1] 3.63 300 61.3 evolution
PNAS [13] 3.63 225 3.2 SMBO
SMASH [16] 4.03 1.5 16.0 gradient-based
DARTS + cutout [14] 2.83 4 3.4 gradient-based
SNAS + cutout [17] 2.85 1.5 2.8 gradient-based
NAONet + cutout [37] 2.07 200 128 gradient-based
One-Shot Top [23] 3.70 - 45.3 gradient-based
NASAS-E 2.73 2.5 3.1 guided sampling
NASAS-EI 2.56 5.5 10.8 guided sampling
NASAS-M 2.20 4.8 21.6 guided sampling
NASAS-MI 2.06 6.5 33.4 guided sampling
NASAS-MI 1.98 11.1 32.8 guided sampling
Table 1: Performance comparison with other state-of-the-art results. Please note that we do not fine-tune the network searched by our method. indicates the architecture searched by sampling 10000 candidates. Full table can be viewed in the supplementary material .
Method Error(%) GPUs Days Params(M) Search Method
NASNET-A [5] 19.70 1800 3.3 RL
ENAS [33] 19.43 0.5 4.6 RL
AmoebaNet-B [2] 17.66 3150 2.8 evolution
PNAS [13] 19.53 150 3.2 SMBO
NAONet + cutout [37] 14.36 200 128 gradient-based
NASAS-MI(ours) 14.28 11 46.4 guided sampling
Table 2: Performance comparison with other state-of-art results on Cifar-100. Please note that we do not fine-tune the network searched by our method.
SupNet-EI SupNet-E SupNet-MI SupNet-M
Err. Param. Err. Param. Err. Param. Err. Param.
Full model 2.78% 15.3M 2.98% 4.6M -% 72.7M 2.58% 26.2M
Random w/o FT 13.45% 10.7M 15.87% 3.0M 9.75% 35.4M 2.63% 22.4M
Random w/ FT 3.16% 10.7M 3.47% 3.0M 2.69% 35.4M 2.56% 22.4M
NASAS 2.56% 10.8M 2.73% 3.1M 2.06% 33.4M 2.20% 21.6M
(a) Impact of our a posteriori distribution guided sampling. w/o FT and w/ FT indicate whether the best searched architecture is fine-tuned on the dataset. Our NASAS does not need fine-tuning.
50 150 250 500
Error(%) 2.13 2.06 2.27 2.39
Params(M) 49.9 33.4 23.8 18.2
(b) Impact of the weight prior on SupNet-EI.
EI M EI
2.74% 2.49% 2.68%
2.56% 2.20% -
(c) Impact of the temperature . denotes fine-tuned results.
0.05K 0.5k 1.5k 5.0k 10k 20k 50K
Error(%) 2.17 2.06 2.06 2.04 1.98 - -
GPUs Days 0.02 0.23 0.69 2.31 4.63 9.26 23.15
(d) Impact of the number of sampled candidate architectures on SupNet-MI.
Table 3: Ablation study and parameter analysis.

Comparison with State-of-the-arts. Table. 1 shows the comparison results on Cifar-10. Here NASAS-X denotes the performance of our NASAS on the super-network SupNet-X. From top to bottom, the first group consists of state-of-the-art manually designed architectures on Cifar-10; the following three groups list the related NAS methods in which different search algorithms, e.g. RL, evolution, and gradient decent, are adopted; the last group exhibits the performance of our NASAS. It shows that our NASAS is capable of finding advanced architectures in a much efficient and effective way, e.g. it finds the architecture at the lowest errors 1.98% on 11.1 GPU days only.

We also enlist the two networks, Multi-branch ResNet [32] and ENAS [33], that inspired our design of super-networks in Table 1. Our NASAS-E and NASAS-M outperform "ENAS+cutout" and "shake-shake+cutout" by 0.16% and 0.36% at smaller model sizes. In the inflated cases, our NASAS-MI/EI find architectures with even higher performance. Regarding the sampling based one-shot method "One-Shot Top" which achieves a competitive 3.7% classification error by randomly sampling 20000 network architectures, our NASAS attains a much higher performance by sampling only 1500 network architectures due to the a posteriori distribution guided sampling.

Table. 2 further demonstrate the performance of our NASAS on a much challenging dataset Cifar-100. Our NASAS achieves a good trade-off on efficiency and accuracy. It achieves 14.8% error rate with only 8.7 GPU days, which is very competitive in terms of both performance and search time.

Please note that results of our NASAS are achieved during search process without any additional fine-tuning on weights of the searched architectures, while those of other methods are obtained by fine-tuning the searched models. In the following ablation study, we will discuss more on this point.

Ablation Study and Parameter Analysis. We first evaluate the effect of our a posteriori distribution guided sampling method in Table. 3(a). Compared with the baseline "Random" sampling that is implemented by employing predefined dropout strategy as discussed in [23], "NASAS" successfully finds better sub-networks which bring relatively 14% - 23% gain. Evidently, the a posteriori distribution guided sampling is much more effective, which validates that our approach can learn a meaningful distribution for efficient architecture search. Besides, as can be viewed in the table, there is usually a huge performance gap between the architecture searched with predefined distribution with and without fine-tuning, which reveals the mismatching problems.

Table. 3(b) discusses the weight prior in Eq. (17). We find that a good usually makes the term in Eq. (16) fall into a commonly used weight decay range. So we choose by grid search. As shown in this table, the weight prior affects both error rate and model size. The higher the is, the smaller the size of parameters. Since the objective of NAS is to maximize performance rather than size of parameters, we choose the one with the minimal error rate.

Table. 3(c) shows the impact of temperature value in Eq. (17). It shows that a smaller leads to a lower error, which is consistent with the analysis regarding to Eq. (17). The corresponding fine-tuned result of our NASAS also provides marginal improvement, which on the other hand demonstrates the reliability of our NASAS on sampling of both architecture and weights.

We further evaluate the impact of number of samples in Table. 3(d). The performance improves along with the increase of number of samples as well as the GPU days. Here we choose sampling 1500 architectures as a trade-off between the complexity and accuracy. Please also note that compared with other sampling-based NAS methods, our scheme achieves 2.17 % error rate by sampling only 50 architectures with the assistance of the estimated a poseteriori distribution. It further reveals the fact that the estimated distribution provides essential information of the distribution of architectures and thus significantly facilitates the sampling process in terms of both efficiency and accuracy.

Model ResNet50 Inflated ResNet50 NASAS-R-50
Error 23.96% 22.93% 22.73%
Params 25.6M 44.0M 26.0M
Table 4: Test results on ImageNet with a relatively small super-network based on ResNet-50.

3.2 ImageNet

We further evaluate our NASAS on ImageNet with two super-networks based on ResNet50 [38] and DenseNet121 [39], respectively. Please find detailed experimental settings in the supplementary material. Rather than transferring architectures searched on smaller dataset, the efficiency and flexibility of our method enable us to directly search architectures on ImageNet within few days.

We first provide test results of our NASAS on ImageNet in Table 4 using a relatively small search space by inflating ResNet50 without limiting the size of the model parameters. Hype-parameters and training process for the three models are identical for fair comparison. It can be observed that NASAS-R-50 outperforms the ResNet50 by 1.23% with a similar size of parameters. Table. 5 shows the comparison with the state-of-the-art results on ImageNet. In this test, we control the size of searched architecture to be comparable to those of other NAS methods in mobile setting. Still, our NASAS outperforms. Please note that the size control limits our choice on . As shown in Table. 3(b), it may prevent us from finding better architectures with advanced performance.

Method Error(%)(Top1/Top5) GPUs Days Params(M) Search Method
NASNET-A [5] 26.0/8.4 1800 5.3 RL
NASNET-B [5] 27.2/8.7 1800 5.3 RL
NASNET-C [5] 27.5/9.0 1800 4.9 RL
AmoebaNet-A [2] 25.5/8.0 3150 5.1 evolution
AmoebaNet-B [2] 26.0/8.5 3150 5.3 evolution
PNAS [13] 25.8/8.1 225 5.1 SMBO
FBNet-C [14] 25.1/- 9 5.5 gradient-based
SinglePath [24] 25.3/- 12 - sampling-based
DARTS [14] 26.9/9.0 4 4.9 gradient-based
SNAS [17] 27.3/9.2 1.5 4.3 gradient-based
NASAS-D-121(ours) 24.8/7.5 26 6.6 guided sampling
Table 5: Performance comparison with other state-of-the-art results on ImageNet. Please note our model is directly searched on ImageNet with 26 GPU days.

3.3 Discussions

Weight Sharing. Weight sharing is a popular method adopted by one-shot models to greatly boost the efficiency of NAS. But it is not well understood why sharing weight is effective [40, 23]. In NASAS, as discussed in subsection 2.2, we find that weight sharing can be viewed as a re-parametrization that enables us to estimate the a posteriori distribution via an end-to-end network training.

Limitations and Future Works. One limitation of our NASAS is that it can not explicitly choose the non-parametric operations such as pooling. Another one is that our NASAS requires prior knowledge on architectures which is hard to achieve. Here we approaches the prior only by manually designed networks. So our future work may be 1) enabling selections on the non-parametric operations (e.g. assigning a 1x1 convolution after each pooling operation as a surrogate to decide whether we need this pooling branch or not.) 2) investigating the robustness of our NASAS to different prior architectures.

4 Conclusion

In this paper, we propose a new one-shot based NAS approach, i.e. NASAS, which explicitly approximates a posteriori distribution of network architecture and weights via network training to facilitate an more efficient search process. It enables candidate architectures to be sampled w.r.t. the a posteriori distribution approximated on training dataset rather than uniform or predefined distribution. It also alleviates the mismatching problem between architecture and shared weights by sampling architecture-weights pair, which makes the ranking results more reliable. The proposed NASAS is efficiently implemented and optimized in an end-to-end way, and thus can be easily extended to other large-scale tasks.

References

  • [1] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
  • [2] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
  • [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
  • [4] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  • [5] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
  • [6] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
  • [7] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2423–2432, 2018.
  • [8] Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906, 2018.
  • [9] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823, 2017.
  • [10] Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
  • [11] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, volume 15, pages 3460–8, 2015.
  • [12] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016.
  • [13] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
  • [14] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  • [15] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
  • [16] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
  • [17] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
  • [18] Tom Véniat and Ludovic Denoyer. Learning time/memory-efficient deep architectures with budgeted super networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3500, 2018.
  • [19] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
  • [20] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016.
  • [21] Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. 2018.
  • [22] Karim Ahmed and Lorenzo Torresani. Connectivity learning in multi-branch networks. arXiv preprint arXiv:1709.09582, 2017.
  • [23] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pages 549–558, 2018.
  • [24] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
  • [25] Yarin Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
  • [26] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
  • [27] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [29] Bin Dai, Shilin Ding, Grace Wahba, et al. Multivariate bernoulli distribution. Bernoulli, 19(4):1465–1483, 2013.
  • [30] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
  • [31] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • [32] Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
  • [33] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
  • [34] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [35] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639, 2018.
  • [36] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. AAAI, 2018.
  • [37] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7826–7837, 2018.
  • [38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [39] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [40] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
378718
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description