Improving One-shot NAS by Suppressing the Posterior Fading
There is a growing interest in automated neural architecture search (NAS). To improve the efficiency of NAS, previous approaches adopt weight sharing method to force all models share the same set of weights. However, it has been observed that a model performing better with shared weights does not necessarily perform better when trained alone. In this paper, we analyse existing weight sharing one-shot NAS approaches from a Bayesian point of view and identify the posterior fading problem, which compromises the effectiveness of shared weights. To alleviate this problem, we present a practical approach to guide the parameter posterior towards its true distribution. Moreover, a hard latency constraint is introduced during the search so that the desired latency can be achieved. The resulted method, namely Posterior Convergent NAS (PC-NAS), achieves state-of-the-art performance under standard GPU latency constraint on ImageNet. In our small search space, our model PC-NAS-S attains top-1 accuracy, higher than MobileNetV2 (1.4x) with the same latency. When adopted to the large search space, PC-NAS-L achieves top-1 accuracy within 11ms. The discovered architecture also transfers well to other computer vision applications such as object detection and person re-identification.
Neural network design requires extensive experiments by human experts. In recent years, neural architecture search (Zoph and Le, 2016; Liu et al., 2018a; Zhong et al., 2018; Li et al., 2019; Lin et al., 2019) has emerged as a promising tool to alleviate the cost of human efforts on manually balancing accuracy and resources constraint.
Early works of NAS (Real et al., 2018; Elsken et al., 2017) achieve promising results but have to resort to search only using proxy or subsampled dataset due to its large computation expense. Recently, the attention is drawn to improve the search efficiency via sharing weights across models (Bender et al., 2018; Pham et al., 2018). Generally, weight sharing approaches utilize an over-parameterized network (supergraph) containing every single model, which can be mainly divided into two categories.
The first category is continuous relaxation method (Liu et al., 2018c; Cai et al., 2018), which keeps a set of so called architecture parameters to represent the model, and updates these parameters alternatively with supergraph weights. The resulted model is obtained using the architecture parameters at convergence. The continuous relaxation method entails the rich-get-richer problem (Adam and Lorraine, 2019), which means that a better-performed model at the early stage would be trained more frequently (or have larger learning rates). This introduces bias and instability to the search process.
Another category is referred to as one-shot method (Brock et al., 2017b; Guo et al., 2019; Bender et al., 2018; Chu et al., 2019), which divides the NAS proceedure into a training stage and a searching stage. In the training stage, the supergraph is optimized along with either dropping out each operator with certain probability or sampling uniformly among candidate architectures. In the search stage, a search algorithm is applied to find the architecture with the highest validation accuracy with shared weights. The one-shot approach ensures the fairness among all models by sampling architecture or dropping out operator uniformly. However, as identified in (Adam and Lorraine, 2019; Chu et al., 2019; Bender et al., 2018), the validation accuracy of the model with shared weights is not predictive to its true performance.
In this paper, we formulate NAS as a Bayesian model selection problem (Chipman et al., 2001). With this formulation, we can obtain a comprehensive understanding of one-shot approaches. We show that shared weights are actually a maximum likelihood estimation of a proxy distribution to the true parameter distribution. Further, we identify the common issue of weight sharing, which we call Posterior Fading, i.e., the KL-divergence between true parameter posterior and proxy posterior also increases with the number of models contained in the supergraph.
To alleviate the aforementioned problem, we proposed a practical approach to guide the convergence of the proxy distribution towards the true parameter posterior. Specifically, our approach divides the training of supergraph into several intervals. We maintain a pool of high potential partial models and progressively update this pool after each interval . At each training step, a partial model is sampled from the pool and complemented to a full model. To update the partial model pool, we generate candidates by extending each partial model and evaluate their potentials, the top ones among which form the new pool size. Since the search space is shrinked in the upcoming training interval, the parameter posterior get close to the desired true posterior during this procedure. Main contributions of our work is concluded as follows:
We analyse the one-shot approaches from a Bayesian point of view and identify the associated disadvantage which we call Posterior Fading.
Inspired by the theoretical discovery, we introduce a novel NAS algorithm which guide the proxy distribution to converge towards the true parameter posterior.
We apply our proposed approach to ImageNet classification (Russakovsky et al., 2015) and achieve strong empirical results. In one typical search space (Cai et al., 2018), our PC-NAS-S attains top-1 accuracy, higher and faster than EfficientNet-B0 (Tan and Le, 2019a), which is the previous state-of-the-art model in mobile setting. To show the strength of our method, we apply our algorithm to a larger search space, our PC-NAS-L boosts the accuracy to .
2 Related work
Increasing interests are drawn to automating the design of neural network with machine learning techniques such as reinforcement learning or neuro-evolution, which is usually referred to as neural architecture search(NAS) (Miller et al., 1989; Liu et al., 2018b; Real et al., 2017; Zoph and Le, 2016; Baker et al., 2017; Wang et al., 2019; Liu et al., 2018c; Cai et al., 2018). This type of NAS is typically considered as an agent-based explore and exploit process, where an agent (e.g. an evolution mechanism or a recurrent neural network(RNN)) is introduced to explore a given architecture space with training a network in the inner loop to get an evaluation for guiding exploration. Such methods are computationally expensive and hard to be used on large-scale datasets, e.g. ImageNet.
Recent works (Pham et al., 2018; Brock et al., 2017a; Liu et al., 2018c; Cai et al., 2018) try to alleviate this computation cost via modeling NAS as a single training process of an over-parameterized network that comprises all candidate models, in which weights of the same operators in different models are shared. ENAS (Pham et al., 2018) reduces the computation cost by orders of magnitude, while requires an RNN agent and focuses on small-scale datasets (e.g. CIFAR10). One-shot NAS (Brock et al., 2017b) trains the over-parameterized network along with droping out each operator with increasing probability. Then it use the pre-trained over-parameterized network to evaluate randomly sampled architectures. DARTS (Liu et al., 2018c) additionally introduces a real-valued architecture parameter for each operator and alternately train operator weights and architecture parameters by back-propagation. ProxylessNAS (Cai et al., 2018) binarize the real-value parameters in DARTS to save the GPU cumputation and memory for training the over-parameterized network.
The paradigm of ProxylessNAS (Cai et al., 2018) and DARTS (Liu et al., 2018c) introduce unavoidable bias since operators of models performing well in the beginning will easily get trained more and normally keep being better than other. But they are not necessarily superior than others when trained from scratch.
Other relevant works are ASAP (Noy et al., 2019) and XNAS (Nayman et al., 2019), which introduce pruning during the training of over-parameterized networks to improve the efficiency of NAS. Similarly, we start with an over-parameterized network and then reduce the search space to derive the optimized architecture. The distinction is that they focus on the speed-up of training and only prune by evaluating the architecture parameters, while we improves the rankings of models and evaluate operators direct on validation set by the performance of models containing it.
In this section, we first formulate neural architecture search in a Bayesian manner. Utilizing this setup, we introduce our PC-NAS approach and analyse its advantage against previous approach. Finally, we discuss the search algorithm combined with latency constraint.
3.1 A Probabilistic Setup for Model Uncertainty
Suppose models are under consideration for data , and describes the probability density of data given model and its associated parameters . The Bayesian approach proceeds by assigning a prior probability distribution to the parameters of each model, and a prior probability to each model.
In order to ensure fairness among all models, we set the model prior a uniform distribution. Under previous setting, we can drive
Since is uniform, the Maximum Likelihood Estimation (MLE) of is just the maximum of (2). It can be inferred that, is crucial to the solution of the model selection. We are interested in attaining the model with highest test accuracy in a trained alone manner, thus the parameter prior is just the posterior which means the distribution of when is trained alone on dataset . Thus we would use the term true parameter posterior to refer .
3.2 Network Architecture Selection In a Bayesian Point of View
We constrain our discussion on the setting which is frequently used in NAS literature. As a building block of our search space, a mixed operator (mixop), denoted by , contains different choices of candidate operators for in parallel. The search space is defined by mixed operators (layers) connected sequentially interleaved by downsampling as in Fig. 1(a). The network architecture (model) is defined by a vector , representing the choice of operator for layer . The parameter for the operator at the -th layer is denoted as . The parameters of the supergraph are denoted by which includes . In this setting, the parameters of each candidate operator are shared among multiple architectures. The parameters related with a specific model is denoted as , which is a subset of the parameters of the supergraph , the rest of the parameters are denoted as , i.e. . The posterior of all parameters given has the property . Implied by the fact that does not affect the prediction of and also not updated during training, is uniformly distributed, . Obtaining the or a MLE of it for each single model is computationally intractable. Therefore, the one-shot method trains the supergraph by dropping out each operator (Brock et al., 2017b) or sampling different architectures (Bender et al., 2018; Chu et al., 2019) and utilize the shared weights to evaluate single model. In this work, we adopt the latter training paradigm while the former one could be easily generalized. Suppose we sample a model and optimize the supergraph with a mini-batch of data based on the objective function :
where is a regularization term. Thus minimizing this objective equals to making MLE to . When training the supergraph, we sample many models , and then train the parameters for these models, which corresponds to a stochastic approximation of the following objective function:
This is equivalent to adopting a proxy parameter posterior as follows:
Maximizing is equivalent to minimizing .
We take one step further to assume that the parameters at each layer are independent, i.e.
Due to the independence, we have
The KL-divergence between and is as follows:
Since the KL-divergence is just the summation of the cross-entropy of and where . The cross-entropy term is always positive. Increasing the number of architectures would push away from , namely the Posterior Fading. We conclude that non-predictive problem originates naturally from one-shot supergraph training. Based on this analysis, if we effectively reduce the number of architectures in Eq.(10), the divergence would decrease, which motivates our design in the next section.
3.3 Posterior Convergent NAS
One trivial way to mitigate the posterior fading problem is limit the number of candidate models inside the supergraph. However, large number of candidate models is demanded for NAS to discover promising models. Due to this conflict, we present PC-NAS which adopt progressive search space shrinking. The resulted algorithm divide the training of shared weights into intervals, where is the number of mixed operators in the search space. The number of training epochs of a single interval is denoted as
Partial model pool is a collection of partial models. At the -th interval, a single partial model should contain selected operators . The size of partial model pool is denoted as . After the -th interval, each partial model in the pool will be extended by the operators in -th mixop. Thus there are candidate extended partial models with length . These candidate partial models are evaluated and the top- among which are used as the partial model pool for the interval . An illustrative exmaple of partial model pool update is in Fig. 1(b)(c)(d).
Candidate evaluation with latency constraint: We simply define the potential of a partial model to be the expected validation accuracy of the models which contain the partial model.
where the validation accuracy of model is denoted by . We estimate this value by uniformly sampling valid models and computing the average of their validation accuracy using one mini-batch. We use to denote the evaluation number, which is the total number of sampled models. We observe that when is large enough, the potential of a partial model is fairly stable and discriminative among candidates. See Algorithm. 1 for pseudo code. The latency constraint is imposed by dumping invalid full models when calculating potentials of extended candidates of partial models in the pool.
Training based on partial model pool The training iteration of the supergraph along with the partial model pool has two steps. First, for a partial model from the pool, we randomly sample the missing operator to complement the partial model to a full model. Then we optimize using the sampled full model and mini-batch data. We Initially, the partial model pool is empty. Thus the supergraph is trained by uniformly sampled models, which is identical to previous one-shot training stage. After the initial training, all operators in the first mixop are evaluated. The top operators forms the partial model pool in the second training stage. Then, the supergraph resume training and the training procedure is identical to the one discussed in last paragraph. Inspired by warm-up, the first stage is set much more epochs than following stages denoted as . The whole PC-NAS process is elaborated in algorithm. 2 The number of models in the shrinked search space at the interval is strictly less than interval . At the final interval, the number of cross-entropy terms in Eq.(10) are P-1 for each architectures in final pool. Thus the parameter posterior of PC-NAS would move towards the true posterior during these intervals.
4 Experiments Results
We demonstrate the effectiveness of our methods on ImageNet, a large scale benchmark dataset, which contains 1,000,000 training samples with 1000 classes. For this task, we focus on models that have high accuracy under certain GPU latency constraint. We search models using PC-NAS, which progressively updates a partial model pool and trains shared weights. Then, we select the model with the highest potential in the pool and report its performance on the test set after training from scratch. Finally, we investigate the transferability of the model learned on ImageNet by evaluating it on two tasks, object detection and person re-identification.
4.1 Training Details
Dataset and latency measurement: As a common practice, we randomly sample 50,000 images from the train set to form a validation set during the model search. We conduct our PC-NAS on the remaining images in train set. The original validation set is used as test set to report the performance of the model generated by our method. The latency is evaluated on Nvidia GTX 1080Ti and the batch size is set 16 to fully utilize GPU resources.
Search spaces: We use two search spaces. We benchmark our small space similar to ProxylessNAS (Cai et al., 2018) and FBNet (Wu et al., 2018) for fair comparison. To test our PC-NAS method in a more complicated search space, we add 3 more kinds of operators to the small space’s mixoperators to construct our large space. Details of the two spaces are in A.1.
PC-NAS hyperparameters: We use PC-NAS to search in both small and large space. To balance training time and performance, we set evaluation number and partial model pool size in both experiments. Ablation study of the two values is in 4.4. When updating weights of the supergraph, we adopt mini-batch nesterov SGD optimizer with momentum 0.9, cosine learning rate decay from 0.1 to 5e-4 and batch size 512, and L2 regularization with weight 1e-4. The warm-up epochs and shrinking interval are set 100 and 5, thus the total training of supergraph lasts epochs. After searching, we select the best one from the top 5 final partial models and train it from scratch. The hyperparameters used to train this best model are the same as that of supergraph and the training takes 300 epochs. We add squeeze-and-excitation layer to this model at the end of each operator and use mixup during the training of resulted model.
4.2 ImageNet Results
|MobileNetV2 1.4x (Sandler et al., 2018)||-||6.9M||10ms||74.7%|
|AmoebaNet-A(Real et al., 2018)||-||5.1M||23ms||74.5%|
|PNASNet (Liu et al., 2018a)||5.6x||5.1M||25ms||74.2%|
|MnasNet(Tan et al., 2018)||-||4.4M||11ms||74.8%|
|FBNet-C(Wu et al., 2018)||5.5M||-||74.9%|
|ProxylessNAS-gpu(Cai et al., 2018)||7.1M||8ms||75.1%|
|EfficientNet-B0(Tan and Le, 2019a)||-||5.3M||13 ms||76.3%|
|MixNet-S(Tan and Le, 2019b)||-||4.1M||13 ms||75.8%|
Table 1 shows the performance of our model on ImageNet. We set our target latency at according to our measurement of mobile setting models on GPU. Our search result in the small space, namely PC-NAS-S, achieves 76.8% top-1 accuracy under our latency constraint, which is higher than EffcientNet-B0 (in terms of absolute accuracy improvement), higher than MixNet-S. If we slightly relax the time constraint, our search result from the large space, namly PC-NAS-L, achieves top-1 accuracy, which improves top-1 accuracy by compared to EfficientNet-B0, compared to MixNet-S. Both PC-NAS-S and PC-NAS-L are faster than EffcientNet-b0 and MixNet-S.
|backbone||params||latency||COCO mAP||Market-1501 mAP|
4.3 Transferability of PC-NAS
We validate our PC-NAS’s transferability on two tasks, object detection and person re-identification. We use COCO (Lin et al., 2014) dataset as benchmark for object detection and Market-1501 (Zheng et al., 2015) for person re-identification. For the two dataset, PC-NAS-L pretrained on ImageNet is utilized as feature extractor, and is compared with other models under the same training script. For object detection, the experiment is conducted with the two-stage framework FPN (Lin et al., 2017). Table 2 shows the performance of our PC-NAS model on COCO and Market-1501. For COCO, our approach significantly surpasses the mAP of MobileNetV2 as well as ResNet50. Compare to the standard ResNet101 backbone, our model achieves comparable mAP quality with almost parameters and faster speed. Similar phenomena are found on Market-1501.
4.4 Ablation Study
Impact of hyperparameters: In this section, we further study the impact of hyperparameters on our method within our small space on ImageNet. The hyperparameters include warm-up, training epochs Tw, partial model pool size , and evaluation number . We tried setting Tw as 100 and 150 with fixed and . The resulted models of these two settings show no significant difference in top-1 accuracy (less than 0.1%), shown as in Fig. 1(a). Thus we choose warm-up training epochs as 100 in our experiment to save computation resources. For the influence of and , we show the results in Fig. 1(a). It can be seen that the top-1 accuracy of the models found by PC-NAS increases with both P and S. Thus we choose , in the experiments for better performance. we did not observe significant improvement when further increasing these two hyperparameters.
Effectiveness of shrinking search space: To assess the role of space shrinking, we trains the supergraph of our large space using One-Shot(Brock et al., 2017b) method without any shrinking of the search space. Then we conduct model search on this supergraph by progressively updating a partial model pool in our method. The resulted model using this setting attains top-1 accuracy on ImageNet, which is lower than our PC-NAS-L as in Table.3.
We add another comparison as follows. First, we select a batch of models from the candidates of our final pool under small space and evaluate their stand alone top-1 accuracy. Then we use One-Shot to train the supergraph also under small space without shrinking. Finally, we shows the model rankings of PC-NAS and One-Shot using the accuracy obtained from inferring the models in the supergraphs trained with the two methods. The difference is shown in Fig. 1(b), the pearson correlation coefficients between stand-alone accuracy and accuracy in supergraph of One-Shot and PC-NAS are 0.11 and 0.92, thus models under PC-NAS’s space shrinking can be ranked by their accuracy evaluated on sharing weights much more precisely than One-Shot.
Effectiveness of our search method: To investigate the importance of our search method, we utilize Evolution Algorithm (EA) to search for models with the above supergraph of our large space trained with One-Shot. The top-1 accuracy of discovered model drops furthur to accuracy, which is lower than PC-NAS-L . We implement EA with population size 5, aligned to the value of pool size in our method, and set the mutation operation as randomly replace the operator in one mixop operator to another. We constrain the total number of validation images in EA the same as ours. The results are shown in Table.3.
|training method||search method||top-1 acc|
In this paper, a new architecture search approach called PC-NAS is proposed. We study the conventional weight sharing approach from Bayesian point of view and identify a key issue that compromises the effectiveness of shared weights. With the theoretical insight, a practical method is devised to mitigate the issue. Experimental results demonstrate the effectiveness of our method, which achieves state-of-the-art performance on ImageNet, and transfers well to COCO detection and person re-identification too.
- Understanding neural architecture search techniques. arXiv preprint arXiv:1904.00438. Cited by: §1, §1.
- Designing neural network architectures using reinforcement learning. International Conference on Learning Representations. Cited by: §2.
- Understanding and simplifying one-shot architecture search. ICML. Cited by: §1, §1, §3.2.
- SMASH: one-shot model architecture search through hypernetworks. NIPS Workshop on Meta-Learning. Cited by: §2.
- SMASH: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §1, §2, §3.2, §4.4.
- Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1, §1, §2, §2, §2, §4.1, Table 1.
- The practical implementation of bayesian model selectio. In Institute of Mathematical Statistics Lecture Notes - Monograph Series, 38, pp. 65–116. Cited by: §1.
- FairNAS:rethinking evaluation of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845v2. Cited by: §1, §3.2.
- Simple and efficient architecture search for convolutional neural networks. ICLR workshop. Cited by: §1.
- Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §1.
- AM-lfs: automl for loss function search. arXiv preprint arXiv:1905.07375. Cited by: §1.
- Online hyper-parameter learning for auto-augmentation strategy. arXiv preprint arXiv:1905.07373. Cited by: §1.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.3.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.3.
- Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1, Table 1.
- Hierarchical representations for efficient architecture search. ICLR. Cited by: §2.
- Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2, §2, §2.
- Designing neural networks using genetic algorithms. ICGA, pp. volume 89, pages 379–384. Cited by: §2.
- XNAS: neural architecture search with expert advice. arXiv preprint arXiv:1906.08031. Cited by: §2.
- ASAP: architecture search, anneal and prune. arXiv preprint arXiv:1904.04123. Cited by: §2.
- Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §2.
- Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §1, Table 1.
- Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §2.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, pp. 115(3):211–252. Cited by: §1.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.1, Table 1.
- Mnasnet: platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626. Cited by: Table 1.
- EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, Table 1.
- MixNet: mixed depthwise convolutional kernels. BMVC. Cited by: Table 1.
- AlphaX: exploring neural architectures with deep neural networks and monte carlo tree search. arXiv preprint arXiv:1903.11059. Cited by: §2.
- FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443. Cited by: §4.1, Table 1.
- Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §4.3.
- Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §1.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
Appendix A Appendix
a.1 Construction of the Search Space:
The operators in our spaces have structures described by either Conv1x1-ConvNxM-Conv1x1 or Conv1x1-ConvNxM-ConvMxN-Conv1x1. We define expand ratio as the ratio between the channel numbers of the ConvNxM in the middle and the input of the first Conv1x1.
Small search space
Our small search space contains a set of MBConv operators (mobile inverted bottleneck convolution (Sandler et al., 2018)) with different kernel sizes and expand ratios, plus Identity, adding up to 10 operators to form a mixoperator. The 10 operators in our small search space are listed in the left column of Table 4, where notation OP_X_Y represents the specific operator OP with expand ratio X and kernel size Y.
Large search space
We add 3 more kinds of operators to the mixoperators of our large search space, namely NConv, DConv, and RConv. We use these 3 operators with different kernel sizes and expand ratios to form 10 operators exclusively for large space, thus the large space contains 20 operators. For large search space, the structure of NConv, DConv are Conv1x1-ConvKxK-Conv1x1 and Conv1x1-ConvKxK-ConvKxK-Conv1x1, and that of RConv is Conv1x1-Conv1xK-ConvKx1-Conv1x1. The kernel sizes and expand ratios of operators exclusively for large space are lised in the right column of Table 4, where notation OP_X_Y represents the specific operator OP with expand ratio X and K=Y.
There are altogether 21 mixoperators in both small and large search spaces. Thus our small search space contains models, while the large one contains .
|Operators in both||Operators exclusively in|
|large and small space||large space|
a.2 Specifications of resulted models:
The specifications of PC-NAS-S and PC-NAS-L are shown in Fig. 3. We observe that PC-NAS-S adopts either high expansion rate or large kernel size at the tail end, which enables a full use of high level features. However, it tends to select small kernels and low expansion rates to ensure the model remains lightweight. PC-NAS-L chooses lots of powerful bottlenecks exclusively contained in the large space to achieve the accuracy boost. The high expansion rate is not quite frequently seen which is to compensate the computation utilized by large kernel size. Both PC-NAS-S and PC-NAS-L tend to use heavy operator when the resolution reduces, circumventing too much information loss in these positions.