Improving Oneshot NAS by Suppressing the Posterior Fading
Abstract
There is a growing interest in automated neural architecture search (NAS). To improve the efficiency of NAS, previous approaches adopt weight sharing method to force all models share the same set of weights. However, it has been observed that a model performing better with shared weights does not necessarily perform better when trained alone. In this paper, we analyse existing weight sharing oneshot NAS approaches from a Bayesian point of view and identify the posterior fading problem, which compromises the effectiveness of shared weights. To alleviate this problem, we present a practical approach to guide the parameter posterior towards its true distribution. Moreover, a hard latency constraint is introduced during the search so that the desired latency can be achieved. The resulted method, namely Posterior Convergent NAS (PCNAS), achieves stateoftheart performance under standard GPU latency constraint on ImageNet. In our small search space, our model PCNASS attains top1 accuracy, higher than MobileNetV2 (1.4x) with the same latency. When adopted to the large search space, PCNASL achieves top1 accuracy within 11ms. The discovered architecture also transfers well to other computer vision applications such as object detection and person reidentification.
1 Introduction
Neural network design requires extensive experiments by human experts. In recent years, neural architecture search (Zoph and Le, 2016; Liu et al., 2018a; Zhong et al., 2018; Li et al., 2019; Lin et al., 2019) has emerged as a promising tool to alleviate the cost of human efforts on manually balancing accuracy and resources constraint.
Early works of NAS (Real et al., 2018; Elsken et al., 2017) achieve promising results but have to resort to search only using proxy or subsampled dataset due to its large computation expense. Recently, the attention is drawn to improve the search efficiency via sharing weights across models (Bender et al., 2018; Pham et al., 2018). Generally, weight sharing approaches utilize an overparameterized network (supergraph) containing every single model, which can be mainly divided into two categories.
The first category is continuous relaxation method (Liu et al., 2018c; Cai et al., 2018), which keeps a set of so called architecture parameters to represent the model, and updates these parameters alternatively with supergraph weights. The resulted model is obtained using the architecture parameters at convergence. The continuous relaxation method entails the richgetricher problem (Adam and Lorraine, 2019), which means that a betterperformed model at the early stage would be trained more frequently (or have larger learning rates). This introduces bias and instability to the search process.
Another category is referred to as oneshot method (Brock et al., 2017b; Guo et al., 2019; Bender et al., 2018; Chu et al., 2019), which divides the NAS proceedure into a training stage and a searching stage. In the training stage, the supergraph is optimized along with either dropping out each operator with certain probability or sampling uniformly among candidate architectures. In the search stage, a search algorithm is applied to find the architecture with the highest validation accuracy with shared weights. The oneshot approach ensures the fairness among all models by sampling architecture or dropping out operator uniformly. However, as identified in (Adam and Lorraine, 2019; Chu et al., 2019; Bender et al., 2018), the validation accuracy of the model with shared weights is not predictive to its true performance.
In this paper, we formulate NAS as a Bayesian model selection problem (Chipman et al., 2001). With this formulation, we can obtain a comprehensive understanding of oneshot approaches. We show that shared weights are actually a maximum likelihood estimation of a proxy distribution to the true parameter distribution. Further, we identify the common issue of weight sharing, which we call Posterior Fading, i.e., the KLdivergence between true parameter posterior and proxy posterior also increases with the number of models contained in the supergraph.
To alleviate the aforementioned problem, we proposed a practical approach to guide the convergence of the proxy distribution towards the true parameter posterior. Specifically, our approach divides the training of supergraph into several intervals. We maintain a pool of high potential partial models and progressively update this pool after each interval . At each training step, a partial model is sampled from the pool and complemented to a full model. To update the partial model pool, we generate candidates by extending each partial model and evaluate their potentials, the top ones among which form the new pool size. Since the search space is shrinked in the upcoming training interval, the parameter posterior get close to the desired true posterior during this procedure. Main contributions of our work is concluded as follows:

We analyse the oneshot approaches from a Bayesian point of view and identify the associated disadvantage which we call Posterior Fading.

Inspired by the theoretical discovery, we introduce a novel NAS algorithm which guide the proxy distribution to converge towards the true parameter posterior.
We apply our proposed approach to ImageNet classification (Russakovsky et al., 2015) and achieve strong empirical results. In one typical search space (Cai et al., 2018), our PCNASS attains top1 accuracy, higher and faster than EfficientNetB0 (Tan and Le, 2019a), which is the previous stateoftheart model in mobile setting. To show the strength of our method, we apply our algorithm to a larger search space, our PCNASL boosts the accuracy to .
2 Related work
Increasing interests are drawn to automating the design of neural network with machine learning techniques such as reinforcement learning or neuroevolution, which is usually referred to as neural architecture search(NAS) (Miller et al., 1989; Liu et al., 2018b; Real et al., 2017; Zoph and Le, 2016; Baker et al., 2017; Wang et al., 2019; Liu et al., 2018c; Cai et al., 2018). This type of NAS is typically considered as an agentbased explore and exploit process, where an agent (e.g. an evolution mechanism or a recurrent neural network(RNN)) is introduced to explore a given architecture space with training a network in the inner loop to get an evaluation for guiding exploration. Such methods are computationally expensive and hard to be used on largescale datasets, e.g. ImageNet.
Recent works (Pham et al., 2018; Brock et al., 2017a; Liu et al., 2018c; Cai et al., 2018) try to alleviate this computation cost via modeling NAS as a single training process of an overparameterized network that comprises all candidate models, in which weights of the same operators in different models are shared. ENAS (Pham et al., 2018) reduces the computation cost by orders of magnitude, while requires an RNN agent and focuses on smallscale datasets (e.g. CIFAR10). Oneshot NAS (Brock et al., 2017b) trains the overparameterized network along with droping out each operator with increasing probability. Then it use the pretrained overparameterized network to evaluate randomly sampled architectures. DARTS (Liu et al., 2018c) additionally introduces a realvalued architecture parameter for each operator and alternately train operator weights and architecture parameters by backpropagation. ProxylessNAS (Cai et al., 2018) binarize the realvalue parameters in DARTS to save the GPU cumputation and memory for training the overparameterized network.
The paradigm of ProxylessNAS (Cai et al., 2018) and DARTS (Liu et al., 2018c) introduce unavoidable bias since operators of models performing well in the beginning will easily get trained more and normally keep being better than other. But they are not necessarily superior than others when trained from scratch.
Other relevant works are ASAP (Noy et al., 2019) and XNAS (Nayman et al., 2019), which introduce pruning during the training of overparameterized networks to improve the efficiency of NAS. Similarly, we start with an overparameterized network and then reduce the search space to derive the optimized architecture. The distinction is that they focus on the speedup of training and only prune by evaluating the architecture parameters, while we improves the rankings of models and evaluate operators direct on validation set by the performance of models containing it.
3 Methods
In this section, we first formulate neural architecture search in a Bayesian manner. Utilizing this setup, we introduce our PCNAS approach and analyse its advantage against previous approach. Finally, we discuss the search algorithm combined with latency constraint.
3.1 A Probabilistic Setup for Model Uncertainty
Suppose models are under consideration for data , and describes the probability density of data given model and its associated parameters . The Bayesian approach proceeds by assigning a prior probability distribution to the parameters of each model, and a prior probability to each model.
In order to ensure fairness among all models, we set the model prior a uniform distribution. Under previous setting, we can drive
(1) 
where
(2) 
Since is uniform, the Maximum Likelihood Estimation (MLE) of is just the maximum of (2). It can be inferred that, is crucial to the solution of the model selection. We are interested in attaining the model with highest test accuracy in a trained alone manner, thus the parameter prior is just the posterior which means the distribution of when is trained alone on dataset . Thus we would use the term true parameter posterior to refer .
3.2 Network Architecture Selection In a Bayesian Point of View
We constrain our discussion on the setting which is frequently used in NAS literature. As a building block of our search space, a mixed operator (mixop), denoted by , contains different choices of candidate operators for in parallel. The search space is defined by mixed operators (layers) connected sequentially interleaved by downsampling as in Fig. 1(a). The network architecture (model) is defined by a vector , representing the choice of operator for layer . The parameter for the operator at the th layer is denoted as . The parameters of the supergraph are denoted by which includes . In this setting, the parameters of each candidate operator are shared among multiple architectures. The parameters related with a specific model is denoted as , which is a subset of the parameters of the supergraph , the rest of the parameters are denoted as , i.e. . The posterior of all parameters given has the property . Implied by the fact that does not affect the prediction of and also not updated during training, is uniformly distributed, . Obtaining the or a MLE of it for each single model is computationally intractable. Therefore, the oneshot method trains the supergraph by dropping out each operator (Brock et al., 2017b) or sampling different architectures (Bender et al., 2018; Chu et al., 2019) and utilize the shared weights to evaluate single model. In this work, we adopt the latter training paradigm while the former one could be easily generalized. Suppose we sample a model and optimize the supergraph with a minibatch of data based on the objective function :
(3) 
where is a regularization term. Thus minimizing this objective equals to making MLE to . When training the supergraph, we sample many models , and then train the parameters for these models, which corresponds to a stochastic approximation of the following objective function:
(4) 
This is equivalent to adopting a proxy parameter posterior as follows:
(5) 
(6) 
Maximizing is equivalent to minimizing .
We take one step further to assume that the parameters at each layer are independent, i.e.
(7) 
Due to the independence, we have
(8) 
where
(9) 
The KLdivergence between and is as follows:
(10)  
Since the KLdivergence is just the summation of the crossentropy of and where . The crossentropy term is always positive. Increasing the number of architectures would push away from , namely the Posterior Fading. We conclude that nonpredictive problem originates naturally from oneshot supergraph training. Based on this analysis, if we effectively reduce the number of architectures in Eq.(10), the divergence would decrease, which motivates our design in the next section.
3.3 Posterior Convergent NAS
One trivial way to mitigate the posterior fading problem is limit the number of candidate models inside the supergraph. However, large number of candidate models is demanded for NAS to discover promising models. Due to this conflict, we present PCNAS which adopt progressive search space shrinking. The resulted algorithm divide the training of shared weights into intervals, where is the number of mixed operators in the search space. The number of training epochs of a single interval is denoted as
Partial model pool is a collection of partial models. At the th interval, a single partial model should contain selected operators . The size of partial model pool is denoted as . After the th interval, each partial model in the pool will be extended by the operators in th mixop. Thus there are candidate extended partial models with length . These candidate partial models are evaluated and the top among which are used as the partial model pool for the interval . An illustrative exmaple of partial model pool update is in Fig. 1(b)(c)(d).
Candidate evaluation with latency constraint: We simply define the potential of a partial model to be the expected validation accuracy of the models which contain the partial model.
(11) 
where the validation accuracy of model is denoted by . We estimate this value by uniformly sampling valid models and computing the average of their validation accuracy using one minibatch. We use to denote the evaluation number, which is the total number of sampled models. We observe that when is large enough, the potential of a partial model is fairly stable and discriminative among candidates. See Algorithm. 1 for pseudo code. The latency constraint is imposed by dumping invalid full models when calculating potentials of extended candidates of partial models in the pool.
Training based on partial model pool The training iteration of the supergraph along with the partial model pool has two steps. First, for a partial model from the pool, we randomly sample the missing operator to complement the partial model to a full model. Then we optimize using the sampled full model and minibatch data. We Initially, the partial model pool is empty. Thus the supergraph is trained by uniformly sampled models, which is identical to previous oneshot training stage. After the initial training, all operators in the first mixop are evaluated. The top operators forms the partial model pool in the second training stage. Then, the supergraph resume training and the training procedure is identical to the one discussed in last paragraph. Inspired by warmup, the first stage is set much more epochs than following stages denoted as . The whole PCNAS process is elaborated in algorithm. 2 The number of models in the shrinked search space at the interval is strictly less than interval . At the final interval, the number of crossentropy terms in Eq.(10) are P1 for each architectures in final pool. Thus the parameter posterior of PCNAS would move towards the true posterior during these intervals.
4 Experiments Results
We demonstrate the effectiveness of our methods on ImageNet, a large scale benchmark dataset, which contains 1,000,000 training samples with 1000 classes. For this task, we focus on models that have high accuracy under certain GPU latency constraint. We search models using PCNAS, which progressively updates a partial model pool and trains shared weights. Then, we select the model with the highest potential in the pool and report its performance on the test set after training from scratch. Finally, we investigate the transferability of the model learned on ImageNet by evaluating it on two tasks, object detection and person reidentification.
4.1 Training Details
Dataset and latency measurement: As a common practice, we randomly sample 50,000 images from the train set to form a validation set during the model search. We conduct our PCNAS on the remaining images in train set. The original validation set is used as test set to report the performance of the model generated by our method. The latency is evaluated on Nvidia GTX 1080Ti and the batch size is set 16 to fully utilize GPU resources.
Search spaces: We use two search spaces. We benchmark our small space similar to ProxylessNAS (Cai et al., 2018) and FBNet (Wu et al., 2018) for fair comparison. To test our PCNAS method in a more complicated search space, we add 3 more kinds of operators to the small space’s mixoperators to construct our large space. Details of the two spaces are in A.1.
PCNAS hyperparameters: We use PCNAS to search in both small and large space. To balance training time and performance, we set evaluation number and partial model pool size in both experiments. Ablation study of the two values is in 4.4. When updating weights of the supergraph, we adopt minibatch nesterov SGD optimizer with momentum 0.9, cosine learning rate decay from 0.1 to 5e4 and batch size 512, and L2 regularization with weight 1e4. The warmup epochs and shrinking interval are set 100 and 5, thus the total training of supergraph lasts epochs. After searching, we select the best one from the top 5 final partial models and train it from scratch. The hyperparameters used to train this best model are the same as that of supergraph and the training takes 300 epochs. We add squeezeandexcitation layer to this model at the end of each operator and use mixup during the training of resulted model.
4.2 ImageNet Results
model  space  params  latency(gpu)  top1 acc 

MobileNetV2 1.4x (Sandler et al., 2018)    6.9M  10ms  74.7% 
AmoebaNetA(Real et al., 2018)    5.1M  23ms  74.5% 
PNASNet (Liu et al., 2018a)  5.6x  5.1M  25ms  74.2% 
MnasNet(Tan et al., 2018)    4.4M  11ms  74.8% 
FBNetC(Wu et al., 2018)  5.5M    74.9%  
ProxylessNASgpu(Cai et al., 2018)  7.1M  8ms  75.1%  
EfficientNetB0(Tan and Le, 2019a)    5.3M  13 ms  76.3% 
MixNetS(Tan and Le, 2019b)    4.1M  13 ms  75.8% 
PCNASS  5.1M  10 ms  76.8%  
PCNASL  15.3M  11 ms  78.1%  

Table 1 shows the performance of our model on ImageNet. We set our target latency at according to our measurement of mobile setting models on GPU. Our search result in the small space, namely PCNASS, achieves 76.8% top1 accuracy under our latency constraint, which is higher than EffcientNetB0 (in terms of absolute accuracy improvement), higher than MixNetS. If we slightly relax the time constraint, our search result from the large space, namly PCNASL, achieves top1 accuracy, which improves top1 accuracy by compared to EfficientNetB0, compared to MixNetS. Both PCNASS and PCNASL are faster than EffcientNetb0 and MixNetS.
backbone  params  latency  COCO mAP  Market1501 mAP 

MobileNetV2  3.5M  7ms  31.7  76.8 
ResNet50  25.5M  15ms  36.8  80.9 
ResNet101  44.4M  26ms  39.4  82.1 
PCNASL  15.3M  11ms  38.5  81.0 
4.3 Transferability of PCNAS
We validate our PCNAS’s transferability on two tasks, object detection and person reidentification. We use COCO (Lin et al., 2014) dataset as benchmark for object detection and Market1501 (Zheng et al., 2015) for person reidentification. For the two dataset, PCNASL pretrained on ImageNet is utilized as feature extractor, and is compared with other models under the same training script. For object detection, the experiment is conducted with the twostage framework FPN (Lin et al., 2017). Table 2 shows the performance of our PCNAS model on COCO and Market1501. For COCO, our approach significantly surpasses the mAP of MobileNetV2 as well as ResNet50. Compare to the standard ResNet101 backbone, our model achieves comparable mAP quality with almost parameters and faster speed. Similar phenomena are found on Market1501.
4.4 Ablation Study
Impact of hyperparameters: In this section, we further study the impact of hyperparameters on our method within our small space on ImageNet. The hyperparameters include warmup, training epochs Tw, partial model pool size , and evaluation number . We tried setting Tw as 100 and 150 with fixed and . The resulted models of these two settings show no significant difference in top1 accuracy (less than 0.1%), shown as in Fig. 1(a). Thus we choose warmup training epochs as 100 in our experiment to save computation resources. For the influence of and , we show the results in Fig. 1(a). It can be seen that the top1 accuracy of the models found by PCNAS increases with both P and S. Thus we choose , in the experiments for better performance. we did not observe significant improvement when further increasing these two hyperparameters.
Effectiveness of shrinking search space: To assess the role of space shrinking, we trains the supergraph of our large space using OneShot(Brock et al., 2017b) method without any shrinking of the search space. Then we conduct model search on this supergraph by progressively updating a partial model pool in our method. The resulted model using this setting attains top1 accuracy on ImageNet, which is lower than our PCNASL as in Table.3.
We add another comparison as follows. First, we select a batch of models from the candidates of our final pool under small space and evaluate their stand alone top1 accuracy. Then we use OneShot to train the supergraph also under small space without shrinking. Finally, we shows the model rankings of PCNAS and OneShot using the accuracy obtained from inferring the models in the supergraphs trained with the two methods. The difference is shown in Fig. 1(b), the pearson correlation coefficients between standalone accuracy and accuracy in supergraph of OneShot and PCNAS are 0.11 and 0.92, thus models under PCNAS’s space shrinking can be ranked by their accuracy evaluated on sharing weights much more precisely than OneShot.
Effectiveness of our search method: To investigate the importance of our search method, we utilize Evolution Algorithm (EA) to search for models with the above supergraph of our large space trained with OneShot. The top1 accuracy of discovered model drops furthur to accuracy, which is lower than PCNASL . We implement EA with population size 5, aligned to the value of pool size in our method, and set the mutation operation as randomly replace the operator in one mixop operator to another. We constrain the total number of validation images in EA the same as ours. The results are shown in Table.3.
training method  search method  top1 acc 

Ours  Ours  78.1% 
Oneshot  Ours  77.1% 
Oneshot  EA  75.9% 
5 Conclusion
In this paper, a new architecture search approach called PCNAS is proposed. We study the conventional weight sharing approach from Bayesian point of view and identify a key issue that compromises the effectiveness of shared weights. With the theoretical insight, a practical method is devised to mitigate the issue. Experimental results demonstrate the effectiveness of our method, which achieves stateoftheart performance on ImageNet, and transfers well to COCO detection and person reidentification too.
References
 Understanding neural architecture search techniques. arXiv preprint arXiv:1904.00438. Cited by: §1, §1.
 Designing neural network architectures using reinforcement learning. International Conference on Learning Representations. Cited by: §2.
 Understanding and simplifying oneshot architecture search. ICML. Cited by: §1, §1, §3.2.
 SMASH: oneshot model architecture search through hypernetworks. NIPS Workshop on MetaLearning. Cited by: §2.
 SMASH: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §1, §2, §3.2, §4.4.
 Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1, §1, §2, §2, §2, §4.1, Table 1.
 The practical implementation of bayesian model selectio. In Institute of Mathematical Statistics Lecture Notes  Monograph Series, 38, pp. 65–116. Cited by: §1.
 FairNAS:rethinking evaluation of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845v2. Cited by: §1, §3.2.
 Simple and efficient architecture search for convolutional neural networks. ICLR workshop. Cited by: §1.
 Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §1.
 AMlfs: automl for loss function search. arXiv preprint arXiv:1905.07375. Cited by: §1.
 Online hyperparameter learning for autoaugmentation strategy. arXiv preprint arXiv:1905.07373. Cited by: §1.
 Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.3.
 Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.3.
 Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1, Table 1.
 Hierarchical representations for efficient architecture search. ICLR. Cited by: §2.
 Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2, §2, §2.
 Designing neural networks using genetic algorithms. ICGA, pp. volume 89, pages 379–384. Cited by: §2.
 XNAS: neural architecture search with expert advice. arXiv preprint arXiv:1906.08031. Cited by: §2.
 ASAP: architecture search, anneal and prune. arXiv preprint arXiv:1904.04123. Cited by: §2.
 Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §2.
 Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §1, Table 1.
 Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2902–2911. Cited by: §2.
 Imagenet large scale visual recognition challenge. International Journal of Computer Vision, pp. 115(3):211–252. Cited by: §1.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.1, Table 1.
 Mnasnet: platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626. Cited by: Table 1.
 EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, Table 1.
 MixNet: mixed depthwise convolutional kernels. BMVC. Cited by: Table 1.
 AlphaX: exploring neural architectures with deep neural networks and monte carlo tree search. arXiv preprint arXiv:1903.11059. Cited by: §2.
 FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443. Cited by: §4.1, Table 1.
 Scalable person reidentification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §4.3.
 Practical blockwise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §1.
 Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
Appendix A Appendix
a.1 Construction of the Search Space:
The operators in our spaces have structures described by either Conv1x1ConvNxMConv1x1 or Conv1x1ConvNxMConvMxNConv1x1. We define expand ratio as the ratio between the channel numbers of the ConvNxM in the middle and the input of the first Conv1x1.
Small search space
Our small search space contains a set of MBConv operators (mobile inverted bottleneck convolution (Sandler et al., 2018)) with different kernel sizes and expand ratios, plus Identity, adding up to 10 operators to form a mixoperator. The 10 operators in our small search space are listed in the left column of Table 4, where notation OP_X_Y represents the specific operator OP with expand ratio X and kernel size Y.
Large search space
We add 3 more kinds of operators to the mixoperators of our large search space, namely NConv, DConv, and RConv. We use these 3 operators with different kernel sizes and expand ratios to form 10 operators exclusively for large space, thus the large space contains 20 operators. For large search space, the structure of NConv, DConv are Conv1x1ConvKxKConv1x1 and Conv1x1ConvKxKConvKxKConv1x1, and that of RConv is Conv1x1Conv1xKConvKx1Conv1x1. The kernel sizes and expand ratios of operators exclusively for large space are lised in the right column of Table 4, where notation OP_X_Y represents the specific operator OP with expand ratio X and K=Y.
There are altogether 21 mixoperators in both small and large search spaces. Thus our small search space contains models, while the large one contains .
Operators in both  Operators exclusively in  

large and small space  large space  
MBConv_1_3  MBConv_3_3  NConv_1_3  NConv_2_3 
MBConv_6_3  MBConv_1_5  DConv_1_3  DConv_2_3 
MBConv_3_5  MBConv_6_5  RConv_1_5  RConv_2_5 
MBConv_1_7  MBConv_3_7  RConv_4_5  RConv_1_7 
MBConv_6_7  Identity  RConv_2_7  RConv_4_7 
a.2 Specifications of resulted models:
The specifications of PCNASS and PCNASL are shown in Fig. 3. We observe that PCNASS adopts either high expansion rate or large kernel size at the tail end, which enables a full use of high level features. However, it tends to select small kernels and low expansion rates to ensure the model remains lightweight. PCNASL chooses lots of powerful bottlenecks exclusively contained in the large space to achieve the accuracy boost. The high expansion rate is not quite frequently seen which is to compensate the computation utilized by large kernel size. Both PCNASS and PCNASL tend to use heavy operator when the resolution reduces, circumventing too much information loss in these positions.