Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

Abstract

One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance. However, weight sharing across models has an inherent deficiency, i.e., insufficient training of subnetworks in the hypernetwork. To alleviate this problem, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. We directly select the most promising one from the prioritized paths as the final architecture, without using other complex search methods, such as reinforcement learning or evolution algorithms. The experiments on ImageNet verify such path distillation method can improve the convergence ratio and performance of the hypernetwork, as well as boosting the training of subnetworks. The discovered architectures achieve superior performance compared to the recent MobileNetV3 and EfficientNet families under aligned settings. Moreover, the experiments on object detection and more challenging search space show the generality and robustness of the proposed method. Code and models are available at https://github.com/microsoft/cream.git1.

1 Introduction

Neural Architecture Search (NAS) is an exciting field which facilitates the automatic design of deep networks. It has achieved state-of-the-art performance on a variety of tasks, surpassing manually designed counterparts [e.g., Zoph et al., 2018; Du et al., 2019; Liu et al., 2019a]. Recently, one-shot NAS methods became popular due to low computation overhead and competitive performance. Rather than training thousands of separate models from scratch, one-shot methods only train a single large hypernetwork capable of emulating any architecture in the search space. The weights are shared across architecture candidates, i.e., subnetworks. Such strategy is able to reduce the search cost from thousands of GPU days to a few.

However, all architectures sharing a single set of weights cannot guarantee each individual subnetwork obtains sufficient training. Although one-shot models are typically only used to sort architectures in the search space, the capacity of weight sharing is still limited. As revealed by recent works Sciuto et al. (2020), weight sharing degrades the ranking of architectures to the point of not reflecting their true performance, thus reducing the effectiveness of the search process. There are a few recent works addressing this issue from the perspective of knowledge distillation Cai et al. (2020); Li et al. (2020); Yu et al. (2020). They commonly introduce a high-performing teacher network to boost the training of subnetworks. Nevertheless, these methods require the teacher model to be trained beforehand, such as a large pretrained model Cai et al. (2020) or a third-party model Li et al. (2020). This limits the flexibility of search algorithms, especially when the search tasks or data are entirely new and there may be no available teacher models.

In this paper, we present prioritized paths to enable the knowledge transfer between architectures, without requiring an external teacher model. The core idea is that subnetworks can learn collaboratively and teach each other throughout the training process, and thus boosting the convergence of individual architectures. More specifically, we create a prioritized path board which recruits the subnetworks with superior performance as the internal teachers to facilitate the training of other models. The recruitment follows the selective competition principle, i.e., selecting the superior and eliminating the inferior. Besides competition, there also exists collaboration. To enable the information transfer between architectures, we distill the knowledge from prioritized paths to subnetworks. Instead of learning from a fixed model, our method allows each subnetwork to select its best-matching prioritized path as the teacher based on the representation complementary. In particular, a meta network is introduced to mimic this path selection procedure. Throughout the course of subnetwork training, the meta network observes the subnetwork’s performance on a held-out validation set, and learns to choose a prioritized path from the board so that if the subnetwork benefits from the prioritized path, the subnetwork will achieve better validation performance.

Such prioritized path distillation mechanism has three advantages. First, it does not require introducing third-party models, such as human-designed architectures, to serve as the teacher models, thus it is more flexible. Second, the matching between prioritized paths and subnetworks are meta-learned, which allows a subnetwork to select various prioritized paths to facilitates its learning. Last but not the least, after hypernetwork training, we can directly pick up the best performing architecture from the prioritized paths, instead of using either reinforcement learning or evolutional algorithms to further search a final architecture from the large-scale hypernetwork.

The experiments demonstrate that our method achieves clear improvements over the strong baseline and establishes state-of-the-art performance on ImageNet. For instance, with the proposed prioritized path distillation, our search algorithm finds a 470M Flops model that achieving 79.2% top-1 accuracy on ImageNet. This model improves the SPOS baseline Guo et al. (2020) by 4.5% while surpassing the EfficientNet-B0 Tan and Le (2019a) by 2.9%. Under the efficient computing settings, i.e., Flops M, our models consistently outperform the MobileNetV3 Howard et al. (2019), sometimes by nontrivial margins, e.g., 3.0% under 43M Flops. The architecture discovered by our approach transfers well to downstream object detection task, getting an AP of 33.2 on COCO validation set, which is superior to the state-of-the-art MobileNetV3. In addition, distilling prioritized paths allows one-shot models to search architectures over more challenging search space, such as the combinations of MBConv Sandler et al. (2018), residual block He et al. (2016) and normal 2D Conv, thus easing the restriction of designing a carefully constrained space.

2 Preliminary: One-Shot NAS

One-shot NAS approaches commonly adopt a weight sharing strategy to eschew training each subnetwork from scratch  [Brock et al., 2018; Pham et al., 2018; Bender et al., 2018; Guo et al., 2020; Li and Talwalkar, 2019, among many others]. The architecture search space is encoded in a hypernetwork, denoted as , where is the weight of the hypernetwork. The weight is shared across all the architecture candidates, i.e., subnetworks in . The search of the optimal architecture in one-shot methods is formulated as a two-stage optimization problem. The first-stage is to optimize the weight by

(1)

where represents the loss function on training dataset. To reduce memory usage, one-shot methods usually sample subnetworks from for optimization. We adopt the single-path uniform sampling strategy as the baseline, i.e., each batch only sampling one random path from the hypernetwork for training Li and Talwalkar (2019); Guo et al. (2020). The second-stage is to search architectures via ranking the performance of subnetworks based on the learned weights , which is formulated as

(2)

where the sampled subnetwork inherits the weight from as , and indicates the top-1 accuracy of the architecture on validation dataset. Since that it is impossible to enumerate all the architectures for evaluation, prior works resort to random search Li and Talwalkar (2019); Bender et al. (2018), evolution algorithms Real et al. (2019); Guo et al. (2020) or reinforcement learning Pham et al. (2018); Tan et al. (2019) to find the most promising one.

Figure 1: (a) Previous one-shot NAS methods use pretrained models for knowledge distillation. (b) Our prioritized path distillation enables knowledge transfer between architecture candidates. It contains three parts: hypernetwork, prioritized path board and meta network. The meta network is to select the best matching prioritized path to guide the training of the sampled subnetwork.

3 Distilling Priority Paths for One-Shot NAS

The weight sharing strategy reduces the search cost by orders of magnitude. However, it brings a potential issue, i.e., the insufficient training of subnetworks within the hypernetwork. Such issue results in the performance of architectures ranked by the one-shot weight is weakly correlated with the true performance. Thus, the search based on the weight may not find a promising architecture. To boost the training of subnetworks, we present prioritized path distillation. The intuitive idea is to leverage the well-performing subnetwork to teach the under-trained ones, such that all architectures converge to better solutions. In the following, we first present the mechanism of prioritized path board, which plays a fundamental role in our approach. Then, we describe the search algorithm using the prioritized paths and knowledge transfer between architectures. The overall framework is visualized in Fig. 1.

3.1 Prioritized Path Board

Prioritized paths refer to the architecture candidates which exhibit promising performance during hypernetwork training. The prioritized path board is an architecture set which contains prioritized paths , i.e., . The board is first initialized with random paths, and then changed on the fly depending on the path performance. More specifically, for each batch, we randomly sample a single path from the hypernetwork and train the path to update the corresponding shared-weight . After that, we evaluate the path on the validation dataset (a subset is used to save computation cost), and get its performance . If the current path performs superior than the least competitive prioritized paths in , then it will replace that prioritized path as

(3)

where counts the multiply-add operations in models. Eq. (3) indicates the update of prioritized path board follows the selective competition, i.e., selecting models with higher performance and lower complexity. Thus, the prioritized paths are changed on the fly. The final left paths on the board are the Pareto optima Mock (2011) among all the sampled paths during the training process.

1:Training and validation data, hypernetwork with weight , meta network with weight and its update interval , path board , max iteration , user specified min and max Flops.
2:The most promising architecture.
3:Random initialize , , with path
4:while search step and not converged do
5:     Randomly sample a path from
6:     Select the best fit path in according to Eq. (4)
7:     Calculate the loss and over one train batch
8:     Update the weight of path according to Eq. (5)
9:     Calculate and top-1 accuracy on subset
10:     Update according to Eq. (3)
11:     if  Mod = 0 then
12:          Calculate loss on val dataset with the updated weight according to Eq. (6)
13:          Update the weight of meta network by calculating
14:     end if
15:end while
16:Select the best performing architecture from on validation dataset.
Algorithm 1 Architecture Search with Prioritized Paths

3.2 Architecture Search with Prioritized Paths

Our solution to the insufficient training of subnetworks is to distill the knowledge from prioritized paths to the weight sharing subnetworks. Due to the large scale of the search space, the structure of subnetworks are extremely diverse. Some subnetworks may be beneficial to other peer architectures, while others may not or even harmful. Hence, we allow each subnetwork to find its best matching collaborator from the prioritized path board, such that the matched path can make up its deficiency. We propose to learn the matching between prioritized paths and subnetworks by a meta network . Since there is no available groundtruth to measure the matching degree of two architectures, we use the learning state (i.e., validation loss) of subnetworks as the signal to supervise the learning of the meta network. The underlying reason is that if the gradient updates of the meta network encourage the subnetworks to learn from the selected prioritized path and achieve a small validate loss, then this matching is profitable.

The hypernetwork training with prioritized path distillation includes three iterative phases.

Phase 1: choosing the prioritized path. For each batch, we first randomly sample a subnetwork . Then, we use the meta network to select the best fit model from the prioritized path board , aiming to facilitate the training of the sampled path. The selection is formulated as

(4)

where is the output of the meta network and represents the matching degree (the higher the better) between the prioritized path and the subnetwork , indicates the training data, and denotes the weight of . The input to the meta network is the difference of the feature logits between the subnetworks and . Such difference reflects the complementarity of the two paths. The meta network learns to select the prioritize path that is complementary to the current subnetwork .

Phase 2: distilling knowledge from the prioritized path. With the picked prioritized path , we perform knowledge distillation to boost the training of the subnetwork . The distillation is supervised by a weighted average of two different objective functions. The first objective function is the cross entropy with the correct labels . This is computed using the logits in softmax of the subnetwork, i.e., = . The second objective function is the cross entropy with the soft target labels and this cross entropy distills knowledge from the prioritized path to the subnetwork . The soft targets are generated by a softmax function that converts feature logits to a probability distribution. We use SGD with a learning rate to optimize the objective functions and update the subnetwork weight as

(5)

where is the iteration index. It is worth noting that we use the matching degree as the weight for the distillation objective function. The underlying reason is if the selected prioritized path are well-matched to the current path, then it can play an more important role to facilitate the learning, and vise versa. After the weight update, we evaluate the performance of the subnetwork on the validation subset and calculate its model complexity. If both performance and complexity satisfy Eq. (3), then the path is added into the prioritized path board .

Phase 3: updating the meta network. Since there is no available groundtruth label measuring the matching degree and complementarity of two architectures, we resort to the loss of the subnetwork to guide the training of the matching network . The underlying reason is that if one prioritized path is complementary to the current subnetwork , then the updated subnetwork with the weight can achieve a lower loss on the validation data. We evaluate the new weight on the validatation data (, ) using the cross entropy loss . Since depends on via Eq. (5) while depends on via Eq. (4), this validation cross entropy loss is a function of . Specifically, dropping (, ) from the equations for readability, we can write:

(6)

This dependency allows us to compute to update and minimize . The differentiation requires computing the gradient of gradient, which is time-consuming, we thereby updates every iterations. In essence, the meta network observing the subnetwork’s validation loss to improve itself is similar to an agent in reinforcement learning performing on-policy sampling and learning from its own rewards Pham et al. (2020). In implementation, we adopts one fully-connected layer with 1,000 hidden nodes as the architecture of meta network, which is simple and efficient.

The above three phases are performed iteratively to train the hypernetwork. The iterative procedure is outlined in Alg. 1. Thanks to the prioritized path distillation mechanism, after hypernetwork training, we can directly select the best performing subnetwork from the prioritized path board as the final architecture, instead of further performing search on the hypernetwork.

4 Experiments

In this section, we first present ablation studies dissecting our method on image classification task, and then compare our method with state-of-the-art NAS algorithms. The experiments on object detection and more challenging search space are performed to evaluate the generality and robustness.

4.1 Implemention Details

# Single-path Evolution Priority Fixed Random Meta Kendall Rank Hypernet Top-1 Acc Model
Training Alg. Path Match Match Match on subImageNet on ImageNet on ImageNet FLOPS
1 450M
2 433M
3 432M
4 451M
5 470M
6 487M
Table 1: Component-wise analysis. Fixed, Random and Meta matching represent performing distillation with the largest subnetwork, random sampled prioritized path and meta-learned prioritized path, respectively.

Search space. The same as recent works Tan and Le (2019a); Howard et al. (2019); Cai et al. (2020); Li et al. (2020); Yu et al. (2020), we perform architecture search over the search space consisting of mobile inverted bottleneck MBConv Sandler et al. (2018) and squeeze-excitation modules Hu et al. (2018) for fair comparisons. There are seven basic operators, including MBConv with kernel sizes of {3,5,7} and expansion rates of {4,6}, and an additional skip connection to enable elastic depth of architectures. The space contains about architecture candidates in total.
Hypernetwork. Our hypernetwork is similar to the baseline SPOS Guo et al. (2020). The architecture details are presented in Appendix A of the supplementary materials. We train the hypernetwork for 120 epochs using the following settings: SGD optimizer Robbins and Monro (1951) with momentum 0.9 and weight decay 4-5, initial learning rate 0.5 with a linear annealing. The meta network is updated every = iterations to save computation cost. The number of prioritized paths is empirically set to , while the number of images sampled from validation set for prioritized path selection in Eq. (3) is set to 2,048.
Retrain. We retrain the discovered architectures for 500 epochs on Imagenet using similar settings as EfficientNet Tan and Le (2019a): RMSProp optimizer with momentum 0.9 and decay 0.9, weight decay 1-5, dropout ratio 0.2, initial learning rate 0.064 with a warmup Goyal et al. (2017) in the first 3 epochs and a cosine annealing, AutoAugment Cubuk et al. (2019) policy and exponential moving average are adopted for training. We use 16 Nvidia Tesla V100 GPUs with a batch size of 2,048 for the retrain.

4.2 Ablation Study

We dissect our method and evaluate the effects of each components. Our baseline is the single-path one-shot method, which trains the hypernetwork with uniform sampling and searches architectures by an evolution algorithm Guo et al. (2020). We re-implement this algorithm in our codebase, and it achieves 76.3% top-1 accuracy on ImageNet, being superior to the original 74.7% reported in Guo et al. (2020) due to different search spaces (ShuffleUnits Guo et al. (2020) v.s. MBConv Sandler et al. (2018)). If we replace the evolution search with the proposed prioritized path mechanism, the performance is still comparable to the baseline, as presented in Tab. 1(#1 v.s. #2). This suggests the effectiveness of the prioritized paths. By comparing #2 with #4/#5, we observe that the knowledge distillation between prioritized paths and subnetworks is indeed helpful for both hypernetwork training and the final performance, even when the matching between prioritized paths and subnetworks is random, i.e. #4. The meta-learned matching function is superior to random matching by 1.3%  in terms of top-1 accuracy on ImageNet. The ablation between #5 and #6 shows that the evolution search over the hypernetwork performs comparably to the prioritized path distillation, suggesting that the final paths left in the prioritized path board is the “cream of the crop”.

Table 2: Ablation for the number of prioritized paths. Board Size  1 5 10 20 50 Hypernetwork (Top-1)  65.4 65.9 67.0 67.3 67.5 Search Cost (GPU days)  9 10 12 16 27 Table 3: Ablation for the number of val images. Image Numbers  0.5k 1k 2k 5k 10k 50k Kendall Rank (Top-1)  0.72 0.74 0.75 0.85 0.94 1 Kendall Rank (Top-5)  0.47 0.50 0.66 0.76 0.89 1
Figure 2: Comparison with state-of-the-art methods on ImageNet under mobile settings (Flops600M).

We further perform a correlation analysis to evaluate whether the enhanced training of the hypernetwork can improve the ranking of subnetworks. To this end, we randomly sample 30 subnetworks and calculate the rank correlation between the weight sharing performance and the true performance of training from scratch. Unfortunately, training such many subnetworks on ImageNet is very computational expensive, we thus construct a subImageNet dataset, which only consists of 100 classes randomly sampled from ImageNet. Each class has 250 training images and 50 validation images (Image lists are released with the code). Its size is about 50 smaller than the original ImageNet. The Kendall rank correlation coefficient Kendall (1938) on subImageNet is reported in Tab. 1. It is clear to observe that after performing prioritized path distillation, the ranking correlation is improved significantly, e.g., from the baseline 0.19 to 0.37 (#1 v.s. #5 in Tab. 1).

There are two hyperparameters in our method: one is the size of the prioritized path board and the other is the number of validation images for prioritized path selection in Eq. (3). The impact of these two hyperparameters are reported in Tab. 4.2 and 4.2 respectively. We observe that when the number of prioritized paths is increased, the performance of hypernetwork becomes better, yet bringing more search overhead. Considering the tradeoff, we empirically set the number of prioritized paths to 10 in the experiments. A similar situation is occurred on the number of val images. We randomly sample 2,048 images from the validation set (50k images in total) for prioritized path selection because it allows fast evaluations and keep a relatively high Kendall rank.

4.3 Comparison with State-of-the-Art NAS Methods

Fig. 2 presents the comparison of our method with state-of-the-arts under mobile settings on ImageNet. It shows that when the model Flops are smaller than 600M, our method consistently outperforms the recent MobileNetV3 Howard et al. (2019) and EfficientNet-B0/B1 Tan and Le (2019a). In particular, our method achieves 77.6% top-1 accuracy on ImageNet with 285M Flops and 3.9M Params, which is 1.0% higher than MobileNetV3 while using 1.2 fewer Flops and 2.1 fewer Params. Moreover, our method is flexible to search low complexity models, only requiring the users input a desired minimum and maximum Flops constraint. From Fig. 2(right), we can see that when the Flops are smaller than 100M, our models establishes new state-of-the-arts. For example, when using 43M Flops, MobileNetV3 is inferior to our model by 3.0%. Besides model complexity, we are also interested in inference latency.

Acc. @ Latency Acc. @ Latency
EfficientNet-B0 Tan and Le (2019a) 76.3% @ 96ms  Howard et al. (2019) 51.7% @ 15ms
Ours (285M Flops) 77.6% @ 89ms Ours (12M Flops) 53.8% @  9ms
Speedup 1.1x Speedup 1.7x
Table 4: Inference latency comparison. Latency is measured with batch size 1 on a single core of Intel Xeon CPU E5-2690.

As shown in Tab. 4, where we report the average latency of 1,000 runs, our method runs 1.1 faster than EfficientNet-B0 Tan and Le (2019a) and 1.7 faster than MobileNetV3 on a single core of Intel Xeon CPU E5-2690. Also, the performance of our method is 1.3% superior to EfficientNett-B0 and 2.4% superior to MobileNetV3. This suggests our models are competitive when deployed on real hardwares.

Methods Top-1 Top-5 Flops Params Memory cost Hypernet train Search cost
(%) (%) (M) (M) (GPU days) (GPU days)
200 – 350M Flops Howard et al. (2019) 75.2 - 219 5.3 single path -
OFA Cai et al. (2020) 76.9 - 230 - two paths 53 2
AKD Liu et al. (2020) 73.0 92.2 300 - single path -
MobileNetV2 Sandler et al. (2018) 72.0 91.0 300 3.4 - - -
MnasNet-A1 Tan et al. (2019) 75.2 92.5 312 3.9 single path -
FairNAS-C Chu et al. (2019) 74.7 92.1 321 4.4 single path 10 2
SPOS Guo et al. (2020) 74.7 - 328 - single path 12
Cream-S (Ours) 77.6 93.3 287 6.0 two paths 12 0.02
350 – 500M SCARLET-A Chu et al. (2019) 76.9 93.4 365 6.7 single path 10 12
GreedyNAS-A You et al. (2020) 77.1 93.3 366 6.5 single path 7
EfficientNet-B0 Tan and Le (2019a) 76.3 93.2 390 5.3 - - -
ProxylessNAS Cai et al. (2019) 75.1 - 465 7.1 two paths -
Cream-M (Ours) 79.2 94.2 481 7.7 two paths 12 0.02
DARTS Liu et al. (2019b) 73.3 91.3 574 4.7 whole hypernet -
500 – 600M BigNASModel-L Yu et al. (2020) 79.5 - 586 6.4 two paths -
Cai et al. (2020) 80.0 - 595 - two paths 53 2
DNA-d Li et al. (2020) 78.4 94.0 611 6.4 single path 24 0.6
EfficientNet-B1 Tan and Le (2019a) 79.2 94.5 734 7.8 - - -
Cream-L (Ours) 80.0 94.7 604 9.7 two paths 12 0.02
Table 5: Comparison of state-of-the-art NAS methods on ImageNet. : TPU days, : reported by Guo et al. (2020), : searched on CIFAR-10.

Tab. 5 presents more comparisons. It is worth noting that there are few recent works leveraging knowledge distillation techniques to boost training Cai et al. (2020); Li et al. (2020); Yu et al. (2020). Compared to these methods, our prioritized path distillation is also superior. Specifically, DNA Li et al. (2020) recruits EfficientNet-B7 Li et al. (2020), a very high-performance third-party model, as the teacher and achieves 78.4% top-1 accuracy (without using AutoAugment), while our method (Cream-L) gets a superior accuracy of 80.0% without using any other pretrained models. Our method performs comparably to the recent OFA Cai et al. (2020) yet taking much less time on hypernetwork training, i.e., 12 v.s. 53 GPU days. Thanks to the prioritized path mechanism, our method only need to evaluate =10 prioritized paths on the validation set and then select the best performing one. This procedure only takes 0.02 GPU days, which is 30 faster than other approaches of using evolutional search algorithm, such as SPOS Guo et al. (2020) and OFA Cai et al. (2020). The learned architectures are plotted in Appendix B.

4.4 Generality and Robustness

To further evaluate the generalizability of the architecture found by our method, we transfer it to the downstream object detection task. We use the discovered architecture as a drop-in replacement for the backbone feature extractor in RetinaNet Lin et al. (2017) and compare it with other backbone networks on COCO dataset Lin et al. (2014). We perform the training on train2017 set (118k images) and the evaluation on val2017 set (5k images) with 32 batch size on 8 V100 GPUs. The same as Chu et al. (2019), we train the detection model by 12 epochs. The initial learning rate is 0.04 and multiplied by 0.1 at the epochs 8 and 11. The optimizer is SGD with 0.9 momentum and 1-4 weight decay.

Search Space Method
MBconv Howard et al. (2019) SPOSGuo et al. (2020) 0.19 75.8
Ours 0.37 77.7
ResBlock He et al. (2016) SPOSGuo et al. (2020) 0.09 75.0
Ours 0.28 77.2
2D Conv SPOSGuo et al. (2020) 0.04 74.0
Ours 0.25 77.1
Table 6: Search on different space. : calculated on subImageNet using the sampled 30 subnetworks. : retrained on ImageNet with 120 epochs.

As shown in Tab. 7, our method surpasses MobileNetV2 by 4.9% while using fewer Flops. Compared to MnasNet Tan et al. (2019), our method utilizes 19% fewer Flops while achieving 2.7% higher performance, suggesting the architecture has good generalization capacity when transferred to other vision tasks. If we further increase the model complexity, our method can achieve an AP of 36.8%, which is comparable to the recent Hit-Detector Guo et al. () (AP with M Flops) but uses much less Flops.

A robust search algorithm should be capable of searching architectures over diverse search spaces. To evaluate this, we evaluate our method on more challenging space, i.e., the combinations of operators from different designed space, including MBConv Howard et al. (2019), Residual Block He et al. (2016) and normal 2D convolutions. Due to limited space, we present the detailed settings of the new search spaces in Appendix C. As the results reported in Tab. 6, we observe that when the search space becomes more challenging, the performance of the baseline SPOS algorithm Guo et al. (2020) is degraded. In contrast, our method shows relatively stable performance, demonstrating it has the potential to search architectures over more flexible spaces. The main reason is attributed to the prioritised path distillation, which improves the ranking correlation of architectures.

Backbones FLOPs (M) AP (%) AP AP AP AP AP Top-1 (%)
Sandler et al. (2018) 300 28.3 46.7 29.3 14.8 30.7 38.1 72.0
Guo et al. (2020) 365 30.7 49.8 32.2 15.4 33.9 41.6 75.0
Tan et al. (2019) 340 30.5 50.2 32.0 16.6 34.1 41.1 75.6
Howard et al. (2019) 219 29.9 49.3 30.8 14.9 33.3 41.1 75.2
Tan and Le (2019b) 360 31.3 51.7 32.4 17.0 35.0 41.9 77.0
FairNAS-C Chu et al. (2019) 325 31.2 50.8 32.7 16.3 34.4 42.3 76.7
MixPath-A Chu et al. (2020) 349 31.5 51.3 33.2 17.4 35.3 41.8 76.9
Cream-S (Ours) 287 33.2 53.6 34.9 18.2 36.6 44.4 77.6
Table 7: Object detection results of various drop-in backbones on the COCO val2017. Top-1 represents the top-1 accuracy on ImageNet. : reported by Chu et al. (2019).

5 Related Work

Neural Architecture Search. Early NAS approaches search a network using either reinforcement learning Zoph et al. (2018); Zoph and Le (2017) or evolution algorithms Xie and Yuille (2017); Real et al. (2019). These approaches require training thousands of architecture candidates from scratch, leading to unaffordable computation overhead. Most recent works resort to the one-shot weight sharing strategy to amortize the searching cost Li and Talwalkar (2019); Pham et al. (2018); Brock et al. (2018); Guo et al. (2020). The key idea is to train a single over-parameterized hypernetwork model, and then share the weights across subnetworks. The training of hypernetwork commonly samples subnetwork paths for optimization. There are several path sampling methods, such as drop path Bender et al. (2018), single path Guo et al. (2020); Li and Talwalkar (2019) and multiple paths You et al. (2020); Chu et al. (2019). Among them, single-path one-shot model is simple and representative. In each iteration, it only samples one random path and train the path using one batch data. Once the training process is finished, the subnetworks can be ranked by the shared weights. On the other hand, instead of searching over a discrete set of architecture candidates, differentiable methods Liu et al. (2019b); Chen et al. (2019); Cai et al. (2019) relax the search space to be continuous, such that the search can be optimized by the efficient gradient descent. Recent surveys on architecture search can be found in Elsken et al. (2019); Wistuba et al. (2019).

Distillation between Architectures. Knowledge distillation Hinton et al. (2015) is a widely used technique for information transfer. It compresses the “dark knowledge” of a well trained larger model to a smaller one. Recently, in one-shot NAS, there are few works leveraging this technique to boost the training of hypernetwork [e.g., Chen et al., 2020], and they commonly introduce additional large models as teachers. More specifically, OFA Cai et al. (2020) pretrains the largest model in the search space and use it to guide the training of other subnetworks, while DNA Li et al. (2020) employs the third-party EfficientNet-B7 Tan and Le (2019a) as the teacher model. These search algorithms will become infeasible if there is no available pretrained model, especially when the search task and data are entirely new. The most recent work, i.e. BigNAS Yu et al. (2020), proposes inplace distillation with a sandwich rule to supervise the training of subnetworks by the largest child model. Although this method does not reply on other pretrained models, it cannot guarantee the fixed largest model is the best teacher for all other subnetworks. Sometimes the largest model may be a noise in the search space. In contrast, our method dynamically recruits prioritized paths from the search space as the teachers, and it allows subnetworks to select their best matching prioritized models for knowledge distillation. Moreover, after training, the prioritized paths in our method can serve as the final architectures directly, without requiring further search on the hypernetwork.

6 Conclusions

In this work, motivated by the insufficient training of subnetworks in the weight sharing methods, we propose prioritized path distillation to enable knowledge transfer between architectures. Extensive experiments demonstrate the proposed search algorithm can improve the training of the weight sharing hypernetwork and find promising architectures. In future work, we will consider adding more constraints on prioritized path selection, such as both model size and latency, thus improving the flexibility and user-friendliness of the search method. The theoretical analysis of the prioritized path distillation for weight sharing training is another potential research direction.

7 Broader Impact

Similar to previous NAS works, this work does not have immediate societal impact, since the algorithm is only designed for image classification, but it can indirectly impact society. As an example, our work may inspire the creation of new algorithms and applications with direct societal implications. Moreover, compared with other NAS methods that require additional teacher model to guide the training process, our method does not need any external teacher models. So our method can be used in a closed data system, ensuring the privacy of user data.

8 Acknowledgements

We acknowledge the anonymous reviewers for their insightful suggestions. In particular, we would like to thank Microsoft OpenPAI v-team for providing AI computing platform and large-scale jobs scheduling support, and Microsoft NNI v-team for AutoML toolkit support as well as helpful discussions and collaborations. Jing Liao and Hao Du were supported in part by the Hong Kong Research Grants Council (RGC) Early Career Scheme under Grant 9048148 (CityU 21209119), and in part by the CityU of Hong Kong under APRC Grant 9610488. This work was led by Houwen Peng, who is the Primary Contact (\Letter houwen.peng@microsoft.com).

Appendix A

Input Shape Operators Channels Repeat Stride
Conv 16 1 2
Depthwise Separable Conv 24 1 2
MBConv / SkipConnect 40 4 2
MBConv / SkipConnect 80 4 2
MBConv / SkipConnect 96 4 1
MBConv / SkipConnect 192 4 2
MBConv / SkipConnect 320 4 1
Global Avg. Pooling 320 1 1
Conv 1,280 1 1
Fully Connect 1,000 1 -
Table 8: The structure of the hypernetwork. The “MBConv” contains inverted bottleneck residual block MBConv Sandler et al. [2018] ( kernel sizes of {3,5,7}) with the squeeze and excitation module (expansion rates {4,6}). The “Repeat” represents the maximum number of repeated blocks in a group. The “Stride” indicates the convolutional stride of the first block in each repeated group.

Appendix B

Figure 3: Discovered architectures (100-600M Flops). “MB  ″ represents the inverted bottleneck MBConv Sandler et al. [2018] with the expand rate of and kernel size of . “DS  ″ denotes the depthwise separable convolution with the expand rate of and kernel size of .
Figure 4: Discovered architectures(0-100M Flops). “MB  ″ represents the inverted bottleneck MBConv Sandler et al. [2018] with the expand rate of and kernel size of . “DS  ″ denotes the depthwise separable convolution with the expand rate of and kernel size of .

Appendix C

Input Shape Operators Channels Repeat Stride
Conv 16 1 2
Depthwise Separable Conv 24 1 2
MBConv / SkipConnect / ResBlock 40 4 2
MBConv / SkipConnect / ResBlock 80 4 2
MBConv / SkipConnect / ResBlock 96 4 1
MBConv / SkipConnect / ResBlock 192 4 2
MBConv / SkipConnect / ResBlock 320 4 1
Global Avg. Pooling 320 1 1
Conv 1,280 1 1
Fully Connect 1,000 1 -
Table 9: The structure of the hypernetwork with additional “ResBlock” operator. The “ResBlock” He et al. [2016] indicates a residual bottleneck block with kernel size of 3.
Input Shape Operators Channels Repeat Stride
Conv 16 1 2
Depthwise Separable Conv 24 1 2
MBConv / Skip / ResBlock / Conv 40 4 2
MBConv / Skip / ResBlock / Conv 80 4 2
MBConv / Skip / ResBlock / Conv 96 4 1
MBConv / Skip / ResBlock / Conv 192 4 2
MBConv / Skip / ResBlock / Conv 320 4 1
Global Avg. Pooling 320 1 1
Conv 1,280 1 1
Fully Connect 1,000 1 -
Table 10: The structure of the hypernetwork with additional “ResBlock” and “Normal 2D Conv”. The “ResBlock” He et al. [2016] indicates a residual bottleneck block with kernel size of 3. The “Conv” indicates the standard 2D convolutions with kernel sizes of {1,3,5}.

Footnotes

  1. We also provide another implementation based upon Microsoft NNI AutoML open source toolkit at here.

References

  1. Understanding and simplifying one-shot architecture search. In ICML, Cited by: §2, §5.
  2. SMASH: one-shot model architecture search through hypernetworks. In ICLR, Cited by: §2, §5.
  3. Once for all: train one network and specialize it for efficient deployment. In ICLR, Cited by: §1, §4.1, §4.3, Table 5, §5.
  4. ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: Table 5, §5.
  5. Fasterseg: searching for faster real-time semantic segmentation. ICLR. Cited by: §5.
  6. Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In ICCV, Cited by: §5.
  7. MixPath: a unified approach for one-shot neural architecture search. arXiv preprint arXiv:2001.05887. Cited by: Table 7.
  8. Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §4.4, Table 5, Table 7, §5.
  9. Autoaugment: learning augmentation strategies from data. In CVPR, Cited by: §4.1.
  10. SpineNet: learning scale-permuted backbone for recognition and localization. arXiv preprint arXiv:1912.05027. Cited by: §1.
  11. Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §5.
  12. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.1.
  13. J. Guo, K. Han, Y. Wang, C. Zhang, Z. Yang, H. Wu, X. Chen and C. Xu Hit-detector: hierarchical trinity architecture search for object detection. In CVPR, Cited by: §4.4.
  14. Single path one-shot neural architecture search with uniform sampling. ECCV. Cited by: §1, §2, §4.1, §4.2, §4.3, §4.4, Table 5, Table 6, Table 7, §5.
  15. Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.4, Table 6, Table 10, Table 9.
  16. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
  17. Searching for mobilenetv3. In ICCV, Cited by: §1, §4.1, §4.3, §4.4, Table 4, Table 5, Table 6, Table 7.
  18. Squeeze-and-excitation networks. In CVPR, Cited by: §4.1.
  19. A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §4.2.
  20. Blockwisely supervised neural architecture search with knowledge distillation. In CVPR, Cited by: §1, §4.1, §4.3, Table 5, §5.
  21. Random search and reproducibility for neural architecture search. In UAI, Cited by: §2, §5.
  22. Focal loss for dense object detection. In ICCV, Cited by: §4.4.
  23. Microsoft coco: common objects in context. In ECCV, Cited by: §4.4.
  24. Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, Cited by: §1.
  25. DARTS: differentiable architecture search. In ICLR, Cited by: Table 5, §5.
  26. Search to distill: pearls are everywhere but not the eyes. In CVPR, Cited by: Table 5.
  27. Pareto optimality. In Encyclopedia of Global Justice, D. K. Chatterjee (Ed.), pp. 808–809. External Links: ISBN 978-1-4020-9160-5 Cited by: §3.1.
  28. Efficient neural architecture search via parameters sharing. In ICML, Cited by: §2, §5.
  29. Meta pseudo labels. arXiv:2003.10580. Cited by: §3.2.
  30. Regularized evolution for image classifier architecture search. In AAAI, Cited by: §2, §5.
  31. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §4.1.
  32. Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §1, §4.1, §4.2, Table 5, Table 7, Table 8, Figure 3, Figure 4.
  33. Evaluating the search phase of neural architecture search. In ICLR, Cited by: §1.
  34. Mnasnet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §2, §4.4, Table 5, Table 7.
  35. EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: §1, §4.1, §4.3, §4.3, Table 4, Table 5, §5.
  36. MixConv: mixed depthwise convolutional kernels. In BMVC, Cited by: Table 7.
  37. A survey on neural architecture search. arXiv preprint arXiv:1905.01392. Cited by: §5.
  38. Genetic cnn. In ICCV, Cited by: §5.
  39. GreedyNAS: towards fast one-shot nas with greedy supernet. In CVPR, Cited by: Table 5, §5.
  40. Bignas: scaling up neural architecture search with big single-stage models. arXiv preprint arXiv:2003.11142. Cited by: §1, §4.1, §4.3, Table 5, §5.
  41. Neural architecture search with reinforcement learning. In ICML, Cited by: §5.
  42. Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §1, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420909
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description