Disturbanceimmune Weight Sharing for Neural Architecture Search
Abstract
Neural architecture search (NAS) has gained increasing attention in the community of architecture design. One of the key factors behind the success lies in the training efficiency created by the weight sharing (WS) technique. However, WSbased NAS methods often suffer from a performance disturbance (PD) issue. That is, the training of subsequent architectures inevitably disturbs the performance of previously trained architectures due to the partially shared weights. This leads to inaccurate performance estimation for the previous architectures, which makes it hard to learn a good search strategy. To alleviate the performance disturbance issue, we propose a new disturbanceimmune update strategy for model updating. Specifically, to preserve the knowledge learned by previous architectures, we constrain the training of subsequent architectures in an orthogonal space via orthogonal gradient descent. Equipped with this strategy, we propose a novel disturbanceimmune training scheme for NAS. We theoretically analyze the effectiveness of our strategy in alleviating the PD risk. Extensive experiments on CIFAR10 and ImageNet verify the superiority of our method.
1 Introduction
Deep neural networks (DNNs) have produced stateoftheart results in many challenging tasks, such as image classification [17, 14], face recognition [29, 37], medical image analysis [48, 47], portfolio selection [51], and image generation [6], etc. One of the key factors behind the success of DNNs lies in the design of effective neural architectures, such as ResNet [17] and MobileNet [20]. In practice, different architectures often show different performances, and the optimal architecture may vary among different tasks. Hence, the design of effective neural architectures highly relies on substantial human expertise. However, the humandesigned process cannot fully explore the whole architecture space, resulting in suboptimal architectures [55].
Besides manual design, one may resort to the neural architecture search (NAS) [55] technique to automatically design network architectures. Specifically, NAS seeks to find an optimal architecture in a predefined search space [55, 56]. The searched architectures often show promising performance, demonstrating tremendous potential to surpass handcrafted architectures [4, 5, 26, 27, 41]. During the search process, NAS seeks to evaluate a large number of candidate architectures and gradually learns a strategy to find good architectures. However, the exact performance evaluation requires training all the architectures from scratch, which is highly timeconsuming and computationally impractical in realworld applications.
To improve the training efficiency, a weight sharing (WS) [3, 35] technique has been developed for NAS. Specifically, WSbased NAS constructs a supernet, i.e., a large computational graph, where each network architecture (subgraph) shares their parameters. In this sense, all candidate architectures share its parameters in the supernet. Based on the supernet, one can directly estimate the performance of architectures instead of training them from scratch. In this way, WS is able to effectively accelerate the search process of NAS and reduce the search cost from 1,800 GPU days to less than 1 day [5]. Despite the efficiency, recent studies [1, 24, 38] have empirically shown that the performance estimation provided by the WS is inaccurate, which makes it hard to identify good architectures. As a result, the search performance of NAS cannot be guaranteed and often fails to find promising architectures [8, 32].
In this paper, we study the risk of the WS scheme and find that WSbased methods suffer from a performance disturbance (PD) issue. That is, the training of a subsequent architecture inevitably disturbs the performance of previously trained architectures due to the update of shared parameters. As a result, the performance estimation of the previous architectures is unstable and inaccurate, which makes it difficult to search for good architectures. To address this issue, we propose a new disturbanceimmune update strategy to train the WS. By exploring orthogonal gradient descent, the proposed strategy trains architectures for better performance while constraining the prediction of previously trained architectures to be unchanged. In this way, we are able to alleviate the PD issue in WS and provide more stable and accurate performance estimation. Based on this update strategy, we further propose a novel disturbanceimmune training scheme for NAS, which helps to search for better architectures.
The main contributions of this paper are summarized as follows.

We propose a novel disturbanceimmune WS training scheme for NAS. By updating models in an orthogonal space with orthogonal gradient decent, our method exhibits more stable/accurate performance estimation for NAS. Equipped with the disturbanceimmune WS, NAS is able to learn a better search strategy and find better architectures.

We theoretically verify the effectiveness of the proposed disturbanceimmune training scheme in alleviating the performance disturbance risk of WS. We also provide an asymptotic convergence analysis of the proposed method.

Extensive experiments demonstrate that the proposed method is able to alleviate the performance disturbance issue and provide more accurate performance evaluation for strategy learning. As a result, the architecture searched by our method performs better than the architectures obtained by other stateoftheart NAS methods on the CIFAR10 and ImageNet datasets.
2 Related Work
Neural architecture search. In the past few years, neural architecture search (NAS) has attracted increasing attention to automatically design effective architectures. The classical NAS problem [55] exploits the paradigms of reinforcement learning (RL) to generate the model descriptions of DNNs. After that, MetaQNN [2] automatically selects the architectures of DNNs through a RLbased metamodeling procedure. NASNet [56] designs a new search space to improve the search performance. Moreover, some studies [36, 27] use evolutionary algorithms to find new architectures with excellent performance. To guide the search process, the NAS methods need to estimate the performance of candidate architectures. The simplest way is to train candidate architectures from scratch to obtain the performance, which, however, is timeconsuming and computationally expensive (e.g., several thousands of GPU days). To reduce the computational resources, recent NAS methods adopt a weight sharing [35, 28, 4, 42] strategy to estimate the performance of candidate architectures.
Weight sharing approaches. Efficient neural architecture search (ENAS) [35] first proposes a NAS training scheme with weight sharing (WS), which measures the performance of an architecture with the weights inherited from the trained supernet. Since WS can reduce the computational resources from thousands of GPU days to one GPU day, it is widely adopted to exploit NAS in various applications, such as objection detection [13, 7], segmentation [25] and compact architecture design [4, 42, 41]. Besides, DARTS [28] exploits the WS scheme with a continuous relaxation of the search space to search for promising architectures. However, recent studies [24, 38] find that the architecture performance measured by WS training is often very inaccurate, thus leading to the inferior performance of WS based NAS methods. To address this, NAOV2 [32] improves WS based NAS by training candidate architectures adequately and training complex architectures more. FairNAS [8] proposes a fair WS training strategy, which can be only applied to a singlepath search space instead of a cellbased search space. Unlike existing methods, we identify a performance disturbance (PD) issue that occurred in WS training, and propose to achieve more accurate performance estimations by alleviating this PD issue.
Continual learning. Continual learning (CL) aims to continuously learn a series of tasks [22, 44, 12, 11]. One of the major issues in CL is catastrophic disturbance, i.e., deep models often forget about previous tasks when model weights are updated for a new task. To address this, EWC [22] attempts to impose constraints on the updating of model weights based on measuring the importance of previous tasks. Moreover, regularization based methods [18, 44] neglect gradients for new tasks. In this paper, motivated by CL, we propose a novel disturbanceimmune update strategy to effectively improve the performance of WSbased NAS methods.
3 Problem Definition and Motivation
Notations. Throughout the paper, we use the following notations. Let be the search space of the NAS. Given any architecture , let be its trainable parameters and be its optimal model parameters trained on some datasets (e.g., CIFAR10 & ImageNet). Moreover, let denote the Frobenius norm, and let .
3.1 Neural Architecture Search and Weight Sharing
Neural architecture search (NAS) aims to search for an optimal architecture from a predefined search space. In this paper, we focus on reinforcement learning (RL)based methods [55, 56], which seek to learn a controller with parameters to generate architectures (i.e., ). To find promising architectures, these methods learn the controller by maximizing the expectation of architecture performance using metric (e.g., the accuracy on the validation set):
(1) 
where is the training loss on the training data. However, we have to train candidate architectures from scratch to obtain , resulting in an unbearable computational burden. To address this, a weight sharing scheme is proposed.
Weight sharing (WS) for NAS [35] constructs a supernet, i.e., a large computational graph, where each network architecture (subgraph) shares their parameters. Let be the parameters of the whole supernet and be the parameters of inherited from the supernet. To train the supernet, one can sample sufficiently many architectures and train them in a sequential manner [35]:
(2) 
Despite the training efficiency, the performance estimated by WS is often very inaccurate [1, 24, 38] due to the sequential training method of WS. Specifically, the training of a subsequent architecture inevitably disturbs the performance of previously trained architectures. As a result, the estimated performance of the previous architectures becomes inaccurate, making it hard to learn a good controller and search for good architectures.
3.2 Performance Disturbance in Weight Sharing
During the training of WS, for any two architectures and with , there are some shared parameters that appear in both and (see examples in Fig. 1(a)). Due to the sequential training strategy in Eqn. (2), once we update the parameters of , the shared parameters would also be changed and may yield different prediction results. As a result, the architecture may incur severe performance disturbance (PD) and the reward becomes inaccurate. To justify this, we show an illustrative example in Fig. 1(b).
In Fig. 1(b), we consider training 10 architectures sequentially with shared parameters on CIFAR10. We record their performance in terms of validation accuracy after the training of each architecture. The element in Fig. 1(b) denotes the validation accuracy of after the training of . The th row illustrates the performance disturbance of during the whole training process. Moreover, the more subsequent architectures are trained, the more severe performance disturbance is. Such unstable and inaccurate performance estimation would mislead learning the policy.
The performance disturbance can be measured by the predictions of an architecture, since the predictions directly determine architecture performance. Let be the input data and be the prediction of based on its parameters . For convenience, we use and to represent the original parameters and the latest parameters after the training of other architectures that have shared parameters, respectively. During the training of WS, the performance disturbance of can be measured by the differences of predictions:
(3) 
3.3 Disturbanceimmune Weight Sharing Problem
To reduce the prediction differences, we seek to define and address a disturbanceimmune problem. Similar to model compression methods [31, 54], one possible way is to restrict the changes of feature maps in each layer. In this sense, if we can update the shared parameters while maintaining the same feature maps for all previously trained architectures, the performance of these architectures will not be changed. Specifically, for the th update, we seek to improve the performance of and reduce the performance disturbance of all the previous architectures . In this way, the weight sharing scheme is able to provide a more stable and accurate reward (or performance estimation) for all candidate architectures.
To this end, we propose a disturbanceimmune training strategy that makes the update term of the model parameters orthogonal to the input features. Note that all the layers adopt the same parameter update method. For simplicity, we only investigate the training process w.r.t. a single layer. Specifically, during the training of the WS, for any layer of the supernet, we record the input feature maps of all the previously trained architectures in . To ensure that we can improve performance w.r.t. the current architecture, we do not include its input feature in . In order to reduce PD of the previously trained architectures, we make the gradient w.r.t. the shared parameters orthogonal to all the input feature maps in . In this way, there would be no change in the prediction results for the previous architectures. Formally, let be the parameters of a specific layer and be the original parameters before the model update. The parameters have to satisfy a constraint that restricts the changes of output feature maps and forms a feasible set
(4) 
where is some small positive value. When we apply such constraints to train WS, the optimization problem becomes:
(5) 
Unlike the original problem in Eqn. (2), we seek to train the WS by keeping the output features of all the previously trained architectures unchanged. In this sense, the performance of these architectures becomes stable and the PD of them can be greatly reduced (see results in Fig. 3). Note that recording all the input features is infeasible in practice. To address this issue, we propose an equivalent solution that adopts the iterative training scheme and avoids data recording (see details in Section 4.1). More critically, we theoretically prove that the proposed method is able to reduce the risk of performance disturbance (see theoretical analysis in Section 4.3). We call this method Disturbanceimmune Weight Sharing and show the details in Section 4.
4 Disturbanceimmune Weight Sharing
In this section, we first propose a disturbanceimmune update strategy for architecture training with weight sharing (WS). Such a strategy aims to handle performance disturbance (PD) by constraining the change of feature maps of previously trained architectures. Based on this strategy, we propose a new disturbanceimmune training scheme for neural architecture search (DINAS). Lastly, we theoretically analyze the effectiveness of the proposed method.
4.1 Disturbanceimmune Update Strategy
To address PD, we propose to project the update gradient towards a useful direction but avoid large changes regarding the feature maps of shared parameters. To this end, we resort to the orthogonal gradient descent method [44]. Such a method projects the gradient to be orthogonal to the input features, which ensures the output features to change very slightly regarding the input. To be specific, we maintain an orthogonal projection matrix set for the supernet. Each parameter in the supernet has a corresponding projection matrix , initialized as a unit matrix . These projection matrices are used to project gradients for model training. Overall, there are two key issues: how to update model parameters via the orthogonal projection matrix, and how to update the orthogonal projection matrix.
The update of model parameters
We train architectures based on orthogonal gradient descent [44], in which we project the gradient to be orthogonal to the input. In this way, we are able to update the model towards a useful direction but avoid large changes regarding the previous output of the shared parameters. Formally, given any architecture and for any layer , there is a corresponding projection matrix inherited from the projection matrix set . We update by the following orthogonal gradient descent:
(6) 
where is often called as the projected gradient. Moreover, when the layers in update, the corresponding layers in the supernet also update.
The update of projection matrices
In order to project the gradient of each shared layer to avoid large changes regarding its output, we ensure the projected gradient is orthogonal to the previous input. Specifically, for any layer of , let denote the input feature maps of all previously trained architectures, with a total of feature maps (whose dimension is ). We compute the orthogonal projection matrix by:
(7) 
where is some regularization constant. The detailed derivations will be provided in a long version. Based on Eqn. (7), our method is able to project the gradient to be orthogonal to the input, which helps to void a large change regarding the output of the shared layers (see Theorem 4.1 for more discussion).
A potential issue of Eqn. (7) is that the computation of requires all previous input feature maps (as described in Section 3.3). However, such a manner may be highly storage consuming as the input feature maps increase. To handle this, we can update the projection matrix in an iterative manner [49, 50, 52]. For each coming sample , based on Woodbury identity [19], we update by:
(8) 
Based on Eqn. (8), we update iteratively and each iteration only requires the input feature maps w.r.t. a single sample. By doing so, we need not to store all previous input feature maps, thus avoiding the high storage issue. Note that the update of also means the update of the corresponding projection matrix in .
4.2 Disturbanceimmune Neural Architecture Search
The proposed disturbanceimmune update strategy helps to alleviate the issue of performance disturbance in the WS, and provides more stable performance estimations for candidate architectures. As a result, we can obtain more accurate reward signals for learning a good controller. Following this, based on the disturbanceimmune update strategy, we propose a new disturbanceimmune training scheme for neural architecture search (named as DINAS). The overall training scheme is provided in Algorithm 1.
The main difference between the proposed DINAS and other standard WS based NAS is the training of the supernet. Specifically, we train the supernet via the new proposed disturbanceimmune update strategy, which alleviates the performance disturbance issue in the training process. To be specific, we maintain an orthogonal projection matrix set for the supernet (see line 2 in Algorithm 1). Each parameter in the supernet has a corresponding projection matrix , initialized as a unit matrix . Based on the orthogonal projection matrix, we update the network architecture by orthogonal gradient descent (see lines 910 in Algorithm 1).
4.3 Theoretical Analysis
In this section, we theoretically analyze the proposed method regarding its effectiveness and convergence. To begin with, we analyze the effectiveness of our proposed method in alleviating the issue of performance disturbance as follows.
Theorem 4.1
Given a model with parameter and any input matrix , let be the projection matrix and be the gradient w.r.t. . For the update of in the direction and , let denote the output change of the original model and the updated model. When , the following inequality holds:
(9) 
where and .
Theorem 4.1 indicates that for any input feature map , the distance between the original output and the one after the update is controlled by the regularization factor . Specifically, the lower , the lower distance upper bound. Therefore, our proposed orthogonal gradient descent method is able to satisfy the constraint in Eqn. (5). In other words, our method can avoid the large change regarding the previous output of shared layers when training the supernet, thus alleviating the issue of performance disturbance in WS.
We further provide an asymptotic analysis regarding the convergence of the update for as follows.
Theorem 4.2
Given a loss function that is smooth and convex w.r.t. . Let and be the optimal and initial solution of . By setting , at the th update step, the proposed update strategy satisfies:
(10) 
This theorem illustrates that our proposed update strategy has a sublinear convergence rate of , which guarantees the effectiveness of the proposed method.
5 Experimental Results
We evaluate the proposed DINAS in two main aspects: (1) the superiority of the searched architecture by DINAS on CIFAR10 and ImageNet, respectively; (2) the effectiveness of our proposed disturbanceimmune update strategy. The source code will be publicly available.
5.1 Evaluation on CIFAR10
In this section, we evaluate the proposed DINAS method on CIFAR10 [23]. To be specific, we first use the proposed method to train a controller on CIFAR10, and use it to search for a convolutional neural architecture. By comparing the searched architecture with other stateoftheart architectures, we can evaluate the effectiveness of our method. To this end, we first describe the search space, training details and evaluation details.
Search space. Following the settings in DARTS [28], we aim to search for two types of convolutional cells, namely the normal cell and the reduction cell. Each cell contains 7 nodes, including 2 input nodes, 4 intermediate nodes and 1 output node. Between any two nodes, there are 8 available operations, including depthwise separable convolution, dilated convolution, max pooling, average pooling, depthwise separable convolution, dilated convolution, identity and none. After obtaining the convolutional cells, we stack them to build the final convolutional network.
Training details. In the search phase, we divide the standard training set of CIFAR10 into two parts. Specifically, we randomly select 40% of the training set to train sampled architectures (as training data) and use the rest 60% to learn controllers (as validation data). Moreover, we train DINAS for epochs in total. We first train the supernet without learning the controller, and start training the controller from epoch 90. In addition, we set by default, where the sensitivity analysis can be found in Section 5.4. For training the supernet, we use an SGD optimizer with a weight decay of and a momentum of . The learning rate is set to . For training the controller, we use ADAM with a learning rate of and a weight decay of . We add the controller’s sample entropy to the reward, which is weighted by .
Evaluation details. In the evaluation phase, we first use the learned controller to search for a normal cell and a reduction cell. Then, we construct the final convolutional network with 17 normal cells and 2 reduction cells. Following [34], we put the two reduction cells at the and depth of the network, respectively. The initial number of the channels is set to 43. Following DARTS [28], we train the convolutional network for epochs with a batch size of . We apply an SGD optimizer with a weight decay of and a momentum of . Moreover, we set the initial learning rate as and use the cosine annealing strategy [30] to adjust it. We also use the cutout scheme [10] with length 16 for data augmentation.
Architecture  Test Accuracy (%)  # Params (M)  Search Cost 
(GPU days)  
DenseNetBC [21]  96.54  25.6  – 
PyramidNetBC [16]  96.69  26.0  – 
Random search baseline  96.71 0.15  3.2  – 
NASNetA + cutout [56]  97.35  3.3  1,800 
NASNetB [56]  96.27  2.6  1,800 
NASNetC [56]  96.41  3.1  1,800 
AmoebaNetA + cutout [36]  96.66 0.06  3.2  3,150 
AmoebaNetB + cutout [36]  96.63 0.04  2.8  3,150 
Hierarchical Evo [27]  96.25 0.12  15.7  300 
SNAS [43]  97.02  2.9  1.5 
GHN [45]  97.16 0.07  5.7  0.8 
ENAS + cutout [35]  97.11  4.6  0.5 
DARTS + cutout [28]  97.24 0.09  3.4  4 
NATDARTS [15]  97.28  2.7  – 
NAONet [33]  97.02  28.6  200 
NAONetWS [33]  96.47  2.5  0.3 
DINAS + cutout  97.38 0.04  3.7  1.5 
Comparison with stateoftheart methods. We show the searched normal cells and reduction cells in Fig. 2 and report the detailed results of all methods in Table 1. For our searched architectures, we run 5 experiments with the different random initialization, and report the average performance with standard deviation. Experimental results show that the architecture searched by our DINAS achieves 97.38% accuracy, outperforming all other stateoftheart architectures, including humandesigned ones and NAS ones. Note that ENAS, NAONetWS, DARTS, and NAT also use the scheme of weight sharing (WS). Such a result demonstrates that it is necessary to alleviate the issue of performance disturbance(PD) in WSbased NAS. This helps us to provide more accurate rewards for controller learning, and search for better neural architectures.
5.2 Evaluation on ImageNet
To verify the generalization of the convolutional cells searched on CIFAR10 (as shown in Fig. 2), we further evaluate them on a largescale image classification dataset, namely ImageNet [9]. To begin with, we describe the evaluation details.
Evaluation details. For the ImageNet dataset, we construct the convolutional network with 12 normal cells and 2 reduction cells. We put the two reduction cells at the and depth of the network, respectively. We set the number of the initial channels to 48. Following [28], we train the network by an SGD optimizer with epochs and use a weight decay of and a momentum of . We initialize the learning rate as and decrease it by the cosine annealing [30]. We follow the ImageNet mobile setting [28], where the size of input images is set to and the number of multiplyadds (Madds) is less than 600M.
Architecture  Test Accuracy (%)  # Params  # MAdds  Search Cost  
Top1  Top5  (M)  (M)  (GPU days)  
ResNet18 [17]  69.8  89.1  11.7  1,814  – 
Inceptionv1 [40]  69.8  89.9  6.6  1,448  – 
MobileNet v1 () [20]  70.6  89.5  4.2  569  – 
ShuffleNet v1 () [46]  70.9  89.2  5.0  524  – 
NASNetA [56]  74.0  91.6  5.3  564  3,150 
NASNetB [56]  72.8  91.3  5.3  488  3,150 
NASNetC [56]  72.5  91.0  4.9  558  3,150 
AmoebaNetA [36]  74.5  92.0  5.1  555  1,800 
AmoebaNetB [36]  74.0  91.5  5.3  555  1,800 
GHN [45]  73.0  91.3  6.1  569  0.8 
PNAS [26]  74.2  91.9  5.1  588  255 
BayesNAS [53]  73.5  91.1  3.9    0.2 
DARTS [28]  73.1  91.0  4.9  595  4 
NATDARTS [15]  73.7  91.4  4.0  441  – 
SNAS [43]  72.7  90.8  4.3  522  1.5 
DINAS  74.7  92.1  5.2  587  1.5 
Comparison with stateoftheart methods. We compare our searched architecture with several stateoftheart models on ImageNet. As shown in Table 2, our architecture achieves 74.7% top1 accuracy and 92.1% top5 accuracy. To be specific, our architecture outperforms humandesigned architectures (e.g., ResNet18 [17]) by about 5%, and outperforms most of NAS models by in terms of top1 accuracy. Moreover, our architecture achieves excellent performance only using 1.5 GPU days, while AmoebaNet [36] and NASNet [56] spend 3,150 and 1,800 GPU days, respectively. These results demonstrate the generalization of the searched convolutional cells and the effectiveness/efficiency of the proposed method.
5.3 Effectiveness of Disturbanceimmune Update Strategy
In previous experiments, we have demonstrated the superiority of the proposed method. One important reason for superiority is the ability to alleviate performance disturbance (PD). In this section, we further verify the effectiveness of our proposed update strategy in dealing with PD. To this end, we use the following two metrics to measure the degree of PD and use them to evaluate the proposed method.
Metrics for PD. (1) Performance change: for any architecture inherited from the supernet, performance change is defined as (), where and denote the validation accuracy of at the th and th training epoch of the supernet, respectively. Note that and indicates the epoch interval. Overall, the larger performance change indicates the more severe performance disturbance. (2) Kendalls Tau (KTau): we use KTau [39] to measure the correlation of performance ranks between two architecture sets. The range of KTau belongs to , where larger KTau means that the performance ranks of two architecture sets are more consistent.
Evaluation in terms of performance change. To evaluate our method, we randomly sample 64 architectures based on the supernet. We record their performance at different training epochs of the supernet, and report the average performance change of these 64 architectures regarding epoch interval 13. Fig. 3 (a) shows that our disturbanceimmune WS scheme is able to reduce the performance change of architectures and thus alleviate the PD issue.
Evaluation in terms of KTau. Based on the above architectures, we compute the KTau between their current performance rank and the one after 13 epochs (i.e., epoch interval is 13). Fig. 3 (b) verifies the effectiveness of our method in improving KTau. Note that the higher KTau indicates a more consistent performance estimation in the training process, which means more stable/accurate rewards for controller learning. Moreover, we compute the groundtruth KTau (GTTau) [38], which measures the performance correlation between a set of architectures inherited from the supernet and exactly trained from scratch. Specificcally, our method achieves higher GTTau (0.48) than the standard WS (0.16).
5.4 More Discussions
We further discuss PD for the different numbers of shared layers and parameter sensitivity through some selfdesigned experiments. We provide our main observations and analyses as follows.
Number of shared layers. We find that as the number of shared layers between two architectures increases, the issue of PD becomes more severe. Since there often exist many shared layers in WSbased NAS, this result demonstrates that it is necessary to alleviate the issue of PD for NAS.
Regularization parameter . Generally, the optimal value of varies regarding different data. A large may make the method fail to satisfy the problem constraint (see Eqn. (5)), while a small may result in the irreversible issue when computing projection matrices (see Eqn. (7)). Nevertheless, the default setting helps to achieve the best or relatively good performance in most cases.
6 Conclusions
In this paper, we have proposed a novel disturbanceimmune training scheme for NAS to conquer performance disturbance (PD) in weight sharing (WS). Specifically, by developing a new update strategy to train sampled architectures, our method provides more stable/accurate performance estimation for architectures. As a result, the proposed method is able to learn a good controller for searching good architectures. We theoretically and empirically verify the effectiveness of the proposed method in alleviating PD, and also provide an asymptotic analysis for its convergence. Extensive experiments demonstrate that the architecture found by our proposed method outperforms the architectures obtained by considered stateoftheart NAS methods.
References
 G. Adam and J. Lorraine. Understanding neural architecture search techniques. arXiv, 2019.
 B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, 2017.
 G. Bender, P.J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le. Understanding and simplifying oneshot architecture search. In International Conference on Machine Learning, pages 549–558, 2018.
 H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient architecture search by network transformation. In AAAI Conference on Artificial Intelligence, pages 2787–2794, 2018.
 H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.
 J. Cao, L. Mo, Y. Zhang, et al. Multimarginal wasserstein gan. In Advances in Neural Information Processing Systems, pages 1774–1784, 2019.
 Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun. Detnas: Neural architecture search on object detection. arXiv, 2019.
 X. Chu, B. Zhang, R. Xu, and J. Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv, 2019.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
 T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv, 2017.
 M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. arXiv, 2019.
 C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
 G. Ghiasi, T.Y. Lin, and Q. V. Le. Nasfpn: Learning scalable feature pyramid architecture for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
 Y. Guo, Q. Wu, C. Deng, J. Chen, and M. Tan. Double forward propagation for memorized batch normalization. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Y. Guo, Y. Zheng, M. Tan, Q. Chen, J. Chen, P. Zhao, and J. Huang. NAT: neural architecture transformer for accurate and compact architectures. In Advances in Neural Information Processing Systems, pages 735–747, 2019.
 D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6307–6315, 2017.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 X. He and H. Jaeger. Overcoming catastrophic interference by conceptors. arXiv, 2017.
 R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
 A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017.
 G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
 J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. National Academy of Sciences, pages 3521–3526, 2017.
 A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In Conference on Uncertainty in Artificial Intelligence, 2019.
 C. Liu, L.C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. FeiFei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019.
 C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In European Conference on Computer Vision, pages 19–34, 2018.
 H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2017.
 H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2019.
 W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6738–6746, 2017.
 I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
 J.H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
 R. Luo, T. Qin, and E. Chen. Understanding and improving oneshot neural architecture optimization. arXiv, 2019.
 R. Luo, F. Tian, T. Qin, E. Chen, and T.Y. Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7816–7827, 2018.
 N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik. Xnas: Neural architecture search with expert advice. In Advances in Neural Information Processing Systems, pages 1975–1985, 2019.
 H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pages 4092–4101, 2018.
 E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019.
 F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
 C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2019.
 P. K. Sen. Estimates of the regression coefficient based on kendall’s tau. Journal of American Statistical Association, 63(324):1379–1389, 1968.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platformaware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
 B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
 S. Xie, H. Zheng, C. Liu, and L. Lin. Snas: stochastic neural architecture search. In International Conference on Learning Representations, 2019.
 G. Zeng, Y. Chen, B. Cui, and S. Yu. Continual learning of contextdependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
 C. Zhang, M. Ren, and R. Urtasun. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2018.
 X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 Y. Zhang, H. Chen, Y. Wei, P. Zhao, J. Cao, et al. From whole slide imaging to microscopy: Deep microscopy adaptation network for histopathology cancer image classification. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 360–368, 2019.
 Y. Zhang, Y. Wei, P. Zhao, S. Niu, et al. Collaborative unsupervised domain adaptation for medical image diagnosis. In Medical Imaging meets NeurIPS, 2019.
 Y. Zhang, P. Zhao, J. Cao, W. Ma, et al. Online adaptive asymmetric active learning for budgeted imbalanced data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2768–2777. ACM, 2018.
 Y. Zhang, P. Zhao, S. Niu, Q. Wu, et al. Online adaptive asymmetric active learning with limited budgets. IEEE Transactions on Knowledge and Data Engineering, 2019.
 Y. Zhang, P. Zhao, Q. Wu, B. Li, et al. Costsensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering, 2020.
 P. Zhao, Y. Zhang, M. Wu, S. C. Hoi, M. Tan, and J. Huang. Adaptive costsensitive online classification. IEEE Transactions on Knowledge and Data Engineering, 31(2):214–228, 2018.
 H. Zhou, M. Yang, J. Wang, and W. Pan. Bayesnas: A bayesian approach for neural architecture search. In International Conference on Machine Learning, pages 7603–7613, 2019.
 Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875–886, 2018.
 B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
 B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8697–8710, 2018.