Disturbance-immune Weight Sharing for Neural Architecture Search

Disturbance-immune Weight Sharing for Neural Architecture Search

Abstract

Neural architecture search (NAS) has gained increasing attention in the community of architecture design. One of the key factors behind the success lies in the training efficiency created by the weight sharing (WS) technique. However, WS-based NAS methods often suffer from a performance disturbance (PD) issue. That is, the training of subsequent architectures inevitably disturbs the performance of previously trained architectures due to the partially shared weights. This leads to inaccurate performance estimation for the previous architectures, which makes it hard to learn a good search strategy. To alleviate the performance disturbance issue, we propose a new disturbance-immune update strategy for model updating. Specifically, to preserve the knowledge learned by previous architectures, we constrain the training of subsequent architectures in an orthogonal space via orthogonal gradient descent. Equipped with this strategy, we propose a novel disturbance-immune training scheme for NAS. We theoretically analyze the effectiveness of our strategy in alleviating the PD risk. Extensive experiments on CIFAR-10 and ImageNet verify the superiority of our method.

1 Introduction

Deep neural networks (DNNs) have produced state-of-the-art results in many challenging tasks, such as image classification [17, 14], face recognition [29, 37], medical image analysis [48, 47], portfolio selection [51], and image generation [6], etc. One of the key factors behind the success of DNNs lies in the design of effective neural architectures, such as ResNet [17] and MobileNet [20]. In practice, different architectures often show different performances, and the optimal architecture may vary among different tasks. Hence, the design of effective neural architectures highly relies on substantial human expertise. However, the human-designed process cannot fully explore the whole architecture space, resulting in suboptimal architectures [55].

Besides manual design, one may resort to the neural architecture search (NAS) [55] technique to automatically design network architectures. Specifically, NAS seeks to find an optimal architecture in a predefined search space [55, 56]. The searched architectures often show promising performance, demonstrating tremendous potential to surpass hand-crafted architectures [4, 5, 26, 27, 41]. During the search process, NAS seeks to evaluate a large number of candidate architectures and gradually learns a strategy to find good architectures. However, the exact performance evaluation requires training all the architectures from scratch, which is highly time-consuming and computationally impractical in real-world applications.

To improve the training efficiency, a weight sharing (WS) [3, 35] technique has been developed for NAS. Specifically, WS-based NAS constructs a supernet, i.e., a large computational graph, where each network architecture (subgraph) shares their parameters. In this sense, all candidate architectures share its parameters in the supernet. Based on the supernet, one can directly estimate the performance of architectures instead of training them from scratch. In this way, WS is able to effectively accelerate the search process of NAS and reduce the search cost from 1,800 GPU days to less than 1 day [5]. Despite the efficiency, recent studies [1, 24, 38] have empirically shown that the performance estimation provided by the WS is inaccurate, which makes it hard to identify good architectures. As a result, the search performance of NAS cannot be guaranteed and often fails to find promising architectures [8, 32].

In this paper, we study the risk of the WS scheme and find that WS-based methods suffer from a performance disturbance (PD) issue. That is, the training of a subsequent architecture inevitably disturbs the performance of previously trained architectures due to the update of shared parameters. As a result, the performance estimation of the previous architectures is unstable and inaccurate, which makes it difficult to search for good architectures. To address this issue, we propose a new disturbance-immune update strategy to train the WS. By exploring orthogonal gradient descent, the proposed strategy trains architectures for better performance while constraining the prediction of previously trained architectures to be unchanged. In this way, we are able to alleviate the PD issue in WS and provide more stable and accurate performance estimation. Based on this update strategy, we further propose a novel disturbance-immune training scheme for NAS, which helps to search for better architectures.

The main contributions of this paper are summarized as follows.

  • We propose a novel disturbance-immune WS training scheme for NAS. By updating models in an orthogonal space with orthogonal gradient decent, our method exhibits more stable/accurate performance estimation for NAS. Equipped with the disturbance-immune WS, NAS is able to learn a better search strategy and find better architectures.

  • We theoretically verify the effectiveness of the proposed disturbance-immune training scheme in alleviating the performance disturbance risk of WS. We also provide an asymptotic convergence analysis of the proposed method.

  • Extensive experiments demonstrate that the proposed method is able to alleviate the performance disturbance issue and provide more accurate performance evaluation for strategy learning. As a result, the architecture searched by our method performs better than the architectures obtained by other state-of-the-art NAS methods on the CIFAR-10 and ImageNet datasets.

2 Related Work

Neural architecture search. In the past few years, neural architecture search (NAS) has attracted increasing attention to automatically design effective architectures. The classical NAS problem [55] exploits the paradigms of reinforcement learning (RL) to generate the model descriptions of DNNs. After that, MetaQNN [2] automatically selects the architectures of DNNs through a RL-based meta-modeling procedure. NASNet [56] designs a new search space to improve the search performance. Moreover, some studies [36, 27] use evolutionary algorithms to find new architectures with excellent performance. To guide the search process, the NAS methods need to estimate the performance of candidate architectures. The simplest way is to train candidate architectures from scratch to obtain the performance, which, however, is time-consuming and computationally expensive (e.g., several thousands of GPU days). To reduce the computational resources, recent NAS methods adopt a weight sharing [35, 28, 4, 42] strategy to estimate the performance of candidate architectures.

Weight sharing approaches. Efficient neural architecture search (ENAS) [35] first proposes a NAS training scheme with weight sharing (WS), which measures the performance of an architecture with the weights inherited from the trained supernet. Since WS can reduce the computational resources from thousands of GPU days to one GPU day, it is widely adopted to exploit NAS in various applications, such as objection detection [13, 7], segmentation [25] and compact architecture design [4, 42, 41]. Besides, DARTS [28] exploits the WS scheme with a continuous relaxation of the search space to search for promising architectures. However, recent studies [24, 38] find that the architecture performance measured by WS training is often very inaccurate, thus leading to the inferior performance of WS based NAS methods. To address this, NAO-V2 [32] improves WS based NAS by training candidate architectures adequately and training complex architectures more. FairNAS [8] proposes a fair WS training strategy, which can be only applied to a single-path search space instead of a cell-based search space. Unlike existing methods, we identify a performance disturbance (PD) issue that occurred in WS training, and propose to achieve more accurate performance estimations by alleviating this PD issue.

Continual learning. Continual learning (CL) aims to continuously learn a series of tasks [22, 44, 12, 11]. One of the major issues in CL is catastrophic disturbance, i.e., deep models often forget about previous tasks when model weights are updated for a new task. To address this, EWC [22] attempts to impose constraints on the updating of model weights based on measuring the importance of previous tasks. Moreover, regularization based methods  [18, 44] neglect gradients for new tasks. In this paper, motivated by CL, we propose a novel disturbance-immune update strategy to effectively improve the performance of WS-based NAS methods.

3 Problem Definition and Motivation

Notations. Throughout the paper, we use the following notations. Let be the search space of the NAS. Given any architecture , let be its trainable parameters and be its optimal model parameters trained on some datasets (e.g., CIFAR-10 & ImageNet). Moreover, let denote the Frobenius norm, and let .

3.1 Neural Architecture Search and Weight Sharing

Neural architecture search (NAS) aims to search for an optimal architecture from a predefined search space. In this paper, we focus on reinforcement learning (RL)-based methods [55, 56], which seek to learn a controller with parameters to generate architectures (i.e., ). To find promising architectures, these methods learn the controller by maximizing the expectation of architecture performance using metric (e.g., the accuracy on the validation set):

(1)

where is the training loss on the training data. However, we have to train candidate architectures from scratch to obtain , resulting in an unbearable computational burden. To address this, a weight sharing scheme is proposed.

Weight sharing (WS) for NAS [35] constructs a supernet, i.e., a large computational graph, where each network architecture (subgraph) shares their parameters. Let be the parameters of the whole supernet and be the parameters of inherited from the supernet. To train the supernet, one can sample sufficiently many architectures and train them in a sequential manner [35]:

(2)

Despite the training efficiency, the performance estimated by WS is often very inaccurate [1, 24, 38] due to the sequential training method of WS. Specifically, the training of a subsequent architecture inevitably disturbs the performance of previously trained architectures. As a result, the estimated performance of the previous architectures becomes inaccurate, making it hard to learn a good controller and search for good architectures.

(a) Example of weight sharing (WS)
(b) Performance disturbance in WS
Figure 1: Exemplar illustration of WS and performance disturbance (PD). (a) The weights of architectures and are inherited from the supernet. There are some shared weights in both and . (b) We record the performance of 10 architectures . Element denotes the accuracy of after the training of . Note that the training of the subsequent architectures results in apparent performance disturbance of the previously trained architectures.

3.2 Performance Disturbance in Weight Sharing

During the training of WS, for any two architectures and with , there are some shared parameters that appear in both and (see examples in Fig. 1(a)). Due to the sequential training strategy in Eqn. (2), once we update the parameters of , the shared parameters would also be changed and may yield different prediction results. As a result, the architecture may incur severe performance disturbance (PD) and the reward becomes inaccurate. To justify this, we show an illustrative example in Fig. 1(b).

In Fig. 1(b), we consider training 10 architectures sequentially with shared parameters on CIFAR-10. We record their performance in terms of validation accuracy after the training of each architecture. The element in Fig. 1(b) denotes the validation accuracy of after the training of . The -th row illustrates the performance disturbance of during the whole training process. Moreover, the more subsequent architectures are trained, the more severe performance disturbance is. Such unstable and inaccurate performance estimation would mislead learning the policy.

The performance disturbance can be measured by the predictions of an architecture, since the predictions directly determine architecture performance. Let be the input data and be the prediction of based on its parameters . For convenience, we use and to represent the original parameters and the latest parameters after the training of other architectures that have shared parameters, respectively. During the training of WS, the performance disturbance of can be measured by the differences of predictions:

(3)

3.3 Disturbance-immune Weight Sharing Problem

To reduce the prediction differences, we seek to define and address a disturbance-immune problem. Similar to model compression methods [31, 54], one possible way is to restrict the changes of feature maps in each layer. In this sense, if we can update the shared parameters while maintaining the same feature maps for all previously trained architectures, the performance of these architectures will not be changed. Specifically, for the -th update, we seek to improve the performance of and reduce the performance disturbance of all the previous architectures . In this way, the weight sharing scheme is able to provide a more stable and accurate reward (or performance estimation) for all candidate architectures.

To this end, we propose a disturbance-immune training strategy that makes the update term of the model parameters orthogonal to the input features. Note that all the layers adopt the same parameter update method. For simplicity, we only investigate the training process w.r.t. a single layer. Specifically, during the training of the WS, for any layer of the supernet, we record the input feature maps of all the previously trained architectures in . To ensure that we can improve performance w.r.t. the current architecture, we do not include its input feature in . In order to reduce PD of the previously trained architectures, we make the gradient w.r.t. the shared parameters orthogonal to all the input feature maps in . In this way, there would be no change in the prediction results for the previous architectures. Formally, let be the parameters of a specific layer and be the original parameters before the model update. The parameters have to satisfy a constraint that restricts the changes of output feature maps and forms a feasible set

(4)

where is some small positive value. When we apply such constraints to train WS, the optimization problem becomes:

(5)

Unlike the original problem in Eqn. (2), we seek to train the WS by keeping the output features of all the previously trained architectures unchanged. In this sense, the performance of these architectures becomes stable and the PD of them can be greatly reduced (see results in Fig. 3). Note that recording all the input features is infeasible in practice. To address this issue, we propose an equivalent solution that adopts the iterative training scheme and avoids data recording (see details in Section 4.1). More critically, we theoretically prove that the proposed method is able to reduce the risk of performance disturbance (see theoretical analysis in Section 4.3). We call this method Disturbance-immune Weight Sharing and show the details in Section 4.

4 Disturbance-immune Weight Sharing

In this section, we first propose a disturbance-immune update strategy for architecture training with weight sharing (WS). Such a strategy aims to handle performance disturbance (PD) by constraining the change of feature maps of previously trained architectures. Based on this strategy, we propose a new disturbance-immune training scheme for neural architecture search (DI-NAS). Lastly, we theoretically analyze the effectiveness of the proposed method.

4.1 Disturbance-immune Update Strategy

To address PD, we propose to project the update gradient towards a useful direction but avoid large changes regarding the feature maps of shared parameters. To this end, we resort to the orthogonal gradient descent method [44]. Such a method projects the gradient to be orthogonal to the input features, which ensures the output features to change very slightly regarding the input. To be specific, we maintain an orthogonal projection matrix set for the supernet. Each parameter in the supernet has a corresponding projection matrix , initialized as a unit matrix . These projection matrices are used to project gradients for model training. Overall, there are two key issues: how to update model parameters via the orthogonal projection matrix, and how to update the orthogonal projection matrix.

The update of model parameters

We train architectures based on orthogonal gradient descent [44], in which we project the gradient to be orthogonal to the input. In this way, we are able to update the model towards a useful direction but avoid large changes regarding the previous output of the shared parameters. Formally, given any architecture and for any layer , there is a corresponding projection matrix inherited from the projection matrix set . We update by the following orthogonal gradient descent:

(6)

where is often called as the projected gradient. Moreover, when the layers in update, the corresponding layers in the supernet also update.

The update of projection matrices

In order to project the gradient of each shared layer to avoid large changes regarding its output, we ensure the projected gradient is orthogonal to the previous input. Specifically, for any layer of , let denote the input feature maps of all previously trained architectures, with a total of feature maps (whose dimension is ). We compute the orthogonal projection matrix by:

(7)

where is some regularization constant. The detailed derivations will be provided in a long version. Based on Eqn. (7), our method is able to project the gradient to be orthogonal to the input, which helps to void a large change regarding the output of the shared layers (see Theorem 4.1 for more discussion).

A potential issue of Eqn. (7) is that the computation of requires all previous input feature maps (as described in Section 3.3). However, such a manner may be highly storage consuming as the input feature maps increase. To handle this, we can update the projection matrix in an iterative manner [49, 50, 52]. For each coming sample , based on Woodbury identity [19], we update by:

(8)

Based on Eqn. (8), we update iteratively and each iteration only requires the input feature maps w.r.t. a single sample. By doing so, we need not to store all previous input feature maps, thus avoiding the high storage issue. Note that the update of also means the update of the corresponding projection matrix in .

4.2 Disturbance-immune Neural Architecture Search

The proposed disturbance-immune update strategy helps to alleviate the issue of performance disturbance in the WS, and provides more stable performance estimations for candidate architectures. As a result, we can obtain more accurate reward signals for learning a good controller. Following this, based on the disturbance-immune update strategy, we propose a new disturbance-immune training scheme for neural architecture search (named as DI-NAS). The overall training scheme is provided in Algorithm 1.

The main difference between the proposed DI-NAS and other standard WS based NAS is the training of the supernet. Specifically, we train the supernet via the new proposed disturbance-immune update strategy, which alleviates the performance disturbance issue in the training process. To be specific, we maintain an orthogonal projection matrix set for the supernet (see line 2 in Algorithm 1). Each parameter in the supernet has a corresponding projection matrix , initialized as a unit matrix . Based on the orthogonal projection matrix, we update the network architecture by orthogonal gradient descent (see lines 9-10 in Algorithm 1).

0:  Training data , validation data , learning rate , parameters ,.
1:  Initialize supernet parameters and controller parameters .
2:  Construct and initialize a projection matrix set for the supernet.
3:  while not convergent do
4:     // Update by minimizing the training loss
5:     for  do
6:         Sample .   // is the policy of the controller
7:         Sample a batch of data from .
8:         // Disturbance-immune Update for and
9:         Update using Eqn. (6).   // denotes the parameters of
10:         Update using Eqn. (8).   // denotes the projection matrices of
11:     end for
12:     // Update by maximizing the reward
13:     for  do
14:         Sample .
15:         Sample a batch of data from .
16:         Update .
17:     end for
18:  end while
Algorithm 1 Overall training scheme of DI-NAS

4.3 Theoretical Analysis

In this section, we theoretically analyze the proposed method regarding its effectiveness and convergence. To begin with, we analyze the effectiveness of our proposed method in alleviating the issue of performance disturbance as follows.

Theorem 4.1

Given a model with parameter and any input matrix , let be the projection matrix and be the gradient w.r.t. . For the update of in the direction and , let denote the output change of the original model and the updated model. When , the following inequality holds:

(9)

where and .

Theorem 4.1 indicates that for any input feature map , the distance between the original output and the one after the update is controlled by the regularization factor . Specifically, the lower , the lower distance upper bound. Therefore, our proposed orthogonal gradient descent method is able to satisfy the constraint in Eqn. (5). In other words, our method can avoid the large change regarding the previous output of shared layers when training the supernet, thus alleviating the issue of performance disturbance in WS.

We further provide an asymptotic analysis regarding the convergence of the update for as follows.

Theorem 4.2

Given a loss function that is -smooth and convex w.r.t.  . Let and be the optimal and initial solution of . By setting , at the -th update step, the proposed update strategy satisfies:

(10)

This theorem illustrates that our proposed update strategy has a sublinear convergence rate of , which guarantees the effectiveness of the proposed method.

5 Experimental Results

We evaluate the proposed DI-NAS in two main aspects: (1) the superiority of the searched architecture by DI-NAS on CIFAR-10 and ImageNet, respectively; (2) the effectiveness of our proposed disturbance-immune update strategy. The source code will be publicly available.

5.1 Evaluation on CIFAR-10

In this section, we evaluate the proposed DI-NAS method on CIFAR-10 [23]. To be specific, we first use the proposed method to train a controller on CIFAR-10, and use it to search for a convolutional neural architecture. By comparing the searched architecture with other state-of-the-art architectures, we can evaluate the effectiveness of our method. To this end, we first describe the search space, training details and evaluation details.

Search space. Following the settings in DARTS [28], we aim to search for two types of convolutional cells, namely the normal cell and the reduction cell. Each cell contains 7 nodes, including 2 input nodes, 4 intermediate nodes and 1 output node. Between any two nodes, there are 8 available operations, including depthwise separable convolution, dilated convolution, max pooling, average pooling, depthwise separable convolution, dilated convolution, identity and none. After obtaining the convolutional cells, we stack them to build the final convolutional network.

Training details. In the search phase, we divide the standard training set of CIFAR-10 into two parts. Specifically, we randomly select 40% of the training set to train sampled architectures (as training data) and use the rest 60% to learn controllers (as validation data). Moreover, we train DI-NAS for epochs in total. We first train the supernet without learning the controller, and start training the controller from epoch 90. In addition, we set by default, where the sensitivity analysis can be found in Section 5.4. For training the supernet, we use an SGD optimizer with a weight decay of and a momentum of . The learning rate is set to . For training the controller, we use ADAM with a learning rate of and a weight decay of . We add the controller’s sample entropy to the reward, which is weighted by .

Evaluation details. In the evaluation phase, we first use the learned controller to search for a normal cell and a reduction cell. Then, we construct the final convolutional network with 17 normal cells and 2 reduction cells. Following [34], we put the two reduction cells at the and depth of the network, respectively. The initial number of the channels is set to 43. Following DARTS [28], we train the convolutional network for epochs with a batch size of . We apply an SGD optimizer with a weight decay of and a momentum of . Moreover, we set the initial learning rate as and use the cosine annealing strategy [30] to adjust it. We also use the cutout scheme [10] with length 16 for data augmentation.

Architecture Test Accuracy (%) # Params (M) Search Cost
(GPU days)
DenseNet-BC [21] 96.54 25.6
PyramidNet-BC [16] 96.69 26.0
Random search baseline 96.71 0.15 3.2
NASNet-A + cutout [56] 97.35 3.3 1,800
NASNet-B [56] 96.27 2.6 1,800
NASNet-C [56] 96.41 3.1 1,800
AmoebaNet-A + cutout [36] 96.66 0.06 3.2 3,150
AmoebaNet-B + cutout [36] 96.63 0.04 2.8 3,150
Hierarchical Evo [27] 96.25 0.12 15.7 300
SNAS [43] 97.02 2.9 1.5
GHN [45] 97.16 0.07 5.7 0.8
ENAS + cutout [35] 97.11 4.6 0.5
DARTS + cutout [28] 97.24 0.09 3.4 4
NAT-DARTS [15] 97.28 2.7
NAONet [33] 97.02 28.6 200
NAONet-WS [33] 96.47 2.5 0.3
DI-NAS + cutout 97.38 0.04 3.7 1.5
Table 1: Comparisons with state-of-the-art NAS methods on CIFAR-10. Moreover, “-” means unavailable results.
(a) Normal cell.
(b) Reduction cell.
Figure 2: The searched convolutional cells by DI-NAS on the CIFAR-10 dataset.

Comparison with state-of-the-art methods. We show the searched normal cells and reduction cells in Fig. 2 and report the detailed results of all methods in Table 1. For our searched architectures, we run 5 experiments with the different random initialization, and report the average performance with standard deviation. Experimental results show that the architecture searched by our DI-NAS achieves 97.38% accuracy, outperforming all other state-of-the-art architectures, including human-designed ones and NAS ones. Note that ENAS, NAONet-WS, DARTS, and NAT also use the scheme of weight sharing (WS). Such a result demonstrates that it is necessary to alleviate the issue of performance disturbance(PD) in WS-based NAS. This helps us to provide more accurate rewards for controller learning, and search for better neural architectures.

5.2 Evaluation on ImageNet

To verify the generalization of the convolutional cells searched on CIFAR-10 (as shown in Fig. 2), we further evaluate them on a large-scale image classification dataset, namely ImageNet [9]. To begin with, we describe the evaluation details.

Evaluation details. For the ImageNet dataset, we construct the convolutional network with 12 normal cells and 2 reduction cells. We put the two reduction cells at the and depth of the network, respectively. We set the number of the initial channels to 48. Following [28], we train the network by an SGD optimizer with epochs and use a weight decay of and a momentum of . We initialize the learning rate as and decrease it by the cosine annealing [30]. We follow the ImageNet mobile setting [28], where the size of input images is set to and the number of multiply-adds (Madds) is less than 600M.

Architecture Test Accuracy (%) # Params # MAdds Search Cost
  Top-1   Top-5 (M) (M) (GPU days)
ResNet-18 [17]   69.8   89.1 11.7 1,814
Inception-v1 [40]   69.8   89.9 6.6 1,448
MobileNet v1 ([20]   70.6   89.5 4.2 569
ShuffleNet v1 ([46]   70.9   89.2 5.0 524
NASNet-A [56]   74.0   91.6 5.3 564 3,150
NASNet-B [56]   72.8   91.3 5.3 488 3,150
NASNet-C [56]   72.5   91.0 4.9 558 3,150
AmoebaNet-A [36]   74.5   92.0 5.1 555 1,800
AmoebaNet-B [36]   74.0   91.5 5.3 555 1,800
GHN [45]   73.0   91.3 6.1 569 0.8
PNAS [26]   74.2   91.9 5.1 588 255
BayesNAS [53]   73.5   91.1 3.9 - 0.2
DARTS [28]   73.1   91.0 4.9 595 4
NAT-DARTS [15]   73.7   91.4 4.0 441
SNAS [43]   72.7   90.8 4.3 522 1.5
DI-NAS   74.7   92.1 5.2 587 1.5
Table 2: Comparison results on ImageNet, where we use the evaluation code of DARTS [28]. Moreover, “-” means unavailable results.

Comparison with state-of-the-art methods. We compare our searched architecture with several state-of-the-art models on ImageNet. As shown in Table 2, our architecture achieves 74.7% top-1 accuracy and 92.1% top-5 accuracy. To be specific, our architecture outperforms human-designed architectures (e.g., ResNet-18 [17]) by about 5%, and outperforms most of NAS models by in terms of top-1 accuracy. Moreover, our architecture achieves excellent performance only using 1.5 GPU days, while AmoebaNet [36] and NASNet [56] spend 3,150 and 1,800 GPU days, respectively. These results demonstrate the generalization of the searched convolutional cells and the effectiveness/efficiency of the proposed method.

5.3 Effectiveness of Disturbance-immune Update Strategy

In previous experiments, we have demonstrated the superiority of the proposed method. One important reason for superiority is the ability to alleviate performance disturbance (PD). In this section, we further verify the effectiveness of our proposed update strategy in dealing with PD. To this end, we use the following two metrics to measure the degree of PD and use them to evaluate the proposed method.

Metrics for PD. (1) Performance change: for any architecture inherited from the supernet, performance change is defined as (), where and denote the validation accuracy of at the -th and -th training epoch of the supernet, respectively. Note that and indicates the epoch interval. Overall, the larger performance change indicates the more severe performance disturbance. (2) Kendalls Tau (KTau): we use KTau [39] to measure the correlation of performance ranks between two architecture sets. The range of KTau belongs to , where larger KTau means that the performance ranks of two architecture sets are more consistent.

(a) absolute performance change
(b) KTau at different training epochs
Figure 3: Comparisons between our proposed disturbance-immune weight sharing (WS) scheme and the standard WS scheme. The larger the performance change is, the more severe the PD; the higher KTau is, the more accurate the performance estimation.

Evaluation in terms of performance change. To evaluate our method, we randomly sample 64 architectures based on the supernet. We record their performance at different training epochs of the supernet, and report the average performance change of these 64 architectures regarding epoch interval 13. Fig. 3 (a) shows that our disturbance-immune WS scheme is able to reduce the performance change of architectures and thus alleviate the PD issue.

Evaluation in terms of KTau. Based on the above architectures, we compute the KTau between their current performance rank and the one after 13 epochs (i.e., epoch interval is 13). Fig. 3 (b) verifies the effectiveness of our method in improving KTau. Note that the higher KTau indicates a more consistent performance estimation in the training process, which means more stable/accurate rewards for controller learning. Moreover, we compute the ground-truth KTau (GT-Tau) [38], which measures the performance correlation between a set of architectures inherited from the supernet and exactly trained from scratch. Specificcally, our method achieves higher GT-Tau (0.48) than the standard WS (0.16).

5.4 More Discussions

We further discuss PD for the different numbers of shared layers and parameter sensitivity through some self-designed experiments. We provide our main observations and analyses as follows.

Number of shared layers. We find that as the number of shared layers between two architectures increases, the issue of PD becomes more severe. Since there often exist many shared layers in WS-based NAS, this result demonstrates that it is necessary to alleviate the issue of PD for NAS.

Regularization parameter . Generally, the optimal value of varies regarding different data. A large may make the method fail to satisfy the problem constraint (see Eqn. (5)), while a small may result in the irreversible issue when computing projection matrices (see Eqn. (7)). Nevertheless, the default setting helps to achieve the best or relatively good performance in most cases.

6 Conclusions

In this paper, we have proposed a novel disturbance-immune training scheme for NAS to conquer performance disturbance (PD) in weight sharing (WS). Specifically, by developing a new update strategy to train sampled architectures, our method provides more stable/accurate performance estimation for architectures. As a result, the proposed method is able to learn a good controller for searching good architectures. We theoretically and empirically verify the effectiveness of the proposed method in alleviating PD, and also provide an asymptotic analysis for its convergence. Extensive experiments demonstrate that the architecture found by our proposed method outperforms the architectures obtained by considered state-of-the-art NAS methods.

References

  1. G. Adam and J. Lorraine. Understanding neural architecture search techniques. arXiv, 2019.
  2. B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, 2017.
  3. G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pages 549–558, 2018.
  4. H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient architecture search by network transformation. In AAAI Conference on Artificial Intelligence, pages 2787–2794, 2018.
  5. H. Cai, L. Zhu, and S. Han. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.
  6. J. Cao, L. Mo, Y. Zhang, et al. Multi-marginal wasserstein gan. In Advances in Neural Information Processing Systems, pages 1774–1784, 2019.
  7. Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun. Detnas: Neural architecture search on object detection. arXiv, 2019.
  8. X. Chu, B. Zhang, R. Xu, and J. Li. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. arXiv, 2019.
  9. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  10. T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv, 2017.
  11. M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. arXiv, 2019.
  12. C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
  13. G. Ghiasi, T.-Y. Lin, and Q. V. Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019.
  14. Y. Guo, Q. Wu, C. Deng, J. Chen, and M. Tan. Double forward propagation for memorized batch normalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  15. Y. Guo, Y. Zheng, M. Tan, Q. Chen, J. Chen, P. Zhao, and J. Huang. NAT: neural architecture transformer for accurate and compact architectures. In Advances in Neural Information Processing Systems, pages 735–747, 2019.
  16. D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6307–6315, 2017.
  17. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  18. X. He and H. Jaeger. Overcoming catastrophic interference by conceptors. arXiv, 2017.
  19. R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
  20. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017.
  21. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017.
  22. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. National Academy of Sciences, pages 3521–3526, 2017.
  23. A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  24. L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In Conference on Uncertainty in Artificial Intelligence, 2019.
  25. C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019.
  26. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In European Conference on Computer Vision, pages 19–34, 2018.
  27. H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In International Conference on Learning Representations, 2017.
  28. H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2019.
  29. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6738–6746, 2017.
  30. I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  31. J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, pages 5058–5066, 2017.
  32. R. Luo, T. Qin, and E. Chen. Understanding and improving one-shot neural architecture optimization. arXiv, 2019.
  33. R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. Neural architecture optimization. In Advances in Neural Information Processing Systems, pages 7816–7827, 2018.
  34. N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik. Xnas: Neural architecture search with expert advice. In Advances in Neural Information Processing Systems, pages 1975–1985, 2019.
  35. H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pages 4092–4101, 2018.
  36. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019.
  37. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
  38. C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2019.
  39. P. K. Sen. Estimates of the regression coefficient based on kendall’s tau. Journal of American Statistical Association, 63(324):1379–1389, 1968.
  40. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  41. M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
  42. B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
  43. S. Xie, H. Zheng, C. Liu, and L. Lin. Snas: stochastic neural architecture search. In International Conference on Learning Representations, 2019.
  44. G. Zeng, Y. Chen, B. Cui, and S. Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
  45. C. Zhang, M. Ren, and R. Urtasun. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2018.
  46. X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
  47. Y. Zhang, H. Chen, Y. Wei, P. Zhao, J. Cao, et al. From whole slide imaging to microscopy: Deep microscopy adaptation network for histopathology cancer image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 360–368, 2019.
  48. Y. Zhang, Y. Wei, P. Zhao, S. Niu, et al. Collaborative unsupervised domain adaptation for medical image diagnosis. In Medical Imaging meets NeurIPS, 2019.
  49. Y. Zhang, P. Zhao, J. Cao, W. Ma, et al. Online adaptive asymmetric active learning for budgeted imbalanced data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2768–2777. ACM, 2018.
  50. Y. Zhang, P. Zhao, S. Niu, Q. Wu, et al. Online adaptive asymmetric active learning with limited budgets. IEEE Transactions on Knowledge and Data Engineering, 2019.
  51. Y. Zhang, P. Zhao, Q. Wu, B. Li, et al. Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering, 2020.
  52. P. Zhao, Y. Zhang, M. Wu, S. C. Hoi, M. Tan, and J. Huang. Adaptive cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering, 31(2):214–228, 2018.
  53. H. Zhou, M. Yang, J. Wang, and W. Pan. Bayesnas: A bayesian approach for neural architecture search. In International Conference on Machine Learning, pages 7603–7613, 2019.
  54. Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875–886, 2018.
  55. B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
  56. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8697–8710, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
412822
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description