Unchain the Search Space with Hierarchical Differentiable Architecture Search

Unchain the Search Space with Hierarchical Differentiable Architecture Search

Abstract

Differentiable architecture search (DAS) has made great progress in searching for high-performance architectures with reduced computational cost. However, DAS-based methods mainly focus on searching for a repeatable cell structure, which is then stacked sequentially in multiple stages to form the networks. This configuration significantly reduces the search space, and ignores the importance of connections between the cells. To overcome this limitation, in this paper, we propose a Hierarchical Differentiable Architecture Search (H-DAS) that performs architecture search both at the cell level and at the stage level. Specifically, the cell-level search space is relaxed so that the networks can learn stage-specific cell structures. For the stage-level search, we systematically study the architectures of stages, including the number of cells in each stage and the connections between the cells. Based on insightful observations, we design several search rules and losses, and mange to search for better stage-level architectures. Such hierarchical search space greatly improves the performance of the networks without introducing expensive search cost. Extensive experiments on CIFAR10 and ImageNet demonstrate the effectiveness of the proposed H-DAS. Moreover, the searched stage-level architectures can be combined with the cell structures searched by existing DAS methods to further boost the performance. Code is available at: https://github.com/MalongTech/research-HDAS

\affiliations

Malong LLC
{gualiu, jaszhong, sheng, mscott, whuang}@malongtech.com

1 Introduction

A large number of neural networks or architectures has been designed for various computer vision tasks in the past years  Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Szegedy et al. (2015); He et al. (2016); Huang et al. (2017), where human experts played an important role. Such manually-designed architectures have been proved to be effective, but heavily depend on human skills and experience. Recently, Neural Architecture Search (NAS) has attracted increasing attentions Baker et al. (2016); Zoph and Le (2016); Pham et al. (2018), and achieved the state-of-the-art performance in various computer vision tasks.

Figure 1: DARTS-series methods search for a repeatable cell structure, which is stacked sequentially in multiple stages. The proposed H-DAS enables a stage-specific search of both cell-level and stage-level structures, allowing for a diversity of cell structures, cell distribution and cell connections over different stages. This results in a significant larger search space with more meaningful architecture search which improves the performance.

Reinforcement learning Zoph and Le (2016); Pham et al. (2018) or evolutionary algorithm Real et al. (2019, 2017) has been introduced to NAS, due to the discrete nature of the architecture space. However, these methods usually require up to thousands of GPU days Zoph and Le (2016). A few methods have been developed to reduce the computational cost, such as  Pham et al. (2018); Cai et al. (2018); Xie et al. (2018); Dong and Yang (2019); You et al. (2020); Zhong et al. (2020). Among them, differentiable architecture search Liu et al. (2018b) attempted to approximate the discrete search space into a continuous one, where gradient descent can be used to optimize the architectures and model parameters jointly. This line of search approaches, referred as DARTS-series methods Chen et al. (2019); Xu et al. (2019); Chen and Hsieh (2020), has made significant improvements in search speed, while maintaining comparable performance.

However, these DARTS-series methods have two major limitations in terms of search space. First, they commonly perform a cell-level search and adopt the same searched cell structure repeatedly for multiple stages (separated by the reduction cells), which may make the cell structure sub-optimal in the stages, since the optimal cell structures (including the connections and kernel sizes) at different stages can be significantly different. For example, in the architectures searched by Zoph and Le (2016); Cai et al. (2018), the operations in shallow layers are mainly convolutions, while many larger kernels appear in the deeper layers. Second, previous DARTS-series methods mainly focus on searching for a repeatable cell structure, which is stacked sequentially to form the networks with three stages. This configuration assumes a simple chain-like structure at the stage level, which reduces the search space considerably, and ignores the importance of stage-level connections or structures. As revealed by Yang et al. (2019), the overall stage-level connections can impact to the final performance of the networks considerably.

In this work, we redesign the search space of DARTS-series methods, and propose a Hierarchical Differentiable Architecture Search (H-DAS) that enables the search of both cell-level (micro-architecture) and stage-level structures (macro-architecture) (Figure 1). H-DAS significantly increases the search space comparing to previous methods. Specifically, for the micro-architecture, we relax the cell-level search space so that the networks can learn the optimized cell structures at different stages. For searching the macro-architecture, we model each stage as a Directed Acyclic Graph (DAG), where each cell is a node of the DAG.

However, naively searching for the macro-architectures inevitably increases a large amount of additional parameters with corresponding computational overhead. More importantly, directly applying the method of cell search for searching the stages can lead to performance degradation, such as flattened stage structures. To address these issues, we carefully design three search rules and a depth loss, which allow us to systematically study the architectures at the stage level. First, we propose a novel yet simple method to search for the distribution of cells over different stages, under a constraint of computational complexity. This allows for a better optimization on the numbers of cells, which is never investigated in previous DARTS-series methods, where all stages are manually set to have the same number of cells. Second, with the optimized cell distribution computed in the previous step, we then focus on the search of the stage-level architecture. To the best of our knowledge, we are, for the first time, to explore the stage-level macro-architecture search, by relaxing the topological structures among different stages. We show that the proposed H-DAS can improve the performance for image classification prominently. The contributions of this work are summarized as follows:

- We propose a two-level Hierarchical Differentiable Architecture Search (H-DAS) that searches for structures at both the cell level and the stage level. H-DAS includes a cell-level search (-DAS) and a stage-level search (-DAS), for the micro- and macro-architectures, respectively.

- -DAS is able to search for stage-specific cell structures within a greatly enlarged search space, comparing to that of DARTS. -DAS includes a number of carefully-designed search rules and losses, which allows it to first search for an optimal distribution of cells over different stages, and then perform the search again to find the optimal structure for each stage.

- We conduct extensive experiments to demonstrate the effectiveness of the proposed H-DAS, which achieves a 2.41% test error on CIFAR10 and a 24.5% top-1 error on ImageNet. Moreover, the proposed stage-level search can be conducted based on the cell structures explored by other DAS methods, which further improves the performance.

Figure 2: The overall pipeline of H-DAS, which includes a cell-level micro-architecture search (-DAS) and a stage-level macro-architecture search (-DAS) (best viewed in color). -DAS searches for the structure of the cell, including operations (nodes) and connections between nodes. It relaxes the cell-level search space to learn stage-specific cell structures over different stages. -DAS searches for the connections and operations between the cells in the stage-level search space.

2 Related Work

Neural architecture search (NAS) has recently been attracting more attentions, and it can be defined as searching for an optimal operation out of a defined operation set and the best connectivity between the operations, by using a Directed Acyclic Graph (DAG)  Zela et al. (2020b). The weight-sharing paradigm led to a significant improvement in search efficiency. Differentiable Architecture Search (DARTS) Liu et al. (2018b) relaxed the discrete search space to be continuous, making it possible to search architectures and learn network weights using gradient descent. P-DARTS Chen et al. (2019) focused on bridging the depth gap between a search stage and an evaluation stage. PC-DARTS Xu et al. (2019) performed a more efficient search without comprising the performance by sampling a small part of super-net to reduce the redundancy in a network space. In SmoothDARTS Chen and Hsieh (2020), a perturbation-based regularization was proposed to smooth the loss landscape, and improve the generalizability. FairDARTS Chu et al. (2020b) solved the problem of aggregation of skip connections. These approaches follows the convention by stacking identical cells to form a chain-like structure.

Our stage-level search is related to that of Liu et al. (2017), where low-level operations are assembled into a high-level motif, but the conceptions of states were not explored, and an evolutionary algorithm was applied for optimizing the search, making it much less efficient. In  Liang et al. (2019a), a computation reallocation (CR) was developed to search for the stage length for object detection, which inspired the current work, but our approach is more efficient by designing a hierarchical search that performs both cell-level and stage-level search jointly for a different target task.

3 Methodology

Preliminary.

In this work, we follow the cell-level design of DARTS Liu et al. (2018b). The goal of DARTS is to search for a repeatable cell, which can be stacked to form a convolutional network. Each cell is a directed acyclic graph (DAG) of nodes , where each node can be represented as a network layer. Weighted by the architecture parameter , each edge of DAG indicates an information flow from node to node , and is formulated as:

(1)

where is the feature map at the -th node, and denotes candidate operations. More details, such as bi-level optimization, can be found in Liu et al. (2018b).

Hierarchical search space.

In this work, we redesign the search space of DARTS-series methods, and propose a Hierarchical Differentiable Architecture Search (H-DAS) that enables the search both at the cell level (for micro-architecture) and at the stage level (for macro-architecture). As shown in figure 2, we relax the cell-level search space so that the network can learn the stage-specific cell structures for micro-architecture search. Then we model each stage as a DAG for macro-architecture search, which increases the variety of connections between cells. The two methods are named as -DAS (for cell level) and -DAS (for stage level), respectively.

3.1 Micro-Architecture

To enrich cell structures, we design -DAS that relaxes the cell-level search space so that the networks can learn more meaningful stage-specific cell structures over different stages. In an extreme case, one can search for a specific structure for each cell, which results in a maximum of cell-level search space. However, it is not practical to performance NAS with such a large search space, which may make the NAS search process not stable, because the search can be influenced by many factors like hyper-parameters, the competition between model weights, and architecture parameters during bi-level optimization Liang et al. (2019b). More importantly, it is difficult to set the depth of networks flexibly for different goals when the network has a unique structure per cell.

Search space relaxation. To relax the cell-level search space, and maintain the stability during the search period, our -DAS aims to search for stage-specific cell-level structures, where the stages can have their own cell structures to capture different levels of semantics, while all cells in each stage has the same structure searched. Similar to DARTS-series methods, we use two reduction cells to divide the spatial resolution by 2 at 1/3 and 2/3 of the total depth of the networks, which separate the entire networks into three stages. The cell search involves both connections search and operation search. The goal of -DAS is to find three structures of normal cells, each of which could be stacked repeatedly to form the optimal cell structure in the corresponding stage. The motivation of this design is that the shallow layers of CNNs often focus on learning low-level image information, like texture, while the deep layers of CNNs pay more attention on high-level information, like semantic features. Therefore, the optimal cell structures at different stages should be diverse and play different functions, and our -DAS naturally enriches the searched cell structures and increases the search space of the micro-architecture considerably.

3.2 Macro-Architecture

DARTS and its extensions mostly focus on searching for a repeatable cell structure, which is then stacked repeatedly and sequentially in a chain over multiple stages to form the networks. This setting assumes a simple chain-like structure for multiple stages, which significantly reduces the search space, and ignores the diversity in stage-level structures. Similarly, recent NAS methods based on mobile inverted bottleneck Sandler et al. (2018), such as MnasNet Tan et al. (2019), also stack the repeatable MBConvs sequentially to form the networks.

In addition to the search of cell structure, we found that it is important to build meaningful high-level macro-architecture by searching for the optimal connections between the cells over different stages. In this work, we introduce a Directed Acyclic Graph (DAG) structure to stage-level macro-architecture search. As shown in Figure 2, -DAS searches for the connections between the searched (normal) cells in each stage to form a macro-architecture, which allows us to explore the power of stage-level structures. Notably, the searched macro-architectures can vary at different depths of networks, allowing the networks to learn meaningful cell connections at different stages.

(a)
(b)
(c) stage-level structure ()
Figure 3: Comparison on the number of candidate input cells. The stage level outside structure with cells, and is the number of previous candidate cells which the current cell could connect with, like a sliding window of length, where .

However, searching for the macro-architecture is non-trivial, and would suffer from several problems. For example, many additional parameters and computational costs will be introduced by performing the search in the new stage-level search space. Moreover, the searched stage-level structures may become very shallow due to the ease on optimizing shallow networks, which will degrade the performance of networks. In this work, we carefully design three important rules, with a novel depth loss, to ensure a robust and efficient search of the stage-level structure.

Rule 1: non-parametric connections. We design a set of candidate operations for the stage-level search space based on a key observation: the connections between cells may play a more important role than operation types. This observation is demonstrated by ablation experiments as shown in supplementary material (SM), and it was also discussed in Xie et al. (2019). Hence, we define a small set of candidate operations, including avg pooling, max pooling, skip-connect and None, which do not have any learnable parameter, and therefore keep the network capacity similar to the conventional sequential structure. The search process can be formulated as Eq. (1), with a difference that represents the output of the -th cell. In this case, the conventional stage-level structure of DARTS can be considered as a special case of -DAS, when only a single operation skip-connect is used between the cells.

Rule 2: stage output with selective cell aggregation. In the cell-level search, the final output of an entire cell is a depth-wise concatenation of the outputs of all nodes within the cell. In this setting, all nodes can contribute to the output in terms of computation. However, in the stage-level search by -DAS, when the same concatenation strategy is adopted, the channel size of the output of each stage can be significantly increased, which in turn results in a great increase of the parameters. To minimize the additional parameters introduced by -DAS, we compute the output of a stage by using the concatenated features of the last two cells in the stage. However, this rule introduces another issue that some intermediate cells are not directly connected to any subsequent cells, and hence are not included in the computational graph, e.g. Figure 3 (a). We alleviate this problem by introducing Rule 3 as follows.

Rule 3: constraint on preceding cells. A cell is defined as a dead cell when it is not connected to the stage output by any path in the graph. We empirically found that the performance of the networks is negatively impacted by the number of dead cells, since the complexity and depth of the networks can be largely reduced when there are many dead cells, as shown in Figure 3. To alleviate this problem, we set a constraint on the number of preceding cells which each current cell can connect to. Namely, the cells as candidate predecessors can be considered as a sliding window for each cell in the stage. For example, a cell can only choose its preceding cells from three previous cells when we set (Figure 3). This search rule can be considered as a trade-off between the stage-level search space and the number of active cells applied in the networks.

Depth loss. Without any restriction, the cells tend to directly connect to the input nodes, since shallower networks are generally easier to be optimized during the search (with respect to the network parameters), but often have lower performance. To alleviate this problem, we introduce a depth loss which takes into account the depth of the networks during the search. Each cell in the stage has a depth number, which indicates the number of intermediate cells between the current cell and input feature maps. We set the depth number of the input feature maps to be 0. In this case, the depth number of a cell is the weighted sum (i.e. ) of the depth numbers of its connected preceding cells. Therefore, we can calculate the depth number for each cell in a recursive manner:

(2)

where indicates the weight between node and , is the depth of node , and is the number of cells in the stage. Minimizing the depth loss encourages the networks to go deep, and therefore improves the capability of networks.

Search for distribution of cells. With the three search rules and the new depth loss, we can now systematically study our stage-level architecture. First, we search for the number of cells in each stage. Previous DARTS-series methods mainly follow conventional configuration by manually setting the same number of cells for all stages, and it has not been verified whether such manual configuration is optimal. We therefore develop a simple method to explore the distribution of cells over different stages. The key idea of this search is to initialize an over-parameterized network (i.e. by containing more cells than necessary), and then remove the less impactful cells in each stage during the search, under a constraint of certain network capacity or FLOPs. Based on , we introduce a parameter to encode the importance of five pairs of adjacent cells that can connect to the output cell, among which only one pair will be selected at the end of the search. The output of a stage can be formulated as:

(3)

where is the weight of output from cell and cell , and are the output features of cell . is an architecture parameter that selects the one optimal pair of cells connecting to the final output of each stage, as illustrated in Figure 4.

To incorporate , E.q. 2 can be reformulated as:

(4)

Crucially, a loss is adopted to constrain the computational complexity of the whole networks, and therefore the less important cells can be removed during the search. The cells in each stage share the same cell-level structure, and thus the computational complexity is constant for each stage. As a result, the loss with a constraint on the computational complexity can be simplified as follows:

(5)

where is the number of cells in each stage, and is the minimum number of cells in a stage. can be either multi-adds (FLOPs) or the number of parameters, depending on the desired constraint. We choose to have a constraint on the FLOPs. In this case, since the stages share the same cell structure and thus the FLOPs of the cell are the same across three stages. The final loss function is:

(6)

where and are weighting factors that balance the contributions of different losses. After training, we choose the pair of cells which has the largest to connect to the output feature map. By setting the weights in E.q. 6, we can explore the number of cells in each stage, and obtain a network with a configurable total number of cells.

Notably, the weight of depth loss influences the final performance of the searched architecture. The optimal weight can be explored empirically. The corresponding experimental results regarding could be found in SM.

(a) Over-parameterized structure and candidate cells pairs
(b) Selection of an output-connected pair of cells
Figure 4: Search for distribution of cells in each stage by .

Search for stage-level structure. We fix the number of cells in each stage when searched, and then focus on searching for a stage-level architecture. The whole networks are constructed in the same manner as DARTS when the search is done. We can directly derive the network structure in the search of cells, but we empirically found that searching again with a fixed number of cells in each stage can improve the performance. This may be explained by a tighter search space which makes the search easier. With the relaxation of the topological structures among stages, we have a hierarchical structure that contains the search of nodes, cells and stages. The searched stage-level structure by -DAS can be combined with other cell-level structures searched by existing DAS methods to further boost the performance.

Search space complexity. The proposed methods significantly increase the scale of search space. Specifically, -DAS increases the search space of DARTS from to by allowing for stage-specific cells. Furthermore, by unchaining the conventional macro-architecture and searching for a specific macro-architecture for each stage, -DAS has a search space of for a 20-cell network, without considering the cell-level search space. In the search of cell distribution by -DAS, two additional cells are added in each stage, increasing the scale of the search space to , which is considerably larger than those in previous DAS methods. By including the cell-level search space, the full search space of a 20-cell -DAS can reach to , which is significantly larger than previous methods, and naturally leads to a higher performance. The details of the complexity calculation are described in SM.

Architecture Test Err. Params Search Cost Search
(%) (M) (GPU-days) Method
ResNet He et al. (2016) 4.61 1.7 - manual
DenseNet-BC Huang et al. (2017) 3.46 25.6 - manual
NASNet-A Zoph et al. (2018) 2.65 3.3 2000 RL
AmoebaNet-A Real et al. (2019) 3.34 3.2 3150 evolution
Hierarchical evolution Liu et al. (2017) 3.75 15.7 300 evolution
ENAS Pham et al. (2018) 2.89 4.6 0.5 RL
Arch2Vec Yan et al. (2020) 2.56 3.6 100 BO
ProxylessNAS Cai et al. (2018) 2.08 5.7 4 gradient
DARTS(2nd order) Liu et al. (2018b) 2.76 3.3 1 gradient
SNAS (mild) Xie et al. (2018) 2.98 2.9 1.5 gradient
GDAS(FRC) Dong and Yang (2019) 2.82 2.5 0.17 gradient
P-DARTS Chen et al. (2019) 2.62 / 2.50 3.4 0.3 gradient
PC-DARTS Xu et al. (2019) 2.57 3.6 0.1 gradient
NoisyDARTS Chu et al. (2020a) 2.65 / 2.39 3.6 0.4 gradient
RDARTS  Zela et al. (2020a) 2.95 - 1.6 gradient
SDARTS-ADV Chen and Hsieh (2020) 2.61 3.3 1.3 gradient
FairDARTS Chu et al. (2020b) 2.54 3.3 0.41 gradient
ISTA-NAS Yang et al. (2020) 2.54 3.3 0.05 gradient
-DAS 2.66 2.3 0.4 gradient
-DAS 2.41 3.4 0.7 gradient
-DAS (with cell in P-DARTS) 2.30 3.5 0.3 gradient
-DAS (with cell in NoisyDARTS) 2.34 3.6 0.3 gradient
-DAS-autoAugment Cubuk et al. (2018) 1.99 3.6 0.7 gradient
Table 1: Comparison with state-of-the-art architecture on CIFAR10.   : Our implementation by training the best cell architecture provided by the authors using the code of H-DAS.   : Obtained on a different search space with PyramidNet Han et al. (2017) as the backbone.  : The search cost contains 0.4 GPU-day for cells and 0.3 GPU-day for stages.

4 Experiments and Results

4.1 Implementation Details

We conduct experiments on CIFAR10 Krizhevsky and Hinton (2009) and ImageNet Deng et al. (2009). In the search of cell-level structure, we follow DARTS Liu et al. (2018b) by using the same search space, hyperparameters and training scheme. We set in Eq. (5). The stage-level search space contains 4 non-parametric operations which connect the cells, including: average pooling, skip connection, max pooling, no connection (none). For training a single model, we use the same strategy and data processing methods as DARTS. More details can be found in SM.

Architecture Test Err.(%) Params Search Cost Search
top-1 top-5 (M) (M) (GPU-days) Method
Inception-v1 Szegedy et al. (2015) 30.2 10.1 6.6 1448 - manual
MobileNet-v2 Sandler et al. (2018) 25.3 - 6.9 585 - manual
ShuffleNet 2x (v2) Ma et al. (2018) 25.1 - 7.4 591 - manual
NASNet-A Zoph et al. (2018) 26.0 8.4 5.3 564 1800 RL
AmoebaNet-C Real et al. (2019) 24.3 7.6 6.4 570 3150 evolution
PNAS Liu et al. (2018a) 25.8 8.1 5.1 588 225 SMBO
MnasNet-92 Tan et al. (2019) 25.2 8.0 4.4 388 - RL
MobileNet-v3-large Howard et al. (2019) 24.8 - 5.4 219 - RL
DARTS (2nd order) Liu et al. (2018b) 26.7 8.7 4.7 574 1 gradient
GDAS Dong and Yang (2019) 26.0 8.5 5.3 581 0.21 gradient
SNAS (mild) Xie et al. (2018) 27.3 9.2 4.3 522 1.5 gradient
P-DARTS Chen et al. (2019) 25.1 / 24.4 7.7 / 7.4 4.9 557 0.3 gradient
SinglePath-NAS Stamoulis et al. (2019) 25.0 7.8 - - 0.15 gradient
ProxylessNAS (GPU) Cai et al. (2018) 24.9 7.5 7.1 465 8.3 gradient
PC-DARTS Xu et al. (2019) 25.1 7.8 5.3 586 0.1 gradient
RandWire-WS Xie et al. (2019) 25.3 7.8 5.6 583 - random
SDARTS-ADV Chen and Hsieh (2020) 25.2 7.8 - - 1.3 gradient
FairDARTS Chu et al. (2020b) 24.9 7.5 4.8 541 0.4 gradient
ISTA-NAS Yang et al. (2020) 25.1 7.7 4.78 550 2.3 gradient
-DAS 25.9 8.4 5.0 578 0.5 gradient
-DAS 24.5 7.7 5.1 572 0.3 gradient
Table 2: Comparison with state-of-the-art architectures on ImageNet (mobile setting).   : Our implementation by training the best architecture provided by the authors using the code of H-DAS.   : Searched on CIFAR10.   : Searched on ImageNet.

4.2 Search for Cell Distribution

The search of cell distribution over three stages is performed under a constraint of certain computational complexity. It is interesting that different stages can have different numbers of cells at the beginning of the search or during the search, but the numbers of cells in all stages become the same at the end of the search. As discussed, adjusting the weighting factor in Eq. (6) leads to a different total number of cells in the networks. We therefore repeated the search several times with various values of , resulting in a different total number of cells in the networks. But the cell distribution remains the same. Based on these observations, for a fair comparison, we set the number of cells in each stage in our -DAS to be 6 for CIFAR10 and 4 for ImageNet, which are the same as other DARTS-series methods.

4.3 Results on CIFAR10

The results on CIFAR10 are compared in Table 1. -DAS has many non-parametric connections in the cells of the last stage, and thus its parameter size becomes small, which is about 30% smaller than that of DARTS. However, it still has better performance than DARTS, suggesting that enlarging the cell-level search space and learning stage-specific cell-level structures could bring a great improvement. It also shows that searching for a single repeatable cell structure to form a chain-like network is not fully optimized.

Furthermore, by combining with the cell structures searched by recent excellent DAS methods Chen et al. (2019); Chu et al. (2020a), our -DAS, which searches for more meaningful cell connections over different stages, can improve the stage-of-the-art methods by achieving an error of 2.30% on CIFAR10, which is the best among DAS methods. Notice that ProxylessNAS has a better result, but with 60-100% more parameters and search time comparing to other methods, including our -DAS.

We also found that the performance of -DAS is better than -DAS, suggesting that the macro-architecture plays an important role in neural architecture. A simple chain-like sequential structure would significantly limit the search space, and discard the importance of stage-level structures. It is beneficial to unchain the search space and pay more attention on developing more meaningful macro-architectures in the future research.

4.4 Results on ImageNet

To verify the generalization of our searched structures, we train our networks on ImageNet using the architecture searched from CIFAR10. Results of our -DAS and -DAS on ImageNet are compared in Table 2. We transfer the cell-level and stage-level structures with four normal cells searched from CIFAR10 to ImageNet. Again, -DAS outperforms -DAS with a large margin, demonstrating the superiority of meaningful cell connections in stage-level structures. Additionally, our -DAS is comparable against other state-of-the-art DAS methods, without using the advanced network block design like MBConvs.

Search Cell Stage Test Err.(%)
Space Sh. Spec. Sh. Spec. C10 ImgNet
DARTS - - 2.76 26.70
-DAS - - 2.66 25.91
-DAS 2.58 25.45
-DAS 2.51 25.80
-DAS 2.30 24.50
Table 3: Performance of the cell-level search and stage-level search, by using the shared structures over all stages or the stage-specific architectures for different stages. DARTS and -DAS do not perform stage-level search.

Structure
m # Dead cells # Depth Test Err.(%)
Fig.3(a) 7 4 2 3.08
Fig.3(b) 4 1 3 2.91
Fig.3(c) 3 0 5 2.58
Table 4: Constraint of input cells. The three macro-architectures are retrained on CIFAR10 with the same configuration. ’m’: the length of sliding window. ‘# Dead cells’: the number of cells which are not connected to a subsequent cell. ‘# Depth’: the number of cells in the longest path from input to output.

4.5 Ablation Study

To demonstrate the effectiveness of cell-level search and stage-level search, we do ablation experiments on CIFAR10 and ImageNet with various configurations. The details are reported in Table 3. We can summary that, (1) our -DAS outperforms DARTS by allowing for searching for different cell structures over different stages; (2) comparing to DARTS, our -DAS which performs the stage-level search, can improve the performance of the unsearched chain-like structure; (3) -DAS can further improve -DAS by implementing the stage-specific macro-architecture search; (4) it is interesting that -DAS has a lower performance than -DAS by performing the stage-specific cell search. This would result in a larger search space which might be over-complex for a search algorithm to learn an optimal result.

Additionally, as we found empirically in Section 3.2 that the number of dead cells in the searched architecture would impact the performance of the networks. We further verify it by training the rest two structures in Figure 3 on CIFAR10. As presented in Table 4, a network with less dead cells and a deeper macro-architecture obtains a higher performance.

Furthermore, to show the effectiveness of our searching strategy on macro-architecture, we keep the same cell-level structures and compare the macro-architectures searched with a random baseline. By following Yang et al. (2019), we calculate a relative improvement over this random baseline as , which provides a quality measurement of the search strategy. and indicate the top-1 accuracy of a search method and the random sampling strategy, respectively. Our -DAS achieves a of 0.44, with a significant improvement over 0.32 of DARTS.

5 Conclusions

We have presented our Hierarchical Differentiable Architecture Search (H-DAS) which preforms both cell-level and stage-level architecture search. In the cell-level search, H-DAS improves DARTS baseline by exploring stage-specific cell structures. Importantly, we formulate the stage-level structure as a directed acyclic graph, which allows us to search for a more advanced architecture than the conventional chain-like configuration. Our two-level search enlarges the search space considerably, and improves the performance of previous DARTS-series methods significantly.

6 Appendix

In this appendix, we provide additional material to supplement our main submission, including additional ablation study, complexity analysis, datasets, implementation details and visualization of cell-level structures and stage-level structures in -DAS and -DAS.

6.1 Ablation Study

Depth loss.

Once the number of cells in each stage is determined, we then fix the number of cells, and focus on the search of stage architectures. The loss function of the search period is . In order to show the importance of the depth loss, we do some ablation experiments on CIFAR10. As we see in Table 5, the depth loss brings a great improvement on the performance. We show the stage-level structure searched without sliding window and depth loss in fig.5(a), structure searched with sliding window but without depth loss in fig.5(b) and that searched with both sliding window and depth loss in fig.5(c). We define the depth of an architecture as the average length of the paths from input to output, and obviously, the depths of the three stage-level structures are increasing.

Connections and operations. We have a key observation that the connections between cells may play a more important role than the operation types. We show an example in Table 6. Therefore, in stage-level structures, we only choose several non-parametric operations.

6.2 Complexity Analysis

In this section, we analysis the complexity of our search space for -DAS and -DAS.

In DARTS, each of the discretized cell allows possible DAGs without considering graph isomorphism. Since they are jointly learning both normal and reduction cells, the total number of architectures is approximately .

In -DAS, we have three stage-specific normal cells and two different reduction cells, so the total number of architectures is approximately .

In -DAS, each of our stage-level structure allows possible DAGs without considering graph isomorphism (recall we have 3 non-zero ops, 2 input nodes, 6 intermediate cells with 2 predecessors each). If we introduce the sliding window constraint to the search space, it becomes . Combined with the cell-level structures, the total number of architectures for CIFAR10 is .

The search space is even larger in the cell allocation search, as there are 8 cells per stage. Hence, each of the stage-level structure has a search space of . This results in a total number of for the whole network. If we introduce the sliding window constraint to the search space, the search space becomes . Combined with the cell-level structures, the total number of architectures for CIFAR10 is about .

Test Err. on CIFAR10 (%)
0 2.59
0.33 2.47
1 2.34
1.5 2.53
Table 5: Depth loss. is used to balance the classification loss and depth loss.

6.3 Datasets

We perform experiments on CIFAR10 Krizhevsky and Hinton (2009) and ImageNet Deng et al. (2009), which are two image classification benchmarks for evaluating neural architecture search.

CIFAR10 consists of 50K training images and 10K testing images. These images are of a spatial resolution of and equally distributed over 10 classes. ImageNet is a large-scale and well known benchmark for image classification. It contains 1000 object categories, 1.28M training images, and 50K validation images. Following the  Liu et al. (2018b), we apply the setting that the input size is fixed to be .

operations stage-level structure top-1
none 97.49
skip-connect
max-pool-3x3
avg-pool-3x3
97.48
none
skip-connect
Table 6: Connection and Operations. In stage-level search space, two structures share the same connection, only with some operations changed. We evaluate each of the structures three times on CIFAR10 and show the average accuracy, which are almost equal.

6.4 Implementation Details

Architecture search in cell-level search space.

The entire experiments have three stages: architecture search for cells, architecture search for macro-architectures and architecture evaluation. Both -DAS and -DAS need architecture search for cells and architecture evaluation, and -DAS needs to search for macro-architecture additionally. The search space for cells is the same as DARTS, which has 8 candidate operations including:

- depthwise-separable conv   - average pooling

- depthwise-separable conv - max pooling

- dilated conv                       - skip connection

- dilated conv                       - no connection (none)

Meanwhile, the search space for macro-architecture has only 4 non-parametric candidate operations including:

- average pooling                 -skip connection

- max pooling                      - no connection(none)

We have narrowed down the choice of stage-level operations because we don’t want to bring too much learnable architecture parameters to -DAS.

For CIFAR10, in -DAS we use the same structure as these one-shot methods by stacking 8 cells (6 normal cells and 2 reduction cells). In our method, the 8 layers network is separated into three stages by 2 different reduction cells. Each stage contains 2 normal cells, and the normal cells in different stages are different. The whole network is stacked by 2 normal cells in the first stage, 1 reduction cell, 2 normal cells in the second stage, 1 reduction cell and 2 normal cells in the third stage. Each cell consists of nodes. We search for the architecture of cells for 50 epochs with initial 16 channels and 64 batch size. Half of the training data is used to update the model weights while the other is used to update the architecture parameters. The model weights are optimized by momentum SGD with initial learning rate of 0.025, momentum of 0.9 and weight decay of . The architecture parameters are optimized by Adam with a fixed learning rate of , a momentum (0.5, 0.999) and a weight decay of .

For ImageNet, the one-shot model starts with three convolution layers with stride 2 to reduce the resolution of input images from to . The cells we use are derived on CIFAR10, which are shown in figure 6 and 7.

Architecture search in stage-level search space. For the macro-architectures searching on CIFAR10, the initial channel is 16, batch size is 64 and the number of training epochs is 50. The optimizers of model weights and architecture parameters are SGD and Adam respectively. The setting of optimizers is just the same as that in the period of searching for cells. In order to construct a network of 20 cells, we search for three macro-structures each of which has 6 normal cells.

For the macro-architectures searching on ImageNet, the cells are derived on CIFAR10. The initial channel is 16, the batch size is 64 and the training epochs is 50. Considering the evaluation stage on ImageNet with 14 stacked cells, we should search three macro-architectures that each stage contains four normal cells. Each cell in a macro-architecture can only choose its preceding cells from three previous cells. Due to the difficulty of bi-level optimization on ImageNet, we search the macro-architectures on CIFAR10 and split half of it for updating model weights and another half for updating architecture parameters of macro-architectures. The model weights are optimized by SGD with initial learning rate of 0.025, momentum of 0.9 and weight decay of . The architecture parameters of macro-architectures are optimized by Adam with a fixed learning rate of , a momentum (0.5, 0.999) and a weight decay of . We visualize the macro-architectures of CIFAR10 in fig. 8 and those of ImageNet in fig. 9.

In short, for a fair comparison, we inherit the configuration of the original DARTS as far as possible. For CIFAR10, the searching time of cells is 0.4 GPU days with a single Titan X and the searching time of macro-architectures is 0.3 GPU days. For ImageNet, we transfer the cells searched on CIFAR10, and just search for macro-architectures in 0.3 GPU days.

Architecture evaluation. The evaluation stage of -DAS is the same as that of DATRS. Considering -DAS for CIFAR10, the network is composed of 20 cells, and six same normal cells construct a macro-architecture. The initial number of channels is 36 and the total training images are used. The network is trained from scratch for 600 epochs with a batch size of 128. We use the SGD optimizer with an initial learning rate of 0.025, a momentum of 0.9, a weight decay of and a gradient clipping for weights of 5. Auxiliary weight of 0.4 and drop path probability of 0.2 are added for regularization.

The evaluation stage of ImageNet also starts with three convolution layers of stride 2 to reduce the resolution from to . The network stacks three macro-architectures with two reduction cells, and each stage contains four normal cells. In order to keep the number of multiply-add operations under mobile setting, the initial channel number is changed to be 39 and the training epoch number is 250 with a batch size of 128. We use the SGD optimizer with a momentum of 0.9, an initial learning rate of 0.1 and a weight decay of . Additional enhancements are adopted including label smoothing and an auxiliary loss tower with a weight of 0.4.

6.5 Visualization

Colors. When plotting cell-level structures, we use the colors of darkseagreen2, lightblue and palegoldenrod, while plotting stage-level structures, the colors are darkgoldenrod1, indianRed1 and honeydew2, which represent input features, nodes (or cells in stage-level structures) and output feature respectively.

Cell-level structures. In Fig 6, we visualize the three stage-specific normal cells of -DAS, and Fig 7 shows the reduction cells. All of them are searched on CIFAR10 and derived on CIFAR10 and ImageNet in -DAS. An interesting discovery is that the normal cells in the 1st stage (fig.6(a)) are full of convolution with small kernels (), while the normal cells in the 3rd stage (fig.6(c)) prefer convolution with various kernels ( and ).

Stage-level structures. There are non-parametric connections between cells in stage-level structures of -DAS. In Fig. 8, we construct a 20-cell network for CIFAR10, each stage of which has 6 cells. Because of the input restrictions of cells, each cell could connect with 3 candidate preceding cells in its stage. In Fig. 9, we construct a 14-cell network for ImageNet, each stage of which has 4 cells.

Loss of computational complexity. The search for the distribution of cells over three stages is conducted under a constraint of certain computational complexity. Adjusting the weighting factor in Equation leads to different total number of cells in the network, but the cell distribution remains the same, which means the number of cells in each stage is the same at the end of the search. In Fig. 10 and Fig. 11, we show how the weighting factor changes the number of cells in each stage.

(a) w/o Sliding Window, w/o Depth Loss
(b) Sliding Window, w/o Depth Loss
(c) Sliding Window, Depth Loss
Figure 5: The depths of the stage-level structures are increasing through adding the sliding window (the input restrictions of cells) and the depth loss.
(a) normal cell in the 1st stage
(b) normal cell in the 2nd stage
(c) normal cell in the 3rd stage
Figure 6: The best normal cells of -DAS searched on CIFAR10.
(a) reduction cell 1
(b) reduction cell 2
Figure 7: The best reduction cells of -DAS searched on CIFAR10.
(a) the 1st macro-architecture
(b) the 2nd macro-architecture
(c) the 3rd macro-architecture
Figure 8: Stage-level Structures of -DAS for CIFAR10. Each stage has 6 cells.
(a) the 1st macro-architecture
(b) the 2nd macro-architecture
(c) the 3rd macro-architecture
Figure 9: Stage-level Structures of -DAS for ImageNet. Each stage has 4 cells.
(a) the 1st macro-structure for CIFAR10
(b) the 2nd macro-structure for CIFAR10
(c) the 3rd macro-structure for CIFAR10
Figure 10: Larger introduces less cells to stage-level structures in -DAS. The cell distribution remains the same.
(a) the 1st macro-structure for CIFAR10
(b) the 2nd macro-structure for CIFAR10
(c) the 3rd macro-structure for CIFAR10
Figure 11: Smaller introduces more cells to stage-level structures in -DAS. The cell distribution remains the same.

References

  1. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §1.
  2. Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1, §1, Table 1, Table 2.
  3. Stabilizing differentiable architecture search via perturbation-based regularization. ICML. Cited by: §1, §2, Table 1, Table 2.
  4. Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760. Cited by: §1, §2, Table 1, §4.3, Table 2.
  5. Noisy differentiable architecture search. arXiv preprint arXiv:2005.03566. Cited by: Table 1, §4.3.
  6. Fair darts: eliminating unfair advantages in differentiable architecture search. ECCV. Cited by: §2, Table 1, Table 2.
  7. Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: Table 1.
  8. Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.1, §6.3.
  9. Searching for a robust neural architecture in four gpu hours. In CVPR, Cited by: §1, Table 1, Table 2.
  10. Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5927–5935. Cited by: Table 1.
  11. Deep residual learning for image recognition. In CVPR, Cited by: §1, Table 1.
  12. Searching for mobilenetv3. In ICCV, Cited by: Table 2.
  13. Densely connected convolutional networks. In CVPR, Cited by: §1, Table 1.
  14. Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1, §6.3.
  15. Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  16. Computation reallocation for object detection. arXiv preprint arXiv:1912.11234. Cited by: §2.
  17. Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §3.1.
  18. Progressive neural architecture search. In ECCV, Cited by: Table 2.
  19. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §2, Table 1.
  20. Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2, §3, §3, Table 1, §4.1, Table 2, §6.3.
  21. Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: Table 2.
  22. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §1, Table 1.
  23. Regularized evolution for image classifier architecture search. In AAAI, Cited by: §1, Table 1, Table 2.
  24. Large-scale evolution of image classifiers. In ICML, Cited by: §1.
  25. Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §3.2, Table 2.
  26. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  27. Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: Table 2.
  28. Going deeper with convolutions. In CVPR, Cited by: §1, Table 2.
  29. Mnasnet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §3.2, Table 2.
  30. Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §3.2, Table 2.
  31. SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §1, Table 1, Table 2.
  32. Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv preprint arXiv:1907.05737. Cited by: §1, §2, Table 1, Table 2.
  33. Does unsupervised architecture representation learning help neural architecture search?. Advances in Neural Information Processing Systems 33. Cited by: Table 1.
  34. NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522. Cited by: §1, §4.5.
  35. Ista-nas: efficient and consistent neural architecture search by sparse coding. Advances in Neural Information Processing Systems 33. Cited by: Table 1, Table 2.
  36. GreedyNAS: towards fast one-shot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2008. Cited by: §1.
  37. Understanding and robustifying differentiable architecture search. ICML. Cited by: Table 1.
  38. NAS-bench-1shot1: benchmarking and dissecting one-shot neural architecture search. arXiv preprint arXiv:2001.10422. Cited by: §2.
  39. Representation sharing for fast object detector search and beyond. arXiv preprint arXiv:2007.12075. Cited by: §1.
  40. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §1, §1.
  41. Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Table 1, Table 2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
425286
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description