RAPDARTS: ResourceAware
Progressive Differentiable Architecture Search
Abstract
Early neural network architectures were designed by socalled “grad student descent”. Since then, the field of Neural Architecture Search (NAS) has developed with the goal of algorithmically designing architectures tailored for a dataset of interest. Recently, gradientbased NAS approaches have been created to rapidly perform the search. Gradientbased approaches impose more structure on the search, compared to alternative NAS methods, enabling faster search phase optimization. In the realworld, neural architecture performance is measured by more than just high accuracy. There is increasing need for efficient neural architectures, where resources such as model size or latency must also be considered. Gradientbased NAS is also suitable for such multiobjective optimization. In this work we extend a popular gradientbased NAS method to support one or more resource costs. We then perform indepth analysis on the discovery of architectures satisfying singleresource constraints for classification of CIFAR10.
I Introduction
THE optimal design of a neural architecture depends on 1) the target dataset, 2) the set of primitive operations (e.g. convolutional filters, skipconnections, nonlinearity functions, pooling), 3) how the primitive operations are composed into a neural architecture and optimized, and 4) resource constraints like hardware cost, minimum accuracy, or maximum latency. In this paper, we assume the target dataset has been provided, and we provide guidelines and analysis for searching for neural architectures under one or more hardware resource constraints.
Convolutional layers and fullyconnected layers are parameterheavy operations. Those, along with other lighter primitive operations, like pooling layers or batch normalization, may be composed into an endless variety of neural architectures. But what is the optimal neural architecture for a given dataset? There is no existing closedform solution to that question.
Historically, the highest performing neural architectures have been found by applying heuristics and a large amount of compute. Some well known examples of modern handcrafted architectures include AlexNet [1], VGG16 [2], ResNet [3], and the Inception series [4, 5, 6]. None of these examples consider hardware, and they pursue classification performance at all cost.
Neural Architecture Search (NAS) methods automate strategies for discovery of high performing neural architectures. A reinforcement learningbased (RL) approach was the first postAlexNet NAS method with stateoftheart performance on CIFAR10 [7, 8]. The RL approach was quickly followed by a high performance Evolutionary Strategy (ES) based method [9]. While both the RL and ES methods discovered high performance architectures, their use came at the cost of thousands of GPU hours.
Gradientbased NAS (GBNAS) methods have the benefit of being directly optimized through gradient descent and consequently complete the search faster than other NAS methods. The basic idea of GBNAS is given in Figure 1. The search process alternates between temporarily fixing one set of parameters, i.e. assuming they are constants, and updating the other set of parameters. This approach has no convergence guarantees, but it works well in practice.
Because neural models are now widely deployed on systems like edge devices, in cars, and running in servers, available hardware resources also have an impact on what may be considered an “optimal” neural architecture design. Hardware resource constraints are often summarized as size, weight, and power (SWaP). Resource constraints could also include maximum latency, minimum throughput, or a manufacturing budget which will determine if a custom ASIC is an option, if a COTS device is sufficient, or if something semicustom, like an FPGA, is an option. For example, during the design of Google’s TPUv1, architects were given a budget of 7 ms per inference (including server communication time) for userfacing workloads [10].
Recent efforts described below implement NAS strategies incorporating hardware resource constraints into the search. GBNAS methods capture hardware resource constraints within a differentiable loss function. This approach enables the architecture search to yield network architectures biased toward satisfying resource constraints.
In this work we have modified PDARTS [11], which inturn is based on another popular gradientbased NAS algorithm, DARTS [12], to support resource costs. We use our modified GBNAS algorithm to search for many neural architectures under various resource consumption penalties. We then use our results and observations to answer the following questions:

What is the computational cost of searching for satisficing architectures?

What heuristics can be used to guide the search and training process to reduce compute time?

How reproducible are search results under random initial conditions?
Ii Related Work
The first competitive NAS approach applied to modern image classification tasks was based on reinforcement learning (RL) [7]. In this work, an LSTMbased RL agent was trained to output primitive operations which were then chained together into a directed acyclic graph. After training and evaluating the graph, the agent was then encouraged or discouraged, via a positive or negative reward derived from classification accuracy, to generate similar graphs in the future or to explore and make new graphs.
The reinforcement learning NAS approach worked well and was able to achieve high accuracy, but at unheard of computational expense. It required 3,150 GPUdays to discover one of their published architectures.
Related approaches to sampling neural architectures include Markov chain Monte Carlo methods [13], evolutionary strategies [14], and genetic algorithms [15]. Similar to RL approaches, all of these optimization methods generate populations of neural architectures. The populations are then trained and a fitness value is derived from the classifier’s final test performance. The fitness value is used to encourage or discourage the design of the next population of architectures.
Reinforcement learning, Markov chain Monte Carlo methods, evolutionary strategies, and genetic algorithms discover highperformance architectures, but they are incredibly expensive. These methods often require 100 to 1000 more compute than gradientbased methods [16].
Gradientbased neural architecture search has recently become popular because of its efficiency [12, 17, 18, 11]. GBNAS methods maintain two sets of parameters: network parameters and architecture parameter . Previous GBNAS methods have introduced various methods to optimize and use the two parameter sets. In the simplest case, optimization is achieved by optimizing one set of parameters and then the other. This firstorder optimization approach is illustrated in Figure 1.
Differentiable Architecture Search (DARTS) is a GBNAS technique that uses mixed operations to compute multiple primitive operations in parallel, followed by elementwise summation [12]. The mixed operations are scaled by architecture parameters prior to summation. For example, as illustrated in Figure 2, a convolutional filter and a convolutional filter can be designed such that both receive the same input feature map and both generate additively conformable output feature maps.
Extending this technique, DARTS composes 14 mixed operations into a cell. Eight cells are then chained to create the network. Each cell has the same connectivity and architecture parameters () for mixed operations, but the network parameters () are learned independently in each primitive operation and in each cell. An illustration of the DARTS cell connectivity is given in Figure 3.
DARTS has a limitation which requires the entire neural network (i.e. all cells and all mixed operations) to fit in GPU memory. This limits the depth of the neural network as well as the batch size during training. Progressive Differentiable Architecture Search (PDARTS) mitigates the memory limitation of DARTS by 1) gradual growth in the depth of the neural network, and simultaneously 2) gradual reduction in number of primitive operations per mixed operation, thus reducing model size [11].
ProxylessNAS also extended DARTS [18]. ProxylessNAS treats the architecture parameters of each mixed operation as a probability distribution. ProxylessNAS stores a large overparameterized network in system memory, because the network is too large to fit on a GPU. During evaluation, a subnetwork is sampled and transferred to the GPU for evaluation. Gradients are calculated and used to update the sharedweights of the overparameterized network.
Addressing the need to search for architectures which not only strive for high accuracy, but also meet additional performance constraints, hardwareaware NAS techniques have been pursued. ProxylessNAS is particularly relevant for hardwareaware GBNAS, because it formalizes the approach to incorporating resource costs during the search. In the context of classification, ProxylessNAS creates a loss function that incorporates both a crossentropy loss for the classification accuracy as well as a resource loss for latency.
In this work we augment PDARTS with a ProxylessNASstyle resource loss and analyze its impact on architectures discovered during the search phase.
Iii Method
Iiia ResourceAware Differentiable Neural Architecture Search
When training a convolutional neural network for classification, the goal is to obtain a model that best predicts labels from observations drawn from an underlying distribution of interest. Fitting a neural model to an underlying distribution is achieved by finding optimal network parameters that minimize expected prediction error on an available dataset:
(1) 
where is the objective function, are dataset observations, are dataset labels, is the empirical distribution, is a prediction error loss function, and is the neural network parameterized by .
Gradientbased NAS methods introduce another set of architecture parameters , producing:
(2) 
We refer to as a directed acyclic graph, or simply graph, to highlight that it is composed of a neural network whose control flow is modified by other nonnetwork architecture parameters. Note the distinction between used in Equation 1, which is only parameterized by network parameters, and used in Equation 2, which is parameterized by both network and architecture parameters.
Architecture parameters, like network parameters, are scalarvalued tensors. Architecture parameters are used to control either the weight of primitive operations, as in [12, 11], or the probability primitive operations will take place, as in [19, 18]. In both cases, the scalar values are at least interpreted as one or more probability distributions through processing by the softmax function. In our case, the probability distribution is then used for evaluation of a mixed operation.
A mixed operation is illustrated in Figure 2, and it is formalized as:
(3) 
where is a primitive operation, and is equivalent to the expected value of the primitive operations. This formalism extends the mixed operation to the inclusion of primitive operations that are evaluated in parallel and designed such that their outputs are additively conformable. In practice many mixed operations are used, with unique subsets of and used for the calculation of each expected value, but we show only a single mixed operation here for clarity.
The inclusion of architecture parameters implies there are now two objective functions to be optimized:
(4) 
The graph evaluations in Equation 4 are now denoted and . This notation highlights that in the case of the graph is evaluated at input and architecture parameter constants and optimized using network parameters . In the second case of the graph is evaluated at input and network parameter constants and optimized using architecture parameters . Therefore the following bilevel optimization must be solved:
(5) 
When using firstorder differentiable methods, this bilevel optimization is solved by alternatingly “locking” one set of parameters and updating the other with gradient descent. Secondorder optimization methods, which involve calculation of the Hessian, are also possible and slightly better in terms of accuracy, but this comes at significant computational cost. However, it is possible to approximate the secondorder optimization with reduced computational cost [12].
Our method extends PDARTS to discover neural architectures biased toward the satisfaction of resource constraints. We do this by including one or more “expected resource cost” loss terms. As mentioned previously, each of the primitive operations in a mixed operation is associated with a unique architecture parameter. PDARTS uses 14 mixed operations in the search phase of cell architecture discovery, and there are eight primitive operations per mixed operation, so there are architecture parameters total.
The expected value of a single mixed operation was given in Equation 3. We temporarily make index values of the mixed operation explicit here for clarity:
(6) 
where is the mixed operation index. Note here that the probability distributions, , are now tied to a particular mixed operation. This calculation is equivalent to the addition node in Figure 2.
As introduced in ProxylessNAS, the probabilities used in the mixed operation calculation are also conducive to calculation of the expected value of various resource costs. For example, if there is a cost function that takes as input the description of each primitive operation (including the input feature map dimension information) and outputs a resource cost, it may be used for the calculation of an expected resource cost of the mixed operation:
(7) 
The cost function may be an analytical function, e.g. number of bytes required by the model, or the cost function could be based on a simulation or a surrogate model trained from data collected from a physical device.
The expected cost of the mixed operation is differentiable with respect to the mixed operation’s architecture parameters. Accordingly, the partial derivative of the expected resource cost with respect to architecture parameter is given as:
(8) 
where we have abbreviated as , if equals and otherwise, and we have dropped the mixed operation index for brevity.
We denote the sum of expected mixed operation costs as:
(9) 
Note that unique correspond to unique resource costs, e.g. could be the sum of expected mixed operation parameter sizes, and could be the sum of expected mixed operation latencies.
We denote the sum of the classification and resource losses as:
(10) 
where is the number of resource costs to satisfy, and is the resourcecost hyperparameter and controls how important the resource cost is compared to accuracy as well as other resource costs.
The bilevel optimization in Equation 5 may now be slightly rewritten as:
(11) 
where only has been replaced by . As before, this may be optimized using first or secondorder approaches. For intuition on the continued use of a single loss function , consider Figure 4. Under the assumption that a change in network parameters creates no change in cost (given a fixed input feature map and primitive operation), the gradient of with respect to is zero. On the other hand, a change in architecture parameters creates a change in both and . So calculating the gradient of with respect to both and results in the correct values.
Using the method above, we created ResourceAware PDARTS (RAPDARTS). Practically, the modification to PDARTS requires the total expected resource cost be returned during the forward pass of an input tensor. To achieve this, during calculation of each mixed operation (Equation 6), we also calculate the expected resource cost (Equation 7). The expected cost for all mixed operations is accumulated and added to the classification loss (Equation 9). If multiple costs are required, e.g. model size and latency, each cost requires its own version of Equation 7, and must be accumulated individually from other costs.
Iv Experiments and Results
We use RAPDARTS to search for CIFAR10 neural architectures. We follow the architecture discovery algorithm of PDARTS and search for cell architectures containing the same primitive operations as used by DARTS and PDARTS, namely:

Zero*

SkipConnect*

AvgPool *

MaxPool *

Seperable Conv.

Seperable Conv.

Dialated Conv.

Dialated Conv.
All of the above primitive operations are standard convolutional layers except Zero which allows a cell to learn not to pass information. Skipconnect is a parameterfree operation which allows information to pass through the mixed operation without modification. Parameterfree primitive operations are marked with an asterisk.
In an effort to simulate a realworld constraint, we restrict ourselves such that discovered CIFAR10 architectures must have less than parameters. This constrained optimization problem may be captured as:
(12)  
subject to 
We perform NAS adhering to this constraint using the RAPDARTS framework above.
For the purpose of baseline calculations, we first consider the unconstrained results from PDARTS. The authors of PDARTS provided a reference architecture discovered through their algorithm [20]. We trained and evaluated that architecture eight times using the latest version of the PDARTS code [21]. We then used the results from the repeated training to obtain performance statistics of the published architecture.
The resulting trained models achieved error on the CIFAR10 validation dataset. Additionally, the published PDARTS architecture requires parameters.
We then executed the PDARTS architecture search code four times to test the ability to rediscover architectures with the performance of the published architecture. The four searches resulted in nine architectures. However, per the PDARTS algorithm, we eliminated one architecture with more than two skipconnections in the normal cell (see PDARTS paper for details on the two cell types).
None of the eight valid architectures were the same as the official PDARTS CIFAR10 architecture, but this is not surprising, given the size of the PDARTS architecture search space. Because of this, we compare our results to the statistics of various architectures discovered during our search, instead of the statistics of the single published architecture. The resulting trained models achieved error on CIFAR10. The architectures required M parameters. The smallest PDARTS model required M parameters.
We now explore the impact of different hyperparameter values on the unconstrained multiobjective version of Equation 12:
(13) 
where is the sum of expected number of parameters in the model. As introduced in Equation 10, the scalar is a hyperparameter which determines the relative importance of the resource cost explicitly and the relative importance of the accuracy of the network implicitly.
C10 Test Err (%)  

Architecture  Best  Avg  Params (M)  Search Cost (GPUdays)  Search Method 
AmoebaNet+B + cutout [22]  N/A  2.8  3150  evolution  
ASHA [23]  2.85  2.2  9  random  
DARTS [12]  2.94  N/A  2.9  .4  gradientbased 
DSONAS [24]  N/A  3.0  1  gradientbased  
SNAS + moderate constraint + cutout [17]  2.85  N/A  2.3  1.5  gradientbased 
RAPDARTS + cutout (ours)  2.68  2.8  12  gradientbased 
As stated in this section’s introduction, our selfimposed resource budget is 3 M parameters. The default PDARTS search does not generate models that small, however, by using RAPDARTS we are able to satisfy this constraint. To achieve this, we need to discover a value to guide the architecture search. That is accomplished by finding a coarse range of suitable s and then identifying a refined .
The coarse is identified by performing various architecture searches with s sampled randomly from a uniform distribution . Each search requires .3 GPUdays.
Results from the coarsesearch are shown in Figure 5. At approximately , architectures begin to be generated which meet the parameter count constraint. Parameter counts reduce dramatically as approaches , but we have observed that models with higher capacity tend to perform better than models with lower capacity, so it is unlikely that architectures derived from are preferred over those closer to the 3 M parameter threshold.
Figure 6 “zooms in” on the previous figure, focusing on sampled uniformly from . Near , architectures are generated that often require less than 3 M parameters.
One final search is then performed on sampled uniformly from . This test resulted in 48 valid architectures with resulting models between 2.1 M and 2.96 M parameters. We then trained the 16 largest resulting architectures. The resulting best model achieved 2.68% CIFAR10 validation error and required 2.8 M parameters. The results for all 16 trained models are plotted in Figure 7. As can be seen, there is no linear relationship at this scale between parameter count and CIFAR10 accuracy. For statistical confidence, we retrained the best model eight times with different seeds and obtained validation error.
The discovered cells corresponding to the 2.68% CIFAR10 validation are shown in Figure 8. The DARTSbased algorithms use two cell types: a “normal” cell, which maintains input and output feature map dimensionality, and a “reduce” cell, which decrease the output feature maps dimensionality.
The cell architectures discovered by RAPDARTS are noteworthy in several respects. First, the normal cell has discovered a “deep” design, similar to that discovered by PDARTS, but only lightweight convolutional operations are used. Second, all pooling operations have been moved to the reduce cells.
Table I compares the RAPDARTS architecture with the performance of recent architectures with parameter counts less than 3 M. RAPDARTS competes favorably with the others.
We report the actual number of hours spent searching for our winning architecture, not merely the search time for a single architecture. Including both the coarse and finesearch phases, 40 different values were used. This took a total of 12 GPUdays to compute.
We trained 16 of the finesearch phase models to completion. Each model required less than 20 hours to train, so the 16 finesearch models took less than 14 GPUdays total to train. All experiments were performed using an NVIDIA V100 GPU.
V Conclusion and Future Work
Classification accuracy achieved by neural architecture search methods now surpass handdesigned neural models. Firstgeneration NAS methods include those based on evolutionary search and reinforcement learning. Second generation NAS methods use gradientbased optimization. In this work we present RAPDARTS, which augments a popular gradientbased NAS method with the ability to target neural architectures meeting specified resource constraints. We use RAPDARTS to identify a neural architecture achieving 2.68% test error on CIFAR10. This is competitive with other existing results for models with less than 3 M parameters.
We believe thirdgeneration methods will be gradientbased and attempt to make more aspects of the search differentiable. For example, the PDARTS (and RAPDARTS) search begins with five cells, then grows the search network to 11 cells, and finally 17 cells. At the same time, as the network grows, less important primitive operations are dropped. The “gradual” adjustments introduced by this technique enable architecture parameters learned by gradientdescent in one phase to be useful in another. It would be preferable to make these changes even more gradually. We leave that for future work.
In conclusion, we have presented an example that optimizes two objectives: minimizing accuracy loss while keeping the number of model parameters below a resource constraint threshold. A limitation of our work is that the number of parameters required by our discovered models may not optimize other constraints, e.g. minimum latency. To address this concern, future work will focus on multiple resource constraints guided by more hardwarespecific costs.
Acknowledgment
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DENA0003525.
The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 [5] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
 [6] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,” in ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [7] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
 [8] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html
 [9] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Largescale evolution of image classifiers,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 2902–2911.
 [10] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “Indatacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.
 [11] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” arXiv preprint arXiv:1904.12760, 2019.
 [12] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
 [13] S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer, “Neural networks designing neural networks: multiobjective hyperparameter optimization,” in Proceedings of the 35th International Conference on ComputerAided Design. ACM, 2016, p. 104.
 [14] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multiobjective neural architecture search via lamarckian evolution,” arXiv preprint arXiv:1804.09081, 2018.
 [15] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf, “Nsganet: neural architecture search using multiobjective genetic algorithm,” in Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 2019, pp. 419–427.
 [16] M. Wistuba, A. Rawat, and T. Pedapati, “A survey on neural architecture search,” arXiv preprint arXiv:1905.01392, 2019.
 [17] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXiv preprint arXiv:1812.09926, 2018.
 [18] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018.
 [19] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path oneshot neural architecture search with uniform sampling,” arXiv preprint arXiv:1904.00420, 2019.
 [20] “Pdarts published cifar10 genotype,” https://github.com/chenxin061/pdarts/blob/b1575e101aedb7396a89d8a7f74d0318877a1156/genotypes.py, accessed: 20191024.
 [21] “Pdarts source code,” https://github.com/chenxin061/pdarts/tree/05addf3489b26edcf004fc4005bbc110b56e0075, accessed: 20191024.
 [22] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4780–4789.
 [23] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” arXiv preprint arXiv:1902.07638, 2019.
 [24] X. Zhang, Z. Huang, and N. Wang, “You only search once: Single shot neural architecture search via direct sparse optimization,” arXiv preprint arXiv:1811.01567, 2018.