MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search
Neural Architecture Search (NAS) has proved effective in offering outperforming alternatives to handcrafted neural networks. In this paper we analyse the benefits of NAS for image classification tasks under strict computational constraints. Our aim is to automate the design of highly efficient deep neural networks, capable of offering fast and accurate predictions and that could be deployed on a low-memory, low-power system-on-chip. The task thus becomes a three-party trade-off between accuracy, computational complexity, and memory requirements. To address this concern, we propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS). We employ a one-shot architecture search approach in order to obtain a reduced search cost and we focus on an anytime prediction setting. Through the usage of multiple-scaled features and early classifiers, we achieved state-of-the-art results in terms of accuracy-speed trade-off.
Tens of thousands of images are captured every second by the world’s population, which led to an increased interest in image classification tasks. Subsequently, remarkable progress has been made in designing systems that can achieve satisfactory results on object recognition benchmark datasets, such as ImageNet  or COCO . Nowadays, there exists a paradigm shift regarding where the inference process takes place. Specifically, instead of using multi-node clusters for simple object recognition tasks, more applications [19, 10] are designed following the edge computing principle, thus bringing data storage and computation closer together, hence saving bandwidth and improving response time.
In an edge computing setting, three main metrics should be taken into account when evaluating a classification system, namely the accuracy in solving a specific task, the computational latency, and the on-chip memory size of the deployed model. Only by considering these three metrics one could create a model able to offer fast and accurate predictions under low-memory, low-power constraints.
The design of efficient, accurate neural networks is a challenging task for computer vision experts in this day and age. Choosing the adequate network architecture has proven to be crucial for the system’s performance and, when one has to take into account multiple metrics, the process becomes all the more laborious. Because of this, a rapidly growing research topic addresses the development of algorithms capable to automate the discovery of optimal architectures, namely neural architecture search (NAS) algorithms .
Our main focus is represented by the trade-off between accuracy and latency. We therefore address the problem of anytime prediction, the network’s ability to offer predictions at any given point. We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS), a novel approach towards obtaining efficient neural networks, capable of successfully accomplishing the aforementioned tasks. The tedious challenge of handcrafting the structure of the network is alleviated through the usage of NAS algorithms, while the quality and complexity of the features passed throughout the network are preserved by introducing a multi-scale structure. Lastly, the latency issue is managed through the usage of early exits.
2 Related work
2.1 Neural architecture search
Several approaches have been made in order to automate the generation of robust neural networks. These approaches include the use of reinforcement learning [26, 1], evolutionary algorithms , and surrogate model-based optimisations [11, 15]. Nevertheless, the prolonged searching time  might be unfeasible in particular scenarios, we are thus compelled to use a different method.
One-shot architecture search is defined by Witsuba et al.  as a system which, during the entire search process, trains a single neural network, used afterwards to derive different architectures throughout the search space as possible candidates to the optimisation problem that was imposed. The neural architecture search space represents a subspace that includes all the nodes and the set of operations that have to be applied by each node in order to compute its output. As certain constraints can be imposed on the final architecture, the concept of search space can be restricted to the set of feasible solutions such a NAS technique may return. The search space can be divided into two classes: a global search space, where a graph is used to define the entire architecture, and a cell-based search space, in which a several cells are replicated through the architecture to form the final network. Empirical evidence provided by Pham et al.  shows that the cell-based approach outperforms the global one both in terms of accuracy and in number of parameters.
Subsequently, Liu et al.  proposed differentiable architecture search (DARTS, see Figure 1). As opposed to previously researched techniques, which involved applying reinforcement learning or using evolutionary algorithms over a discrete, non-differentiable search space, DARTS introduces a continuous relaxation of the architecture representation, thus favouring the use of gradient descent for a more efficient search of the network architecture. The relatively short GPU time necessary to discover an efficient architecture (i.e. 4 GPU days) and its high accuracy, together with the comparably small amount of parameters and the modularity of the structure per se, which results from the cell-based search space approach, are the main reasons for choosing this method as the foundation of our work.
2.2 Latency reduction
Recent hardware and algorithmic advances have stimulated the design of networks with increased depths, with the end goal of obtaining a better performance. The result was an explosion in number of layers, and in number of parameters implicitly, from VGG-16 , which achieved 74.4% accuracy on ImageNet dataset  with 138 million parameters, to ResNeXt-101 32x48d , with 85.4% accuracy using 829 million parameters. Apart from the on-chip memory that these fast growing networks take up, one has to take into account other factors as well, such as the latency and the energy required for feedforward inference. In many real world scenarios, and especially in an edge computing setting, the end-user is willing to trade several percentages in accuracy in order to consume less energy and to have a shorter runtime.
Different methods have been used to alleviate the latency issue and, inherently, to reduce the power consumption. Whilst initially introduced to alleviate the vanishing gradient problem, the method of integrating skip connections (i.e. dynamically adjusting which layers of a network are to be executed during inference) was successfully used to reduce the inference time as well. [7, 6, 22]. While these models successfully address the latency issue, they generate an overhead in terms of network parameters, therefore a different approach has to be taken in order to obtain a satisfactory trade-off between accuracy, latency, and memory.
The second approach aimed at reducing the latency in a deep neural network is represented by early exits. This assumes the placement of additional classifiers after intermediate layers of the network, thus allowing the system to finish the execution before reaching the final layer. Subsequently, the final classifier is expected to be reached only for complex data, which results in a reduced computational effort. Huang et al.  presented Multi-Scale Dense Networks (MSDNet), shown in Figure 2, where they follow the principle of anytime inference. Their work addresses a relevant issue of early exits, namely the fact that the first layers of a deep neural network extract fine-grained features, whilst the deep ones operate with coarse-grained features, thus the scale of the features throughout the network is varying. In order to alleviate this problem, the authors propose maintaining features of all scales at any point in the network. By always having both fine- and coarse-grained features to learn from, classifiers can use the coarse-level ones to operate their predictions, whilst the others can be used as the main source of knowledge propagation. Early exits are used as well in the work of Zhang et al. , who proposed Graph HyperNetwork (GHN). Their approach relies on neural architecture search, as opposed to the handcrafted network presented by Huang et al. . A mixture of graph neural networks and hypernetworks is used, thus allowing for the generation of weights based on the computation graph representation of the architecture.
2.3 Model compression
As discussed in subsection 2.2, the goal of designing high-accuracy networks led to deeper and deeper architectures. The clear trade-off of this scenario consists in the uprising of memory expensive and computationally intensive models. Nevertheless, several methods could be performed in order to help compressing the models, while simultaneously preserving their performance. According to Cheng et al. , these techniques can be divided in four classes: (1) parameter pruning and sharing based methods, which make use of the model parameters’ redundancy, trying to remove the redundant and unimportant ones, (2) low-rank factorisation based methods, that use tensor decomposition to estimate the valuable parameters of the network, (3) the transfer or the compression of convolutional filters, aimed at reducing the parameter space, and (4) knowledge distillation techniques, allowing for larger model to be reproduced using a shallower one. Additionally, we have to mention quantisation , the reduction in number of bits of a network’s parameters, thus reducing both bandwidth and storage requirements with little to no drop in performance.
3 Multi-Scale Resource-Aware Neural Architecture Search
3.1 Differentiable architecture search
Following Liu et al. , we propose the search of a computation cell as a building block that is replicated throughout the final architecture. From that perspective, a cell is a directed acyclic graph comprising an ordered set of nodes, a description that follows Bauer’s formalisation .
Each node represents a feature map and is associated with an operation that acts as a directed edge between the node that will be processed (i.e. ) and the node’s child (i.e. ). Instead of assuming that each operation from the search space is either part of the final network or it is not (i.e. ), Liu et al.  propose a relaxation of this assumption, namely they consider a linearly weighted combination in which can take any real value in the range , which admits for a soft decision on each path. Considering as the set of candidate operations, where each operation is a function that will be applied on , the relaxation of the categorical choice of a specific operation to a softmax over all possible operations leads to the following equation:
The softmax performed in Equation (1) ensures that for every
Therefore, the task of searching for a specific architecture can be reduced to learning a set of continuous variables , as shown in Figure 1. Thus, we can consider to be the encoding of the architecture, or the architecture itself.
3.2 Multi-scale feature maps
As we presented in subsection 2.2, the memory-favourable approach that can be used in order to reduce the latency of a network is represented by the introduction of early exits, thus having accurate-enough predictions with a short runtime and going through the entire set of layers only for complex inputs. Given the results obtained by Huang et al.  through the usage of multi-scaled feature maps, we will implement a similar technique in the current work. Namely, we will have up to three scales, the larger one having the same size as the input, whilst the height and the width of the rest will be halved in comparison to predecessor. The convolution of features of similar size from one layer to another will be herein after named same-scale convolution or horizontal convolution. Secondly, in order to go from a fine-level feature to a coarser one, we will apply a strided convolution, referred to as diagonal convolution. Additionally, we will use the term horizontal level, or simply level, representing all the equally-sized feature maps through the entire network. For the upper level, where the features have the same size as the input, we will use only same-scale convolutions. For all the other levels, the input of an MS-RANAS cell (i.e. a layer derived using our neural architecture search technique) will be represented by a concatenation between a diagonal convolution of the previous larger-scale feature and a horizontal convolution of the previous same-scale feature. In addition, the first layer will be used only to generate all the multiple scales, through a strided convolution, known as a vertical convolution. To explain concisely, horizontal convolutions help preserving and feeding forward high-resolution information, which allows for the creation of relevant coarse-grained features in deep layers, whilst the diagonal convolutions facilitate the access of classifiers to both fine- and coarse-level features. An overview of MS-RANAS can be observed in Figure 3.
3.3 Learning procedure
The final network is comprised of stacked identical cells, each of them learning its own parameters. Unless stated otherwise, we are using four intermediate nodes, each with two inputs, whose output is a simple addition of the inputs. The sole purpose of these nodes is to create the context for intermediate operations, thus allowing a variety of possible interactions of the inputs from one layer to another. The complete set of operations that can take place between the inputs is known in literature as the NASNet search space . The reason behind using depth-wise separable convolutions (i.e. a depth-wise convolution, followed by a point-wise convolution) and dilated convolutions (i.e. a convolution with a larger receptive field) lies in the lower amount of multiplications and additions than the common 2D convolution. The final output is obtained as a concatenation of all intermediate nodes’ results.
During both searching for an architecture and its training, we are using cross-entropy loss function for all the classifiers. Our goal is to minimise the weighted cumulative loss:
Here, refers to the training set and is the weight of the -th classifier. Intuitively, having a larger weight for the early classifiers would lead to more precise result in the first layers. Yet, given the conclusions of Huang et al.  and the fact that we want to limit the constraints imposed on the searching algorithm, we decided to have an equal weight for all the predictors.
4 Experiments and results
4.1 Enhancing the search process
In order to asses the impact of the previously presented techniques, we have selected a 5-layer, 16-initial channels, single-classifier (i.e. a singular exit point, located after the last layer), single-scaled network to represent our baseline model. Additionally, during the training phase the same network architecture was implemented, the only difference being represented by the MS-RANAS cell structure, obtained through different search processes. The results are presented in Figure 4.
In terms of data preprocessing, we implemented the findings of DeVries and Taylor , namely the use of cutout (i.e. randomly masking out square regions of the input in the training phase). As shown in Figure 4, applying this technique solely during the search process led to a considerable increase in accuracy when compared to the baseline network.
Secondly, we introduced early classifiers during the searching phase, the classifiers being included on each layer except for the first one. While we could have integrated classifiers on the first layer as well, we decided not to do it, as that might have over-optimised features for the first classifier excessively early with respect to the total depth. When using only one scale during the search process, the use of early exits largely improves the overall accuracy of the network, as seen in Figure 4. We postulate that this behaviour is linked with the MS-RANAS cells being forced to be more agile in order to offer good predictions in the early stages. Another aspect that we have to take into account is the relationship between adding intermediate classifiers and widening the architecture. As the number of scales increases while searching for the optimal cell structure, the positive impact of multiple classifiers is reduced. A similar behaviour is observed when analysing the impact of early exits when the search process includes the usage of cutout. We therefore conclude that early exits act as regularizers for the model, applying penalties on layer activity or layer parameters during the optimisation phase.
Lastly, we observed that, by increasing the amount of scales (e.g. three scales) during the search process, the network is able to offer more accurate classifications. Specifically, we obtained an increase of almost 2% in accuracy when using three scales during searching for the architecture instead of only one, despite the fact that only one scale was used during training. We attribute that to the fact that each cell forces their predecessors to produce more qualitative results. Therefore, this behaviour is similar to implementing a deeper network, but, instead of adding more layers and, hence, expanding the depth, we add scales, thus we increase the width. The advantage of having a rather wide than deep model lies in the increasing interaction between weights, as presented in 3.2: during backpropagation, each operation directly influences two sets of weights, as opposed to only one set for a single-scaled model.
4.2 Anytime inference and accuracy-latency trade-off
Our main focus is the accuracy-latency trade-off, analysed under the anytime inference setting . Anytime prediction implies the existence of a non-deterministic computational budget per each test sample . The budget is drawn from the joint distribution . We therefore aim to minimise the expected loss
where represents the loss function, whilst is our model.
In order to adhere to the anytime prediction principles, early classifiers had to be introduced in the final model. As mentioned in 2.2, the implementation of early exits in a neural network leads to a reduction in accuracy when compared to the its single-classifier counterpart. This behaviour is visible when comparing the results depicted in Figure 5 and Figure 4, respectively. Therefore, to alleviate this issue we exploited and further researched the findings presented in 4.1.
Firstly, due to the promising results obtained through the usage of cutout while searching for the MS-RANAS cell structure, we implemented the same technique, followed by introducing the so-obtained cell in a multi-classifier network. Whilst the cutout might have brought an improvement of the final accuracy, the insertion of early exits overshadowed it, contributing to unsatisfactory results, as shown in Figure 5. We consider that this behaviour is a consequence of the network over-optimising the early features in order not be penalised by the loss function applied to the first classifiers. Hence, as the usage of cutout only during the search process was insufficient, we made additional use of it during the training phase, which led to an increase in the model’s accuracy. This was followed by an additional improvement when the searching phase benefited from the implementation of multiple classifiers, thus proving the consistency of the conclusions presented in 4.1.
When we addressed the impact of multiple feature maps during the training process and both training and searching phase, respectively, a comparable behaviour, yet with a larger magnitude, was observed, as seen in Figure 5. Namely, a multiple-scaled model broadly outperforms a single-scaled one when early exits are implemented. Nevertheless, driven by the encouraging results discussed in 4.1, we have introduced intermediate classifiers not only in the final architecture, but also while searching for the optimal cell, thus marginally augmenting the quality of the predictions.
4.3 Memory impact in anytime inference
If one application has latency as the main target and accuracy is only a second-order metric, the predictions can be taken solely from an early classifier instead of averaging out the required budget with respect to a desired accuracy. To address this issue, in Figure 6 we present a comparison of the absolute accuracy on each intermediate classifier of several architectures. We observed that a smaller difference between the errors of the first and last classifier is achieved through the usage of early classifiers and multiple scales during both searching and training phase. Namely, in the triple-scaled network there is a difference of less than 1.25% in accuracy between the prediction obtained through the first exit and the one retrieved from the last classifier, but with the usage of only 12% of the initial amount of millions of floating-point operations (MFLOPS) and memory requirements.
When the on-chip memory represents a third metric used to evaluate a model, it would appear that the total amount of model parameters decreases when an earlier classifier is used. Nonetheless, the storage space required to deploy the entire network is still given by the total amount of parameters, namely the value that is depicted for the last classifier. In order to de facto reduce the model parameters, one should apply the techniques presented in Section 2.3, their implementation being outside the scope of this paper.
4.4 Comparison with state-of-the-art
In order to compare our MS-RANAS to the current state-of-the-art models in the field of anytime prediction, the network size (i.e. number of parameters) has been chosen to match the dimension of the handcrafted MSDNet network presented by Huang et al. , thus allowing for an impartial comparison, whilst the size of the GHN model introduced by Zhang et al.  was unavailable. Given this constraint, we could not focus on a true three-party trade-off between accuracy, computational complexity, and memory requirements, therefore the latter became a hard limitation instead of an adjustable property of our model. This led to the usage of two intermediate nodes in each cell, as opposed to the four proposed by Liu et al. . An additional benefit of this decision was that, compared to their one-shot architecture search strategy, our search cost in terms of GPU days was further reduced. This is justified by the fact that decreasing the amount of nodes within an MS-RANAS cell reduces the number of edges in the acyclic graph, which translates into a smaller search space on which the neural architecture search algorithm operates. The final MS-RANAS cell structure is presented in Figure 8, together with the reduction cell shown in Figure 8. For the same reason of maintaining the memory requirements within the aforementioned bounds, the final architecture is comprised of only three scales and seven layers, thus also allowing for the introduction of multiple early exits.
Under this specific setting, we analysed the relation between the amount of MFLOPS and the accuracy of the network on CIFAR-10, presented in Figure 9. As Zhang et al.  reported the usage of cutout during GHN training, we followed their approach and used also cutout, which allowed us to successfully surpass their results as well. We also outperformed MSDNet  after 0.7 MFLOPS. When compared to the state-of-the-art in anytime classification on CIFAR-100 , our model outperformed MSDNet after 0.6 MFLOPS, as shown in Figure 10. MS-RANAS therefore achieved scalable, competitive results in terms of accuracy-latency trade-off with handcrafted (MSDNet) or optimised (GHN) state-of-the-art architectures, while employing a limited amount of parameters.
5 Conclusions and future work
We introduced MS-RANAS, a novel approach towards automating the generation of efficient neural networks. Its flexibility in terms of intermediate nodes per cell, scales throughout the architecture, and positioning of early exits allows for its usage in a large variety of applications. We showed that the process of searching for the optimal MS-RANAS cell can be augmented through the usage of cutout, multiple scales, and intermediate classifiers, thus obtaining a more accurate network despite the absence of these techniques in the final model. When the main aim of the system is the trade-off between speed and accuracy (e.g. anytime prediction), MS-RANAS network outperforms the existing state-of-the-art architectures, either handcrafted or not, while requiring a comparable amount of parameters. This makes our work an ideal candidate for object recognition tasks on an edge computing setting, where there exists a joint focus on accuracy, latency, and on-chip memory.
We plan to further analyse two directions which could lead to improved results. First of all, as we have presented in subsection 2.3, several techniques such as pruning and quantisation could be used to further reduce the dimensionality of the network. Secondly, in order to allow for accurate predictions under a scarce computational budget, utilizing the area under accuracy-MFLOPS curve as a selection criteria during the architecture search phase might be beneficial, thus generating a model efficiently operating under any computational constraints. Additionally, directly optimising latency instead of using a surrogate ( i.e. FLOPS), as proposed by Tan et al. , could lead to a better accuracy-latency trade-off.
- All our codes and models will be released.
- (2016) Designing neural network architectures using reinforcement learning. CoRR abs/1611.02167. External Links: Cited by: §2.1.
- (1974-03) Computational graphs and rounding error. SIAM Journal on Numerical Analysis 11, pp. 87–96. External Links: Cited by: §3.1.
- (2017) A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282. External Links: Cited by: §2.3.
- (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §2.2.
- (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Cited by: §4.1.
- (2017) Spatially adaptive computation time for residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1790–1799. Cited by: §2.2.
- (2016) Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983. External Links: Cited by: §2.2.
- (2012-21–23 Apr) SpeedBoost: anytime prediction with uniform near-optimality. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, N. D. Lawrence and M. Girolami (Eds.), Proceedings of Machine Learning Research, Vol. 22, La Palma, Canary Islands, pp. 458–466. External Links: Cited by: §4.2.
- (2017) Multi-scale dense convolutional networks for efficient prediction. CoRR abs/1703.09844. External Links: Cited by: Figure 2, §2.2, §3.2, §3.3, Figure 10, Figure 9, §4.4, §4.4.
- (2018-09) AI benchmark: running deep neural networks on android smartphones. In The European Conference on Computer Vision (ECCV) Workshops, Cited by: §1.
- (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 2016–2025. External Links: Cited by: §2.1.
- (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §4.4, §4.
- (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: §1.
- (2018) DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Cited by: Figure 1, §2.1, §3.1, §3.1, §4.4, §4.
- (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 7816–7827. External Links: Cited by: §2.1.
- (2018) Exploring the limits of weakly supervised pretraining. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu and Y. Weiss (Eds.), Cham, pp. 185–201. Cited by: §2.2.
- (2018-10–15 Jul) Efficient neural architecture search via parameters sharing. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 4095–4104. External Links: Cited by: §2.1.
- (2018) Regularized evolution for image classifier architecture search. CoRR abs/1802.01548. External Links: Cited by: §2.1.
- (2016-10) Edge computing: vision and challenges. IEEE Internet of Things Journal 3 (5), pp. 637–646. External Links: Cited by: §1.
- (2015) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §2.2.
- (2018) MnasNet: platform-aware neural architecture search for mobile. CoRR abs/1807.11626. External Links: Cited by: §5.
- (2017) SkipNet: learning dynamic routing in convolutional networks. CoRR abs/1711.09485. External Links: Cited by: §2.2.
- (2019) A survey on neural architecture search. CoRR abs/1905.01392. External Links: Cited by: §1, §2.1, §2.1.
- (2018) Graph hypernetworks for neural architecture search. CoRR abs/1810.05749. External Links: Cited by: §2.2, Figure 9, §4.4, §4.4.
- (2017) Incremental network quantization: towards lossless cnns with low-precision weights. CoRR abs/1702.03044. External Links: Cited by: §2.3.
- (2016) Neural architecture search with reinforcement learning. CoRR abs/1611.01578. External Links: Cited by: §2.1.
- (2017) Learning transferable architectures for scalable image recognition. CoRR abs/1707.07012. External Links: Cited by: §3.3.