Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent
Designing energy-efficient networks is of critical importance for enabling state-of-the-art deep learning in mobile and edge settings where the computation and energy budgets are highly limited. Recently, splitting2019 framed the search of efficient neural architectures into a continuous splitting process: it iteratively splits existing neurons into multiple off-springs to achieve progressive loss minimization, thus finding novel architectures by gradually growing the neural network. However, this method was not specifically tailored for designing energy-efficient networks, and is computationally expensive on large-scale benchmarks. In this work, we substantially improve splitting2019 in two significant ways: 1) we incorporate the energy cost of splitting different neurons to better guide the splitting process, thereby discovering more energy-efficient network architectures; 2) we substantially speed up the splitting process of splitting2019, which requires expensive eigen-decomposition, by proposing a highly scalable Rayleigh-quotient stochastic gradient algorithm. Our fast algorithm allows us to reduce the computational cost of splitting to the same level of typical back-propagation updates and enables efficient implementation on GPU. Extensive empirical results show that our method can train highly accurate and energy-efficient networks on challenging datasets such as ImageNet, improving a variety of baselines, including the pruning-based methods and expert-designed architectures.
Deep neural networks (DNNs) have demonstrated remarkable performance in solving various challenge problems such as image classification (e.g. simonyan2014very; he2016deep; huang2017densely), object detection (e.g. he2017mask) and language understanding (e.g. devlin2018bert). Although large-scale deep networks have good empirical performance, their large sizes cause slow computation and high energy cost in the inference phase. This imposes a great challenge for improving the applicability of deep networks to more real-word domains, especially on mobile and edge devices where the computation and energy budgets are highly limited. It is of urgent demand to develop practical approaches for automatizing the design of small, highly energy-efficient DNN architectures that are still sufficiently accurate for real-world AI systems.
Unfortunately, neural architecture optimization has been known to be notoriously difficult. Compared with the easier task of learning the parameters of DNNs, which has been well addressed by back-propagation (rumelhart1988learning), optimizing the network structures casts a much more challenging discrete optimization problem, with excessively large search spaces and high evaluation cost. Furthermore, for neural architecture optimization in energy-efficient settings, extra difficulties arise due to strict constraints on resource usage.
Recently, splitting2019 investigated similar notations of gradient descent for learning network architectures and framed the architecture optimization problem into a continuous optimization in an infinite-dimensional configuration space, on which novel notions of steepest descent can be derived for incremental update of the neural architectures. In practice, the algorithm optimizes a neural network through a cycle of paramedic updating and splitting phase. In the parametric updating phase, the algorithm performs standard gradient descent to reach a stable local minima; in the splitting phase, the algorithm expands the network by splitting a subset of exiting neurons into several off-springs in an optimal way. A key observation is that the previous local minima can be turned into a saddle point in the new higher-dimensional space induced by splitting that can be escaped easily; thus enabling implicitly architecture space exploration and achieving monotonic loss decrease.
However, the splitting algorithm in splitting2019 treats each neuron equally, without taking into account the different amount of energy consumption incurred by different neurons, thus finding models that may not be applicable in resource-constrained environments. To close the gap between DNNs design via splitting and energy efficiency optimization, we propose an energy-aware splitting procedure that improves over splitting2019 by explicitly incorporating energy-related metrics to guild the splitting process.
Another practical issue of splitting2019 is that it requires eigen-computation of the splitting matrices, which causes a time complexity of and space complexity of when implemented exactly, where is the number of neurons in the network, and is the dimension of the weights of each neuron. This makes it difficult to implement the algorithm on GPUs for modern neural networks with thousands of neurons, mainly due to the explosion of GPU memory, thus prohibiting efficient parallel calculation on GPUs. In this work, we address this problem by proposing a fast gradient-based approximation of splitting2019, which reduces the time and space complexity to and , respectively. Critically, our new fast gradient-based approximation can be efficiently implemented on GPUs, hence making it possible to split very large networks, such as these for ImageNet.
Our method achieves promising empirical results on challenging benchmarks. Compared with prior art pruning baselines that improve the efficiency by removing the least significant neurons (e.g. liu2017learning; li2016pruning; gordon2018morphnet), our method produces a better accuracy-flops trade-off on CIFAR-100. On the large-scale ImageNet dataset, our method finds more flops-efficient network architectures that achieve 1.0% and 0.8% improvements in top-1 accuracy compared with prior expert-designed MobileNet (howard2017mobilenets) and MobileNetV2 (sandler2018mobilenetv2), respectively. The gain is even more significant on the low-flops regime.
2 Splitting Steepest Descent
Our work builds upon a recently proposed splitting steepest descent approach (splitting2019), which transforms the co-optimization of neural architectures and parameters into a continuous optimization, solved by a generalized steepest descent on a functional space. To illustrate the key idea, assume the neural network structure is specified by a set of size parameters where each denotes the number of neurons in the -th layer, or the number of a certain type of neurons. Denote by the set of possible parameters of networks of size , then , which we call the configuration space, is the space of all possible neural architectures and parameters.
In this view, learning parameters of a fixed network structure is minimizing the loss inside an individual sub-region . In contrast, optimizing in the overall configuration space admits the co-optimization of both architectures and parameters. The key observation is that the optimization in is in fact continuous (despite being infinite dimensional), for which (generalized) steepest descent procedures can be introduced to yield efficient and practical algorithms.
In particular, splitting2019 considered a splitting steepest descent on , which consists of two phases: 1) the parametric descent inside each with a fixed network structure , which reduces to the typical steepest descent on parameters, and 2) the architecture descent crossing the boundaries of different sub-regions , which, in the case of splitting2019, corresponds to “growing” the network structures by splitting a set of critical neurons to multiple off-springs (see Figure 1a).
From the perspective of non-convex optimization, the architecture descent across boundaries of can be viewed as escaping saddle points in the configuration space. As shown in Figure 1b, when the parametric training inside a fixed gets saturated, architecture descent allows us to escape local optima by jumping into a higher dimensional sub-region of a larger network structure. The idea is that the local optima inside can be turned into a saddle point when viewed from the higher dimensional space of larger networks (Figure 1c), which is escaped using splitting descent.
Escaping local minima via splitting
It requires to fix a proper notion of distance on in order to derive a steepest descent algorithm. In splitting2019, steepest descent with -Wasserstein distance was considered, which is shown to naturally correspond to the practical procedure of splitting neurons. Here we only introduce the intuitive idea from the practical perspective of optimally splitting neurons. The readers are referred to splitting2019 for more theoretical discussion.
Consider the simplest case of splitting a single neuron. Let be a neuron inside a neural network that we want to learn from data, where is the parameter of the neuron and its input variable. Assume the loss function of has a general form of
where is the data distribution, and is the map from the output of the neuron to the final loss. In this work, the word “neuron” broadly refers to repeatable modules in neural networks, such as the typical hidden neurons, filters in CNNs.
Assume we have achieved a stable local optimum of , that is, and , so that we can not further decrease the loss by local descent on the parameters. In this case, splitting steepest descent enables further descent by introducing more neurons via splitting. Specifically, we split into off-springs , and replace the neuron with a weighted sum of the off-spring neurons , where is a set of positive weights assigned on each of the off-springs, and satisfies , . See Figure 1a for an illustration. This yields an augmented loss function on and :
It is easy to see that if we set for all the off-springs, the network remains unchanged. Therefore, as we change in a small neighborhood of , it introduces a smooth change on the loss function. Splitting steepest descent is derived by considering the optimal splitting strategies to achieve the steepest descent on loss in a small neighborhood of the original parameters.
Deriving splitting steepest descent
Derive the optimal splitting strategy involves deciding the number of off-springs , the values of the weights and the parameters for the off-springs . In splitting2019, this is formulated into the following optimization problem:
where the parameters of the off-springs are restricted within an infinitesimal -ball of the original parameter , that is, , with a small positive step size parameter. Note that the number of off-springs is also optimized, yielding an infinite dimensional optimization.
Fortunately, when is very small, the optimum of Equation (3) is achieved by either (no splitting) or (two off-springs). The property of the optimal solution is characterized (asymptotically) by the following key splitting matrix ,
which is a symmetric matrix (and is the dimension of ). The optimum of Equation (3), when reaches a stable local optimum (i.e., , ), is determined by via
where denotes the minimum eigenvalue of and it is called the splitting index.
When , the loss can not be improved by any splitting strategies following (4). When , the maximum decrease of loss, which equals , can be achieved by a simple strategy of splitting the neuron into two copies with equal weights, whose parameters are updated along the minimum eigen-vectors of , that is,
In this case, splitting allows us to escape the parametric local optima to enable further improvement.
Splitting deep neural networks
As shown in splitting2019, the result above can be naturally extended to more general cases when we need to split multiple neurons in deep neural networks. Consider a neural network with neurons . Assume we split a subset of neurons with the optimal strategy in Equation (5) following their own splitting matrices, the improvement of the overall loss equals the sum of individual gains:
where denotes the minimum eigenvalue of the splitting matrix associated with neuron . Therefore, given a budget of splitting at most a given number of neurons, the optimal subset of neurons to split are the top ranked neurons with the smallest, and negative minimum eigenvalues. Overall, the splitting descent in splitting2019 alternates between parametric updates with fixed network architectures, and splitting top ranked neurons to augment the architectures, until a stopping criterion is reached.
3 Neural architecture optimization via energy-aware splitting
The method above allows us to select the best subset of neurons to split to yield the steepest descent on the loss function. In practice, however, splitting different neurons incurs a different amount of increase on the model size, computational cost, and physical energy consumption. For example, splitting a neuron connecting to a large number of inputs and outputs increases the size and computational cost of the network much more significantly than splitting the neurons with fewer inputs and outputs. In practice, convolutional layers close to inputs often have larger feature maps which lead to a high energy cost, and layers closer to outputs have smaller feature maps and hence lower computational cost. A better splitting strategy should take the cost of different neurons into account.
To address this problem, we propose to explicitly incorporate the energy cost to better guide the splitting process. Specifically, for a neural network with neurons, we propose to decide the optimal splitting set by solving the following constrained optimization:
Here is a binary mask, with indicates whether the -th neuron should be split () or not (), and represents the cost of splitting at the current iteration. We search for the optimal subset of neurons that yields the largest descent on the loss (in terms of the splitting index), while incurring a total energy cost no larger than a budget threshold . This optimization Equation (6) is a standard knapsack problem. The exact solution of knapsack problems can be very expensive due to their NP-hardness. In practice, we use linear programming relaxation for fast approximation by relaxing the integrality constrains to linear constrains such that . The continuous relaxation could then be solved using standard linear programming tools efficiently (dantzig1998linear). Finally, we define the optimal splitting set . For each neuron in , we split it into two equally weighted off-springs along their splitting gradients, following Equation (5).
In this work, we take to be the energy cost, and estimate it by the increase of flops if we split the -th neuron starting from the current network structure. Note that the cost of splitting the same neuron changes when the network size changes across iterations. Therefore, we re-evaluate the cost of every neuron at each splitting stage, based on the architecture of the current network.
4 Fast Splitting with Rayleigh-Quotient Gradient Descent
A practical issue of splitting steepest descent is the high computational cost of the eigen-computation of the splitting matrices. The time complexity of evaluating all splitting indexes is . Here is the number of neurons and is the number of the parameters of each neuron. Meanwhile, the space complexity is . Although this is manageable for networks with small or moderate sizes, an immediate difficulty for modern deep networks with thousands of high-dimensional neurons () is that we are not able to store all splitting-matrices on GPUs, which necessities slow calculation on CPUs. It is desirable to further speed up the calculation for very large scale problems. In this section, we propose an approach for computing the splitting indexes and gradients without explicitly expanding the splitting matrices, based on fast (stochastic) gradient descent on the Rayleigh quotient.
Rayleigh-Quotient Gradient Descent for Minimum Eigenvalues
The key idea is to note that the minimum eigenvalues and eigenvectors of a matrix can be obtained by minimizing the Rayleigh quotient (parlett1998symmetric),
which can be solved using gradient descent or other numerical methods. Although this problem is non-convex, can be shown to be the unique global minimum of , and all the other stationary points, corresponding to the other eigenvectors, are saddle points and can be escaped with random perturbation. Therefore, stochastic or noisy gradient descent on is expected to converge to . The gradient of w.r.t. can be written as follows,
which depends on only through the matrix-vector product . A significant saving in computation can be obtained by directly calculating at each iteration, without explicitly expanding the whole matrix. This can be achieved by the following auto-differentiation trick.
Automatic Differentiation Trick
Recall that the splitting matrix of a single neuron is . To calculate for any vector , we construct the following auxiliary function,
with which it is easy to show that . Here corresponds to simply adding an extra term on the top of the neuron’s output and can be constructed conveniently.
In the case of deep neural networks with neurons , we can calculate all the matrix-vector product for all the neurons jointly with a single differentiation process. More precisely, for each neuron , we can add a term (denoted as auxiliary activation) on its own output (see Figure 2). Thus, we obtain a joint function , for which it is easy to see that , . Therefore, simply differentiating allows us to obtain all simultaneously.
Stochastic Gradient on Rayleigh quotient
Note that we still need to average over the whole dataset to measure the Rayleigh quotient gradients , this is computationally expensive in the case of big data. However, we can conveniently address this by approximating with subsampled mini-batches . In the case of single-neuron networks, that is,
Assume we sweep the training data times to train the Rayleigh-Quotient to convergence (see Equation (7)). In this way, the splitting time complexity for approximating all splitting indexes and gradients would be only ( is often a small constant). More importantly, a significant advantage of our gradient-based approximation is that the space complexity is only . In this way, all calculation could be efficient performed on GPUs. This given us an algorithm for splitting that is almost as efficient as back-propagation.
Our overall algorithm in shown in Algorithm 1, which improves over splitting2019 by offering much lower time and space complexity, and the flexibility of incorporating energy and other costs of different neurons. It can be implemented easily using modern deep learning frameworks such as Pytorch (paszke2017automatic). Our code is available at https://github.com/dilinwang820/fast-energy-aware-splitting.
We apply our method to split small variants of MobileNetV1 (howard2017mobilenets) and MobileNetV2 (sandler2018mobilenetv2), on both CIFAR-100111https://www.cs.toronto.edu/~kriz/cifar.html and ImageNet dataset. We show our method finds networks that are more accurate and also more energy-efficient compared to expert-designed architectures and pruned models.
Settings of Our Algorithm
In all our tests of our Algorithm 1, we restrict the increase of the energy cost to be smaller than a budget at each splitting stage. We set adaptively to be proportional to the total flops of the current network such that the flops of the augmented network obtained by splitting cannot exceed times of the previous one. We denote by the growth ratio and set unless otherwise specified.
For our fast splitting indexes approximation (see section 4), we set batch size to be 64 and use RMSprop (tieleman2012lecture) optimizer with 0.001 learning rate. We find the Rayleigh-Quotient converges fast in general: for small CIFAR-10/100 datasets, we train 10 epochs (T=10); for the large-scale ImageNet set, we find a small T (=2) is sufficient.
5.1 Testing Importance of Energy-Aware Splitting (Results on CIFAR-10)
To study the importance of our energy-aware splitting, we compare our method (denoted as splitting (energy-aware)) to splitting2019 (denoted as splitting (vanilla)), which doesn’t use energy metrics to guide the splitting process. In this experiment, we apply both splitting algorithms to grow a variant of small version of MobileNets (howard2017mobilenets) trained on the CIFAR-10 dataset, in order to test the importance of using energy cost for splitting.
We test our algorithm on two variants of MobileNet, each of which consists one regular convolution layer, followed by and MobileNet blocks (howard2017mobilenets), respectively. In both variants, the resolutions are reduced 3 times evenly and one extra MobileNet block attached with a fully connected layer for classification. Note that each MobileNet block consists a depthwise convolutional layer and a pointwise convolutional layer. In our implementation, we only split the convolutional filters in the pointwise convolutional layers and duplicate the corresponding depthwise convolution filters accordingly during splitting. We start with small networks that have the same number of channels (=8) across all layers to better study the behavior of how neurons are split. We set batch size to be 256 and learning rate 0.1 for 160 epochs, with learning rate dropped 10x at 80 and 120 epochs for the two variants (), respectively.
Our results are shown in Figure 3, which shows that our splitting (energy-aware) approach yields better trade-offs of accuracy and flops than splitting (vanilla) in both cases ( and ). We find that splitting (vanilla) does discover networks with small model size (fewer parameters, see Figure 3 (b) and (d)), but does not yield lower energy consumption in practice. These results highlight the importance of using real energy cost for guiding the splitting process in order to optimize for the best energy-efficiency.
|(a) k = 3||(b) k = 3||(c) k = 6||(d) k = 6|
5.2 Results on CIFAR-100
We compare our method with several state-of-the-art pruning baselines on the CIFAR-100 dataset. We also show our fast gradient-based splitting approximation in section 4 achieves the same accuracy as the exactly eigen-computation, while significantly reducing the overall splitting time.
We again apply splitting on a small version of MobileNet (howard2017mobilenets) (with the same network topology) to obtain a sequence of increasingly large models. Specifically, we set the number of channels of the base model to be for each layer, respectively. We compare our method with a simple but competitive width multiplier (howard2017mobilenets) baseline, which prunes filters uniformly across layers (denoted as Width multiplier) from the original full size MobileNet. We also experiment with three state-of-the-art structured pruning methods: Pruning (Bn) (liu2017learning), Pruning (L1) (li2016pruning) and MorphNet (gordon2018morphnet). The implementation of all the baselines are based on liu2018rethinking. For all methods, we normalize the inputs using channel means and standard deviations. We use stochastic gradient descent with momentum 0.9, weight decay 1e-4, batch size 128. We set 0.1 initial learning rate for 160 epochs, with learning rate decreased by 10x at epochs 80, 120, respectively. For all pruned models, we report the finetune performance with the same training settings. For Morphnet, we grid search the best sparsity hyper-parameter in the range [1e-8, 5e-8, 1e-9, 5e-9, 1e-10] and report the best models found.
Figure 4 (a) shows the results on CIFAR-100, in which our method achieves the best accuracy when targeting similar flops. To draw further comparison between the splitting and pruning approaches, we prune the final network learned by our splitting algorithm to obtain a sequence of increasingly smaller models using Pruning (Bn) (liu2017learning). As shown in Figure 4 (b), it is clear that our splitting checkpoints (red circles) form a better flops-accuracy trade-off curve than models obtained by pruned from the same model (green Pentagons). This confirms the advantage of our method in neural architecture optimization, especially on the low-flops regime.
|Test Accuracy||Test Accuracy|
|(a) Log10(flops)||(b) Log10(flops)|
|Test Accuracy||Splitting time (s)|
|(a) Log10(flops)||(b) #split|
In Figure 5 (a-b), we examine the accuracy and speed of our fast gradient-based eigen-approximation. We run all methods on a server with one V100 GPU and 16 CPU cores and report the wall-clock time. We can see that our fast method (red dots and bars) achieves almost the same accuracy as the splitting based on exact eigen-decomposition (blue dots and bars), while achieving significant gain in computational time (see Figure 5 (b)).
5.3 Results on ImageNet
We conduct experiments on large-scale ImageNet dataset, on which our method again shows clear advantages over existing methods. Note that splitting based on exact eigen-composition is no longer feasible on ImageNet and our fast gradient-based approximation must be used.
The ImageNet dataset (deng2009imagenet) consists of about 1.2 million training images, and validation images, classified into distinct classes. We resize the image size to , and adopt the standard data augment scheme (mirroring and shifting) for training images (e.g. howard2017mobilenets; sandler2018mobilenetv2).
We choose both MobileNetV1 (howard2017mobilenets) and MobileNetV2 (sandler2018mobilenetv2) as our base net for splitting, which are strong baselines and specifically hand-designed and heavily tuned to optimize accuracy under a flops-constrain on the ImageNet dataset.
For parametric updates, we follow standard training settings on the ImageNet dataset using MobileNets. Specifically, we train with a batch-size of on 4 GPUs (total batch size 512). We use stochastic gradient descent with an initial learning rate and for MobileNetV1 and MobileNetV2, respectively. We apply cosine learning rate annealing scheduling and use label smoothing (0.1) by following (he2019bag).
For our method, we start with relative small models (denoted by Splitting-0 (seed)) by shrinking the network uniformly with a width multipler 0.3, and gradually grow the network via energy-aware splitting. We use Splitting- to represent the model we discovered at the -th splitting stage. We report the single-center-crop validation error of different models.
In Table 1, we find that our method achieves about top-1 accuracy improvements in general when targeting similar flops. On low-flops regime (G flops), our method achieves 3.06% higher top-1 accuracy compared with MobileNet (0.5X) (with width multiper 0.5). Also, the model found by our method is higher than a prior art pruning method (AMC (he2018amc)) when comparing with checkpoints with G flops.
From table 2, we find that our splitting models yield better performance compared with prior art expert-designed architectures on all flops-regimes. Specially, out splitting-3 reaches 72.84 top-1 accuracy; this yields an 0.8% improvement over its corresponding baseline model. On the low-flops regime, our splitting-2 achieves an 1.96% top-1 accuracy improvement over MobileNetV2 (0.75x); our splitting-1 is 1.1% higher than MobileNetV2 (0.5x). Our performance is also about 0.9% higher than AMC when targeting 70% flops.
|Model||MACs (G)||Top-1 Accuracy||Top-5 Accuracy|
|Model||MACs (G)||Top-1 Accuracy||Top-5 Accuracy|
5.4 Ablation study
In our algorithm, the growth ratio controls how many neurons we could split at each splitting stage. In this section, we perform an in-depth analysis of the effect of different values. We also examine the robustness of our splitting method regarding randomness during the network training and splitting (e.g. parameters initializations, data shuffle).
Impact of growth ratio
To find the optimal growth ratio , we ran multiple experiments with different growth ratio under the same settings as section 5.2. Figure 6 (a) shows the performance of various runs. We find that the growth ratio in the range of tend to perform similarly well. However, the smaller growth ratio of tends to give lower accuracy, this may be because with a small growth ratio, the neurons in the layers close to the input may never be selected because of their higher energy cost for splitting, hence yielding sub-optimal networks.
|(a) Log10(flops)||(b) Log10(flops)|
We apply our method to grow a small MobileNet (howard2017mobilenets) using different random seeds for parameters initialization and data shuffle under the same setting as Figure 6 (a) with a growth ratio 0.5. Figure 6 (b) shows the test performance of different models learned. As we can see from Figure 6 (b), all runs perform similarly well with small variations.
6 Related Work
Neural architecture search (NAS) has been found a powerful tool for automating energy-efficient architecture design. Most existing NAS methods are based on black-box optimization techniques, including reinforcement learning (e.g. zoph2016neural; zoph2018learning) , evolutionary algorithms (e.g. real2019regularized; real2017large). However, these methods are often extremely time-consuming due to the enormous search space of possible architectures and the high cost for evaluating the performance of each candidate network. More recent approaches have made the search more efficient by using weight-sharing (e.g. pham2018efficient; liu2018darts; cai2018proxylessnas), which, however, suffers from the so-called multi-model forgetting problem (benyahia2019overcoming) that causes training instability and performance degradation during search. Overall, designing the best architectures using NAS still requires a lot of expert knowledge and trial-and-errors.
In contrast, pruning-based methods construct smaller networks from a pretrained over-parameterized neural network by gradually removing the least important neurons. Various pruning strategies have been developed based on different heuristics (e.g., han2015deep; li2016pruning; luo2017thinet; he2017channel; peng2019collaborative), including energy-aware pruning methods that use energy consumption related metrics to guide the pruning process (e.g., yang2017designing; gordon2018morphnet; he2018amc; yang2018ecc). However, a common issue of these methods is to alter the standard training objective with sparsity-induced regularization which necessities sensitive hyper-parameters tuning. Furthermore, the final performance is largely limited by the initial hand-crafted network, which may not be optimal in the first place.
In this work, we present a fast energy-aware splitting steepest descent approach for resource-efficient neural architecture optimization that generalizes splitting2019. Empirical results on large-scale ImageNet benchmark using MobileNetV1 and MoibileNetV2 demonstrate the effectiveness of our method.
This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421.