Embedded deep learning platforms have witnessed two simultaneous improvements. First, the accuracy of convolutional neural networks (CNNs) has been significantly improved through the use of automated neural-architecture search (NAS) algorithms to determine CNN structure. Second, there has been increasing interest in developing application-specific platforms for CNNs that provide improved inference performance and energy consumption as compared to GPUs. Embedded deep learning platforms differ in the amount of compute resources and memory-access bandwidth, which would affect performance and energy consumption of CNNs. It is therefore critical to consider the available hardware resources in the network architecture search. To this end, we introduce TEA-DNN, a NAS algorithm targeting multi-objective optimization of execution time, energy consumption, and classification accuracy of CNN workloads on embedded architectures. TEA-DNN leverages energy and execution time measurements on embedded hardware when exploring the Pareto-optimal curves across accuracy, execution time, and energy consumption and does not require additional effort to model the underlying hardware. We apply TEA-DNN for image classification on actual embedded platforms (NVIDIA Jetson TX2 and Intel Movidius Neural Compute Stick). We highlight the Pareto-optimal operating points that emphasize the necessity to explicitly consider hardware characteristics in the search process. To the best of our knowledge, this is the most comprehensive study of Pareto-optimal models across a range of hardware platforms using actual measurements on hardware to obtain objective values.
fnsymbolarabic \xpatchcmd TEA-DNN: the Quest for Time-Energy-Accuracy Co-optimized Deep Neural Networks Lile Cai* Anne-Maelle Barneche* Arthur Herbout* Chuan Sheng Foo Jie Lin Vijay Ramaseshan Chandrasekhar Mohamed M. Sabry *Equal contribution.
Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in image classification, object detection and many other applications . To achieve better accuracy, CNN models have become increasingly deeper and require more computational resources [2, 3]. This poses a challenge when these models are deployed to run on resource-limited devices, such as mobile and embedded platforms, as the memory on these devices may not be large enough to hold the models or running the model may consume more power than the device can supply.
Much effort has been devoted to designing CNN models that can run efficiently on these devices, for instance, by manually designing more efficient convolution operations and network architectures [4, 5, 6]. However, this approach demands expert knowledge and obtaining an optimal model is difficult as one has to carefully balance the trade-off between accuracy and computational resources. An alternative approach is to use automated neural architecture search (NAS) algorithms to find optimal models under hardware constraints [7, 8, 9]. NAS algorithms usually consist of a controller, which is responsible for sampling models from the search space, a trainer, which is responsible for training the sampled models and an evaluator, which is responsible for evaluating the objective values (like model accuracy) on currently sampled models. The parameters of the controller are then updated to increase the likelihood that it subsequently samples better models.
Due to the variation in device hardware/software configurations, it can happen that models optimized for one device are suboptimal for another. Consider two hardware platforms, namely the TITAN X GPU  and Intel Movidius Neural Computing Stick (NCS)  that respectively exemplify a high-performance and embedded platform. Figure 1 displays the Pareto-optimal CNN models searched for the GPU and NCS respectively for the trade-off between classification error and inference time, which are then both executed on the TITAN X GPU (for measurement details see Section IV). It can be seen that the Pareto-optimal models searched for the NCS are far from the Pareto curve for the GPU, implying that a platform-agnostic NAS may result in highly suboptimal models – in this case resulting in up to increase in execution time to achieve comparable accuracy.
The problem revealed in Fig. 1 demonstrates that in order to obtain an optimal model for a hardware platform, its corresponding characteristics have to be taken into consideration during the search process. To this end, we introduce TEA-DNN (Time-Energy-Accuracy co-optimized Deep Neural Networks), a NAS framework that explicitly considers two hardware metrics – inference time and energy consumption – in addition to classification accuracy as objective metrics. We formulate this problem as a multi-objective optimization problem and leverage Bayesian optimization to search for the Pareto-optimal points. While Bayesian optimization has been used to obtain hardware-aware neural networks , it was only used to search for several hyper-parameters with a fixed network architecture. To the best of our knowledge, our work is the first to apply Bayesian optimization for neural architecture search. Furthermore, TEA-DNN does not require modeling the hardware platform and instead leverages the ability to directly measure energy and execution time on actual hardware. We summarize our contributions as follows:
A time, energy and accuracy co-optimization framework for CNNs.
Employing Bayesian optimization to search for CNN structures that yield Pareto-optimal operating conditions.
We demonstrate how different device configurations can lead to different trade-off behaviors.
We demonstrate that optimal models searched on one hardware platform are not optimal for another and thus reiterate the importance of hardware-aware NAS.
Ii Related Work
Ii-a Neural Architecture Search (NAS)
Early versions of NAS algorithms  employed recurrent neural networks (RNNs) to predict the architecture of a target CNN where the weights of the RNN are updated using reinforcement learning.  follows the same framework as proposed in , but instead of using a RNN to predict the entire network architecture, the algorithm only predicts the optimal structure for one convolutional module (or “cell”). Identical cells are then stacked multiple times to form the full network.  replaced reinforcement learning with progressive search, which can yield better models with fewer samples.
Ii-B Hardware-Aware NAS
Explicitly incorporating hardware constraints into NAS has been an active research topic in recent years. HyperPower  approximates the power and memory consumption of a networks using linear regression and the approximated functions are then used in the acquisition function in Bayesian optimization to avoid sampling constraint-violated models. MnasNet  focused on searching optimal networks for mobile devices and used inference time as one of the objectives. DPP-Net  performs the search on different devices and considers more objective metrics, e.g., error rate, number of parameters, FLOPs, memory, and inference time.
Our work is closely related to MnasNet and DPP-Net in that we all search for the Pareto-optimal networks for a specific device. Our approach is unique in two aspects: 1) we perform true multi-objective optimization instead of combining several objectives into a single objective as done in MnasNet; 2) Unlike DPP-Net, we do not use any surrogate functions to approximate the optimization objectives. Instead, we directly measure the real-world values for all the three objectives (i.e., time, energy and accuracy). This eliminates the need to model the targeted hardware, which is a challenging task given the diversity of hardware platform configurations.
Iii TEA-DNN Optimization Framework
Iii-a System Overview
We formulate the neural network architecture search problem as a multi-objective optimization problem where we wish to find a network architecture parameterized by (see Section III-B for details) that minimizes classification error, energy consumption, and inference time. We do not assume a closed-form model for energy consumption or inference time, but evaluate them directly on actual hardware to measure real-world performance. Networks were trained and evaluated on GPUs for efficiency as we assume that classification error is not affected by the specific hardware a network is run on. As formulated, this is an instance of a black-box optimization problem where the objective functions can only be evaluated (and are not differentiable), and where function evaluations (especially classification error, which requires training the model) are costly. Note that no single “best” point exists for a multi-objective optimization problem. A solution is instead defined by a Pareto optimal set of points, for which improvement in any objective function cannot be made without negatively affecting some other objectives.
We chose to employ a Bayesian optimization algorithm  (detailed in Section III-C) to solve this optimization problem. We provide a brief overview and refer the reader to the comprehensive review in . Bayesian optimization algorithms perform a sequential exploration of the parameter space while building a surrogate probabilistic model to approximate the objective functions. This model is used to select points at which to next evaluate the objective functions, and the obtained function values are then used to update the model. The algorithm proceeds iteratively following this select-evaluate-update loop, such that points in the Pareto optimal set are selected more frequently as the algorithm progresses. We stopped the algorithm after a specified number of iterations or when a time limit is hit. A schematic overview of our search algorithm is shown in Fig. 2.
Iii-B Search Space
We search over the subset of network architectures that can be described as repetitions of a modular network “cell”, as proposed by . The overall network architecture is predefined (Fig. 3(a)) and consists of cells with either stride 1 or 2. As a common heuristic, the number of filter channels is doubled after the stride 2 cells. As such, the network architecture is uniquely determined by the initial filter channel number , the number of cell repeats and the cell structure. and are hyper-parameters that are pre-specified and the cell structure is searched using Bayesian optimization.
Specifically, each cell is composed of 5 building blocks and each building block (illustrated in Fig. 3(b)) is parameterized by 4 parameters for a 20-dimensional parameter space. and denote the inputs, and and specify the operations applied to the respective inputs. The input space of each building block consists of the outputs of all preceding blocks in the current cell as well as outputs from the two preceding cells. The operation space includes the following eight functions commonly used in top performing CNNs:
: depthwise-separable convolution
: depthwise-separable convolution
: depthwise-separable convolution
identity: identity mapping
: average pooling
: max pooling
: convolution followed by convolution
: dilated convolution with dilation rate = 2
The search space described above has an order of (). The outputs of the two operations are then combined by element-wise addition. The final output of the cell is the concatenation of all unused building block outputs.
Iii-C Multi-Objective Bayesian Optimization
Bayesian optimization is a sequential model-based approach that approximates each objective function with a Gaussian process (GP) model. For a particular objective function (e.g., classification error) let be its surrogate GP model, be the evaluated network architectures in the search space, be the objective function value for network , and be the actual measured function value. The GP model assumes that are jointly Gaussian with mean and covariance and observations are normally distributed given :
Each iteration of Bayesian optimization consists of 3 steps:
Selecting the next point (network architecture to evaluate) by maximizing an acquisition function, which specifies a likely candidate that improves the objective(s). We used the PESMO (Predictive Entropy Search Multi-objective)  acquisition function in our experiments that chooses points which maximally reduce the entropy of the current posterior distribution given by the GPs over the Pareto set.
Evaluating the objective functions at .
Updating the parameters and for the GP models.
To employ Bayesian optimization for neural architecture search, we use the 20-dimensional parameterization of the search space as described in Section III-B. Our three objective functions are the 1) error rate (i.e., ), 2) inference time and 3) energy consumption, and we used the open-source PESMO implementation in Spearmint for our experiments.
Iv Experimental Setup
We evaluate TEA-DNN models on different deep-learning hardware platforms, representing embedded and server-based systems. Table I summarizes the properties of these platforms.
|GTX TITAN X||Jetson TX2||Movidius|
|Processing Unit||3072 CUDA cores||256 CUDA cores||Myriad 2 VPU|
|FLOPS||6.7T FP32||1.5T FP32||2T FP16|
|Memory||12GByte GDDR5||8GByte LPDDR4||4GBit LPDDR3|
|Mem. Bandwidth||336.6 GBytes/s||59.7 GBytes/s||4 GBits/s|
|Power||250 W||15 W||1 W|
Iv-a Training Setup
In our experiments, models are trained and tested on the CIFAR-10 dataset, which is a popular benchmarking dataset for image classification. CIFAR-10 has 50,000 training images and 10,000 test images of dimension . We removed 5,000 images from the training set for use as a validation set and train on the remaining 45,000 images. During the search process, each model is trained for 20 epochs with a batch size of 32. We use the RMSProp optimizer  with momentum and decay both set to 0.9. The learning rate is set to 0.01, and decayed by 0.94 every 2 epochs. Weight decay is set to 0.00004. The data augmentation technique we used is as described in . The initial channel number is set to 24 and the number of cell repeats is set to 2 in the search process.
Iv-B Time and Energy Measurement
We describe how we measure inference time and energy consumption on each of the hardware platforms in Table I:
TITAN X GPU . We run the model on the 5,000 validation images with a batch size of 100 and report the total inference time. We use the NVIDIA Management Library (NVML)  to monitor power consumption during evaluation and compute energy consumption by integrating the collected power values over the total inference time.
Jetson TX2 . We run the model on the 5,000 validation images and report the total inference time. The batch size is set to 1 to match actual use scenario. We use the Python library provided by  to monitor power and compute energy consumption by integrating the collected power values over the total inference time.
V Results and Discussions
V-a Evolution of Pareto Curve
To demonstrate the effectiveness of Bayesian optimization in searching time-energy-accuracy co-optimized DNN models, we plot the Pareto curves for the TITAN X GPU as the search progresses in Fig. 4. We see that the Pareto curve evolves towards the bottom-left corner as the search progresses, implying that models with better trade-offs are being found. By about 500 iterations, the curve does not change much suggesting that the optimization has converged.
V-B Accuracy Benchmarking
We evaluate the performance of TEA-DNN models on the CIFAR-10 test subset. For each device, we select the model that achieves the best accuracy and train it for 300 epochs. The starting learning rate is set to 0.025 and decayed by 0.1 every 150 epochs. is set to 48 and is set to 3 (Section III-B). The results are reported in Table II. It can be seen that models searched on different devices achieve similar error rates. However, the numbers of parameters and multiplication-add operations on the Movidius NCS are half those on the GPU. Indeed, the extremely limited resources on the NCS caused TEA-DNN to explore network structures that lower the energy consumption on an embedded device, but would not affect the execution time or energy consumption on the higher-performance TITAN X GPU (CNNs with low number of parameters do not fully utilize the parallel resources in a GPU). This is further illustrated in Fig. 5 where the cell structures of the models evaluated in Table II are shown. It can be seen that compared with those for the TITAN X and Jetson, the Pareto-optimal structure for the Movidius NCS uses more and operations, as they utilize much fewer parameters than normal convolutions and thus allowing the use of a larger kernel size that helps to achieve high accuracy.
|Error Rate (%)||#Parameters||#Mult-Adds|
|TITAN X GPU||7.16||15.3M||2.7B|
V-C Pareto Curve on Different Devices
We obtain the Pareto points for each pair of objectives (error-time, error-energy and energy-time) and illustrate the resulting curves for TITAN X, Jetson TX2 and Movidius in Fig. 6, Fig. 7 and Fig. 8, respectively. We see that time or energy has to increase to reduce the error rate for all hardware platforms, which is an intuitive result – deeper CNNs reduce error rates by using more compute operations. However, for the energy-time trade-off, the three platforms exhibit different behaviors. For the TITAN X, there is only one optimal point. For Jetson and Movidius, however, there is a trade-off between time and energy. This behavior is likely related to the following three factors: the relatively small input images, the amount of on-chip memory, and the bandwidth to off-chip memory. A model with fewer parameters and a larger number of activations can consume more time and less energy on a platform with limited on-chip memory—a substantial amount of time would be spent on waiting for memory-access requests to be serviced while computing units sit idle.
V-D Cross-Device Evaluation of Pareto-Optimal Models
We evaluate whether a set of Pareto-optimal models searched for one platform is also Pareto-optimal for another. For brevity, we consider the error rate v.s. time trade-off. First, we evaluate Pareto-optimal models searched for the TITAN X GPU on the Jetson TX2 (Fig. 9(a)) and Movidius NCS (Fig. 9(b)). For the Jetson TX2, none of the Pareto-optimal models searched for the GPU is Pareto-optimal. For the Movidius NCS, only 3 out of the 9 models are Pareto-optimal. This clearly indicates that incorporating energy and execution time of the targeted platform is key in TEA-DNN.
In addition, we evaluate the set of Pareto-optimal models for the Jetson TX 2 and Movidius NCS on the TITAN X GPU, as illustrated in Fig. 10. For Jetson TX2 and Movidius NCS, there are both only two models that are Pareto-optimal on GPU. We also note that models searched for the embedded systems distribute within a limited time range on GPU, which suggests that the limited resources on embedded platforms yield CNNs architectures that cannot fully leverage the availability of compute resources in the high-performance GPU.
In this work, we propose the TEA-DNN framework that employs Bayesian optimization to search for time-energy-accuracy co-optimized CNN models. We apply TEA-DNN on three different devices: TITAN X GPU, Jetson TX2 and Movidius NCS. Experimental results show that TEA-DNN can effectively find Pareto-optimal models within a few hundred iterations. By analyzing the Pareto curve of the search results, we demonstrate that different device configurations can lead to different trade-off behaviors. Cross-device evaluation of Pareto-optimal models demonstrates that optimal models searched for one hardware platform are not optimal for another and thus reiterates the importance of explicitly considering hardware characteristics in NAS.
-  Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, vol. 1, MIT press Cambridge, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
-  Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.,” in AAAI, 2017, vol. 4, p. 12.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  X Zhang, X Zhou, M Lin, and J Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017,” arXiv preprint arXiv:1707.01083.
-  Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” group, vol. 3, no. 12, pp. 11, 2017.
-  Barret Zoph and Quoc V Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le, “Learning transferable architectures for scalable image recognition,” arXiv preprint arXiv:1707.07012, vol. 2, no. 6, 2017.
-  Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy, “Progressive neural architecture search,” arXiv preprint arXiv:1712.00559, 2017.
-  TITAN X GPU, https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x.
-  Movidius Neural Compute Stick, https://developer.movidius.com/.
-  Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, and Diana Marculescu, “Hyperpower: Power-and memory-constrained hyper-parameter optimization for neural networks,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 19–24.
-  Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le, “Mnasnet: Platform-aware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
-  Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun, “Dpp-net: Device-aware progressive search for pareto-optimal neural architectures,” arXiv preprint arXiv:1806.08198, 2018.
-  Daniel Hernández-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams, “Predictive entropy search for multi-objective bayesian optimization,” .
-  B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, Jan 2016.
-  Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
-  Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
-  NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidia-management-library-nvml.
-  Jetson TX2, https://developer.nvidia.com/embedded/buy/jetson-tx2.
-  Lukas Cavigelli, “Convenient power measurements on the jetson tx2/tegra x2 board,” 2018.
-  Intel Movidius Neural Compute SDK, https://movidius.github.io/ncsdk/.
-  Power-Z USB TD Tester, https://www.unionrepair.com/how-to-use-power-z-usb-pd-tester-voltage-current-type-c-meter-km001/.