Abstract
Embedded deep learning platforms have witnessed two simultaneous improvements. First, the accuracy of convolutional neural networks (CNNs) has been significantly improved through the use of automated neuralarchitecture search (NAS) algorithms to determine CNN structure. Second, there has been increasing interest in developing applicationspecific platforms for CNNs that provide improved inference performance and energy consumption as compared to GPUs. Embedded deep learning platforms differ in the amount of compute resources and memoryaccess bandwidth, which would affect performance and energy consumption of CNNs. It is therefore critical to consider the available hardware resources in the network architecture search. To this end, we introduce TEADNN, a NAS algorithm targeting multiobjective optimization of execution time, energy consumption, and classification accuracy of CNN workloads on embedded architectures. TEADNN leverages energy and execution time measurements on embedded hardware when exploring the Paretooptimal curves across accuracy, execution time, and energy consumption and does not require additional effort to model the underlying hardware. We apply TEADNN for image classification on actual embedded platforms (NVIDIA Jetson TX2 and Intel Movidius Neural Compute Stick). We highlight the Paretooptimal operating points that emphasize the necessity to explicitly consider hardware characteristics in the search process. To the best of our knowledge, this is the most comprehensive study of Paretooptimal models across a range of hardware platforms using actual measurements on hardware to obtain objective values.
fnsymbolarabic \xpatchcmd TEADNN: the Quest for TimeEnergyAccuracy Cooptimized Deep Neural Networks Lile Cai* AnneMaelle Barneche* Arthur Herbout* Chuan Sheng Foo Jie Lin Vijay Ramaseshan Chandrasekhar Mohamed M. Sabry ^{*}Equal contribution.
I Introduction
Deep convolutional neural networks (CNNs) have achieved stateoftheart performance in image classification, object detection and many other applications [1]. To achieve better accuracy, CNN models have become increasingly deeper and require more computational resources [2, 3]. This poses a challenge when these models are deployed to run on resourcelimited devices, such as mobile and embedded platforms, as the memory on these devices may not be large enough to hold the models or running the model may consume more power than the device can supply.
Much effort has been devoted to designing CNN models that can run efficiently on these devices, for instance, by manually designing more efficient convolution operations and network architectures [4, 5, 6]. However, this approach demands expert knowledge and obtaining an optimal model is difficult as one has to carefully balance the tradeoff between accuracy and computational resources. An alternative approach is to use automated neural architecture search (NAS) algorithms to find optimal models under hardware constraints [7, 8, 9]. NAS algorithms usually consist of a controller, which is responsible for sampling models from the search space, a trainer, which is responsible for training the sampled models and an evaluator, which is responsible for evaluating the objective values (like model accuracy) on currently sampled models. The parameters of the controller are then updated to increase the likelihood that it subsequently samples better models.
Due to the variation in device hardware/software configurations, it can happen that models optimized for one device are suboptimal for another. Consider two hardware platforms, namely the TITAN X GPU [10] and Intel Movidius Neural Computing Stick (NCS) [11] that respectively exemplify a highperformance and embedded platform. Figure 1 displays the Paretooptimal CNN models searched for the GPU and NCS respectively for the tradeoff between classification error and inference time, which are then both executed on the TITAN X GPU (for measurement details see Section IV). It can be seen that the Paretooptimal models searched for the NCS are far from the Pareto curve for the GPU, implying that a platformagnostic NAS may result in highly suboptimal models – in this case resulting in up to increase in execution time to achieve comparable accuracy.
The problem revealed in Fig. 1 demonstrates that in order to obtain an optimal model for a hardware platform, its corresponding characteristics have to be taken into consideration during the search process. To this end, we introduce TEADNN (TimeEnergyAccuracy cooptimized Deep Neural Networks), a NAS framework that explicitly considers two hardware metrics – inference time and energy consumption – in addition to classification accuracy as objective metrics. We formulate this problem as a multiobjective optimization problem and leverage Bayesian optimization to search for the Paretooptimal points. While Bayesian optimization has been used to obtain hardwareaware neural networks [12], it was only used to search for several hyperparameters with a fixed network architecture. To the best of our knowledge, our work is the first to apply Bayesian optimization for neural architecture search. Furthermore, TEADNN does not require modeling the hardware platform and instead leverages the ability to directly measure energy and execution time on actual hardware. We summarize our contributions as follows:

A time, energy and accuracy cooptimization framework for CNNs.

Employing Bayesian optimization to search for CNN structures that yield Paretooptimal operating conditions.

We demonstrate how different device configurations can lead to different tradeoff behaviors.

We demonstrate that optimal models searched on one hardware platform are not optimal for another and thus reiterate the importance of hardwareaware NAS.
Ii Related Work
Iia Neural Architecture Search (NAS)
Early versions of NAS algorithms [7] employed recurrent neural networks (RNNs) to predict the architecture of a target CNN where the weights of the RNN are updated using reinforcement learning. [8] follows the same framework as proposed in [7], but instead of using a RNN to predict the entire network architecture, the algorithm only predicts the optimal structure for one convolutional module (or “cell”). Identical cells are then stacked multiple times to form the full network. [9] replaced reinforcement learning with progressive search, which can yield better models with fewer samples.
IiB HardwareAware NAS
Explicitly incorporating hardware constraints into NAS has been an active research topic in recent years. HyperPower [12] approximates the power and memory consumption of a networks using linear regression and the approximated functions are then used in the acquisition function in Bayesian optimization to avoid sampling constraintviolated models. MnasNet [13] focused on searching optimal networks for mobile devices and used inference time as one of the objectives. DPPNet [14] performs the search on different devices and considers more objective metrics, e.g., error rate, number of parameters, FLOPs, memory, and inference time.
Our work is closely related to MnasNet and DPPNet in that we all search for the Paretooptimal networks for a specific device. Our approach is unique in two aspects: 1) we perform true multiobjective optimization instead of combining several objectives into a single objective as done in MnasNet; 2) Unlike DPPNet, we do not use any surrogate functions to approximate the optimization objectives. Instead, we directly measure the realworld values for all the three objectives (i.e., time, energy and accuracy). This eliminates the need to model the targeted hardware, which is a challenging task given the diversity of hardware platform configurations.
Iii TEADNN Optimization Framework
Iiia System Overview
We formulate the neural network architecture search problem as a multiobjective optimization problem where we wish to find a network architecture parameterized by (see Section IIIB for details) that minimizes classification error, energy consumption, and inference time. We do not assume a closedform model for energy consumption or inference time, but evaluate them directly on actual hardware to measure realworld performance. Networks were trained and evaluated on GPUs for efficiency as we assume that classification error is not affected by the specific hardware a network is run on. As formulated, this is an instance of a blackbox optimization problem where the objective functions can only be evaluated (and are not differentiable), and where function evaluations (especially classification error, which requires training the model) are costly. Note that no single “best” point exists for a multiobjective optimization problem. A solution is instead defined by a Pareto optimal set of points, for which improvement in any objective function cannot be made without negatively affecting some other objectives.
We chose to employ a Bayesian optimization algorithm [15] (detailed in Section IIIC) to solve this optimization problem. We provide a brief overview and refer the reader to the comprehensive review in [16]. Bayesian optimization algorithms perform a sequential exploration of the parameter space while building a surrogate probabilistic model to approximate the objective functions. This model is used to select points at which to next evaluate the objective functions, and the obtained function values are then used to update the model. The algorithm proceeds iteratively following this selectevaluateupdate loop, such that points in the Pareto optimal set are selected more frequently as the algorithm progresses. We stopped the algorithm after a specified number of iterations or when a time limit is hit. A schematic overview of our search algorithm is shown in Fig. 2.
IiiB Search Space
We search over the subset of network architectures that can be described as repetitions of a modular network “cell”, as proposed by [9]. The overall network architecture is predefined (Fig. 3(a)) and consists of cells with either stride 1 or 2. As a common heuristic, the number of filter channels is doubled after the stride 2 cells. As such, the network architecture is uniquely determined by the initial filter channel number , the number of cell repeats and the cell structure. and are hyperparameters that are prespecified and the cell structure is searched using Bayesian optimization.
Specifically, each cell is composed of 5 building blocks and each building block (illustrated in Fig. 3(b)) is parameterized by 4 parameters for a 20dimensional parameter space. and denote the inputs, and and specify the operations applied to the respective inputs. The input space of each building block consists of the outputs of all preceding blocks in the current cell as well as outputs from the two preceding cells. The operation space includes the following eight functions commonly used in top performing CNNs:

: depthwiseseparable convolution

: depthwiseseparable convolution

: depthwiseseparable convolution

identity: identity mapping

: average pooling

: max pooling

: convolution followed by convolution

: dilated convolution with dilation rate = 2
The search space described above has an order of (). The outputs of the two operations are then combined by elementwise addition. The final output of the cell is the concatenation of all unused building block outputs.
IiiC MultiObjective Bayesian Optimization
Bayesian optimization is a sequential modelbased approach that approximates each objective function with a Gaussian process (GP) model. For a particular objective function (e.g., classification error) let be its surrogate GP model, be the evaluated network architectures in the search space, be the objective function value for network , and be the actual measured function value. The GP model assumes that are jointly Gaussian with mean and covariance and observations are normally distributed given :
(1) 
Each iteration of Bayesian optimization consists of 3 steps:

Selecting the next point (network architecture to evaluate) by maximizing an acquisition function, which specifies a likely candidate that improves the objective(s). We used the PESMO (Predictive Entropy Search Multiobjective) [15] acquisition function in our experiments that chooses points which maximally reduce the entropy of the current posterior distribution given by the GPs over the Pareto set.

Evaluating the objective functions at .

Updating the parameters and for the GP models.
To employ Bayesian optimization for neural architecture search, we use the 20dimensional parameterization of the search space as described in Section IIIB. Our three objective functions are the 1) error rate (i.e., ), 2) inference time and 3) energy consumption, and we used the opensource PESMO implementation in Spearmint[15] for our experiments.
Iv Experimental Setup
We evaluate TEADNN models on different deeplearning hardware platforms, representing embedded and serverbased systems. Table I summarizes the properties of these platforms.
GTX TITAN X  Jetson TX2  Movidius  

Processing Unit  3072 CUDA cores  256 CUDA cores  Myriad 2 VPU 
FLOPS  6.7T FP32  1.5T FP32  2T FP16 
Memory  12GByte GDDR5  8GByte LPDDR4  4GBit LPDDR3 
Mem. Bandwidth  336.6 GBytes/s  59.7 GBytes/s  4 GBits/s 
Power  250 W  15 W  1 W 
Iva Training Setup
In our experiments, models are trained and tested on the CIFAR10 dataset[17], which is a popular benchmarking dataset for image classification. CIFAR10 has 50,000 training images and 10,000 test images of dimension . We removed 5,000 images from the training set for use as a validation set and train on the remaining 45,000 images. During the search process, each model is trained for 20 epochs with a batch size of 32. We use the RMSProp optimizer [18] with momentum and decay both set to 0.9. The learning rate is set to 0.01, and decayed by 0.94 every 2 epochs. Weight decay is set to 0.00004. The data augmentation technique we used is as described in [9]. The initial channel number is set to 24 and the number of cell repeats is set to 2 in the search process.
IvB Time and Energy Measurement
We describe how we measure inference time and energy consumption on each of the hardware platforms in Table I:

TITAN X GPU [10]. We run the model on the 5,000 validation images with a batch size of 100 and report the total inference time. We use the NVIDIA Management Library (NVML) [19] to monitor power consumption during evaluation and compute energy consumption by integrating the collected power values over the total inference time.

Jetson TX2 [20]. We run the model on the 5,000 validation images and report the total inference time. The batch size is set to 1 to match actual use scenario. We use the Python library provided by [21] to monitor power and compute energy consumption by integrating the collected power values over the total inference time.
V Results and Discussions
Va Evolution of Pareto Curve
To demonstrate the effectiveness of Bayesian optimization in searching timeenergyaccuracy cooptimized DNN models, we plot the Pareto curves for the TITAN X GPU as the search progresses in Fig. 4. We see that the Pareto curve evolves towards the bottomleft corner as the search progresses, implying that models with better tradeoffs are being found. By about 500 iterations, the curve does not change much suggesting that the optimization has converged.
VB Accuracy Benchmarking
We evaluate the performance of TEADNN models on the CIFAR10 test subset. For each device, we select the model that achieves the best accuracy and train it for 300 epochs. The starting learning rate is set to 0.025 and decayed by 0.1 every 150 epochs. is set to 48 and is set to 3 (Section IIIB). The results are reported in Table II. It can be seen that models searched on different devices achieve similar error rates. However, the numbers of parameters and multiplicationadd operations on the Movidius NCS are half those on the GPU. Indeed, the extremely limited resources on the NCS caused TEADNN to explore network structures that lower the energy consumption on an embedded device, but would not affect the execution time or energy consumption on the higherperformance TITAN X GPU (CNNs with low number of parameters do not fully utilize the parallel resources in a GPU). This is further illustrated in Fig. 5 where the cell structures of the models evaluated in Table II are shown. It can be seen that compared with those for the TITAN X and Jetson, the Paretooptimal structure for the Movidius NCS uses more and operations, as they utilize much fewer parameters than normal convolutions and thus allowing the use of a larger kernel size that helps to achieve high accuracy.
Error Rate (%)  #Parameters  #MultAdds  

TITAN X GPU  7.16  15.3M  2.7B 
Jetson TX2  7.23  10.1M  1.8B 
Movidius NCS  6.99  7.3M  1.1B 
VC Pareto Curve on Different Devices
We obtain the Pareto points for each pair of objectives (errortime, errorenergy and energytime) and illustrate the resulting curves for TITAN X, Jetson TX2 and Movidius in Fig. 6, Fig. 7 and Fig. 8, respectively. We see that time or energy has to increase to reduce the error rate for all hardware platforms, which is an intuitive result – deeper CNNs reduce error rates by using more compute operations. However, for the energytime tradeoff, the three platforms exhibit different behaviors. For the TITAN X, there is only one optimal point. For Jetson and Movidius, however, there is a tradeoff between time and energy. This behavior is likely related to the following three factors: the relatively small input images, the amount of onchip memory, and the bandwidth to offchip memory. A model with fewer parameters and a larger number of activations can consume more time and less energy on a platform with limited onchip memory—a substantial amount of time would be spent on waiting for memoryaccess requests to be serviced while computing units sit idle.
VD CrossDevice Evaluation of ParetoOptimal Models
We evaluate whether a set of Paretooptimal models searched for one platform is also Paretooptimal for another. For brevity, we consider the error rate v.s. time tradeoff. First, we evaluate Paretooptimal models searched for the TITAN X GPU on the Jetson TX2 (Fig. 9(a)) and Movidius NCS (Fig. 9(b)). For the Jetson TX2, none of the Paretooptimal models searched for the GPU is Paretooptimal. For the Movidius NCS, only 3 out of the 9 models are Paretooptimal. This clearly indicates that incorporating energy and execution time of the targeted platform is key in TEADNN.
In addition, we evaluate the set of Paretooptimal models for the Jetson TX 2 and Movidius NCS on the TITAN X GPU, as illustrated in Fig. 10. For Jetson TX2 and Movidius NCS, there are both only two models that are Paretooptimal on GPU. We also note that models searched for the embedded systems distribute within a limited time range on GPU, which suggests that the limited resources on embedded platforms yield CNNs architectures that cannot fully leverage the availability of compute resources in the highperformance GPU.
Vi Conclusions
In this work, we propose the TEADNN framework that employs Bayesian optimization to search for timeenergyaccuracy cooptimized CNN models. We apply TEADNN on three different devices: TITAN X GPU, Jetson TX2 and Movidius NCS. Experimental results show that TEADNN can effectively find Paretooptimal models within a few hundred iterations. By analyzing the Pareto curve of the search results, we demonstrate that different device configurations can lead to different tradeoff behaviors. Crossdevice evaluation of Paretooptimal models demonstrates that optimal models searched for one hardware platform are not optimal for another and thus reiterates the importance of explicitly considering hardware characteristics in NAS.
References
 [1] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, vol. 1, MIT press Cambridge, 2016.
 [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
 [3] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning.,” in AAAI, 2017, vol. 4, p. 12.
 [4] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 [5] X Zhang, X Zhou, M Lin, and J Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017,” arXiv preprint arXiv:1707.01083.
 [6] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” group, vol. 3, no. 12, pp. 11, 2017.
 [7] Barret Zoph and Quoc V Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
 [8] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le, “Learning transferable architectures for scalable image recognition,” arXiv preprint arXiv:1707.07012, vol. 2, no. 6, 2017.
 [9] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy, “Progressive neural architecture search,” arXiv preprint arXiv:1712.00559, 2017.
 [10] TITAN X GPU, https://www.geforce.com/hardware/desktopgpus/geforcegtxtitanx.
 [11] Movidius Neural Compute Stick, https://developer.movidius.com/.
 [12] Dimitrios Stamoulis, Ermao Cai, DaCheng Juan, and Diana Marculescu, “Hyperpower: Powerand memoryconstrained hyperparameter optimization for neural networks,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 19–24.
 [13] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le, “Mnasnet: Platformaware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
 [14] JinDong Dong, AnChieh Cheng, DaCheng Juan, Wei Wei, and Min Sun, “Dppnet: Deviceaware progressive search for paretooptimal neural architectures,” arXiv preprint arXiv:1806.08198, 2018.
 [15] Daniel HernándezLobato, Jose HernandezLobato, Amar Shah, and Ryan Adams, “Predictive entropy search for multiobjective bayesian optimization,” .
 [16] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, Jan 2016.
 [17] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
 [18] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
 [19] NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidiamanagementlibrarynvml.
 [20] Jetson TX2, https://developer.nvidia.com/embedded/buy/jetsontx2.
 [21] Lukas Cavigelli, “Convenient power measurements on the jetson tx2/tegra x2 board,” 2018.
 [22] Intel Movidius Neural Compute SDK, https://movidius.github.io/ncsdk/.
 [23] PowerZ USB TD Tester, https://www.unionrepair.com/howtousepowerzusbpdtestervoltagecurrenttypecmeterkm001/.