DyVEDeep: Dynamic Variable Effort
Deep Neural Networks
Abstract
Deep Neural Networks (DNNs) have advanced the stateoftheart in a variety of machine learning tasks and are deployed in increasing numbers of products and services. However, the computational requirements of training and evaluating largescale DNNs are growing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. In this work, we propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep) to reduce the computational requirements of DNNs during inference. Previous efforts propose specialized hardware implementations for DNNs, statically prune the network, or compress the weights. Complementary to these approaches, DyVEDeep is a dynamic approach that exploits the heterogeneity in the inputs to DNNs to improve their compute efficiency with comparable classification accuracy. DyVEDeep equips DNNs with dynamic effort mechanisms that, in the course of processing an input, identify how critical a group of computations are to classify the input. DyVEDeep dynamically focuses its compute effort only on the critical computations, while skipping or approximating the rest. We propose 3 effort knobs that operate at different levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep versions for 5 popular image recognition benchmarks — one for CIFAR10 and four for ImageNet (AlexNet, OverFeat and VGG16, weightcompressed AlexNet). Across all benchmarks, DyVEDeep achieves 2.12.6 reduction in the number of scalar operations, which translates to 1.82.3 performance improvement over a Caffebased implementation, with loss in accuracy.
DyVEDeep: Dynamic Variable Effort
Deep Neural Networks
Sanjay Ganapathy 

Department of Computer Science and Engineering 
Indian Institute of Technology Madras 
Chennai, Tamil Nadu, India 
sanjaygana@gmail.com 
Swagath Venkataramani ^{†}^{†}thanks: Currently a Research Staff Member at IBM T.J. Watson Reseach Center, Yorktown Heights, NY 

Department of Electrical and Computer Engineering 
Purdue University 
West Lafayette, IN, USA 
venkata0@purdue.edu 
Balaraman Ravindran 

Department of Computer Science and Engineering 
Indian Institute of Technology Madras 
Chennai, Tamil Nadu, India 
ravi@cse.iitm.ac.in 
Anand Raghunathan 

Department of Electrical and Computer Engineering 
Purdue University 
West Lafayette, IN, USA 
raghunathan@purdue.edu 
1 Introduction
Deep Neural Networks (DNNs) have greatly advanced the stateoftheart on a variety of machine learning tasks from different modalities including image, video, text, and natural language processing. However, from a computational standpoint, DNNs are highly compute and data intensive workloads. For example, DNN topologies that have won the ImageNet LargeScale Visual Recognition Contest (ILSVRC) for the past 5 years, contain between 60150 million parameters and require 220 giga operations of compute to classify a single image. These requirements are only projected to increase in the future, as data sets of larger sizes and topologies of larger complexity (more layers, features and feature sizes) are actively explored. Indeed, the growth in computational requirements of DNNs has far outpaced improvements in the capabilities of commodity computational platforms in recent years.
Two key scenarios exemplify the computational challenges imposed by DNNs: (i) Largescale training, in which DNNs are trained on massive datasets using highperformance server clusters or in the cloud, and (ii) Lowpower inference, in which DNN models are evaluated on energyconstrained platforms such as mobile and deeplyembedded (InternetofThings) devices. Towards addressing the latter challenge, we propose Dynamic Variable Effort Deep neural networks (DyVEDeep), a new dynamic approach to improve the computational efficiency of DNN inference.
Related Research Directions. Prior research efforts to improve the computational efficiency of DNNs can be classified into 4 broad directions. The first comprises parallel implementations of DNNs on commercial multicore and GPGPU platforms. Parallelization strategies such as model, data and hybrid parallelism (Krizhevsky (2014); Das et al. (2016)), techniques such as asynchronous SGD (Dean et al. (2012)) and 1bit SGD (Seide et al. (2014)) to alleviate communication overheads are representative examples. The next set of efforts design specialized hardware accelerators to realize DNNs, trading off programmability, the cost of specialized hardware and design effort for efficiency. A spectrum of architectures ranging from lowpower IP cores to largescale systems have been proposed (Farabet et al. (2011); Chen et al. (2014); Jouppi ()). The third set of efforts focus on developing new device technologies whose characteristics intrinsically match the computational primitives in neural networks, leading to improvements in energy efficiency (Liu et al. (2015b); Ramasubramanian et al. (2014)). The final set of efforts exploit the fact that DNNs are typically overparametrized (Denil et al. (2013)) due to the nonconvex nature of the optimization space (Hinton et al. (2012)). Therefore, they approximate DNNs by statically pruning network connections, representing weights with reduced bit precision and/or in a compressed format, thereby improving compute efficiency for a negligible loss in classification accuracy (LeCun et al. (1989); Han et al. (2015b); Liu et al. (2014); Venkataramani et al. (2014); Anwar et al. (2015); Tan & Sim (2016)).
DyVEDeep: Motivation and Concept. In contrast to the above efforts, our proposal, Dynamic Variable Effort Deep neural networks (DyVEDeep ^{1}^{1}1The name stems from the notion that a network should ”dive deep”, or expend computational effort, judiciously as and where it is needed.), leverages the heterogeneity in the characteristics of inputs to a DNN to improve its compute efficiency. The motivation behind DyVEDeep stems from the following key insights.
First, in realworld data, not all inputs are created equal, i.e., inputs vary considerably in their “difficulty”. Intuitively, only inputs that lie very close to the decision boundary require the full effort of the classifier, while the rest could be classified with a much simpler (e.g., linear) decision boundary. In the context of DNNs, we can see that increasing network size provides a valuable, but nevertheless diminishing increase in accuracy. For example, in the context of ImageNet, increasing network’s computational requirements by over 15 (from AlexNet to VGG) yields an additional 16% increase in classification accuracy. This raises the question of whether some of the inputs can be classified with substantially fewer computations, while expending increased effort only for inputs that require it.
Second, for a given input, the effort needs to be expended across different parts of the network. For example, in an image recognition problem, the computations corresponding to neurons that operate on the image region where an object of interest is located are more critical to the classification output than the others. Also, some features may be less relevant than others in the context of a given input. For example, features that detect sharp edges may be less relevant if the current input is comprised mostly of curved surfaces.
Notwithstanding the above observations, stateoftheart DNNs are static i.e., they are computationally agnostic to the nature of the input being processed and expend the same (worst case) computational effort on all inputs, which leads to significant inefficiency. DyVEDeep addresses this limitation by dynamically predicting which computations are critical to classify a given input and focusing compute effort only on those computations, while skipping or approximating the rest. In effect, the network expends computational effort on different subsets of computations for each input, reducing computational requirements in each case without sacrificing classification accuracy.
Dynamic Effort Knobs. The key to the efficiency of DyVEDeep lies in favorably navigating the tradeoff between the cost of identifying critical computations vs. the benefits accrued by skipping or approximating computations. To this end, we identify three dynamic effort mechanisms at different levels of granularity viz. neuron, feature and layerlevels. These mechanisms employ runtime criteria to dynamically evaluate the criticality of groups of computations and appropriately skip or approximate those that are deemed to be less critical.

Saturation Prediction and Early Termination (SPET) operates at the neuronlevel. It monitors the intermediate output of each neuron after processing a subset of its inputs (partial dot product between a subset of inputs and corresponding weights) and predicts the likelihood of the neuron eventually saturating after applying the activation function. If the partial sum is deep within the saturation regime (e.g., a large negative value in the case of ReLU), all further computations corresponding to the neuron are deemed to be noncritical and skipped.

Significancedriven Selective Sampling (SDSS) operates within each feature map, and exploits the spatial locality between neuron activations. A uniformly spatially sampled version of the feature is first computed. The activations of each remaining neuron is either approximated or accurately computed based on the magnitude and variance of its neighbors.

Similaritybased Feature Map Approximation (SFMA) operates at the layer level, and examines the similarity between neuron activations in each feature map. If all neuron activations are similar, the convolution operation on the feature map is approximated by a single scalar multiplication of the average neuron activation value with the precomputed sum of kernel weights.
We develop a systematic methodology to identify the hyperparameters for each of these mechanisms during the training phase for any given DNN. We built DyVEDeep versions for 5 popular DNN benchmarks viz. CIFAR10, AlexNet, OverFeataccurate, VGG16 and a weightcompressed AlexNet model. Our experiments demonstrate that by dynamically exploiting the heterogeneity across inputs, DyVEDeep achieves 2.12.6 reduction in the total number of scalar operations for 0.5% loss in classification accuracy. The reduction in scalar operations translates to 1.82.3 improvement in performance in our software implementation of DyVEDeep using the Caffe deep learning framework on an Intel Xeon 2.7GHz server with 128GB memory.
The rest of the paper is organized as follows. Section 2 describes prior research efforts related to DyVEDeep. Section 3 details the proposed dynamic effort mechanisms and how they are integrated in DyVEDeep. Section 4 outlines the methodology used in our experiments. The experimental results are presented in Section 5, and Section 6 concludes the paper.
2 Related Work
In this section, we provide a brief summary of prior research efforts related to DyVEDeep, and highlight the distinguishing features of our work. Prior research on improving the computational efficiency of DNNs follows 4 distinct directions.
The first class of efforts focus on parallelizing DNNs on commercial multicores and GPGPU platforms. Different work distribution strategies such as model, data and hybrid parallelism (Krizhevsky (2014); Das et al. (2016)), and hardware transparent onchip memory allocation/management schemes such as virtualized DNNs (Rhu et al. (2016)) are representative examples. The second class of efforts design specialized hardware accelerators that realize the key computation kernels in DNNs. A range of architectures targeting lowpower mobile devices (Farabet et al. (2011)) to highperformance server clusters (Chen et al. (2014); Jouppi ()) have been explored. The third set of efforts investigate new device technologies whose characteristics intrinsically match the compute primitives present in DNNs. Memristorbased crossbar array architectures (Liu et al. (2015b)) and spintronic neuron designs (Ramasubramanian et al. (2014)) are representative examples.
The final set of efforts improve efficiency by approximating computations in the DNN. DyVEDeep falls under this category, as we propose to dynamically skip or approximate computations based on their criticality in the context of a given input. Therefore, we describe the approaches that fall under this category in more detail. To this end, we classify these approaches into static vs. dynamic optimizations.
Static Techniques Almost all efforts that approximate computations in DNNs are static in nature i.e., they apply the same approximation uniformly across all inputs. Static techniques primarily reduce the model size of DNNs by using mechanisms such as pruning connections (LeCun et al. (1989); Han et al. (2015b); Liu et al. (2014)), reducing the precision of computations (Venkataramani et al. (2014); Anwar et al. (2015)), and storing weights in a compressed format (Han et al. (2015a)). For example, in the context of fully connected layers, HashNets ( Chen et al. (2015)) use a hash function to randomly group weights into bins, which share a common parameter value, thereby reducing the number of parameters needed to represent the network. Deep compression (Han et al. (2015a)) attempts to prune connections in the network by adding a regularization term during training, and removing connections with weights below a certain threshold.
In the context of convolution layers, Denton et al. (2014); Jaderberg et al. (2014) exploit the linear structure of the network to find a suitable low rank approximation. On the other hand, Liu et al. (2015a) propose sparse convolutional DNNs, wherein almost of the parameters in the kernels are zeroed out by adding a weight sparsity term to the objective function. In contrast, Mathieu et al. (2013) demonstrate that performing convolution in the Fourier domain can yield substantial improvement in efficiency. Finally, /citeDBLP:journals/corr/FigurnovVK15 propose perforated CNNs, in which only a subset of the neurons in a feature are evaluated. The neurons to be evaluated for each feature are determined statically at training time.
Dynamic Techniques. Dynamic optimizations adapt the computations that are approximated based on the input currently being processed. Dynamic techniques are more powerful than statically optimised DNNs, as they can capture additional inputdependent opportunities for efficiency that static methods lack. Notwithstanding this, very little focus has been devoted to developing dynamic DNN approximation techniques. One of the first efforts in this direction (Bengio (2013)), utilizes stochastic neurons to gate regions within the DNN. Along similar lines, Ba & Frey (2013) propose Standout, where the dropout probability of each neuron is estimated using a binary belief network. The dropout mask is computed for the network in one shot, conditioned on the input to the network. Bengio et al. (2015) extends a similar idea, wherein the dropout distribution of each layer is computed based on the output of the preceding layer.
The dynamic effort mechanisms proposed in DyVEDeep are qualitatively different from the aforementioned efforts. Rather than stochastically dropping computations, effort knobs in DyVEDeep exploit properties such as the saturating nature of activation to directly predict the effect of approximation on the neuron output. Further, prior dynamic approaches have only be been applied to fullyconnected networks trained on small datasets. Their applicability to largescale DNNs remains unexplored. On the other hand, DyVEDeep is naturally applicable to both convolutional and fully connected layers, and we demonstrate substantial benefits on largescale networks for ImageNet.
3 DyVEDeep: Design Approach and Dynamic Effort Knobs
The key idea behind DyVEDeep is to improve the computational efficiency of DNNs by modulating the effort that they expend based on the input that is being processed. As shown in Figure 1, we achieve this by equipping the DNN with dynamic effort mechanisms (“effort knobs”) that dynamically predict criticality of groups of computations with very low overhead, and correspondingly skip or approximate them, thereby improving efficiency with negligible impact on classification accuracy. We identify three such dynamic effort mechanisms in DNNs that operate at different levels of granularity. We also propose a methodology to tune the hyperparameters associated with these mechanisms so that variable effort versions of any DNN can be obtained with negligible loss in classification accuracy.
3.1 Saturation Prediction and Early Termination
Saturation Prediction and Early Termination (SPET) works at the finest level of granularity, which is at the level of each neuron in the DNN. In this case, we leverage the fact the almost all convolutional and fully connected layers are followed by an activation function that saturates on at least one side. For example, the commonly used Rectified Linear Unit (ReLU) activation function saturates at one end by truncating the negative inputs to zero, while passing the positive inputs as is.
The key idea in SPET is that the actual value of the weighted sum (dot product between a neuron’s inputs and weights) does not impact the neuron’s output, provided the sum will eventually cause the neuron’s activation function to saturate. In the case of ReLU, it is unnecessary to compute the actual sum if it will eventually be a negative value, as any negative value would result in a neuron output of zero. Based on the above observation, as shown in Figure 2, SPET monitors the partial weighted sum of a neuron after a predefined fraction of its inputs have been multipliedandaccumulated. SPET then predicts whether the final partial sum would cause the neuron’s activation function to saturate. To this end, we introduce the following hyperparameters:

and : We set two thresholds on the partial sum value of the each neuron. At the time of prediction, as shown in Equation 1, if the partial sum is found to be smaller than or greater than , the partial sum computation is terminated early, and the appropriate saturated activation function value is returned as the neuron’s output. If not, we continue to completely evaluate the partial sum value for the neuron.
(1) 
We note that if the activation function saturates in just one direction, only one of the SPET thresholds will be useful to predict saturation. For example, in the case of ReLU, only the is used to predict saturation.
To demonstrate the potential benefits from SPET, Figure 3 shows the fraction of neurons in the convolutional layers of the CIFAR10 DNN that saturate. We find that between 50%73% of the neuron activations are zeros due to the ReLU activation function. Figure 3 also reveals that the fraction of neurons saturating increases as we proceed deeper into the network. We observed similar trends for larger networks such as AlexNet and OverFeat. Since a majority of neuron activations saturate in typical DNNs, SPET has a potential to achieve significant improvements in processing efficiency.
Saturation Prediction Interval. A key aspect of SPET is the interval at which we predict for saturation. On the one hand, predicting saturation after processing a small number of inputs to each neuron would frequently result in the prediction being incorrect, leading to a loss in classification accuracy. On the other hand, a larger prediction interval yields progressively smaller computational savings. Quantifying the above tradeoff, Figure 4 illustrates, for the CIFAR10 DNN, the fraction of neuron that were predicted to be saturated correctly at various prediction intervals. For the illustration in Figure 4, we assume a of 0 i.e., a neuron is predicted to saturate if its partial sum at the point of prediction is negative. We find that the fraction of neurons predicted correctly increases with the prediction interval.
The and hyperparameters are determined during DNN training. We note that the prediction interval could also be learnt during the training process. However, we found that a simpler scheme where we fix the prediction interval at 50% (i.e., we predict for saturation after half the inputs to a neuron have been processed) worked quite well in practice.
Rearranging Neuron Inputs. For SPET to be most effective, the weights should be processed in decreasing order of magnitude, as larger weights are likely to have the most impact on the partial sum. However, this is not feasible in practise, as it affects the regularity in the memory access pattern, directly offsetting the savings from skipping computations. Also, in the case of convolutional layers, if the prediction interval is set to 50%, inputs from half of the feature maps are ignored at the time of prediction. To maximize the range of inputs processed before prediction, while maintaining regularity in the memory access pattern, we rearrange the neuron inputs such that all odd indexed inputs are processed first, after which the prediction is made. The even indexed inputs are computed only if the neuron was not predicted to saturate.
3.2 Significancedriven Selective Sampling
Significancedriven Selective Sampling (SDSS) operates the granularity of each feature in the convolutional layers of the DNN. SDSS leverages the spatial locality in neuron activations within each feature. For example, in the context of images, adjacent pixels in the input image frequently take similar values. As the neuron activations are computed by sliding the kernel over the image, the spatial locality naturally permeates to the feature outputs of convolutional layers. This behavior is also observed in deeper layers in the network. In fact, the saturating nature of the activation function enhances locality, as variations in the weighted sum between neighbors are masked if they both fall within the same saturation regime.
SDSS adopts a 2step process to exploit the spatial locality within features.
Uniform Feature Sampling. In the first step, we compute the activation values for a subset of neurons in the feature by uniformly sampling the feature. For this purpose, we define a parameter that denotes the periodicity of sampling in each dimension. The value of is chosen based on the size of the feature and the correlation between adjacent neuron activations. In our experiments, we used a sampling period of 2 across all convolutional layers in a DNN.
Significancedriven Selective Evaluation. In the second step, as shown in Figure 5 we selectively approximate activation values of neurons that were not sampled in the first step. To this end, we define the following two hyperparameters: (i) Maximum Activation Value Threshold (), (ii) Delta Activation Value Threshold (). For each neuron in the feature that is yet to be computed, we examine the activation values of its immediate neighbors in all directions, and compute the maximum and range (difference between max and min) of the neighbors’ activation values. If the maximum value is below the threshold and the range is less than the , then the activation value of the neuron is approximated to be the average of its neighbors. If not, the actual activation value of the neuron is evaluated.
Thus, the SDSS effort knob utilizes the magnitude and variance of neighbors to gauge whether a neuron lies within a region of interest, and accordingly expends computational effort to compute its activation value.
3.3 Similaritybased Feature Map Approximation
Similaritybased Feature Map Approximation (SFMA) also exploits the correlation between activation values in a feature, but in a very different way. In SDSS, the spatial locality was exploited in computing the neuron activations themselves. In contrast, in the case of SFMA, the spatial locality is used to approximate computations that use the feature as their input. Consider a convolutional layer in which one of the input features has all of its neuron activations similar to each other. When a convolution operation is performed on this input feature by sliding the kernel matrix, all the entries in the convolution output are likely to be close to each other. Therefore, as shown in Figure 6, we approximate entire convolution operation as follows. First, the average value of all neuron activations in the feature is computed. Next the sum of all weights in the kernel matrix is evaluated. We note that the sum can be precomputed and stored along with the kernel matrix. We then approximate all outputs of the convolution as the product of the average input activation and the sum of all kernel weights.
Mathematically, the above approximation can be expressed as follows.
In the above equation, is the convolution output for a window of size , where is the kernel size. is the mean of all the activation values in the feature. This approximation is valid when is negligible.
To determine on which convolutions to apply the aforementioned approximation, we define the following 2 hyperparameters:

Weight Significance Threshold ()  We set this threshold on the sum of absolute values of the kernel weights. This is an approximate measure of significance of the current convolution to the output feature

Feature Variance Threshold ()  We set this threshold on the variance of the neuron activations in the feature.
Given the hyperparameters, the convolution is approximated when (i) the sum of the kernel weights are below , indicating that the convolution is relatively less significant to the output feature, and (ii) the variance of neuron activations in the feature is below , indicating that the error due to replacing the entire feature with its average is tolerable.
When the feature sizes are large, we do not check for the variance across the entire feature. Instead, we split the feature into multiple regions, that overlap on each dimension by the size of the kernel window. We check for variance within each region, and if the variance is below , the kernel windows that fit entirely within the region are approximated.
3.4 Integrating Effort Knobs
We now describe how the different effort knobs—SPET, SDSS and SFMA—are combined in DyVEDeep. Since each effort knob operates at a different level of granularity, they can be easily integrated with each other. To combine SPET and SDSS, each neuron activation in the uniformly sampled features of SDSS are computed with SPET. However, we do not apply SPET to the neurons that are selectively computed in SDSS, as they are located in the midst of neurons with large activation values and/or variance, and are hence unlikely to saturate. SFMA fundamentally amounts to grouping a set of inputs (within a convolution window) to a neuron into a single input, and therefore directly fits with the process of evaluating a neuron with SPET/SDSS.
In summary, the SPET effort knob applies to both convolutional and fully connected layers of DNNs, and is most effective when majority of the neurons saturate. Since the convolutional layers towards the middle of the DNN have a large number of inputs per neuron and contain a substantial fraction of saturated neurons, we expect SPET to be most beneficial for those layers. The SDSS effort knob primarily applies only to convolutional layers, and is most effective when the features sizes are large. Therefore, the initial convolutional layers would benefit the most from SDSS. On the other hand, SFMA works best when there are a large number of features in the layer and when the feature sizes are small. Hence the middle and later convolutional layers are likely to benefit from SFMA.
3.5 Hyperparameter Tuning
As described in the previous subsections, the dynamic effort knobs together contain 6 hyper parameters viz. , , , , and . These hyperparameters control how aggressively the effort knobs skip or approximate computations, thereby yielding a direct tradeoff between computational savings vs. classification accuracy. Using a pretrained network and a training dataset, we systematically determine the DyVEDeep hyperparameters before the DNN model is deployed. Ideally, we could define these parameters uniquely for each neuron in the DNN. For example, each neuron could have its unique threshold to predict when it saturates (SPET), or threshold to deem if an input feature map can be approximated during its partial sum evaluation (SFMA). Clearly, this results in a prohibitively large hyperparameter search space, and adds substantial overhead to the overall size of the DNN model. Since neurons in a given layer are computationally similar (same set of inputs, number of computationsetc.), we define the hyperparameters at a layerwise granularity i.e., all neurons within a layer share the same set of hyperparameters. Also, since all our benchmarks utilized the ReLU activation function, we ignored the when identifying the hyperparameter configuration.
Algorithm 1 shows the pseudocode for the hyperparameter tuning process. Empirically, we observed that parameters corresponding to each effort knob can be independently tuned. Therefore, we adopt a strategy wherein we first identify a range of possible values for each hyperparameter. Since computational savings monotonically increase or decrease with the value of each parameter, we perform a greedy binary search on its range. The range of each parameter can be identified as follows. The and parameters vary over the entire range of values the partial sum of neurons can take in a layer. However, we typically observe that zero is a good lower bound for these parameters, as ReLU sets all negative values to 0. The upper bound is determined by evaluating the DNN on each input in the training dataset and recording the maximum partial sum value for each layer. The other parameters , and are naturally lowered bounded by 0 as they are thresholds on absolute magnitudes. Similar to and , the upper limit of the other parameters are also estimated by evaluating the DNN on the training set.
Given a hyperparameter and its range, the highest possible value for the parameter yields the maximum computation savings but adversely affects the classification accuracy. On the other extreme, the lowest value of the parameter does not impact the classification accuracy. However, it yields no computation savings and in fact adds a penalty for criticality prediction. Therefore, we perform a binary search on the range to identify the highest value of the parameter that yields negligible loss in classification accuracy (0.5% in our experiments). In the case of SFMA, we observed that the two hyperparameters ( and ) need to be searched together. Since the range of is more coarser than , we loop over the values of , and search for possible values of in each case.
In summary, by embedding dynamic effort knobs into DNNs, DyVEDeep seamlessly varies computational effort across inputs to achieve significant computational savings while maintaining classification accuracy.
4 Experimental Methodology
In this section, we describe the methodology used in our experiments to evaluate DyVEDeep.
Benchmarks. To evaluate DyVEDeep, we utilized pretrained DNN models available publicly on the Caffe Model Zoo (BVLC (a)) benchmark repository. This reinforces DyVEDeep’s ability to adapt to any given trained network. We used the following 5 DNN benchmarks in our experiments: CIFAR10 Caffe network (BVLC (b)) for the CIFAR10 dataset (Krizhevsky (2009)), and AlexNet (Krizhevsky et al. (2012)), Overfeataccurate (Sermanet et al. ()), VGG16 (Simonyan & Zisserman (2014)), and compressed AlexNet (Han et al. (2015a)) for the ImageNet ILSVRC 2012 data set (Deng et al. (2009)). The inputs for the ImageNet dataset are generated by using a center crop of the images in the test set. We randomly selected 5% of the test inputs and used it as a validation set to tune the hyper parameters. We report speedup and classification accuracy results on the remaining 95% of the test inputs.
Performance Measurement. We implemented DyVEDeep in C++ within the Caffe deep learning framework (Jia et al. (2014)). However, we could not directly integrate DyVEDeep within Caffe, as it composes all computations within a layer for a given batch size into a single GEMM (GEneral Matrix Multiplication) operation, which is offered by BLAS (Basic Linear Algebra Subprograms) libraries. BLAS libraries specifically optimize matrix operations at the assembly level. Since DyVEDeep requires more finegrained computation skipping/approximation, we were unable to directly incorporate it within these routines. Therefore, we prototyped our own implementation for the convolutional layers within Caffe and used it in our experiments.
Our experiments were conducted on an Intel Xeon server operating at 2.7GHz frequency and 128GB memory. We added performance counters to both DyVEDeep and the baseline DNN implementation to measure the software execution time. All our timing results are reported for a singlethreaded sequential execution. Also, for our experiments, we introduced dynamic effort knobs only in the convolutional layers of the DNN, as they dominated the overall runtime for all our benchmarks. However, we note that the reported execution times and performance benefits include the time taken by all layers in the network.
5 Results
In this section, we present the results of our experiments that demonstrate the benefits of DyVEDeep.
5.1 Improvement in Scalar Operations and Execution Time
We first present the reduction in scalar operations and execution time achieved by DyVEDeep in Figure 7. Please note that the Yaxis in Figure 7 is a normalized scale to represent the benefits in both scalar operations and runtime. We find that, across all benchmarks, DyVEDeep consistently achieves substantial reduction in operation count, ranging between 2.12.6. This translates to 1.82.3 benefits in software execution time. In all the above cases, the difference in classification accuracy between the baseline DNN and DyVEDeep was 0.5%. On an average, the runtime overhead of the dynamic effort knobs in DyVEDeep was 5% of the baseline DNN. Also, while the runtime benefits with DyVEDeep are quite significant, they are smaller compared to the reduction in scalar operations. This is expected as applying the knobs require us to alter memory access patterns and perform additional book keeping operations. Also, control operations, such as loop counters etc., that are inherent to any software implementation limits the fraction of runtime DyVEDeep can benefit.
5.2 Layerwise and Knobwise Breakdown of Compute Savings
Figure 8a shows the break down of run time savings across different layers of AlexNet, with the layers plotted on the Xaxis and the average run time per layer normalized to the total baseline DNN run time on the Yaxis. We achieve 1.5 reduction in run time in the initial convolutional layers (C1,C2), which increases to 2.6 in the deeper convolutional layers (C3C5). The C1 layer in AlexNet has a kernel size of 1111 and operates with a stride of 4. Hence, its output is less likely to have the correlation that SSDS expects. Also, since there are very few input features, SFMA is also not very effective. Also, the fraction of neurons saturating is relatively small in the first layers, which impacts the effectiveness of SPET. Hence, we achieve better savings in the deeper convolutional layers compared to the initial ones.
Figure 8b compares the contribution of each effort knob to the overall savings for each convolutional layer in AlexNet. Over all layers, the SDSS knob yields the highest savings, reducing 31% of the total scalar operations. The SPET and SFMA knobs contribute 19% and 7% respectively. We find that the effectiveness of each knob is more pronounced in the deeper convolutional layers.
5.3 Visualisation of Effort Map of DyVEDeep
Figures 10 and 11 illustrate the normalised effort map of DyVEDeep for all features in layer C1 for two sample images (Figure 9) from the CIFAR10 data set. We use layer C1, as this is the closest layer to the actual image and allows for better visualization. The normalization is done with respect to the number of operations that would have been performed to compute the neuron, had our knobs not been in place. Darker regions represent more computations. It is remarkable to see that DyVEDeep focuses more effort on precisely the regions of the image, that contains the object of interest. We compare this with the activation map of the corresponding features. Here, the darker regions represent activated neurons. This has been done to highlight the correlation between the activation values and the effort that DyVEDeep expends on the corresponding neurons. The activation map demonstrates that regions where the activation value of neurons are high have a higher variance in the values, that makes it harder to approximate them. However, the parameter ensures that DyVEDeep constrains the effort spent in regions with uniform activation values. These effort maps corroborate our knobs’ effectiveness in identifying the critical computations for the current input.
6 Conclusion
Deep Neural Networks have significantly impacted the field of machine learning, by enabling stateoftheart functional accuracies on a variety of machine learning problems involving image, video, text, speech and other modalities. However, their largescale structure renders them compute and data intensive, which remains a key challenge. We observe that stateoftheart DNNs are static i.e. they perform the same set of computations on all inputs. However, in many realworld datasets, there exists significant heterogeneity in the compute effort required to classify each input. Leveraging this opportunity, we propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep), or DNNs that modulate their compute effort dynamically ascertaining which computations are critical to classify a given input. We build DyVEDeep versions of 4 popular image recognition benchmarks. Our experiments demonstrate that DyVEDeep achieves 2.12.6 reduction in scalar operations and 1.92.3 reduction in runtime on a Caffebased sequential software implementation, while maintaining the same level of classification accuracy.
References
 Anwar et al. (2015) Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 1924, 2015, pp. 1131–1135, 2015. doi: 10.1109/ICASSP.2015.7178146. URL http://dx.doi.org/10.1109/ICASSP.2015.7178146.
 Ba & Frey (2013) Lei Jimmy Ba and Brendan J. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pp. 3084–3092, 2013. URL http://papers.nips.cc/paper/5032adaptivedropoutfortrainingdeepneuralnetworks.
 Bengio et al. (2015) Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. CoRR, abs/1511.06297, 2015. URL http://arxiv.org/abs/1511.06297.
 Bengio (2013) Yoshua Bengio. Estimating or propagating gradients through stochastic neurons. CoRR, abs/1305.2982, 2013. URL http://arxiv.org/abs/1305.2982.
 BVLC (a) BVLC. Caffe model zoo. a. URL https://github.com/BVLC/caffe/wiki/ModelZoo.
 BVLC (b) BVLC. Caffe cifar10 network. b. URL https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_quick_train_test.prototxt.
 Chen et al. (2015) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. URL http://arxiv.org/abs/1504.04788.
 Chen et al. (2014) Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO47, pp. 609–622, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 9781479969982. doi: 10.1109/MICRO.2014.58. URL http://dx.doi.org/10.1109/MICRO.2014.58.
 Das et al. (2016) Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidyanathan, Srinivas Sridharan, Dhiraj D. Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using synchronous stochastic gradient descent. CoRR, abs/1602.06709, 2016. URL http://arxiv.org/abs/1602.06709.
 Dean et al. (2012) Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcâAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
 Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and FeiFei Li. Imagenet: A largescale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA, pp. 248–255, 2009. doi: 10.1109/CVPRW.2009.5206848. URL http://dx.doi.org/10.1109/CVPRW.2009.5206848.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pp. 2148–2156, 2013. URL http://papers.nips.cc/paper/5025predictingparametersindeeplearning.
 Denton et al. (2014) Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. CoRR, abs/1404.0736, 2014. URL http://arxiv.org/abs/1404.0736.
 Farabet et al. (2011) C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In CVPR 2011 WORKSHOPS, pp. 109–116, June 2011. doi: 10.1109/CVPRW.2011.5981829.
 Graves (2016) Alex Graves. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983, 2016. URL http://arxiv.org/abs/1603.08983.
 Han et al. (2015a) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015a. URL http://arxiv.org/abs/1510.00149.
 Han et al. (2015b) Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. CoRR, abs/1506.02626, 2015b. URL http://arxiv.org/abs/1506.02626.
 Hinton et al. (2012) Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.
 Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 15, 2014, 2014. URL http://www.bmva.org/bmvc/2014/papers/paper073/index.html.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 (21) Norman Jouppi. Google supercharges machine learning tasks with custom chip: https://cloudplatform.googleblog.com/2016/05/googlesuperchargesmachinelearningtaskswithcustomchip.html.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 Krizhevsky (2014) Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http://arxiv.org/abs/1404.5997.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 LeCun et al. (1989) Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 2730, 1989], pp. 598–605, 1989. URL http://papers.nips.cc/paper/250optimalbraindamage.
 Liu et al. (2015a) Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall F. Tappen, and Marianna Pensky. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pp. 806–814, 2015a. doi: 10.1109/CVPR.2015.7298681. URL http://dx.doi.org/10.1109/CVPR.2015.7298681.
 Liu et al. (2014) Chao Liu, Zhiyong Zhang, and Dong Wang. Pruning deep neural networks by optimal brain damage. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 1418, 2014, pp. 1092–1095, 2014. URL http://www.iscaspeech.org/archive/interspeech_2014/i14_1092.html.
 Liu et al. (2015b) Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, and Jianhua Yang. Reno: A highefficient reconfigurable neuromorphic computing accelerator design. In Proceedings of the 52Nd Annual Design Automation Conference, DAC ’15, pp. 66:1–66:6, New York, NY, USA, 2015b. ACM. ISBN 9781450335201. doi: 10.1145/2744769.2744900. URL http://doi.acm.org/10.1145/2744769.2744900.
 Mathieu et al. (2013) Michaël Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. CoRR, abs/1312.5851, 2013. URL http://arxiv.org/abs/1312.5851.
 Ramasubramanian et al. (2014) Shankar Ganesh Ramasubramanian, Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan. Spindle: Spintronic deep learning engine for largescale neuromorphic computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, ISLPED ’14, pp. 15–20, New York, NY, USA, 2014. ACM. ISBN 9781450329750. doi: 10.1145/2627369.2627625. URL http://doi.acm.org/10.1145/2627369.2627625.
 Rhu et al. (2016) Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. Virtualizing deep neural networks for memoryefficient neural network design. CoRR, abs/1602.08124, 2016. URL http://arxiv.org/abs/1602.08124.
 Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1bit stochastic gradient descent and application to dataparallel distributed training of speech dnns. In Interspeech 2014, September 2014.
 (33) Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann Lecun. Overfeat: Integrated recognition, localization and detection using convolutional networks. http://arxiv.org/abs/1312.6229.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
 Tan & Sim (2016) Shawn Tan and Khe Chai Sim. Towards implicit complexity control using variabledepth deep neural networks for automatic speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 2025, 2016, pp. 5965–5969, 2016. doi: 10.1109/ICASSP.2016.7472822. URL http://dx.doi.org/10.1109/ICASSP.2016.7472822.
 Venkataramani et al. (2014) Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. Axnn: Energyefficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, ISLPED ’14, pp. 27–32, New York, NY, USA, 2014. ACM. ISBN 9781450329750. doi: 10.1145/2627369.2627613. URL http://doi.acm.org/10.1145/2627369.2627613.