Pruning Convolutional Neural Networks
for Resource Efficient Inference
Abstract
We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteriabased pruning with finetuning by backpropagation—a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. We focus on transfer learning, where large pretrained networks are adapted to specialized tasks. The proposed criterion demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation, for pruning large CNNs after adaptation to finegrained classification tasks (Birds200 and Flowers102) relaying only on the first order gradient information. We also show that pruning can lead to more than theoretical reduction in adapted 3Dconvolutional filters with a small drop in accuracy in a recurrent gesture classifier. Finally, we show results for the largescale ImageNet dataset to emphasize the flexibility of our approach.
Pruning Convolutional Neural Networks
for Resource Efficient Inference
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, Jan Kautz 

NVIDIA 
{pmolchanov, styree, tkarras, taila, jkautz}@nvidia.com 
1 Introduction
Convolutional neural networks (CNN) are used extensively in computer vision applications, including object classification and localization, pedestrian and car detection, and video classification. Many problems like these focus on specialized domains for which there are only small amounts of carefully curated training data. In these cases, accuracy may be improved by finetuning an existing deep network previously trained on a much larger labeled vision dataset, such as images from ImageNet (Russakovsky et al., 2015) or videos from Sports1M (Karpathy et al., 2014). While transfer learning of this form supports state of the art accuracy, inference is expensive due to the time, power, and memory demanded by the heavyweight architecture of the finetuned network.
While modern deep CNNs are composed of a variety of layer types, runtime during prediction is dominated by the evaluation of convolutional layers. With the goal of speeding up inference, we prune entire feature maps so the resulting networks may be run efficiently even on embedded devices. We interleave greedy criteriabased pruning with finetuning by backpropagation, a computationally efficient procedure that maintains good generalization in the pruned network.
Neural network pruning was pioneered in the early development of neural networks (Reed, 1993). Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi & Stork, 1993) leverage a secondorder Taylor expansion to select parameters for deletion, using pruning as regularization to improve training and generalization. This method requires computation of the Hessian matrix partially or completely, which adds memory and computation costs to standard finetuning.
In line with our work, Anwar et al. (2015) describe structured pruning in convolutional layers at the level of feature maps and kernels, as well as strided sparsity to prune with regularity within kernels. Pruning is accomplished by particle filtering wherein configurations are weighted by misclassification rate. The method demonstrates good results on small CNNs, but larger CNNs are not addressed.
Han et al. (2015) introduce a simpler approach by finetuning with a strong regularization term and dropping parameters with values below a predefined threshold. Such unstructured pruning is very effective for network compression, and this approach demonstrates good performance for intrakernel pruning. But compression may not translate directly to faster inference since modern hardware exploits regularities in computation for high throughput. So specialized hardware may be needed for efficient inference of a network with intrakernel sparsity (Han et al., 2016). This approach also requires long finetuning times that may exceed the original network training by a factor of or larger. Group sparsity based regularization of network parameters was proposed to penalize unimportant parameters (Wen et al., 2016; Zhou et al., 2016; Alvarez & Salzmann, 2016; Lebedev & Lempitsky, 2016). Regularizationbased pruning techniques require per layer sensitivity analysis which adds extra computations. In contrast, our approach relies on global rescaling of criteria for all layers and does not require sensitivity estimation. Moreover, our approach is faster as we directly prune unimportant parameters instead of waiting for their values to be made sufficiently small by optimization under regularization.
Other approaches include combining parameters with correlated weights (Srinivas & Babu, 2015), reducing precision (Gupta et al., 2015; Rastegari et al., 2016) or tensor decomposition (Kim et al., 2015). These approaches usually require a separate training procedure or significant finetuning, but potentially may be combined with our method for additional speedups.
2 Method
The proposed method for pruning consists of the following steps: 1) Finetune the network until convergence on the target task; 2) Alternate iterations of pruning and further finetuning; 3) Stop pruning after reaching the target tradeoff between accuracy and pruning objective, e.g. floating point operations (FLOPs) or memory utilization.
The procedure is simple, but its success hinges on employing the right pruning criterion. In this section, we introduce several efficient pruning criteria and related technical considerations.
Consider a set of training examples , where and represent an input and a target output, respectively. The network’s parameters^{1}^{1}1A “parameter” might represent an individual weight, a convolutional kernel, or the entire set of kernels that compute a feature map; our experiments operate at the level of feature maps. are optimized to minimize a cost value . The most common choice for a cost function is a negative loglikelihood function. A cost function is selected independently of pruning and depends only on the task to be solved by the original network. In the case of transfer learning, we adapt a large network initialized with parameters pretrained on a related but distinct dataset.
During pruning, we refine a subset of parameters which preserves the accuracy of the adapted network, . This corresponds to a combinatorial optimization:
(1) 
where the norm in bounds the number of nonzero parameters in . Intuitively, if we reach the global minimum of the error function, however will also have its maximum.
Finding a good subset of parameters while maintaining a cost value as close as possible to the original is a combinatorial problem. It will require evaluations of the cost function for a selected subset of data. For current networks it would be impossible to compute: for example, VGG16 has convolutional feature maps. While it is impossible to solve this optimization exactly for networks of any reasonable size, in this work we investigate a class of greedy methods.
Starting with a full set of parameters , we iteratively identify and remove the least important parameters, as illustrated in Figure 1. By removing parameters at each iteration, we ensure the eventual satisfaction of the bound on .
Since we focus our analysis on pruning feature maps from convolutional layers, let us denote a set of image feature maps by with dimensionality and individual maps (or channels).^{2}^{2}2While our notation is at times specific to 2D convolutions, the methods are applicable to 3D convolutions, as well as fully connected layers. The feature maps can either be the input to the network, , or the output from a convolutional layer, with . Individual feature maps are denoted for . A convolutional layer applies the convolution operation () to a set of input feature maps with kernels parameterized by :
(2) 
where is the result of convolving each of kernels of size with its respective input feature map and adding bias . We introduce a pruning gate , an external switch which determines if a particular feature map is included or pruned during feedforward propagation, such that when g is vectorized: .
2.1 Oracle pruning
Minimizing the difference in accuracy between the full and pruned models depends on the criterion for identifying the “least important” parameters, called saliency, at each step. The best criterion would be an exact empirical evaluation of each parameter, which we denote the oracle criterion, accomplished by ablating each nonzero parameter in turn and recording the cost’s difference.
We distinguish two ways of using this oracle estimation of importance: 1) oracleloss quantifies importance as the signed change in loss, , and 2) oracleabs adopts the absolute difference, . While both discourage pruning which increases the loss, the oracleloss version encourages pruning which may decrease the loss, while oracleabs penalizes any pruning in proportion to its change in loss, regardless of the direction of change.
While the oracle is optimal for this greedy procedure, it is prohibitively costly to compute, requiring evaluations on a training dataset, one evaluation for each remaining nonzero parameter. Since estimation of parameter importance is key to both the accuracy and the efficiency of this pruning approach, we propose and evaluate several criteria in terms of performance and estimation cost.
2.2 Criteria for pruning
There are many heuristic criteria which are much more computationally efficient than the oracle. For the specific case of evaluating the importance of a feature map (and implicitly the set of convolutional kernels from which it is computed), reasonable criteria include: the combined norm of the kernel weights, the mean, standard deviation or percentage of the feature map’s activation, and mutual information between activations and predictions. We describe these criteria in the following paragraphs and propose a new criterion which is based on the Taylor expansion.
Minimum weight.
Pruning by magnitude of kernel weights is perhaps the simplest possible criterion, and it does not require any additional computation during the finetuning process. In case of pruning according to the norm of a set of weights, the criterion is evaluated as: , with , where is dimensionality of the set of weights after vectorization. The motivation to apply this type of pruning is that a convolutional kernel with low norm detects less important features than those with a high norm. This can be aided during training by applying or regularization, which will push unimportant kernels to have smaller values.
Activation.
One of the reasons for the popularity of the ReLU activation is the sparsity in activation that is induced, allowing convolutional layers to act as feature detectors. Therefore it is reasonable to assume that if an activation value (an output feature map) is small then this feature detector is not important for prediction task at hand. We may evaluate this by mean activation, , with for activation , or by the standard deviation of the activation, .
Mutual information.
Mutual information (MI) is a measure of how much information is present in one variable about another variable. We apply MI as a criterion for pruning, , with , where is the target of neural network. MI is defined for continuous variables, so to simplify computation, we exchange it with information gain (IG), which is defined for quantized variables , where is the entropy of variable . We accumulate statistics on activations and ground truth for a number of updates, then quantize the values and compute IG.
Taylor expansion.
We phrase pruning as an optimization problem, trying to find with bounded number of nonzero elements that minimize . With this approach based on the Taylor expansion, we directly approximate change in the loss function from removing a particular parameter. Let be the output produced from parameter . In the case of feature maps, . For notational convenience, we consider the cost function equally dependent on parameters and outputs computed from parameters: . Assuming independence of parameters, we have:
(3) 
where is a cost value if output is pruned, while is the cost if it is not pruned. While parameters are in reality interdependent, we already make an independence assumption at each gradient step during training.
To approximate , we use the firstdegree Taylor polynomial. For a function , the Taylor expansion at point is
(4) 
where is the th derivative of evaluated at point , and is the th order remainder. Approximating with a firstorder Taylor polynomial near , we have:
(5) 
The remainder can be calculated through the Lagrange form:
(6) 
where is a real number between and . However, we neglect this firstorder remainder, largely due to the significant calculation required, but also in part because the widelyused ReLU activation function encourages a smaller second order term.
Intuitively, this criterion prunes parameters that have an almost flat gradient of the cost function w.r.t. feature map . This approach requires accumulation of the product of the activation and the gradient of the cost function w.r.t. to the activation, which is easily computed from the same computations for backpropagation. is computed for a multivariate output, such as a feature map, by
(8) 
where is length of vectorized feature map. For a minibatch with examples, the criterion is computed for each example separately and averaged over .
Independently of our work, Figurnov et al. (2016) came up with similar metric based on the Taylor expansion, called impact, to evaluate importance of spatial cells in a convolutional layer. It shows that the same metric can be applied to evaluate importance of different groups of parameters.
Relation to Optimal Brain Damage.
The Taylor criterion proposed above relies on approximating the change in loss caused by removing a feature map. The core idea is the same as in Optimal Brain Damage (OBD) (LeCun et al., 1990). Here we consider the differences more carefully.
The primary difference is the treatment of the firstorder term of the Taylor expansion, in our notation for cost function and hidden layer activation . After sufficient training epochs, the gradient term tends to zero: and . At face value offers little useful information, hence OBD regards the term as zero and focuses on the secondorder term.
However, the variance of is nonzero and correlates with the stability of the local function w.r.t. activation . By considering the absolute change in the cost^{3}^{3}3OBD approximates the signed difference in loss, while our method approximates absolute difference in loss. We find in our results that pruning based on absolute difference yields better accuracy. induced by pruning (as in Eq. 3), we use the absolute value of the firstorder term, . Under assumption that samples come from independent and identical distribution, where is the standard deviation of , known as the expected value of the halfnormal distribution. So, while tends to zero, the expectation of is proportional to the variance of , a value which is empirically more informative as a pruning criterion.
As an additional benefit, we avoid the computation of the secondorder Taylor expansion term, or its simplification  diagonal of the Hessian, as required in OBD.
We found important to compare proposed Taylor criteria to OBD. As described in the original papers (LeCun et al., 1990; 1998), OBD can be efficiently implemented similarly to standard back propagation algorithm doubling backward propagation time and memory usage when used together with standard finetuning. Efficient implementation of the original OBD algorithm might require significant changes to the framework based on automatic differentiation like Theano to efficiently compute only diagonal of the Hessian instead of the full matrix. Several researchers tried to tackle this problem with approximation techniques (Martens, 2010; Martens et al., 2012). In our implementation, we use efficient way of computing Hessianvector product (Pearlmutter, 1994) and matrix diagonal approximation proposed by (Bekas et al., 2007), please refer to more details in appendix. With current implementation, OBD is 30 times slower than Taylor technique for saliency estimation, and 3 times slower for iterative pruning, however with different implementation can only be 50% slower as mentioned in the original paper.
Average Percentage of Zeros (APoZ).
Hu et al. (2016) proposed to explore sparsity in activations for network pruning. ReLU activation function imposes sparsity during inference, and average percentage of positive activations at the output can determine importance of the neuron. Intuitively, it is a good criteria, however feature maps at the first layers have similar APoZ regardless of the network’s target as they learn to be Gabor like filters. We will use APoZ to estimate saliency of feature maps.
2.3 Normalization
Some criteria return “raw” values, whose scale varies with the depth of the parameter’s layer in the network. A simple layerwise normalization can achieve adequate rescaling across layers:
2.4 FLOPs regularized pruning
One of the main reasons to apply pruning is to reduce number of operations in the network. Feature maps from different layers require different amounts of computation due the number and sizes of input feature maps and convolution kernels. To take this into account we introduce FLOPs regularization:
(9) 
where controls the amount of regularization. For our experiments, we use . is computed under the assumption that convolution is implemented as a sliding window (see Appendix). Other regularization conditions may be applied, e.g. storage size, kernel sizes, or memory footprint.
3 Results
We empirically study the pruning criteria and procedure detailed in the previous section for a variety of problems. We focus many experiments on transfer learning problems, a setting where pruning seems to excel. We also present results for pruning large networks on their original tasks for more direct comparison with the existing pruning literature. Experiments are performed within Theano (Theano Development Team, 2016). Training and pruning are performed on the respective training sets for each problem, while results are reported on appropriate holdout sets, unless otherwise indicated. For all experiments we prune a single feature map at every pruning iteration, allowing finetuning and reevaluation of the criterion to account for dependency between parameters.
3.1 Characterizing the oracle ranking
We begin by explicitly computing the oracle for a single pruning iteration of a visual transfer learning problem. We finetune the VGG16 network (Simonyan & Zisserman, 2014) for classification of bird species using the CaltechUCSD Birds 2002011 dataset (Wah et al., 2011). The dataset consists of nearly training images and test images, covering species. We finetune VGG16 for epochs with learning rate to achieve a test accuracy of using uncropped images.
To compute the oracle, we evaluate the change in loss caused by removing each individual feature map from the finetuned VGG16 network. (See Appendix A.3 for additional analysis.) We rank feature maps by their contributions to the loss, where rank indicates the most important feature map—removing it results in the highest increase in loss—and rank indicates the least important. Statistics of global ranks are shown in Fig. 3 grouped by convolutional layer. We observe: (1) Median global importance tends to decrease with depth. (2) Layers with maxpooling tend to be more important than those without. (VGG16 has pooling after layers , , , , and .) However, (3) maximum and minimum ranks show that every layer has some feature maps that are globally important and others that are globally less important. Taken together with the results of subsequent experiments, we opt for encouraging a balanced pruning that distributes selection across all layers.
Next, we iteratively prune the network using precomputed oracle ranking. In this experiment, we do not update the parameters of the network or the oracle ranking between iterations. Training accuracy is illustrated in Fig. 3 over many pruning iterations. Surprisingly, pruning by smallest absolute change in loss (Oracleabs) yields higher accuracy than pruning by the net effect on loss (Oracleloss). Even though the oracle indicates that removing some feature maps individually may decrease loss, instability accumulates due the large absolute changes that are induced. These results support pruning by absolute difference in cost, as constructed in Eq. 1.
3.2 Evaluating proposed criteria versus the oracle
To evaluate computationally efficient criteria as substitutes for the oracle, we compute Spearman’s rank correlation, an estimate of how well two predictors provide monotonically related outputs, even if their relationship is not linear. Given the difference between oracle^{4}^{4}4We use Oracleabs because of better performance in previous experiment and criterion ranks for each parameter , the rank correlation is computed:
(10) 
where is the number of parameters (and the highest rank). This correlation coefficient takes values in , where implies full negative correlation, no correlation, and full positive correlation.
AlexNet / Flowers102  VGG16 / Birds200  
Weight  Activation  OBD  Taylor  Weight  Activation  OBD  Taylor  Mutual  
Mean  S.d.  APoZ  Mean  S.d.  APoZ  Info.  
Per layer  
All layers  
(w/ norm)      
AlexNet / Birds200  VGG16 / Flowers102  
Per layer  
All layers  
(w/ norm)      
AlexNet / ImageNet  
Per layer  
All layers  
(w/ norm)   
We show Spearman’s correlation in Table 1 to compare the oracleabs ranking to rankings by different criteria on a set of networks/datasets some of which are going to be introduced later. Datadependent criteria (all except weight magnitude) are computed on training data during the finetuning before or between pruning iterations. As a sanity check, we evaluate random ranking and observe correlation across all layers. “Per layer” analysis shows ranking within each convolutional layer, while “All layers” describes ranking across layers. While several criteria do not scale well across layers with raw values, a layerwise normalization significantly improves performance. The Taylor criterion has the highest correlation among the criteria, both within layers and across layers (with normalization). OBD shows the best correlation across layers when no normalization used; it also shows best results for correlation on ImageNet dataset. (See Appendix A.2 for further analysis.)
3.3 Pruning finetuned ImageNet networks
We now evaluate the full iterative pruning procedure on two transfer learning problems. We focus on reducing the number of convolutional feature maps and the total estimated floating point operations (FLOPs). Finegrained recognition is difficult for relatively small datasets without relying on transfer learning. Branson et al. (2014) show that training CNN from scratch on the Birds200 dataset achieves test accuracy of only . We compare results to training a randomly initialized CNN with half the number of parameters per layer, denoted "from scratch".
Fig. 4 shows pruning of VGG16 after finetuning on the Birds dataset (as described previously). At each pruning iteration, we remove a single feature map and then perform minibatch SGD updates with batchsize , momentum , learning rate , and weight decay . The figure depicts accuracy relative to the pruning rate (left) and estimated GFLOPs (right). The Taylor criterion shows the highest accuracy for nearly the entire range of pruning ratios, and with FLOPs regularization demonstrates the best performance relative to the number of operations. OBD shows slightly worse performance of pruning in terms of parameters, however significantly worse in terms of FLOPs.
In Fig. 5, we show pruning of the CaffeNet implementation of AlexNet (Krizhevsky et al., 2012) after adapting to the Oxford Flowers dataset (Nilsback & Zisserman, 2008), with training and test images from species of flowers. Criteria correlation with oracleabs is summarized in Table 1. We initially finetune the network for epochs using a learning rate of , achieving a final test accuracy of . Then pruning procedes as previously described for Birds, except with only minibatch updates between pruning iterations. We observe the superior performance of the Taylor and OBD criteria in both number of parameters and GFLOPs.
We observed that Taylor criterion shows the best performance which is closely followed by OBD with a bit lower Spearman’s rank correlation coefficient. Implementing OBD takes more effort because of computation of diagonal of the Hessian and it is 50% to 300% slower than Taylor criteria that relies on first order gradient only.
Fig. 7 shows pruning with the Taylor technique and a varying number of finetuning updates between pruning iterations. Increasing the number of updates results in higher accuracy, but at the cost of additional runtime of the pruning procedure.
During pruning we observe a small drop in accuracy. One of the reasons is finetuning between pruning iterations. Accuracy of the initial network can be improved with longer fine tunning and search of better optimization parameters. For example accuracy of unpruned VGG16 network on Birds200 goes up to after extra 128k updates. And AlexNet on Flowers102 goes up to after 130k updates. It should be noted that with farther finetuning of pruned networks we can achieve higher accuracy as well, therefore the onetoone comparison of accuracies is rough.
3.4 Pruning a recurrent 3DCNN network for hand gesture recognition
Molchanov et al. (2016) learn to recognize dynamic hand gestures in streaming video with a large recurrent neural network. The network is constructed by adding recurrent connections to a 3DCNN pretrained on the Sports1M video dataset (Karpathy et al., 2014) and fine tuning on a gesture dataset. The full network achieves an accuracy of when trained on the depth modality, but a single inference requires an estimated GFLOPs, too much for deployment on an embedded GPU. After several iterations of pruning with the Taylor criterion with learning rate , momentum , FLOPs regularization , we reduce inference to GFLOPs, as shown in Fig. 7. While pruning increases classification error by nearly , additional finetuning restores much of the lost accuracy, yielding a final pruned network with a reduction in GFLOPs and only a loss in accuracy.
3.5 Pruning networks for ImageNet
We also test our pruning scheme on the largescale ImageNet classification task. In the first experiment, we begin with a trained CaffeNet implementation of AlexNet with top5 validation accuracy. Between pruning iterations, we finetune with learning rate , momentum , weight decay , batch size , and dropout . Using a subset of training images, we compute oracleabs and Spearman’s rank correlation with the criteria, as shown in Table 1. Pruning traces are illustrated in Fig. 8.
We observe: 1) Taylor performs better than random or minimum weight pruning when updates are used between pruning iterations. When results are displayed w.r.t. FLOPs, the difference with random pruning is only , but the difference is higher, , when plotted with the number of feature maps pruned. 2) Increasing the number of updates from to improves performance of pruning significantly for both the Taylor criterion and random pruning.
For a second experiment, we prune a trained VGG16 network with the same parameters as before, except enabling FLOPs regularization. We stop pruning at two points, and GFLOPs, and finetune both models for an additional five epochs with learning rate . Finetuning after pruning significantly improves results: the network pruned to GFLOPs improves from to top5 validation accuracy, and the network pruned to GFLOPs improves from to .
3.6 Speed up measurements
During pruning we were measuring reduction in computations by FLOPs, which is a common practice (Han et al., 2015; Lavin, 2015a; b). Improvements in FLOPs result in monotonically decreasing inference time of the networks because of removing entire feature map from the layer. However, time consumed by inference dependents on particular implementation of convolution operator, parallelization algorithm, hardware, scheduling, memory transfer rate etc. Therefore we measure improvement in the inference time for selected networks to see real speed up compared to unpruned networks in Table 2. We observe significant speed ups by proposed pruning scheme.
Hardware  Batch  Accuracy  Time, ms  Accuracy  Time (speed up)  Accuracy  Time (speed up) 
AlexNet / Flowers102, 1.46 GFLOPs  41% feature maps, 0.4 GFLOPs  19.5% feature maps, 0.2 GFLOPs  
CPU: Intel Core i75930K  16  80.1%  226.4  79.8%(0.3%)  121.4 (1.9x)  74.1%(6.0%)  87.0 (2.6x) 
GPU: GeForce GTX TITAN X (Pascal)  16  4.8  2.4 (2.0x)  1.9 (2.5x)  
GPU: GeForce GTX TITAN X (Pascal)  512  88.3  36.6 (2.4x)  27.4 (3.2x)  
GPU: NVIDIA Jetson TX1  32  169.2  73.6 (2.3x)  58.6 (2.9x)  
VGG16 / ImageNet, 30.96 GFLOPs  66% feature maps, 11.5 GFLOPs  52% feature maps, 8.0 GFLOPs  
CPU: Intel Core i75930K  16  89.3%  2564.7  87.0% (2.3%)  1483.3 (1.7x)  84.5% (4.8%)  1218.4 (2.1x) 
GPU: GeForce GTX TITAN X (Pascal)  16  68.3  31.0 (2.2x)  20.2 (3.4x)  
GPU: NVIDIA Jetson TX1  4  456.6  182.5 (2.5x)  138.2 (3.3x)  
R3DCNN / nvGesture, 37.8 GFLOPs  25% feature maps, 3 GFLOPs  
GPU: GeForce GT 730M  1  80.7%  438.0  78.2% (2.5%)  85.0 (5.2x) 
4 Conclusions
We propose a new scheme for iteratively pruning deep convolutional neural networks. We find: 1) CNNs may be successfully pruned by iteratively removing the least important parameters—feature maps in this case—according to heuristic selection criteria; 2) a Taylor expansionbased criterion demonstrates significant improvement over other criteria; 3) perlayer normalization of the criterion is important to obtain global scaling.
References
 Alvarez & Salzmann (2016) Jose M Alvarez and Mathieu Salzmann. Learning the Number of Neurons in Deep Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 2262–2270. Curran Associates, Inc., 2016.
 Anwar et al. (2015) Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015. URL http://arxiv.org/abs/1512.08571.
 Bekas et al. (2007) Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11):1214–1229, 2007.
 Branson et al. (2014) Steve Branson, Grant Van Horn, Serge Belongie, and Pietro Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
 Dauphin et al. (2015) Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated adaptive learning rates for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 1504–1512, 2015.
 Figurnov et al. (2016) Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. PerforatedCNNs: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pp. 947–955, 2016.
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 392, 2015. URL http://arxiv.org/abs/1502.025513.
 Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
 Han et al. (2016) Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pp. 243–254, Piscataway, NJ, USA, 2016. IEEE Press.
 Hassibi & Stork (1993) Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NIPS), pp. 164–171, 1993.
 Hu et al. (2016) Hengyuan Hu, Rui Peng, YuWing Tai, and ChiKeung Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 Karpathy et al. (2014) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 Kim et al. (2015) YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lavin (2015a) Andrew Lavin. maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs. CoRR, abs/1501.06633, 2015a. URL http://arxiv.org/abs/1501.06633.
 Lavin (2015b) Andrew Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015b.
 Lebedev & Lempitsky (2016) Vadim Lebedev and Victor Lempitsky. Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564, 2016.
 LeCun et al. (1990) Yann LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), 1990.
 LeCun et al. (1998) Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient BackProp, pp. 9–50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
 Martens (2010) James Martens. Deep learning via Hessianfree optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 735–742, 2010.
 Martens et al. (2012) James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the Hessian by backpropagating curvature. arXiv preprint arXiv:1206.6464, 2012.
 Molchanov et al. (2016) Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Nilsback & Zisserman (2008) ME. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
 Pearlmutter (1994) Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian. Neural Computation, 6:147–160, 1994.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. CoRR, abs/1603.05279, 2016. URL http://arxiv.org/abs/1603.05279.
 Reed (1993) Russell Reed. Pruning algorithmsa survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 Simonyan & Zisserman (2014) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Srinivas & Babu (2015) Suraj Srinivas and R. Venkatesh Babu. Datafree parameter pruning for deep neural networks. In Mark W. Jones Xianghua Xie and Gary K. L. Tam (eds.), Proceedings of the British Machine Vision Conference (BMVC), pp. 31.1–31.12. BMVA Press, September 2015.
 Theano Development Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
 Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds2002011 dataset. 2011.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 Zhou et al. (2016) Hao Zhou, Jose M. Alvarez, and Fatih Porikli. Less is more: Towards compact cnns. In European Conference on Computer Vision, pp. 662–677, Amsterdam, the Netherlands, October 2016.
Appendix A Appendix
a.1 FLOPs computation
To compute the number of floatingpoint operations (FLOPs), we assume convolution is implemented as a sliding window and that the nonlinearity function is computed for free. For convolutional kernels we have:
(11) 
where , and are height, width and number of channels of the input feature map, is the kernel width (assumed to be symmetric), and is the number of output channels.
For fully connected layers we compute FLOPs as:
(12) 
where is the input dimensionality and is the output dimensionality.
We apply FLOPs regularization during pruning to prune neurons with higher FLOPs first. FLOPs per convolutional neuron in every layer:
a.2 Normalization across layers
Scaling a criterion across layers is very important for pruning. If the criterion is not properly scaled, then a handtuned multiplier would need to be selected for each layer. Statistics of feature map ranking by different criteria are shown in Fig. 10. Without normalization (Fig. 13(a)–13(d)), the weight magnitude criterion tends to rank feature maps from the first layers more important than last layers; the activation criterion ranks middle layers more important; and Taylor ranks first layers higher. After normalization (Fig. 9(d)–9(f)), all criteria have a shape more similar to the oracle, where each layer has some feature maps which are highly important and others which are unimportant.
a.3 Oracle computation for VGG16 on Birds200
We compute the change in the loss caused by removing individual feature maps from the VGG16 network, after finetuning on the Birds200 dataset. Results are illustrated in Fig. 10(a)10(b) for each feature map in layers and , respectively. To compute the oracle estimate for a feature map, we remove the feature map and compute the network prediction for each image in the training set using the central crop with no data augmentation or dropout. We draw the following conclusions:

The contribution of feature maps range from positive (above the red line) to slightly negative (below the red line), implying the existence of some feature maps which decrease the training cost when removed.

There are many feature maps with little contribution to the network output, indicated by almost zero change in loss when removed.

Both layers contain a small number of feature maps which induce a significant increase in the loss when removed.
MI  Weight  Activation  OBD  Taylor  
Mean  S.d.  APoZ  
Per layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Layer  
Mean  
All layers  
No normalization  
normalization  
normalization  
Minmax normalization 
Table 3 contains a layerbylayer listing of Spearman’s rank correlation of several criteria with the ranking of oracleabs. In this more detailed comparison, we see the Taylor criterion shows higher correlation for all individual layers. For several methods including Taylor, the worst correlations are observed for the middle of the network, layers . We also evaluate several techniques for normalization of the raw criteria values for comparison across layers. The table shows the best performance is obtained by normalization, hence we select it for our method.
a.4 Comparison with weight regularization
Han et al. (2015) find that finetuning with high or regularization causes unimportant connections to be suppressed. Connections with energy lower than some threshold can be removed on the assumption that they do not contribute much to subsequent layers. The same work also finds that thresholds must be set separately for each layer depending on its sensitivity to pruning. The procedure to evaluate sensitivity is timeconsuming as it requires pruning layers independently during evaluation.
The idea of pruning with high regularization can be extended to removing the kernels for an entire feature map if the norm of those kernels is below a predefined threshold. We compare our approach with this regularizationbased pruning for the task of pruning the last convolutional layer of VGG16 finetuned for Birds200. By considering only a single layer, we avoid the need to compute layerwise sensitivity. Parameters for optimization during finetuning are the same as other experiments with the Birds200 dataset. For the regularization technique, the pruning threshold is set to while we vary the regularization coefficient of the norm on each feature map kernel.^{5}^{5}5In our implementation, the regularization coefficient is multiplied by the learning rate equal to . We prune only kernel weights, while keeping the bias to maintain the same expected output.
A comparison between pruning based on regularization and our greedy scheme is illustrated in Fig. 12. We observe that our approach has higher test accuracy for the same number of remaining unpruned feature maps, when pruning or more of the feature maps. We observe that with high regularization all weights tend to zero, not only unimportant weights as Han et al. (2015) observe in the case of ImageNet networks. The intuition here is that with regularization we push all weights down and potentially can affect important connections for transfer learning, whereas in our iterative procedure we only remove unimportant parameters leaving others untouched.
a.5 Combination of criteria
One of the possibilities to improve saliency estimation is to combine several criteria together. One of the straight forward combinations is Taylor and mean activation of the neuron. We compute the joint criteria as and perform a grid search of parameter in Fig.13. The highest correlation value for each dataset is marked with with vertical bar with and gain. We observe that the gain of linearly combining criteria is negligibly small (see ’s in the figure).
a.6 Optimal Brain Damage implementation
OBD computes saliency of a parameter by computing a product of the squared magnitude of the parameter and the corresponding element on the diagonal of the Hessian. For many deep learning frameworks, an efficient implementation of the diagonal evaluation is not straightforward and approximation techniques must be applied. Our implementation of Hessian diagonal computation was inspired by Dauphin et al. (2015) work, where the technique proposed by Bekas et al. (2007) was used to evaluate SGD preconditioned with the Jacobi preconditioner. It was shown that diagonal of the Hessian can be approximated as:
(13) 
where is the elementwise product, v are random vectors with entries , and is the gradient operator. To compute saliency with OBD, we randomly draw v and compute the diagonal over 10 iterations for a single minibatch for 1000 mini batches. We found that this number of mini batches is required to compute close approximation of the Hessian’s diagonal (which we verified). Computing saliency this way is computationally expensive for iterative pruning, and we use a slightly different but more efficient procedure. Before the first pruning iteration, saliency is initialized from values computed offline with 1000 minibatches and 10 iterations, as described above. Then, at every minibatch we compute the OBD criteria with only one iteration and apply an exponential moving averaging with a coefficient of . We verified that this computes a close approximation to the Hessian’s diagonal.
a.7 Correlation of Taylor criterion with gradient and activation
The Taylor criterion is composed of both an activation term and a gradient term. In Figure 14, we depict the correlation between the Taylor criterion and each constituent part. We consider expected absolute value of the gradient instead of the mean, because otherwise it tends to zero. The plots are computed from pruning criteria for an unpruned VGG network finetuned for the Birds200 dataset. (Values are shown after layerwise normalization). Figure 14(ab) depict the Taylor criterion in the yaxis for all neurons w.r.t. the gradient and activation components, respectively. The bottom of neurons (lowest Taylor criterion, most likely to be pruned) are depicted in red, while the top are shown in green. Considering all neurons, both gradient and activation components demonstrate a linear trend with the Taylor criterion. However, for the bottom of neurons, as shown in Figure 14(cd), the activation criterion shows much stronger correlation, with lower activations indicating lower Taylor scores.