Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks
Abstract
Recently there has been a lot of work on pruning filters from deep convolutional neural networks (CNNs) with the intention of reducing computations. The key idea is to rank the filters based on a certain criterion (say, norm, average percentage of zeros, etc) and retain only the top ranked filters. Once the low scoring filters are pruned away the remainder of the network is fine tuned and is shown to give performance comparable to the original unpruned network. In this work, we report experiments which suggest that the comparable performance of the pruned network is not due to the specific criterion chosen but due to the inherent plasticity of deep neural networks which allows them to recover from the loss of pruned filters once the rest of the filters are finetuned. Specifically, we show counterintuitive results wherein by randomly pruning 2550% filters from deep CNNs we are able to obtain the same performance as obtained by using state of the art pruning methods. We empirically validate our claims by doing an exhaustive evaluation with VGG16 and ResNet50. Further, we also evaluate a real world scenario where a CNN trained on all 1000 ImageNet classes needs to be tested on only a small set of classes at test time (say, only animals). We create a new benchmark dataset from ImageNet to evaluate such class specific pruning and show that even here a random pruning strategy gives close to state of the art performance. Lastly, unlike existing approaches which mainly focus on the task of image classification, in this work we also report results on object detection. We show that using a simple random pruning strategy we can achieve significant speed up in object detection (74 improvement in fps) while retaining the same accuracy as that of the original Faster RCNN model.
1 Introduction
Over the past few years, deep convolutional neural networks (CNNs) have been very successful in a wide range of computer vision tasks such as image classification [37, 4, 24] , object detection [10, 9, 33, 32, 27] and image segmentation [3, 28]. In general, with each passing year, these networks are becoming deeper and deeper with a corresponding increase in the performance [13, 18, 36]. However, this increase in performance is accompanied by an increase in the number of parameters and computations. This makes it difficult to port these models on embedded and mobile devices where storage, computation and power are limited. In such cases, it is crucial to have small, computationally efficient models which can achieve performance at par or close to large networks. This practical requirement has led to an increasing interest in model compression where the aim is to either (i) design efficient small networks [20, 16] or (ii) efficiently prune weights from existing deep networks [12, 39, 11] or (iii) efficiently prune filters from deep convolutional networks [26, 23, 30, 17] or (iv) replace expensive floating point weights by binary or quantized weights [5, 31, 11, 41] or (v) guide the training of a smaller network using a larger (teacher) network [2, 15].
In this work, we focus on pruning filters from deep convolutional neural networks. The filters in the convolution layers typically account for fewer parameters than the fully connected layers (the ratio is 10:90 for VGG16 [26]), but they account for most of the floating point operations done by the model (99% for VGG16 [26]). Hence reducing the number of filters effectively reduces the computation (and thus power) requirements of the model. All existing works on filter pruning [26, 23, 30, 17] follow a very similar recipe. The filters are first ranked based on a specific criterion such as, norm [26] or percentage of zeros in the filter [17]. The scoring criterion essentially determines the importance of the filter for the end task, typically image classification [24]. Only the topm ranked filters are retained and the resulting pruned network is then fine tuned. It is observed that when pruning up to 50% of the filters using different proposed criteria, the pruned network almost recovers the original performance after finetuning. The claim is that this recovery is due to soundness of the criterion chosen for pruning. However, in this work we argue that this recovery is not due the specific pruning criterion but due to the inherent plasticity of deep CNNs. Specifically, we show that even if we prune filters randomly we can match the performance of stateoftheart pruning methods.
To effectively prove our point, it is crucial that we look at factors/measures other than the final performance of the pruned model. To do so we draw an analogy with the human brain and observe that the process of pruning filters from a deep CNN is akin to causing damage to certain portions of the brain. It is known that the human brain has a high plasticity and over time can recover from such damages with appropriate treatment [19]. In our case, the process of finetuning would be akin to such postdamage (postpruning) treatment. If the injury damages only redundant or unimportant portions of the brain then the recovery should be complete quickly and with minimal treatment. Similarly, we could argue that if the pruning criteria is indeed good and prunes away only unimportant filters then (i) the performance of the model should not drop much (ii) the model should be able to regain its full performance after finetuning (iii) this recovery should be fast (i.e., with fewer iterations of fine tuning) and (iv) the quantum of data used for finetuning should be less. None of the existing works on filter pruning do a thorough comparison w.r.t. these factors. We not only consider these factors but also present counterintuitive results which show that a random pruning criteria is comparable to state of the art pruning methods on all these factors. Note that we are not claiming that we can always recover the full performance of the unpruned network. For example, it should be obvious that in the degenerate case if 90% of the filters are pruned then it would be almost impossible to recover. The claim being made is that, at different pruning levels (25%, 50%, 75%) a random pruning strategy is not much worse than of state of the art pruning strategies.
To further prove our point, we wanted to check if such recovery from pruning is task agnostic. In other words, in addition to showing that a network trained for image classification (task1) can be pruned efficiently, we also show that same can be done with a network trained for object detection (task2). Here again, we show that a random pruning strategy works at par with state of the art pruning methods. Stretching this idea further and continuing the above analogy, we note that once the brain recovers from such damages, it is desirable that in addition to recovering its performance on the tasks that it was good at before the injury, it should also be able to do well on newer tasks. In our case, the corresponding situation would be to take a network pruned and finetuned for image classification (old task) and plug it into a model for object detection (new task). Specifically, we show that when we plug a randomly pruned and fine tuned VGG16 network into a Faster RCNN model we can get the same performance on object detection as obtained by plugging (i) the original unpruned network or (ii) a network pruned using a state of the art pruning method. This once again hints at the inherent plasticity of deep CNNs which allows them to recover (up to a certain level) irrespective of the pruning strategy.
Finally, we consider the case of class specific pruning which has not been studied in the literature. We note that in many real world scenarios, it is possible that while we have trained an image classification network on a large dataset containing many classes, at test time we may be interested in only a few classes. A case in point, is the task of object detection using the Pascal VOC dataset [8]. RCNN and its variants [10, 9, 33] use as a subcomponent an image classification model trained on all the 1000 ImageNet classes. We hypothesize that this is an overkill and instead create a class specific benchmark dataset from ImageNet which contains only those 52 classes which correspond to the 20 classes in Pascal VOC. Ideally, one would expect that a network trained, pruned and finetuned only for these 52 classes when plugged into faster RCNN should do better than a network trained, pruned and finetuned on a random set of 52 classes (which are very different from the classes in Pascal VOC). However, we observe that irrespective of which of these networks is plugged into Faster RCNN the final performance after finetuning is the same, once again showing the ability to recover from unfavorable situations.
To the best of our knowledge, this is a first of its kind work on pruning filters which:

Proposes that while assessing the performance of a pruning method, we should consider factors such as amount of damage (drop in performance before finetuning), amount of recovery (performance after finetuning), speed of recovery and quantum of data required for recovery.

Performs extensive evaluation using two image classification networks (VGG16 and ResNet) and shows that a random pruning strategy gives comparable performance to that of state of the art pruning strategies w.r.t. all the above factors.

Shows that such behavior is task agnostic and a random pruning strategy works well even for the task of object detection. Specifically, we show that by randomly pruning filters from an object detection model we can get a 74 improvement in fps while maintaining almost the same accuracy (1% drop) as the original unpruned network

Shows that pruned networks can adapt with ease to newer tasks

Proposes a new benchmark for evaluating class specific pruning
2 Related Work
In this section, we review existing work on making deep convolutional neural networks efficient w.r.t. their memory and computation requirements while not compromising much on the accuracy. These approaches can be broadly classified into the following categories (i) pruning unimportant weights (ii) low rank factorization (iii) knowledge distillation (iv) designing compact networks from scratch or (v) using binary or quantized weights and (vi) pruning unimportant filters. Below, we first quickly review the related work for the first five categories listed above and then discuss approaches on pruning filters which is the main focus of our work.
Optimal brain damage [25] and optimal brain surgery [7] are two examples of approaches which prune the unimportant weights in the network. A weight is considered unimportant if the output is not very sensitive to this weight. They show that pruning such weights leads to minimal drop in the overall performance of the network. However, these methods are computationally expensive as they require the computation of the Hessian (second order derivative). Another approach is to use low rank factorization of the weight tensor/matrices to reduce the computations [21, 38, 22, 6, 40]. For example, instead of directly multiplying a high dimensional weight tensor with the input tensor , we could first compute a low rank approximation of where the dimensions of , and are much smaller than the dimensions of . This essentially boils down to decomposing the larger matrix multiplication operation into smaller operations. Also, the low rank approximation ensures that only the important information in the weight matrix is retained. Alternately, researchers have also explored designing compact networks from scratch which have fewer number of layers and/or parameters and/or computations [20]. There are also some approaches which quantize [11] or binarize [31, 5] the weights of a network to reduce both memory footprint and computation time. Another line of work focuses on transferring the knowledge from bigger trained network (or ensemble of networks) to smaller (thin) network [2, 15].
The main focus of our work is on pruning filters from deep CNNs with the intention of reducing computations. As mentioned earlier, while the convolution filters do not account for a large number of parameters, they account for almost all the computations that happen in the network. Here, the idea is to rank the filters using a scoring function and then retain only the top scoring functions. For example, in [26], the authors have used the norm of the filters to rank their importance. The argument is that filters having a lower l1norm will produce smaller activation values which will contribute minimally to the output of that layer. Alternately, in [29] authors have proposed entropy as a measure to calculate the importance of a filter. If a filter as high entropy than the filter is more informative and hence more important. On the other hand, [17] calculate the average percentage of zeros in the corresponding activation maps of filters and hypothesize that filters having more average percentage of zeros in their activation are less important. In [30] authors have used Taylor series expansion that approximates the change in cost function caused by pruning filters. Unlike [25], this method uses information from first derivative only. Another work on pruning filters [23] proposes that instead of pruning filters based on current layer’s statistics they should be pruned based on next layer’s statistics. Essentially the idea of [23] is to look at the activation map of layer and prune out the channel which will give you the minimum change in output on its removal and its corresponding filter in layer . In [14] authors proposed a similar idea to [23] but instead of removing the filters one by one they have proposed to use LASSO regression. Lastly, in [1] authors has used particle filtering to prune out the filters.
3 Methodology
In this section, we first formally define the problem of pruning filter and give a generic algorithm for pruning filters using any appropriate scoring function. We then discuss existing scoring functions along with some new variants that we propose.
3.1 Problem Statement
Suppose there are convolutional layers in a CNN and suppose the layer contains filters. We use to denote the th filter in the th layer. Each such filter is a three dimensional tensor, where is the number of input channels for layer and are the width and height of the th filter in the th layer. Our goal is to rank all the filters in layer , and then retain the top filters where is a hyperparameter which indicates the desired pruning. For example, based on available computation resources, if we want to reduce the number of computations in this layer by half then we can set . Let the original output of layer be denoted by where are the width and height and is the number of channels which is the same as the number of filters. After pruning and retaining only top filters the size of the output will be reduced to . Thus, pruning filters not only reduces the number of computations in this layer but also reduces the size of the input to the next layer (which is the same as the output of this layer). The same process of pruning can then be repeated across all layers of the CNN. The main task here is to find the right scoring function for ranking the filters.
3.2 A Generic Algorithm for Pruning
Algorithm 1 summarizes the generic recipe used by different approaches for pruning filters. As shown in the algo, pruning typically starts from the outermost layer. Once the low scoring filters from this layer are pruned, the network is then finetuned and the same process is then repeated for the layers before it. Once all the layers are pruned and finetuned, the entire network is then tuned for a few epochs.
Existing methods for pruning filters differ in the that they use for ranking the filters. We alternately refer to this scoring function as pruning criteria as discussed in the next subsection.
3.3 Pruning Criteria
We now describe various pruning criteria which are used by existing approaches and also introduce some new variants of existing pruning criteria. These criteria are essentially used as in Algorithm 1.

Mean Activation [30] : Most deep CNNs for image classification use ReLU as the activation function which results in very sparse activations (as all negative outputs are set to 0). We could compute the mean activation of the feature map corresponding to a filter across all images in the training data. If this mean activation is very low (because most of the activations are 0) then this feature map and hence the corresponding filter is not going to contribute much to the discriminatory power of the network (since the filter rarely fires for any input). Hence, [30] uses the mean activation as a scoring function for ranking filters.

Norm [26] : The authors of [26] suggest that the norm (F) of a filter can also be used as an indicator of the importance of the filter. The argument is that if the norm of a filter is small then on average the weights in the filter will be small and hence produce very small activations. These small activations will not influence the output of the network and hence the corresponding filters can be pruned away. One important benefit of this method is that apart from computing the norm, it does not need any extra computation during pruning and finetuning.

Entropy [29] : If the feature map corresponding to a filter produces the same output for every input (image) then this feature map and hence the corresponding filters may not be very important (because it does not play any discriminatory role). In other words, we are interested in feature maps (and hence filters) which are more informative or have a high entropy. If we divide the possible range of the average output of a feature map into bins then we could compute the entropy of the th feature map (or filter) [29] as :
where is the probability that the output of the th feature map lies in the th bin. This probability can be computed as the fraction of input images for which the average output of the feature map lies in this bin.

Average Percentage of Zeros (APoZ) [17] : As mentioned earlier, when ReLU is used as the activation function, the output activations are very sparse. If most of the neurons in a feature map are zero then this feature map is not likely to contribute much to the output of the network. The Average Percentage of Zeros in the output of each filter can thus be used to compute the importance of the filter (the lesser the better).

Sensitivity : We could compute the gradient of a filter w.r.t. the loss function (i.e., cross entropy). If a filter has a high influence on the loss function then the value of this gradient would be high. The norm of this gradient averaged over all images can thus be used to compute the importance of a filter.

Scaled Entropy : We propose a new variant of the entropy based criteria. We observe that a filter may have a high entropy but if all its activations are very low (belonging to lower bins) then this filter is not likely to contribute much to the output. We thus propose to use a combination of entropy and mean activation by scaling the entropy by the mean activation of the filter. This scaledentropy of th filter can be computed as:
where is the average activation of the th filter over all input images.

Class Specific Importance : In this work, we are also interested in a more practical scenario, where a network trained for detecting all the 1000 classes from ImageNet is required to detect only () of these classes at test time (say, only animals). Intuitively, we should then devise a scoring function which retains only those filters which are important for these classes. To do so we once aging compute the gradient of the loss function w.r.t. the filter. However, now instead of averaging the norm of this gradient over all images in the training data, we compute the average over only those images in the training data which correspond to the classes of interest. This classspecific average is then used to rank the filters.

Random Pruning : One of the main contributions of this work is to show that even if we randomly prune the filters from a CNN, its performance after finetuning is not much worse than any of the above approaches.
4 Experiments: Image Classification
In this section, we focus on the task of image classification using the ImageNet [34] dataset. The dataset is split into three sets : training (1.3M images), validation (50K images), and testing (100K images with heldout class labels). We experiment with two popular networks, viz., VGG16 and ResNet50. We first train these networks using the full ImageNet training data and then prune them using Algorithm 1. We compare the performance of different scoring functions as listed in the section 3.3.
4.1 Comparison of different pruning methods on VGG16
VGG16 [35] has 13 convolutional (CONV) and two fully connected (FC) layers. The number of filters in each CONV layer in the the standard VGG16 network [35] is {64, 64, 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512}. We first train this network as it is (i.e., with the standard number of filters in each layer) using the ImageNet training data. When evaluated on the standard ImageNet test set, this trained model gives us a top1 accuracy of 69 which is comparable to the accuracy reported elsewhere in the literature. We now prune this network, one layer at a time starting from the last convolution layer. We prune away % of filters from each layer where we chose the value of to be {25, 50, 75}. We use one of the scoring functions described in Section 3.3 to select the top % filters. We drop the remaining (100  m)% filters from this layer and then finetune the pruned network for 1 epoch. We then repeat the same process for the lower layers and use the same value of across all layers. Once the network is pruned till layer 1, we then fine tune the entire pruned network for 12 epochs using 1/10th of the training data picked randomly. The only reason for not using the entire training data is that it is quite computationally expensive. We did not see any improvement in the performance on the validation set by finetuning beyond 12 epochs. We then evaluate this pruned and finetuned network on the test set. Below, we discuss the performance of the final pruned and finetuned network obtained using different pruning strategies.
Performance of pruned network after finetuning: In Table 1, we report the performance of the final pruned network after fine tuning. We observe that random pruning works better than most of the other pruning methods described earlier. norm is the only scoring function which does better than random and that too by a small margin. In fact, if we finetune the final trained network using the entire training data then we observe that there is hardly any difference between random and norm (see Table 2). This provides empirical evidence for our claim that the amount of recovery (i.e., final performance after finetuning) is not due to the soundness of the pruning criteria. Even with random pruning, the performance of the pruned network is comparable. Of course, as the percentage of pruning increases ( as m increases) it becomes harder for the pruned network to recover the full performance of the original network (but the point is that it is equally hard irrespective of the pruning method used). Thus, w.r.t. the amount of recovery after damage (pruning), a random pruning strategy is as good as any other pruning strategy. We further drive this point in Figure 0(a) where we show that after pruning and fine tuning for every layer, the amount of recovery after fine tuning is comparable across different pruning strategies.
Heuristic  25 %  50%  75% 

Random  0.650  0.569  0.415 
Mean Activation  0.652  0.570  0.409 
Entropy  0.641  0.549  0.405 
Scaled Entropy  0.637  0.550  0.401 
norm  0.667  0.593  0.436 
APoZ  0.647  0.564  0.422 
Sensitivity  0.636  0.543  0.379 
As a side note we would like to mention that we do not include the performance of ThiNets [23] in Table 1. This is because it uses a slightly different methodology. In particular there are two major differences. First, in ThiNets pruning is done only till layer 10 and not upto layer 11 as is the case for all numbers reported in Table 1. Secondly, in ThiNets, if a CONV layer appears before a maxpooling layer then it is finetuned for an extra epoch to compensate more for the downsampling in the max pooling layer. For a fair comparison, we followed this exact same strategy as ThiNet but using a random pruning criteria. In this setup, a randomly pruned network was able to achieve 68% top1 accuracy after 50% pruning which is comparable to the performance of the corresponding ThiNet (69%).
Heuristic  50% 

Random  0.6701 
Mean Activation  0.6662 
Entropy  0.6635 
Scaled Entropy  0.6625 
norm  0.6759 
APoZ  0.6706 
Sensitivity  0.6659 
Amount of initial damage caused by different pruning strategies: One might argue that while random pruning strategy is equivalent to other pruning strategies w.r.t. final performance after fine tuning, it is possible that the amount of initial damage caused by a careful pruning strategy maybe less than than caused by random pruning. This could be important in cases where enough time or resources are not available for finetuning after pruning. To evaluate this, we compute the accuracy of the network just after pruning (and before finetuning) at each layer. Figure 0(b) compares this performance for different pruning strategies. Here again we observe that the damage caused by a random pruning strategy is not worse than other pruning strategies. The only exception is when we prune the first 4 layers in which case the damage caused by norm based pruning is less than random pruning. We hypothesize that this is because the first 4 layers have very few filters and hence one needs to be careful while pruning for filters from these layers. In fact, in hindsight we would recommend not to prune any filters from these 4 layers because the computation savings are less as compared to drop in accuracy.
Speed of recovery and quantum of data for finetuning: Another important criteria is the speed of recovery, i.e., the number of iterations for which the network needs to be finetuned after pruning. It is conceivable that a carefully pruned network may be able to recover and reach its best performance faster than a randomly pruned network. However, as shown in Figure 0(c) that almost all the pruning strategies (including random) reach their peak after 2 epochs when finetuned with onetenth of the data. Even, if we increase the quantum of data, this behavior does not change as shown in Figure 0(d) (for norm based pruning and random pruning). Of course, as we increase the quantum of data the amount of recovery increases, i.e., the peak performance of the pruned network increases. However, the important point is that a random strategy is no worse than a careful pruning strategy w.r.t. speed of recovery and quantum of data required.
4.2 Pruning ResNet50 using Norm and Random
While the above set of experiments focused on VGG16, we now turn our attention to ResNet50 [13] which gives state of the art results on ImageNet. We took a trained ResNet50 model which gave 74.5 top1 accuracy on the ImageNet test set which is again comparable to the accuracy reported elsewhere in the literature. ResNet contains 16 residual blocks wherein each block contains 3 layers with a skip connection from the first layer to the third layer. The standard practice is to either prune the first layer of each block or the first two layers of each block. In the first case, out of the total 48 convolution layers (16 * 3) we will end up pruning 16 and in the second case we will end up pruning 32. As before, for each pruned layer we vary the percentage of pruning from 25%, 50% to 75%. Here, we only compare the performance of Norm with random pruning as these were the top performing strategies on VGG16. This was just to save time and resources as given the deep structure of ResNet it would have been very expensive to run all pruning strategies. Once again from Table 3, we observe that random pruning performs at par (in fact, slightly better) when compared to Norm based pruning. Note that, in this case the pruned models were trained with only onetenth of the data. The performance of both the methods are likely to improve further if we were to finetune the pruned network on the entire training data.
Heuristics  #Layers Pruned  25 %  50%  75% 

Random  16  0.722  0.683  0.617 
norm  16  0.714  0.677  0.610 
Random  32  0.696  0.637  0.518 
norm  32  0.691  0.633  0.514 
5 Experiments: Class specific pruning
Existing work on pruning filters (or model compression, in general) focuses on the scenario where we have a network trained for detecting all the 1000 classes in ImageNet and at test time it is again evaluated using data belonging to all of these 1000 classes. However, in many real world scenarios, at test time we may be interested in fewer classes. A case in point, is the Pascal VOC dataset which contains only 20 classes. Intuitively, if we are interested in only fewer classes at test time then we should be able to prune the network to cater to only these classes. Alternately, we could train the original network itself using data corresponding to these classes only. To enable these experiments, we first create a new benchmark from ImageNet which contains only those 52 classes which correspond to the 20 classes in Pascal VOC. Note that the mapping of 5220 happens because ImageNet has more finegrained classes. For example, there is only one class for ‘dog’ in Pascal VOC but ImageNet contains many subclasses of ‘dog’ (different breeds of dogs). We manually went over all the classes in ImageNet and picked out the classes which correspond to the 20 classes in Pascal VOC. In some cases, we ignored ImageNet classes which were too finegrained and only considered those classes which were immediate hyponyms of a class in Pascal VOC. We then extracted the train, test and valid images for these classes from the original ImageNet dataset. We refer to this subset of ImageNet as ImageNet52P (where P stands for Pascal VOC). We refer to the original ImageNet dataset as ImageNet1000. Note that the train, test and validation splits of ImageNet52P are subsets of the corresponding splits of ImageNet1000. In particular , the training split of ImageNet1000 does not overlap with the test or validation splits of ImageNet52P.
We first compare the performance in the following two setups: (i) model trained on ImageNet1000 and evaluated on the test split of ImageNet52P and (ii) model trained on ImageNet52P and evaluated on the test split of ImageNet52P. We observe that while in the first setup we get a top1 accuracy of 74%, in the second setup we get an accuracy of 87%. This suggests that model trained on ImageNet1000 is clearly overloaded with extra information about the remaining 948 classes and hence performs poorly on the 52 classes of interest. We should thus be able to prune the network effectively to cater to only the 52 classes of interest. Note that in practice it is desirable to have just one network trained on ImageNet1000 and then prune it for different subsets of classes that we are interested in instead of training a separate network from scratch for each of these subsets. We again compare different pruning strategies as listed earlier except that now when finetuning (after each layer and at the end of all layers) we only use ImageNet52P. In other words, we finetune using only data corresponding to the 52 classes. Once again, we observe that there is not much difference between random pruning and other pruning strategies. Also with 25% pruning, we are able to almost match the performance of a network trained only on these 52 classes (i.e., 87%) .
Heuristics  25 %  50%  75% 

Random  0.859  0.820  0.692 
Mean Activation  0.866  0.816  0.698 
Entropy  0.860  0.802  0.684 
Scaled Entropy  0.863  0.813  0.691 
norm  0.867  0.823  0.729 
APoZ  0.858  0.811  0.700 
Important Classes  0.857  0.795  0.655 
Sensitivity  0.849  0.793  0.634 
6 Experiments: Faster Object Detection
The above experiments have shown that with reasonable levels of pruning (2550%) and enough finetuning (using entire data) the pruned network is able to recover and almost match the performance of the unpruned network on the original task (image classification) even with a random pruning strategy. However, it is possible that if such a pruned network is used for a new task, say object detection, then a randomly pruned network may not give the same performance as a carefully pruned network. To check this, we performs experiments using the FasterRCNN model for object detection. Note that the FasterRCNN model uses a VGG16 model as a base component and then adds other components which are specific to object detection. We experiment with the PASCALVOC 2007 dataset [8] which consists of 9,963 images, containing 24,640 annotated objects. We first plugin a standard trained VGG16 network into FasterRCNN and then train FasterRCNN for 70K iterations (as is the standard practice). This model gives a mean Average Precision (mAP) value of . The idea is to now plugin a pruned VGG16 model into faster RCNN instead of the original unpruned model and check the performance. Table 5 again shows that the specific choice of pruning strategy does not have much impact on the final performance on object detection. Of course, as earlier, as the level of pruning increases the performance drops (but the drop is consistent across all pruning strategies). We now report some more interesting experiments on pruning Faster RCNN.
Directly pruning Faster RCNN: Instead of plugging in a pruned VGG16 model into FasterRCNN, we could alternately take a trained FasterRCNN model and then prune it directly. Here again, we use a simple random pruning strategy and observe that the performance of the pruned model comes very close to that of the unpruned model. In particular, with 50% pruning we are able to achieve a mAP of 0.648 with a speedup in terms of frames per second.
Plugging in a VGG16 model trained using ImageNet52P: Since we are only interested in the 52 classes corresponding to PascalVOC, we wanted to check what happens if we plugin a VGG16 model trained, pruned and finetuned only on ImageNet52P. As shown in Table 7 we do not get much benefit of plugging in this specialized model into FasterRCNN. In fact, in a separate experiment we observed that even if we train a VGG16 model on a completely random set of 52 classes (different from the 52 classes corresponding to Pascal VOC) and then plug in this model into Faster RCNN, even then the final performance of the Faster RCNN model remains the same. This is indeed surprising and further demonstrates the ability of these networks to recover from unfavorable situations.
Heuristics  25 %  50%  75% 

Random  0.647  0.600  0.505 
Mean Activation  0.647  0.601  0.489 
Entropy  0.635  0.584  0.501 
Scaled Entropy  0.640  0.593  0.507 
norm  0.628  0.608  0.520 
APoZ  0.646  0.598  0.514 
Sensitivity  0.636  0.592  0.485 
FasterRCNN  Baseline  25 %  50%  75% 
mAP  0.66  0.655  0.648  0.530 
fps  7.5  10  13  16 
Heuristics  25 %  50%  75% 

Random  0.647  0.580  0.469 
Mean Activation  0.644  0.583  0.454 
Entropy  0.642  0.578  0.47 
Scaled Entropy  0.645  0.580  0.443 
norm  0.648  0.601  0.487 
APoZ  0.641  0.585  0.466 
Important Classes  0.631  0.568  0.432 
Sensitivity  0.637  0.576  0.4345 
7 Conclusion and Future Work
We evaluated the performance of various pruning strategies based on the (i) drop in performance after pruning (ii) amount of recovery after pruning (iii) speed of recovery and (iv) amount of data required. We do extensive evaluations with two networks (VGG16 and ResNet50) and present counterintuitive results which show that w.r.t. all these factors a random pruning strategy performs at par with principled pruning strategies. We also show that even when such a randomly pruned network is used for a completely new task it performs well. Finally, we present results for pruning Faster RCNN and show that even a random pruning strategy can give a 74% speedup w.r.t frames per second while giving only a 1% drop in the performance.
References
 [1] S. Anwar, K. Hwang, and W. Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
 [2] J. Ba and R. Caruana. Do deep nets really need to be deep? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2654–2662. Curran Associates, Inc., 2014.
 [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [4] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pages 1800–1807, 2017.
 [5] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [6] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
 [7] C. Endisch, C. Hackl, and D. Schröder. Optimal Brain Surgeon for General Dynamic Neural Networks, pages 15–28. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.
 [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
 [9] R. Girshick. Fast RCNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
 [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 770–778, 2016.
 [14] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. arXiv preprint arXiv:1707.06168, 2017.
 [15] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
 [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 [17] H. Hu, R. Peng, Y. Tai, and C. Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. CoRR, abs/1607.03250, 2016.
 [18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pages 2261–2269, 2017.
 [19] B. E. Huseyinsinoglu, A. R. Ozdincler, and Y. Krespi. Bobath concept versus constraintinduced movement therapy to improve arm functional recovery in stroke patients: a randomized controlled trial. Clinical rehabilitation, 26(8):705–715, 2012.
 [20] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [21] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training cnns with lowrank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
 [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [23] J. W. JianHao Luo and W. Lin. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In International Conference on Computer Vision (ICCV), October 2017.
 [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [25] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. MorganKaufmann, 1990.
 [26] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR (to appear), Nov. 2015.
 [29] J.H. Luo and J. Wu. An entropybased pruning method for cnn compression. arXiv preprint arXiv:1706.05791, 2017.
 [30] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. 2016.
 [31] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [32] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 779–788, 2016.
 [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
 [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [36] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.
 [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
 [38] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.
 [39] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [40] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1984–1992, 2015.
 [41] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.