Progressive Gradient Pruning for Classification, Detection and Domain Adaptation
Abstract
Although deep neural networks (NNs) have achieved stateoftheart accuracy in many visual recognition tasks, the growing computational complexity and energy consumption of networks remains an issue, especially for applications on platforms with limited resources and requiring realtime processing. Filter pruning techniques have recently shown promising results for the compression and acceleration of convolutional NNs (CNNs). However, these techniques involve numerous steps and complex optimisations because some only prune after training CNNs, while others prune from scratch during training by integrating sparsity constraints or modifying the loss function.
In this paper we propose a new Progressive Gradient Pruning (PGP) technique for iterative filter pruning during training. In contrast to previous progressive pruning techniques, it relies on a novel filter selection criterion that measures the change in filter weights, uses a new hard and soft pruning strategy and effectively adapts momentum tensors during the backward propagation pass. Experimental results obtained after training various CNNs on image data for classification, object detection and domain adaptation benchmarks indicate that the PGP technique can achieve a better tradeoff between classification accuracy and network (time and memory) complexity than PSFP and other stateoftheart filter pruning techniques.
1 Introduction
Convolutional neural networks (CNNs) learn discriminant feature representations from labeled training data, and have achieved stateoftheart accuracy across a wide range of visual recognition tasks, e.g., image classification, object detection, and assisted medical diagnosis. Since the breakthrough results achieved with AlexNet for the 2012 ImageNet Challenge [20], CNN’s accuracy has been continually improved with architectures like VGG [39], ResNet [11] and DenseNet [17], at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources [18]. In particular, the speed of the CNNs can significantly degrade with such increased complexity.
In order to deploy these powerful CNN architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices) and for realtime processing (e.g., video surveillance and monitoring, virtual reality), the time and memory complexity and energy consumption of CNNs should be reduced. This is particularly true when the model also has to adapt to a new environment/domain. For instance, the application of CNNbased architectures to realtime face detection in video surveillance remains a challenging task [33], because of the complexity of realtime detection and the handling of different scenes, environments and camera angles. Currently, the more accurate detectors such as region proposal networks are too slow for realtime applications [38, 6], faster detectors such as singleshot detectors are less accurate [26, 36] and none of these are universal detector that can work on any new environment without domain adaptation. Consequently, effective methods to accelerate and compress deep networks, especially during training/domain adaptation are required to provide a reasonable tradeoff between accuracy, efficiency and deployment time.
This paper focuses on filterlevel pruning techniques. While it does not provide the compression level of unstructured pruning, the reduction of parameters can be converted in a real speed up while preserving network accuracy [23, 32]. These techniques attempt to remove the filters and input channels at each convolution layer using various criteria based on, e.g., L1 norm [23], or the product of feature maps and gradients[32]. Pruning techniques can be applied under two different scenarios: either (1) from a pretrained network, or (2) from scratch. In the first scenario, pruning is applied as a postprocessing procedure, once the network has already been trained, through an onetime pruning (followed by finetuning) [23] or complex iterative process [32] using a validation dataset [23, 30], or by minimizing the reconstruction error [29]. In the second scenario, pruning is applied from scratch by introducing sparsity constraints and/or modifying the loss function to train the network [27, 42, 46]. The later scenario can have more difficulty converging to accurate network solutions (due to the modified loss function), and thereby increase the computational complexity required for the optimisation process. For greater training efficiency, the progressive soft filter pruning (PSFP) method was recently introduced [12], allowing for iterative pruning from scratch, where filters are set to zero (instead of removing them) so that the network can preserve a greater learning capacity. This method, however, does not account for the optimization of soft pruned weights which can have an negative impact on accuracy, because pruned weights are still being optimized with old momentum values accumulated from previous epochs.
In this paper, a new Progressive Gradientbased Pruning (PGP) technique is proposed for iterative filter pruning to provide a better tradeoff between accuracy and complexity. To this end, the filters are efficiently pruned in a progressive fashion while training a network from scratch, and accuracy is maintained without requiring validation data and additional optimisation constraints. In particular, PGP improves on PSFP by integrating hard and soft pruning strategies to effectively adapt the momentum tensor during the backward propagation pass. It also integrates an improved version of the Taylor selection criterion [32] that relies on the gradient w.r.t weights (instead of output feature maps), and is more suitable for progressive filterbased pruning. For performance evaluation, the accuracy and complexity of proposed and stateoftheart filter pruning techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark image classification (MNIST and CIFAR10 datasets), object detection (PASCAL VOC dataset) and domain adaptation (Office31) problems. From our experiments we show that the proposed approach performs comparably or better than most of the previous techniques on image classification as well as object detection. Additionally, we also found that by performing pruning and domain adaptation jointly our method outperforms other approaches based on two separate steps.
2 Compression and Acceleration of CNNs
In general, time complexity of a CNN depends more on the convolutional layers, while the fully connected layers contain the most of the number of parameters. Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers[9, 10]. This section provides an overview of the recent acceleration and compression approaches for CNNs, namely, quantization, lowrank approximation, and network pruning. Finally, a brief survey on the filter pruning methods and challenges is presented.
2.1 Overview of methods:
Quantization:
A deep neural network can be accelerated by reducing the precision of its parameters. Such techniques are often used on general embedded systems, where lowprecision, e.g., 8bit integer, provides faster processing than the higherprecision representation, e.g., 32bit floating point. There are two main approaches to quantizing a neural network – the first focuses on quantizing using weights[9, 45], and the second uses both weights and activations for quantization [7, 4]. These techniques can be either scalable [9, 45] or nonscalable [3, 7, 4, 35], where scalable techniques means that an already quantized network can be further compressed.
Lowrank decomposition:
Lowrank approximation (LRA) can accelerate CNNs by decomposing a tensor in lower rank approximations as vector products. [19, 40, 21].There are different ways of decomposing convolution tensor. Techniques like [19, 40] focus on approximating tensor by low rank tesnor that can be obtained either in a layer by layer fashion [19] or by scanning the whole network [40]. [43] proposes to force filers to coordinate more information into a lower rank space during training and then decompose it once the model is trained. Another technique employed the CPDecomposition (Canonical Polyadic Decomposition), where a good tradeoff between accuracy and efficiency is achieved [21].
Pruning:
Pruning is a family of techniques that removes nonuseful parameters from a neural network. There are several approaches of pruning for deep neural networks. The first is weight pruning, where individual weights are pruned. This approach has proven to significantly compress and accelerates deep neural networks [9, 44, 10]. Weight pruning techniques usually employ sparse convolution algorithms [24, 37].The other approach is output channel or filter pruning, where complete output channel or filters are pruned [23, 29, 12, 46]. Since this paper proposes a method for filter pruning, we provide more details on this approach in the next section.
2.2 Filter pruning:
Following the work of Optical Brain Damage [5], one of the first papers that showed the efficiency of filterlevel pruning was [23], where the weight norm is used to identify and prune weak filters, filters that do not contribute much to network. Afterwards, several works proposed pruning procedures and filter importance metrics. These methods can be organized in five pruning approaches: 1) Pruning as one time post processing and then fine tune– this approach is simple and easy to apply [23], 2) Pruning in an iterative way once the model was trained– the iterative pruning and finetune increase the chance of recovering accuracy loss directly after a filter is pruned [32, 31], 3) Pruning by minimizing the reconstruction error– minimizing the reconstruction error at each layer allows the model to approximate the original performance [29, 15, 46], 4) Pruning by using sparse constraints with a modified objective function– to let the network consider pruning during optimization [27, 2, 1, 22], 5) Pruning progressively while training from scratch or pretrained model – soft pruning [13, 14] was applied where filters are set to zero instead of actually removing them (hard manner), which leaves the network with more capacity to learn [12].
While first three approaches are capable of reducing the complexity of a model, they are only applied when the model is already trained, it would certainly be more beneficial to be able to start pruning from scratch during training. While, the fourth approach can start the pruning from scratch by adding sparse constraints and by modifying the optimization objective, this makes the loss harder and more sensitive to optimize. This can be potentially complicated when the original loss function is hard to optimize since this type of approach modifies the original loss function therefore making it potentially harder for the model to converge to a good solution. The fifth approach eases this process by not removing filters and uses the original loss function. However, we think that this approach can be improved since, currently, this approach does not handle pruning in the backward pass and only set the weak filters to zero. Also, the current approach calculates the L2 criterion separately from when the parameters are updated, i.e. not when we are iterating inside an epoch. For our approach we want to directly compute the criterion during update, i.e. when we are iterating in an epoch and updating parameters.
Another important part of pruning filters is the capacity to evaluate the importance of a filter. In literature, there has been a lot of criteria that has been used to evaluate the importance of filters, e.g. L1 [23], APoz [16], Entropy [30], L2 [12] and Taylor [32]. Among these, the Taylor criterion [32] has the most potential for pruning during training since the criterion is the result of trying to minimize the impact of having a filter pruned.
3 Progressive Gradient Pruning
3.1 Pruning strategy with momentum:
In a regular CNN, the weight tensor of a convolutional layer can be defined as , where and are the number of input and output channels (filters), respectively. A weight tensor of filter can be then defined as . In order to select the weak filters of a layer, we evaluate the importance of an filter using a criterion function , is usually defined as . Given an filter, it yields a scalar that represents the rank, e.g. L1 [23] or gradient norm in our case.
In order to prune convolution layer progressively, an exponential decay function is defined such that there is always a solution in . (It is slightly different than in [12], where the decay function can have solutions in .) This decay function allows to select the number of weak filters at each epoch. The decay function is defined as the ratio of filters remaining after the training on epoch :
(1) 
where is a hyperparameter that defines the ratio of filters to be pruned, and is the epoch. Since we progressively prune layer by layer and epoch by epoch, we calculate the the number of weak filters or the number of remaining filters at each layer, . Given ratio at epoch , the number of weak filters for any layer is defined as:
(2) 
where can be the original number of filters of any layers. Using the the number of weak filters and a pruing criterion function , we end up having a subset of filters with the smallest value. This subset is further divided into two subsets, using a hyperparameter that decides the ratio of hardtoremove filters. The subset is removed completely, while the subset will be reset to zero while keeping and as indexes for the backward pass. Additionally, hard pruning is performed on the input channels of the next layer using .
Figure 1 illustrates the hard and soft pruning strategy of the PGP technique, with the momentum tensor defined as , same dimension as a weight tensor. Using the indexes of , we set to zero the subset and hard prune the subset using indexes . Currently, progressive pruning techniques like [12], only the weights set to zero during training, without handling the previouslyaccumulated momentum accumulated which is critical for the optimization. As illustrated in Figure 2, momentum pruning is important for the optimization process.
Let us take a closer look at the typical equations for update of weight and momentum:
(3) 
(4) 
where and are respectively the weight and momentum tensors at iteration , and and are the learning rate and momentum hyperparameters, respectively. By expanding in Equ. 4:
(5) 
The tensor depends on the previous gradient of weight at time . Using a soft pruning technique (like PSFP), the momentum tensor using is meaningless if W is soft pruned at , since the weight is reset, meaning the optimization point is no longer the same. It is therefore important to adapt the momentum tensor during soft pruning. Our solution is to perform soft prune the momentum such that the weight tensor is correctly optimized.
3.2 Selection criteria:
Molchanov at al. [32] proposed the following criterion to measure the importance of a feature map from a filter , computed at each layer, and for each filter:
(6) 
The term refers to the loss of a model when a labeled dataset is given with a pruned feature map . is the original loss before the model has been pruned. In summary, the criterion of Equ. 6 is the difference between the loss of a pruned model and the original model. The criterion grows with the impact the feature map. This criterion has been shown to work well on some trained network. However, in the scenario where the network is pruned from scratch, we argue that information measured from feature map is not informative since the model is not trained. Empirical results in Section 4 also support that the criterion of Equ. 6 is not effective at other criteria for progressive pruning.
Instead of using to prune a feature map [25] or filter, we can replace with since setting an filter to zero is the same as pruning it [12]. The same Taylor expansion from [32] then can applied with , resulting in:
(7) 
Equ. 7 can be further simplified when taking in account the soft pruning nature. We can decomposed this equation because is an elementwise multiplication:
(8) 
where is the absolute value of the weight of filter . This meant that can be or very close to zero if was one of the filter that was softpruned. In this case, has little chance to recover, since it will likely be pruned. In order to encourage more recovery on soft prune filters, we propose to remove the term:
(9) 
where is the criterion for our approach for filter. There are two ways of calculating our criterion:

PGP: performs a training epoch without updating the model, and compute the pruning criterion. This amounts to a batch gradient descent without updating the parameters at then end, and can provide better performance since the optimization is less noisy than SGD.

RPGP: computes the pruning criterion directly during a forwardbackward pass of training (while updating). This approach uses a SGD optimizer and calculates the criterion directly during the optimization and update of the model.
In either case, the criterion is applied over several iterations, so there are two ways of interpreting Equ. 9. One natural way of interpreting is by accumulating gradients, where the gradients are summed up to the total gradient of an filter. Since PGP goes thought the entire epoch without updates. We can use an L1 norm in order to sum up the variation inside an filter using criterion:
(10) 
where is the gradient tensor of an filter at iteration inside an epoch. Equ. 10 measures the amount of global changes for an filter at the end of an epoch, which makes it most suitable for PGP. The second way of interpreting is by accumulating the actual changes of an filter at each updates, using criterion:
(11) 
Equ. 11 calculates the L1 norm of a gradient tensor of an filter at each iteration during an epoch. Thus, instead of measuring the global change only at the end like Equation 10, this measure the gradual changes during an epoch. This criteria is most suitable for RPGP since the weight is updated at the same time as we accumulate our gradient. PGP is summarized in Algo. 1. The algorithm for RPGP is similar but the criterion is calculated directly at the train step.
4 Experiments
Our experiments consider three different visual recognition tasks: (1) image classification (2) object detection and (3) domain adaptation. For image classification we evaluate our approach on MNIST and CIFAR10 datasets with several commonly used network architectures. Additionally, we conduct an ablation study to better understand which parts of our approach are important for good performance. For object detection we progressively prune the VGG16 backbone of Faster RCNN and show a good tradeoff between detection performance and computation. Finally, for domain adaptation, we evaluate our method on the Office31 datasets. As our method is progressive, we can prune and adapt to the target domain simultaneously, which is beneficial. The source code for our paper will be available at https://github.com/Anon6627/PruningPGP.
4.1 Classification
In this section, we compare the experimental results obtained using the proposed PGP and RPGP techniques against stateoftheart filter pruning techniques that are representative of each family described in Section 2.2: L1norm Pruning (prunes once), Taylor Pruning (prunes iteratively), DCP (specialised loss function and minimize reconstruction error) and PSFP (progressive pruning). Performance is measured in terms of accuracy, and in terms of memory and time complexity (number of parameters and number of FLOPS). For techniques like our PGP, and PSFP, DCP and L1, it is possible to set a target pruning rate hyper parameter. For a fixed pruning rate, the complexity (number of FLOPS and parameters) is identical for these techniques, so we can compare them in terms of accuracy for a given complexity. In contrast, techniques like Taylor prunes some filters at each iteration and can be stopped at a certain condition like number of FLOPs or until 99% of the filters are pruned. Due to the skip connections, pruning ResNet needs a special strategy. We decided to follow the popular pruning strategy proposed in [23] – pruning the downsampling layer and then using the same indexes to prune the last convolution of the residual. Techniques are compared using Resnet, LeNet and VGG networks trained to address benchmark problems.
Mnist:
On this dataset, we use the same hyperparameters as in the original papers. The same settings were used for LeNet5 and ResNet20. With PGP and RPGP, we use a learning rate 0.01, momentum 0.9, 40 epochs with a remove rate of 50%. For PSFP, we used these same settings except for removal rate of 50%. For Taylor [32], we iteratively remove 5 filters each time, and then finetune for 5 epochs. This varies slightly from the original procedure because this configuration does not collapse and return the best result. For L1 pruning, we use a 20 epochs finetuning after pruning. For DCP, we ran the author’s code for MNIST over 40 epochs, with 20 epochs for the filter pruning and 20 epochs for finetuning.
Methods  Params  FLOPS  Error % ( gap)  

Baseline LeNet5  0%  61K  446k  0.84 ( 0) 
L1 [23]  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.05 ( +0.21)  
70%  84K  82K  2.22 ( +1.38)  
Taylor [32]  30%  38K  286K  0.9 ( +0.06) 
50%  24K  76K  1.05 ( +0.21)  
70%  13K  34K  1.22 ( +0.38)  
DCP [46]  30%  42.7K  325K  2.75 ( +1.91) 
50%  30.5K  232K  4.18 ( +3.34)  
70% 
30.5K  232K  6.28 ( +5.44)  
PSFP [12]  30%  34.1K  304K  1.32 ( +0.48) 
50%  18K  152K  2.27 ( +1.43)  
70%  84K  82K  2.99 ( +2.15)  
PGP_GN_{G} (ours)  30%  34.1K  304K  0.87 ( +0.03) 
50%  18K  152K  1.08 ( +0.24)  
70%  84K  82K  1.74 ( +0.9)  
RPGP_GN_{S} (ours)  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.25 ( +0.41)  
70%  84K  82K  1.75 ( +0.91) 
Methods  Params  FLOPS  Error % ( gap)  

Baseline Resnet20  0%  272K  41M  0.74 ( 0) 
L1 [23]  30%  137K  22M  0.75 ( +0.01) 
50%  68K  10M  1.09 ( +0.35)  
70%  27K  4.2M  2.02 ( +1.28)  
Taylor [32]  30%  149K  17.7M  0.87 ( +0.13) 
50%  87K  7.8M  0.95 ( +0.21)  
70%  36K  2.6M  1.04 ( +0.30)  
DCP [46]  30%  193K  30.3M  1.11 (+0.37) 
50%  138K  21.1M  0.62 ( 0.12)  
70%  87.7K  13.5M  1.19 ( +0.45)  
PSFP [12]  30%  137K  22M  0.5 ( 0.24) 
50%  68K  10M  0.61 ( 0.13)  
70%  27K  4.2M  0.72 ( 0.02)  
PGP_GN_{G} (ours)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.51 ( 0.23)  
70%  27K  4.2M  0.57 ( 0.17)  
RPGP_GN_{S} (ours)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.48 ( 0.29)  
70%  27K  4.2M  0.5 ( 0.24) 
Results in Tab. 1 show that our PGP methods compare favorably against Stateof theart techniques like L1,Taylor and PSFP. Similar tendencies are seen in Tab. 2. We also see that PGP performs slightly better than DCP in some case. Finally, since both PGP_GN_{G} and RPGP_GN_{S} have the same criterion, results show that their procedure that differs. The slight better performance of PGP_GN_{G} can be explained by the fact that the pruning criterion is calculated using Batch Gradient Descent instead of Stochastic Gradient Descent.
Cifar10:
In this case, we use a VGG19 for CIFAR10, with learning rate 0.1, momentum 0.9, 400 epochs and we decrease the learning rate by a factor of 10 at 160 and 240 epochs. We also use Resnet56 adapted to CIFAR10 with the same settings, except with 500 epochs. As of PGP and RPGP, we set the remove rate hyperparameter to 0.5 (50%), finetune them for 100 epochs after pruned, and store the best score. We use the same settings for PSFP except the removal rate . For Taylor, 5 filters are iteratively iteratively each time and finetune on 5 epochs after that. We slightly changed the procedure compared to the original paper because the original procedure pruned one feature map each iteration which is inefficient on a large model. Empirically, we found that 5 feature maps has the best accuracy. For L1 pruning, 100 epochs of finetuning are used after pruned to find the best score. With DCP, the settings are provided by the original authors are found to have the best performance.
Methods  Params  FLOPS  Error % ( gap)  

Baseline VGG19  0%  20M  400M  6.23 (0) 
Li [23]  30%  9M  198M  16.94 ( +8.41) 
50%  5M  100M  16.51 ( +7.98)  
70%  1M  37M  16.17 ( +7.64)  
Taylor [32]  30%  10M  156M  9.82 ( +2.29) 
50%  5M  72M  11.94 ( +3.41)  
70%  1.9M  24M  16.85 ( +8.32)  
DCP [46]  30%  10M  221M  5.8 ( 0.65) 
50%  6M  158M  7.76 ( +1.53)  
70% 
6M  158M  7.86 ( +1.63)  
PSFP [12]  30%  9M  198M  8.98 ( +2.75) 
50%  5M  100M  11.2 ( +4.97)  
70%  1M  37M  12.06 ( +5.83)  
PGP_GN_{G} (ours)  30%  9M  198M  7.37 ( +1.14) 
50%  5M  100M  8.38 ( +2.15)  
70%  1M  37M  9.7 ( +3.47)  
RPGP_GN_{S} (ours)  30%  9M  198M  7.65 ( +1.42) 
50%  5M  100M  8.79 ( +2.56)  
70%  1M  37M  10.56 ( +4.33) 
Methods  Params  FLOPS  Error % ( gap)  

Baseline Resnet56  0%  855K  128M  6.02 ( 0) 
L1 [23]  30%  431K  67M  13.34 ( +7.32) 
50%  215K  32M  15.57 ( +9.55)  
70%  84K  13M  17.89 ( +11.87)  
Taylor [32]  40%  491K  51M  13.9 ( +7.88) 
50%  268K  23M  15.34 ( +9.32)  
70%  100k  8M  22.1 ( +16.08)  
DCP [46]  30%  600K  90M  5.67 ( 0.35) 
50%  430K  65M  6.43 ( +0.41)  
70%  270K  41M  7.18 ( +1.16)  
PSFP [12]  30%  431K  67M  8.94 ( +2.92) 
50%  215K  32M  10.93 ( +4.91)  
70%  84K  13M  14.18 ( +8.16)  
PGP_GN_{G} (ours)  30%  431K  67M  8.95 ( +2.93) 
50%  215K  32M  10.59 ( +4.57)  
70%  84K  13M  13.02 ( +7)  
RPGP_GN_{S} (ours))  30%  431K  67M  9.37 ( +3.35) 
50%  215K  32M  10.46 ( +4.44)  
70%  84K  13M  14.16 ( +8.14) 
From Tabs. 3 and 4, our techniques consistently perform better than state of the art techniques L1, Taylor and PSFP on VGGNet. For ResNet, PSFP has a different pruning strategy on ResNet, and does not prune the downsample layer, and therefore does not prune the last convolutional layer of the residual. This translates into a slight better accuracy on some settings. Our ablation study also provides a comparison of techniques using the same pruning strategy on ResNet, and shows the importance of momentum pruning. DCP performs better than ours on this dataset, mainly because of the additional losses that help selecting discriminate filters. However, it is difficult to compare directly since they do not yield the same number of FLOPS and parameters, and DCP starts from a trained model and requires more computation power.
4.2 Ablation study
The training and pruning time of a model are important factors of a technique, for instance for deploying or adapting a model in an operational environment. One of advantage of progressive pruning techniques is the reduction of processing time at each epoch since filters are removed while training, at each epoch. Tab. 5 presents the training and pruning time pruning for the evaluated techniques. For progressive pruning technique, values represent both pruning and training times, while for DCP, L1 and Iterative pruning, values represent (training time) + pruning and retrain times. Experiments are conducted on the CIFAR10 dataset with the same settings as above.
Methods  VGG19  Resnet56  

0.5  0.9  0.5  0.9  
Baseline  219m  219m  307m  307m 
L1 [23]  (219) + 32m  (219) + 32m  (307) + 48m  (307) + 48m 
Taylor [32]  (219) + 254m  (219) + 457m  (307) + 488m  (307) + 878m 
DCP [46]      (307) + 489m  (307) + 443m 
PSFP [12]  219m  219m  307m  307m 
PGP (ours)  329m  329m  441m  441m 
RPGP (ours)  211m  168m  263m  241m 
From Tab. 5, the fastest pruning method (without considering training time) is currently the L1. However, it should be noted that the original training of the model takes around 219 mins for VGG and 307 mins for Resnet56. So, taking into account also training time L1 is slower than our approach. Other techniques likes Taylor prune in a iterative way composed of multiple feature maps and finetuning, this method can be very slow, depending on the number of filters pruned at each iteration. DCP is particulary slow since it needs to start from an already trained model and then the pruning process need to do the filter pruning optimization process and the finetuning after pruning. For PSFP, this algorithm has similar time to the original training since it does not technically change the size of the model during training. Between PGP and RPGP, the difference is the use of an entire epoch to compute the pruning criterion with PGP, and the direct computation of the criterion during a training epoch with RPGP. Also, since we hardprune filters at each epoch, the epoch time will become faster as the model is pruned/trained. Overall, the progressive pruning methods train and prune in considerably less time than other methods.
To compare the selection criterion, we use the same configuration as the general comparison for RPGP on CIFAR10, except we vary the criterion and set a pruning rate of 50%.
Networks  L2  Taylor  TW  GN_G  GN_S 

VGG19  8.47%  9.27%  8.78%  8.47%  8.79% 
ResNet56  10.30%  10.97%  10.46%  10.24%  10.28% 
In Tab. 6, we can see that our criterion performs better than others in the context of progressive pruning, and similar to the L2 Norm. The comparison between Taylor Weight (TW), and Gradient Norm (GN) shows that a small gradient norm during training may be a good indicator about the importance of a filter. From the table we can also see that Taylor Weights performs better than the original Taylor criterion. Overall , which uses batch gradient to capture changes, seems to work the best with progressive pruning. As for the similarity between L2 and , it is explained in the Supplemental Material.
In this experiment of momentum pruning, the same strategy, hyperparameters and L2 criterion are used for both RPGP and PSFP. The only difference is that RPGP performs momentum pruning.
Method  VGG19  ResNet56 

PSFP  11.20%  10.93% 
RPGP  8.47%  10.09% 
From the Tab. 7, in both of the case (VGG19 and ResNet56), our proposed methods performs better than the state of the art PSFP method. Since, everything is the same in this setting except the momentum pruning, this clearly shows the advantage of pruning momentum during progressive pruning.
As described, PSFP does not prune the downsampling layer of ResNet56, thus, it does not prune the last layer of the residual connection. The performance of PSFP and RPGP is compared using the same strategy on ResNet56, i.e., the downsampling layer and last layer of residual connection are not pruned, on with CIFAR10 dataset and the same hyperparameters as in previous experiments. The results in Tab. 8 indicate that the RPGP approach typically performs better than PSFP. Interestingly, when no pruning is performed on the downsampling layer and last layer of the residual connection, our method performs much better. The residual connection is sensitive to pruning, and may require a different pruning strategy.
%  

Methods  30%  50%  70%  90% 
PSFP  8.94  10.93  14.18  28.09 
RPGP(GN_S)  8.87  10.09  11.02  13.94 
4.3 Detection
Pascal Voc:
In this case, PGP, RPGP and PSFP techniques are adapted for an object detection problem. We progressively prune a Faster RCNN with a VGG16 backbone using a learning rate of 0.001, momentum of 0.9 using a 10 epochs progressive pruning, and early stopping for finetuning over a few epochs. For the L1 pruning, a trained model is prune 50% from the network, and then we finetune on Pascal VOC. For this experiment, we set the pruning rate hyperparameter to 0.5 (50%), and show mean average precision (MAP) measure for comparison. In this experiment we skip the pruning of the last layer since it would mean pruning the input of the RPN layer, which we empirically found that it results in significant performance reduction. In Tab.9, PGP and RPGP perform better than PSFP, the current stateoftheart progressive pruning. However, the PGP needs more time to prune due to the calculation of the criterion in a separate epoch. RPGP provides a slightly better performance (possibly due to stochasticity), and with much less pruning time. The difference in accuracy between RPGP and PSFP highlights the importance of momentum pruning with these approaches. The significant difference in the training time between RPGP and PSFP also suggests that by adding hard pruning to existing soft pruning during training can reduce training time.
Methods  Params  FLOPS  mAP  Training Time 
Baseline VGG16  137M  250G  69.6%  428m 
L1 [23]  125M  174G  62.3%  (428) + 31m 
PSFP [12]  125M  174G  63.5%  428m 
PGP_GN_{G} (ours)  125M  174G  65.5%  769m 
RPGP_GN_{S} (ours)  125M  174G  66.0%  281m 
4.4 Domain Adaptation
Office31:
We adapted our pruning technique for domain adaptation. Among unsupervised domain adaptations techniques[8, 28, 41], we chose Deep Adaptation Network(DAN)[28] with a VGG16 backbone since it is a popular technique. However, we could adapt our pruning technique to other unsupervised domain adaptation such as ADDA[41] or DANN[8]. For this experiment, we train our model for 400 epochs for joint domainadaptation and pruning, with an exponential decay learning rate starting at 0.001, we set a pruning rate of 20% with a remove rate of 0.3. In Tab.10 we see that our method performs better than TCP in the majority of the cases while having a higher FLOPs Reduction. We can explain the improvement over TCP mainly due to two factors. First, the use of our pruning criterion. As shown in our ablation study, a Taylor criterion based on weight is better than its feature map counterpart. Second, we think that progressive pruning is more suited for domain adaptation than the commonly used scheme in which first the model is adapted to the target domain, then pruned, and finally finetuned. In our approach we perform pruning and domain adaptation simultaneously in a single training. In this way, instead of trying to adapt all filters of the model, we might expect to directly remove those filters that are specific of a certain domain and do not help in bridging domain discrepancy. This would ease the learning of transferable features in the case of DAN[28] or domain confusion in the case of ADDA[41].
Sourceonly  Baseline VGG  TCP  RPGP_GN_{S}(ours)  
FLOPS Reduction %  0%  0%  26%  35% 
68.5  74.0  76.1  78.2  
61.1  72.3  76.2  77.7  
41.6  55.2  51.2  51.6  
94.3  97.5  99.8  99.4  
94.5  94.0  96.1  96.5  
50.3  54.1  47.9  48.0  
Average  68.3  74.5  74.5  75.2 
5 Conclusion
In this paper, we show that it is possible to efficiently prune a deep learning model from scratch with the PGP technique while improving the tradeoff between compression, accuracy and training time. PGP is a new progressive pruning technique that relies on change in filter weights to apply hard and soft pruning strategies that allows for pruning along the backpropagation path. The filter selection criterion is well adapted for progressive pruning from scratch when the norm of the gradient is considered. Results obtained from pruning various CNNs on image data for classification and object detection problems show that the proposed PGP allows maintaining a high level of accuracy with compact networks. Results show that PGP can achieve better CNN optimisations than PSFP, often translating to a higher level of accuracy for a same pruning rate as PSFP and other stateofart techniques. In domain adaptation problems, our technique outperforms current state of the art technique while pruning more. Future research will involve analyzing the performance of different CNNs pruned using the proposed method on larger datasets from realworld visual recognition problems (e.g., tracking and recognition of persons in video surveillance) and on different domain adaptation architetures.
Supplementary Material
Appendix A Additional Experimental Results
a.1 Implementation Details
One of the problem of pruning during training is how to handle the shape of the gradient tensor and momentum tensor during backward pass. In the case of PyTorch [34], the shape of the gradient tensor and momentum tensor is usually handled by the optimizer, which does not necessary update the shape during forward pass. Also, redefining a new optimizer with the new pruned model in a trivial way would result in losing all values accumulated in the momentum buffer. One of the way to overcome this, is to prune also the gradient and momentum tensors using indexes that we used to prune the weight tensor, and then transfer them to a newly defined optimizer.
a.2 Graphical comparison on CIFAR10 with VGG:
The results presented in this section are similar to the ones shown in Tabs. 1 to 4 of our paper. In the main paper, we could only compare the performance of methods with 4 pruning rates due to space constraints. In this section, we compare the performance of methods using the same experimental settings (as in our paper), but with 10 data points () on L1 [23], Taylor [32], PSFP [12] and our approach. Since the number of remaining parameters can differ slightly from one algorithm to the other, some of the value on Xaxis are rounded up for a better visualization.
Results in Figure 3 show the proposed PGP and RPGP pruning methods consistently outperforming the other methods. Note that the proposed methods allow to maintain a low lever of error event with an important increase in the pruning rate.
a.3 L2 vs Gradient Norm:
From the ablation study, we noticed that the performance of L2 and Gradient norm is very similar in the case of soft pruning. This can be understood considering the following:
(12)  
Where represents the weight of an filter at iteration in an epoch, is the learning rate, and denotes here the loss function at iteration . From the Equ.12 we can observe the difference between L2 and Gradient Norm is the initial values of . Taking in account the partial soft pruning nature of our approach, can be zero when it is soft pruned. Therefore the two approaches tends to have similar values (since is a scalar, it is not important in this context).
a.4 Progressive pruning from scratch vs trained:
Tab. 11 shows that the performance obtained by a model that was randomly initialized (scratch) versus one that was pretrained on CIFAR10 using the same settings as before (, ).
Training Scenario  VGG19  ResNet56 

Scratch  8.79 %  10.46 % 
Pretrained  8.23 %  9.51 % 
From Tab. 11 the difference in terms of accuracy between a network pruned starting from scratch and a network pruned after training is quite reduced and can vary depending on the architectures. Overall, instead of starting from a trained model and prune, the proposed techniques can attain similar performance starting from a randomly initialized model, thus, with a reduced training and pruning time, therefore more suitable for fast deployment.
a.5 Hard vs soft pruning:
RPGP is used with our gradient criterion and a target prune rate at 50% and using the same hyperparameters. The removal rate is varied in order to see the impact of having more or less recovery.
Networks  

VGG19  8.74%  8.79%  8.99%  8.92% 
ResNet56  10.57%  10.46%  11.03%  10.78% 
The results in Tab. 12 show that a remove rate of 0.3(30%)or 0.5(50%) has the best balance between the amount of hard pruning soft pruning. It is also interesting to see that, without any soft pruning (=1.0), the performance of the approach is still close to others removal rate.
Footnotes
 Since DCP’s code, provided by the authors, did not handle nonresidual architecture, we had to modified the original code. Pruning rate above 50% are struck on LeNet and VGG19
 footnotemark:
References
 (2016) Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 2270–2278. External Links: Link Cited by: §2.2.
 (2017) Compressionaware training of deep networks. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 856–867. External Links: Link Cited by: §2.2.
 (2017) Deep learning with low precision by halfwave gaussian quantization. CoRR abs/1702.00953. External Links: Link, 1702.00953 Cited by: §2.1.
 (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or 1. CoRR abs/1602.02830. External Links: Link, 1602.02830 Cited by: §2.1.
 (1990) Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §2.2.
 (2016) Rfcn: object detection via regionbased fully convolutional networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 379–387. External Links: Link Cited by: §1.
 (2018) SYQ: learning symmetric quantization for efficient deep neural networks. CoRR abs/1807.00301. External Links: Link, 1807.00301 Cited by: §2.1.
 (2014) Unsupervised domain adaptation by backpropagation. External Links: 1409.7495 Cited by: §4.4.
 (2015) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149. External Links: Link, 1510.00149 Cited by: §2.1, §2.1, §2.
 (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 1135–1143. External Links: Link Cited by: §2.1, §2.
 (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §1.
 (2018) Progressive deep neural networks acceleration via soft filter pruning. CoRR abs/1808.07471. External Links: Link, 1808.07471 Cited by: §A.2, §1, §2.1, §2.2, §2.2, §3.1, §3.1, §3.2, Table 1, Table 2, Table 3, Table 4, Table 5, Table 9.
 (2018) Soft filter pruning for accelerating deep convolutional neural networks. CoRR abs/1808.06866. External Links: Link, 1808.06866 Cited by: §2.2.
 (201906) Filter pruning via geometric median for deep convolutional neural networks acceleration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
 (2017) Channel pruning for accelerating very deep neural networks. CoRR abs/1707.06168. External Links: Link, 1707.06168 Cited by: §2.2.
 (2016) Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. CoRR abs/1607.03250. External Links: Link, 1607.03250 Cited by: §2.2.
 (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Link, 1608.06993 Cited by: §1.
 (2016) Speed/accuracy tradeoffs for modern convolutional object detectors. CoRR abs/1611.10012. External Links: Link, 1611.10012 Cited by: §1.
 (2014) Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866. External Links: Link, 1405.3866 Cited by: §2.1.
 (201705) ImageNet classification with deep convolutional neural networks. 60 (6), pp. 84–90. External Links: ISSN 00010782, Link, Document Cited by: §1.
 (2014) Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR abs/1412.6553. External Links: Link, 1412.6553 Cited by: §2.1.
 (201906) Structured pruning of neural networks with budgetaware regularization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
 (2016) Pruning filters for efficient convnets. CoRR abs/1608.08710. External Links: Link, 1608.08710 Cited by: §A.2, §1, §2.1, §2.2, §2.2, §3.1, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5, Table 9.
 (201506) Sparse convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
 (2019) Channel pruning based on mean gradient for accelerating convolutional neural networks. Signal ProcessingCommun. ACMCoRRJournal of FooJournal of FooJournal of Foo 156, pp. 84 – 91. External Links: ISSN 01651684, Document, Link Cited by: §3.2.
 (2016) SSD: single shot multibox detector. In ECCV (1), Lecture Notes in Computer Science, Vol. 9905, pp. 21–37. Cited by: §1.
 (2017) Learning efficient convolutional networks through network slimming. CoRR abs/1708.06519. External Links: Link, 1708.06519 Cited by: §1, §2.2.
 (2015) Learning transferable features with deep adaptation networks. CoRR abs/1502.02791. External Links: Link, 1502.02791 Cited by: §4.4.
 (2017) ThiNet: A filter level pruning method for deep neural network compression. CoRR abs/1707.06342. External Links: Link, 1707.06342 Cited by: §1, §2.1, §2.2.
 (2017) An entropybased pruning method for CNN compression. CoRR abs/1706.05791. External Links: Link, 1706.05791 Cited by: §1, §2.2.
 (201906) Importance estimation for neural network pruning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
 (2016) Pruning convolutional neural networks for resource efficient transfer learning. CoRR abs/1611.06440. External Links: Link, 1611.06440 Cited by: §A.2, §1, §1, §2.2, §2.2, §3.2, §3.2, §4.1, Table 1, Table 2, Table 3, Table 4, Table 5.
 (201711) A comparison of cnnbased face and head detectors for realtime video surveillance applications. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), Vol. , pp. 1–7. External Links: Document, ISSN 2154512X Cited by: §1.
 (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §A.1.
 (2016) XNORnet: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Link, 1603.05279 Cited by: §2.1.
 (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Link, 1804.02767 Cited by: §1.
 (201806) SBNet: sparse blocks network for fast inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
 (2015) Faster RCNN: towards realtime object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §1.
 (2014) Very deep convolutional networks for largescale image recognition. abs/1409.1556. External Links: Link, 1409.1556 Cited by: §1.
 (2015) Convolutional neural networks with lowrank regularization. CoRR abs/1511.06067. External Links: Link, 1511.06067 Cited by: §2.1.
 (2017) Adversarial discriminative domain adaptation. External Links: 1702.05464 Cited by: §4.4.
 (2016) Learning structured sparsity in deep neural networks. CoRR abs/1608.03665. External Links: Link, 1608.03665 Cited by: §1.
 (2017) Coordinating filters for faster deep neural networks. CoRR abs/1703.09746. External Links: Link, 1703.09746 Cited by: §2.1.
 (201809) A systematic dnn weight pruning framework using alternating direction method of multipliers. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
 (2017) Incremental network quantization: towards lossless cnns with lowprecision weights. CoRR abs/1702.03044. External Links: Link, 1702.03044 Cited by: §2.1.
 (2018) Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi and R. Garnett (Eds.), pp. 883–894. External Links: Link Cited by: §1, §2.1, §2.2, Table 1, Table 2, Table 3, Table 4, Table 5.