Structured Pruning of Neural Networks with Budget-Aware Regularization

Structured Pruning of Neural Networks with Budget-Aware Regularization


Pruning methods have shown to be effective at reducing the size of deep neural networks while keeping accuracy almost intact. Among the most effective methods are those that prune a network while training it with a sparsity prior loss and learnable dropout parameters. A shortcoming of these approaches however is that neither the size nor the inference speed of the pruned network can be controlled directly; yet this is a key feature for targeting deployment of CNNs on low-power hardware. To overcome this, we introduce a budgeted regularized pruning framework for deep CNNs. Our approach naturally fits into traditional neural network training as it consists of a learnable masking layer, a novel budget-aware objective function, and the use of knowledge distillation. We also provide insights on how to prune a residual network and how this can lead to new architectures. Experimental results reveal that CNNs pruned with our method are more accurate and less compute-hungry than state-of-the-art methods. Also, our approach is more effective at preventing accuracy collapse in case of severe pruning; this allows pruning factors of up to without significant accuracy drop. 1


1 Introduction

Convolutional Neural Networks (CNN) have proven to be effective feature extractors for many computer vision tasks [12, 15, 18, 30]. The design of several CNNs involve many heuristics, such as using increasing powers of two as the number of feature maps, or width, of each layer. While such heuristics allow achieving excellent results, they may be too crude in situations where the amount of compute power and memory is restricted, such as with mobile platforms. Thus arises the problem of finding the right number of layers that solve a given task while respecting a budget. Since the number of layers depends highly on the effectiveness of the learned filters (and their combination), one cannot determine these hyper-parameters a priori.

Convolution operations constitute the main computational burden of a CNN. The execution of these operations benefit from a high degree of parallelism, which requires them to have regular structures. This implies that one cannot remove isolated neurons from a CNN filter as they must be full grids. To achieve the same effect as removing a neuron, one can zero-out its weights. While doing this reduces the theoretical size of the model, it does not reduce the computational demands of the model nor the amount of feature map memory. Therefore, to accelerate a CNN and reduce its memory footprint, one has to rely on structured sparsity pruning methods that aim at reducing the number of feature maps and not just individual neurons.

By removing unimportant filters from a network and retraining it, one can shrink it while maintaining good performance [10, 19]. This can be explained by the following hypothesis: the initial value of a filter’s weights is not guaranteed to allow the learning of a useful feature; thus, a trained network might contain many expendable features [7].

Among the structured pruning methods, those that implement a sparsity learning (SL) framework have shown to be effective as pruning and training are done simultaneously [1, 17, 21, 22, 24, 26]. Unfortunately, most SL methods cannot prune a network while respecting a neuron budget imposed by the very nature of a device on which the network shall be deployed. As of today, pruning a network while respecting a budget can only be done by trial-and-error, typically by training multiple times a network with various pruning hyperparameters.

In this paper, we present a SL framework which allows learning and selecting filters of a CNN while respecting a neuron budget. Our main contributions are:

  • We present a novel objective function which includes a variant of the log-barrier [2] function for simultaneously training and pruning a CNN while respecting a total neuron budget;

  • We propose a variant of the barrier method [2] for optimizing a CNN;

  • We demonstrate the effectiveness of combining SL and knowledge distillation [14];

  • We empirically confirm the existence of the automatic depth determination property of residual networks pruned with filter-wise methods, and give insights on how to ensure the viability of the pruned network by preventing “fatal pruning”;

  • We propose a new mixed-connectivity block which roughly doubles the effective pruning factors attainable with our method.

2 Previous Works

Compressing neural networks without affecting too much their accuracy implies that networks are often over-parametrized. Denil \etal [5] have shown that typical neural networks are over-parametrized; in the best case of their experiments, they could predict 95% of the network weights from the others. Recent work by Frankle \etal [7] support the hypothesis that a large proportion (typically 90%) of weights in standard neural networks are initialized to a value that will lead to an expendable feature. In this section, we review six categories of methods for reducing the size of a neural network.

Neural network compression aims to reduce the storage requirements of the network’s weights. In [6, 16], low-rank approximation through matrix factorization, such as singular-value decomposition, is used to factorize the weight matrices. The factors’ rank is reduced by keeping only the leading eigenvalues and their associated eigenvectors. In [8], quantization is used to reduce the storage taken by the model; both scalar quantization and vector quantization (VQ) have been considered. Using VQ, a weight matrix can be reconstructed from a list of indices and a dictionary of vectors. Thus, practical computation savings can be obtained. Unfortunately, most network compression methods do not decrease the memory and compute usage during inference.

Neural network pruning consists of identifying and removing neurons that are not necessary for achieving high performance. Some of the first approaches used the second-order derivative to determine the sensitivity of the network to the value of each weight [19, 11]. A more recent, very simple and effective approach selects which neurons to remove by thresholding the magnitude of their weights; smaller magnitudes are associated with unimportant neurons [10]. The resulting network is then finetuned for better performance. Nonetheless, experimental results (c.f. Section 5) show that variational pruning methods (discussed below) outperform the previously mentioned works.

Sparsity Learning (SL) methods aim at pruning a network while training it. Some methods add to the training loss a regularization function such as  [21], Group LASSO [33], or an approximation of the norm [22, 27]. Several variational methods have also been proposed [1, 17, 26, 24]. These methods formalize the problem of network pruning as a problem of learning the parameters of a dropout probability density function (PDF) via the reparametrization trick [17]. Pruning is enforced via a sparsity prior that derives from a variational evidence lower bound (ELBO). In general, SL methods do not apply an explicit constraint to limit the number of neurons used. To enforce a budget, one has to turn towards budgeted pruning.

Budgeted pruning is an approach that provides a direct control on the size of the pruned network via some “network size” hyper-parameter. MorphNet [9] alternates between training with a sparsifying regularizer and applying a width multiplier to the layer widths to enforce the budget. Contrary to our method, this work does not leverage dropout-based SL. Budgeted Super Networks [32] is a method that finds an architecture satisfying a resource budget by sparsifying a super network at the module level. This method is less convenient to use than ours, as it requires “neural fabric” training through reinforcement learning. Another budgeted pruning approach is “Learning-Compression” [3], which uses the method of auxiliary coordinates [4] instead of back-propagation. Contrary to this method, our approach adopts a usual gradient descent optimization scheme, and does not rely on the magnitude of the weights as a surrogate of their importance.

Architecture search (AS) is an approach that led to efficient neural networks in terms of performance and parameterization. Using reinforcement learning and vast amounts of processing power, NAS [35] have learned novel architectures; some that advanced the state-of-the-art, others that had relatively few parameters compared to similarly effective hand-crafted models. PNAS [20] and ENAS [29] have extended this work by cutting the necessary compute resources. These works have been aggregated by EPNAS [28]. AS is orthogonal to our line of work as the learned architectures could be pruned by our method. In addition, AS is more complicated to implement as it requires learning a controller model by reinforcement learning. In contrast, our method features tools widely used in CNN training.

3 Our Approach

3.1 Dropout Sparsity Learning

Before we introduce the specifics of our approach, let us first summarize the fundamental concepts of Dropout Sparsity Learning (DSL).

Let be the output of the -th hidden layer of a CNN computed by , a transformation of the previous layer, typically a convolution followed by a batch norm and a non-linearity. As mentioned before, one way of reducing the size of a network is by shutting down neurons with an element-wise product between the output of layer and a binary tensor :


To enforce structured pruning and shutdown feature maps (not just individual neurons), one can redefine as a vector of size where is the number of feature maps in . Then, is applied over the spatial dimensions by performing an element-wise product with .

As one might notice, Eq. (1) is the same as that of dropout [31] for which is a tensor of independent random variables i.i.d. of a Bernoulli distribution . To prune a network, DSL redefines as random variables sampled from a distribution whose parameters can be learned while training the model. In this way, the network can learn which feature maps to drop and which ones to keep.

Since the operation of sampling from a distribution is not differentiable, it is common practice to redefine it with the reparametrization trick [17]:


where is a continuous function differentiable with respect to and stochastic with respect to , a random variable typically sampled from or .

In order to enforce network pruning, one usually incorporates a two-term loss :


where is the prior’s weight, are the parameters of the network, is a data loss that measures how well the model fits the training data (e.g. the cross-entropy loss) and is a sparsity loss that measures how sparse the model is. While varies from one method to another, the KL divergence between and some prior distribution is typically used by variational approaches [17, 24]. Note that during inference, one can make the network deterministic by replacing the random variable by its mean.

3.2 Soft and hard pruning

As mentioned before, is a continuous function differentiable with respect to . Thus, instead of being binary, the pruning of Eq. (2) becomes continuous (soft pruning), so there is always a non-zero probability that a feature map will be activated during training. However, to achieve practical speedups, one eventually needs to “hard-prune” filters. To do so, once training is over, the values of are thresholded to select which filters to remove completely. Then, the network may be fine-tuned for several epochs with the loss only, to let the network adapt to hard-pruning. We call this the “fine-tuning phase”, and the earlier epochs constitute the “training phase”.

3.3 Budget-Aware Regularization (BAR)

In our implementation, a budget is the maximum number of neurons a “hard-pruned” network is allowed to have. To compute this metric, one may replace by its mean so feature maps with have no effect and can be removed, while the others are kept. The network size is thus the total activation volume of the structurally “hard-pruned” network :


where is the area of the output feature maps of layer and is the indicator function. Our training process is described in Algorithm 1.

Data: : network weights; : r.v. parametrization; TeacherLogits: the class-wise scores for all samples of the dataset; : all the hyperparameters of the method (including the budget); Prog : progress of the training process; : function introduced in Eq. (2) of the paper; : predicted class-wise logits.
Result: PrunedNet: the pruned neural network object including its weights.
1 TrainUnprunedNetwork() TeacherLogits PredictWholeDataset() for  do
2       ForwardPass() BARLoss() BackwardPass() OptimizationStep()
PruningMasks PrunedNet ConvertNet(, PruningMasks)
Algorithm 1 BAR Training

A budget constraint imposes on to be smaller than the allowed budget . If embedded in a sparsity loss, that constraint makes the loss go to infinity when , and zero otherwise. This is a typical inequality constrained minimization problem whose binary (and yet non-differentiable) behavior is not suited to gradient descent optimization. One typical solution to such problem is the log-barrier method [2]. The idea of this barrier method is to approximate the zero-to-infinity constraint by a differentiable logarithmic function : where is a parameter that adjusts the accuracy of the approximation and whose value increases at each optimization iteration (c.f. Algo 11.1 in [2]).

Unfortunately, the log-barrier method requires beginning optimization with a feasible solution (i.e. ), and this brings two major problems. First, we need to compute such that , which is no trivial task. Second, this induces a setting similar to training an ensemble of pruned networks, as the probability that a feature map is “turned on” is very low. This means that filters will receive little gradient and will train very slowly. To avoid this, we need to start training with a larger than the budget.

We thus implemented a modified version of the barrier algorithm. First, as will be shown in the rest of this section, we propose a barrier function as a replacement for the log barrier function (c.f. Fig. 3). Second, instead of having a fixed budget and a parameter that grows at each iteration as required by the barrier method, we eliminate the hardness parameter and instead decrease the budget constraint at each iteration. This budget updating schedule is discussed in Section 3.4.

(a) Logarithmic barrier function
(b) Our barrier function
Figure 3: Comparing barrier functions. (a) Common barrier function with . (b) Our barrier function with .

Our barrier function is designed such that:

  • it has an infinite value when the volume used by a network exceeds the budget, i.e. ;

  • it has a value of zero when the budget is comfortably respected, i.e. ;

  • it has continuity.

Instead of having a jump from zero to infinity at the point where , we define a range where a smooth transition occurs. To do so, we first perform a linear mapping of  :

such that (the budget is comfortably respected), and (our constraint is violated). Then, we use the following function:

which has three useful properties: () and , () and () it has a continuity. Those properties correspond to the ones mentioned before. To obtain the desired function, we substitute in and simplify:


As shown in Fig. 3, like for log barrier, is an asymptote, as we require . However, corresponds to a respected budget and for , the budget is respected with a comfortable margin, and this corresponds to a penalty of zero.

Our proposed prior loss is as follows:


where are the lower and upper budget margins, is the current “hard-pruned” volume as computed by Eq. (4), and is a differentiable approximation of . Note that since is not differentiable w.r.t to , we cannot solely optimize .

The content of is bound to . In our case, we use the Hard-Concrete distribution (which is a smoothed version of the Bernoulli distribution), as well as its corresponding prior loss, both introduced in [22]. This prior loss measures the expectation of the number of feature maps currently unpruned. To account for the spatial dimensions of the output tensors of convolutions, we use:

where is the hard-concrete prior loss [22] and is the area of the output feature maps of layer . Thus, measures the expectation of the activation volume of all convolution operations in the network.

Note that could also be replaced by another metric, such as the total FLOPs used by the network. In this case, should also include the expectation of the number of feature maps of the preceding layer.

3.4 Setting the budget margins

As mentioned earlier, initializing the network with a volume that respects the budget (as required by the barrier method) leads to severe optimization issues. Instead, we iteratively shift the pruning target during training. Specifically, we shift it from at the beginning, to at the end (where is the unpruned network’s volume and the maximum allowed budget).

As shown in Fig. 3b, doing so induces a lateral shift to the “barrier”. This is unlike the barrier method in which the hardness parameter evolves in time (c.f. Fig. 3a). Mathematically, the budget evolves as follows:


while is fixed. Here is a transition function which goes from zero at the first iteration all the way to one at the last iteration. While could be a linear transition schedule, experimental results reveal that when approaches , some gradients suffers from extreme spikes due to the nature of . This leads to erratic behavior towards the end of the training phase that can hurt performance. One may also implement an exponential transition schedule. This could compensate for the shape of by having change quickly during the first epochs and slowly towards the end of training. While this gives good results for severe pruning (up to ), the increased stress at the beginning yields sub-optimal performance for low pruning factors.

For our method, we propose a sigmoidal schedule, where changes slowly at the beginning and at the end of the training phase, but quickly in the middle. This puts most of the “pruning stress” in the middle of the training phase, which accounts for the difficulty of pruning (1) during the first epochs, where the filters’ relevance is still unknown, and (2) during the last epochs, where more compromises might have to be made. The sigmoidal transition function is illustrated in Fig. 4 (c.f. Supp. materials for details).

Figure 4: Sigmoidal transition function.

3.5 Knowledge Distillation

Knowledge Distillation (KD) [14] is a method for facilitating the training of a small neural network (the student) by having it reproduce the output of a larger network (the teacher). The loss proposed by Hinton et al [14] is :

where is a cross-entropy, is the groundtruth, and are the output logits of the student and teacher networks, , and is a temperature parameter used to smooth the softmax output : .

In our case, the unpruned network is the teacher and the pruned network is the student. As such, our final loss is:

where , and are fixed parameters.

4 Pruning Residual Networks

While our method can prune any CNN, pruning a CNN without residual connections does not affect the connectivity patterns of the architecture, and simply selects the width at each layer [9]. In this paper, we are interested in allowing any feature map of a residual network to be pruned. This pruning regime can reduce the depth of the network, and generally results in architectures with atypical connectivity that require special care in their implementation to obtain maximum efficiency.

4.1 Automatic Depth Determination

We found, as in [9], that filter-wise pruning can successfully prune entire ResBlocks and change the network depth. This effect was named Automatic Depth Determination in [25]. Since a ResBlock computes a delta that is aggregated with the main (residual) signal by addition (c.f. Fig. 5a), such block can generally be removed without preventing the flow of signal through the network. This is because the main signal’s identity connections cannot be pruned as they lack prunable filters.

Figure 5: Typical ResBlock vs. pooling block. (a) A typical ResBlock. The “B” arrow is the sequence of convolutions done inside the block. (b) A pooling block at the beginning of a ResNet Layer, that deals with the change in spatial dimensions and number of feature maps. Notice that it breaks the continuity of the residual signal. The arrow labeled “” is a convolution with stride 2; the first convolution of “B” also has stride 2. If all convolutions (arrows) are removed, no signal can pass.

However, some ResBlocks, which we call “pooling blocks”, change the spatial dimensions and feature dimensionality of the signal. This type of block breaks the continuity of the residual signal (c.f. Fig. 5b). As such, the convolutions inside this block cannot be completely pruned, as this would prevent any signal from flowing through it (a situation we call “fatal pruning”). As a solution, we clamp the highest value of to ensure that at least one feature map is kept in the conv operation.

4.2 Atypical connectivity of pruned ResNets

Our method allows any feature map in the output of a convolution to be pruned (except for the conv of the pooling block). This produces three types of atypical residual connectivity that requires special care (see Fig. 6). For example, there could be a feature from the residual signal that would pass through without another signal being added to it (Fig. 6b). New feature maps can also be created and concatenated (Fig. 6c). Furthermore, new feature maps could be created while others could pass through (Fig. 6d).

Figure 6: Connectivity allowed by our approach. (a) A 3-feature ResBlock with typical connectivity. Arrows represent one or more convolutions. (b) With one feature map pruned, only two features are computed and added to the residual signal; one feature from the residual signal is left unchanged. (c) a new feature is created and concatenated to the residual signal. (d) a combination of (b) and (c) as a new feature is concatenated to the residual signal, one feature from the residual is left unchanged, and a third feature has typical connectivity (best viewed in color).

To leverage the speedup incurred by a pruned feature map, the three cases in Fig. 6 must be taken into account through a mixed-connectivity block which allows these unorthodox configurations. Without this special implementation, some zeroed-out feature maps would still be computed because the summations of residual and refinement signals must have the same number of feature maps. In fact, a naive implementation does not allow refining only a subset of the features of the main signal (as in Fig. 6b), nor does it allow having a varying number of features in the main signal (as in Fig. 6c).

Figure 7: (a) A 4-feature chunk of a ResNet Layer pruned by our method. Dotted feature maps are zeroed-out by their associated mask. An arrow labeled B represents a Block operation, which consist of a sequence of convolutions. Inner convolutions of the Block can be pruned, but only the output of the last convolution is shown (for clarity). (b) The same pruned subgraph, illustrated without the pruned feature maps. The resulting subgraph is shallower and narrower than its “full” counterpart (best viewed in color).

Fig. 7 shows the benefit of a mixed-connectivity block. In (a) is a ResNet Layer pruned by our method. Using a regular ResBlock implementation, all feature maps in pairs of tensors that are summed together need to have matching width. This means that, in Fig. 7, all feature maps of the first, third and fourth rows (features) are computed, even if they are dotted. Only the second row can be fully removed.On the other hand, by using mixed-connectivity, only unpruned feature maps are computed, yielding architectures such as in Fig. 7b, that saves substantial compute (c.f. Section 5).

Technical details on our mixed-connectivity block are provided in the Supplementary materials.

5 Experiments

5.1 Experimental Setup

We tested our pruning framework on two residual architectures and report results on four datasets. We pruned Wide-ResNet [34] on CIFAR-10, CIFAR-100 and TinyImageNet (with a width multiplier of 12 as per [34]), and ResNet50 [13] on Mio-TCD [23], a larger and more complex dataset devoted to traffic analysis. TinyImageNet and Mio-TCD samples are resized to and , respectively. Since this ResNet50 has a larger input and is deeper than its CIFAR counterpart, we do not opt for the “wide” version and thus save significant training time. Both network architectures have approximately the same volume.

For all experiments, we use the Adam optimizer with an initial learning rate of and a weight decay of . For CIFAR and TinyImageNet, we use a batch size of 64. For our objective function, we use , , and . We use PyTorch and its standard image preprocessing. For experiments on Mio-TCD, we start training/pruning with the weights of the unpruned network whereas we initialize with random values for CIFAR and TinyImageNet. Please refer to the Supplementary materials for the number of epochs used in each training phase.

We compare our approach to the following methods:

  • Random. This approach randomly selects feature maps to be removed.

  • Weight Magnitude (WM) [10]. This method uses the absolute sum of the weights in a filter as a surrogate of its importance. Lower magnitude filters are removed.

  • Vector Quantization (VQ) [8] This approach vectorizes the filters and quantizes them into clusters, where is the target width for the layer. The clusters’ center are used as the new filters.

  • Interpolative Decomposition (ID). This method is based on low-rank approximation for network compression [6, 16]. This algorithm factorizes each filters into , where has a specific number of rows corresponding to the budget. replaces , and is multiplied at the next layer (i.e. ) to approximate the original sequence of transformations.

  • regularization (LZR) [22]. This DSL method is the closest to our method. However, it incorporates no budget, penalizes layer width instead of activation tensor volume, and does not use Knowledge Distillation.

  • Information Bottleneck (IB) [1]. This DSL method uses a factorized Gaussian distribution (with parameters ) to mask the feature maps as well as the following prior loss : .

  • MorphNet [9]. This approach uses the scaling parameter of Batch Norm modules as a learnable mask over features. The said parameters are driven to zero by a objective that considers the resources used by a filter (e.g. FLOPs). This method computes a new width for each layer by counting the non-zero parameters. We set the sparsity trade-off parameter after an hyperparameter search, with as the target pruning factor for CIFAR-10.

For every method, we set a budget of tensor activation volume corresponding to of the unpruned volume . Since LZR and IB do not allow setting a budget, we went through trial-and-error to find the hyperparameter value that yield the desired resource usage. For Random, WM, VQ, and ID we scale the width of all layers uniformly to satisfy the budget and implement a pruning scheme which revealed to be the most effective (c.f. Supplementary materials). We also apply our mixed-connectivity block to the output of every method for a fair comparison.

5.2 Results

Figure 8: Pruning results. Plots showing test accuracy w.r.t. volume and FLOP reduction factor (best viewed in color).

Results for every method executed on all four datasets are shown in Fig. 8. The first row shows test accuracies w.r.t. the network volume reduction factor for CIFAR-10, CIFAR-100, TinyImageNet and Mio-TCD. As one can see, our method is above the others (or competitive) for CIFAR-10 and CIFAR-100. It is also above every other method on TinyImageNet and Mio-TCD except for MorphNet which is better for pruning factors of 2 and 4. However, MorphNet gets a severe drop of accuracy at 16x, a phenomena we observed as well on CIFAR-10 and CIFAR-100. Our method is also always better than IB and LZR, the other two DSL methods. Overall, our method is resilient to severe (16x) pruning ratios.

Furthermore, for every dataset, networks pruned with our method (as well as some others) get better results than the initial unpruned network. This illustrates the fact that Wide-ResNet and ResNet-50 are overparameterized for certain tasks and that decreasing their number of feature maps reduces overfitting and thus improves test accuracy.

We then took every pruned network and computed their FLOP reduction factor (we considered operations from convolutions only). This is illustrated in the second row of Fig. 8. There again, our method outperforms (or is competitive with) the others for CIFAR-10 and CIFAR-100. Our method reduces FLOPs by up to a factor of x on CIFAR-10, x on CIFAR-100 and x on Mio-TCD without decreasing test accuracy. We get similar results as LZR for pruning ratios around x on CIFAR-10 and CIFAR-100 and x on Mio-TCD. MorphNet gets better accuracy for pruning ratios of x and x on Mio-TCD, but then drops significantly around x. Results are similar for TinyImageNet.

In Table 1, we report results of an ablation study on WideResNet-CIFAR-10 with two pruning factors. We replaced the Knowledge Distillation data loss (c.f. Section 3.5) by a cross-entropy loss, and changed the Sigmoid pruning schedule (c.f. Section 3.4) by a linear one. As can be seen, removing either of those reduces accuracy, thus showing their efficiency. We also studied the impact of not using the mixed-connectivity block introduced in Section 4.2. As shown in Table 2, when replacing our mixed-connectivity blocks by regular ResBlocks, we get a drop of the effective pruned volume of more than 50% for 16x (even up to 58% for CIFAR-10).

Configuration Pruning factor
2x 16x
Our method 92.70% 91.62%
w/o Knowledge Distillation -1.37% -0.40%
w/o Sigmoid pruning schedule -0.87% -0.92%
Table 1: Test Accuracy for different configurations of our method (using WideResNet-CIFAR-10). The test accuracy of the unpruned network is .
Dataset 2x 4x 8x 16x
CIFAR-10 12% 43% 53% 58%
CIFAR-100 14% 49% 55% 57%
MIO-TCD 32% 37% 40% 52%
Table 2: Reduction of the effective pruned volume when removing the mixed-connectivity block.

We illustrate in Fig. 9 results of our pruning method for CIFAR-10 (for the other datasets, see supplementary materials). The figure shows the number of neurons per residual block for the full network, and for the networks pruned with varying pruning factors. These plots show that our method has the capability of eliminating entire residual blocks (especially around 1.3 and 1.4). Also, the pruning configurations follow no obvious trend thus showing the inherent plasticity of a DSL method such as ours.

Figure 9: Result of pruning with our method on WideResNet-CIFAR-10. Total number of active neurons in the full networks and with four different pruning rates. Sections without an orange (8x) or red (16x) bar are those for which a res-Block has been eliminated.

As mentioned in Section 3.3, instead of the volume metric (Eq. (4)) the budget could be set w.r.t a FLOP metric by accounting for the expectation of the number of feature maps in the preceding layer. We compare in Fig. 10 the results given by these two budget metrics for WideResnet-CIFAR-10. As one might expect, pruning a network with a volume metric (V-Trained) yields significantly better performances w.r.t. the volume pruning factor whereas pruning a network with a FLOP metric (F-Trained) yields better performances w.r.t. to the FLOP reduction factor, although by a slight margin. In light of these results, we conclude that the volume metric (Eq. (4)) is overall a better choice.

Figure 10: Comparison of objective metrics. Test accuracy versus the volume pruning factor and the FLOP reduction factor for our method with a Volume metric (V-trained) and a FLOP metric (F-trained).

6 Conclusion

We presented a structured budgeted pruning method based on a dropout sparsity learning framework. We proposed a knowledge distillation loss function combined with a budget-constrained sparsity loss whose formulation is that of a barrier function. Since the log-barrier solution is ill-suited for pruning a CNN, we proposed a novel barrier function as well as a novel optimization schedule. We provided concrete insights on how to prune residual networks and used a novel mixed-connectivity block. Results obtained on two ResNets architecture and four datasets reveal that our method outperforms (or is competitive to) 7 other pruning methods.


We thank Christian Desrosiers for his insights. This work was supported by FRQ-NT scholarship #257800 and Mitacs grant IT08995. Supercomputers from Compute Canada and Calcul Quebec were used.


  1. Our code is available here:


  1. D. P. W. Bin Dai (2018) Compressing neural networks using the variational information bottleneck. proc. of ICML. Cited by: §1, §2, 6th item.
  2. S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. External Links: ISBN 0521833787 Cited by: 1st item, 2nd item, §3.3.
  3. M. A. Carreira-Perpinán and Y. Idelbayev (2018) Learning-compression algorithms for neural net pruning. In Proc. of CVPR, pp. 8532–8541. Cited by: §2.
  4. M. Carreira-Perpinan and W. Wang (2014) Distributed optimization of deeply nested systems. In AI and Stats, pp. 10–19. Cited by: §2.
  5. M. Denil, B. Shakibi, L. Dinh and N. De Freitas (2013) Predicting parameters in deep learning. In proc of NIPS, pp. 2148–2156. Cited by: §2.
  6. E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In proc of NIPS, pp. 1269–1277. Cited by: §2, 4th item.
  7. J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: training pruned neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2.
  8. Y. Gong, L. Liu, M. Yang and L. Bourdev (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2, 3rd item.
  9. A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang and E. Choi (2018) Morphnet: fast & simple resource-constrained structure learning of deep networks. In Proc. of CVPR, Cited by: §2, §4.1, §4, 7th item.
  10. S. Han, J. Pool, J. Tran and W. Dally (2015) Learning both weights and connections for efficient neural network. In proc of NIPS, pp. 1135–1143. Cited by: §1, §2, 2nd item.
  11. B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In proc of NIPS, pp. 164–171. Cited by: §2.
  12. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In proc. of ICCV, pp. 2980–2988. Cited by: §1.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016-06) Deep residual learning for image recognition. In Proc. of CVPR, Cited by: §5.1.
  14. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. proc of NIPS DLRL Workshop. Cited by: 3rd item, §3.5.
  15. G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger (2017) Densely connected convolutional networks.. In proc. of CVPR, Cited by: §1.
  16. M. Jaderberg, A. Vedaldi and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. proc of BMVC. Cited by: §2, 4th item.
  17. D. P. Kingma, T. Salimans and M. Welling (2015) Variational dropout and the local reparameterization trick. In proc of NIPS, pp. 2575–2583. Cited by: §1, §2, §3.1, §3.1.
  18. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  19. Y. LeCun, J. S. Denker and S. A. Solla (1990) Optimal brain damage. In proc of NIPS, pp. 598–605. Cited by: §1, §2.
  20. C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang and K. Murphy (2017) Progressive neural architecture search. arXiv preprint arXiv:1712.00559. Cited by: §2.
  21. Z. Liu, J. Li, Z. Shen, G.Huang, S. Yan and C.Zhang (2017) Learning efficient convolutional networks through network slimming. In proc of ICCV, Cited by: §1, §2.
  22. C. Louizos, M. Welling and D. P. Kingma (2018) Learning sparse neural networks through regularization. In proc. of ICLR, Cited by: §1, §2, §3.3, 5th item.
  23. Z. Luo, F. B-Charron, C. Lemaire, J. Konrad, S. Li, A. Mishra, A. Achkar, J. Eichel and P. Jodoin (2018) MIO-tcd: a new benchmark dataset for vehicle classification and localization. IEEE TIP 27 (10), pp. 5129–5141. Cited by: §5.1.
  24. D. Molchanov, A. Ashukha and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. proc of ICML. Cited by: §1, §2, §3.1.
  25. E. Nalisnick and P. Smyth (2018) Unifying the dropout family through structured shrinkage priors. arXiv preprint arXiv:1810.04045. Cited by: §4.1.
  26. K. Neklyudov, D. Molchanov, A. Ashukha and D. P. Vetrov (2017) Structured bayesian pruning via log-normal multiplicative noise. In proc of NIPS, Cited by: §1, §2.
  27. W. Pan, H. Dong and Y. Guo (2016) DropNeuron: simplifying the structure of deep neural networks. In arXiv preprint arXiv:1606.07326, Cited by: §2.
  28. J. Pérez-Rúa, M. Baccouche and S. Pateux (2018) Efficient progressive neural architecture search. arXiv preprint arXiv:1808.00391. Cited by: §2.
  29. H. Pham, M. Y. Guan, B. Zoph, Q. V. Le and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.
  30. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In proc of MICCAI, pp. 234–241. Cited by: §1.
  31. N. Srivastava (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of ML Research 15, pp. 1929–1958. Cited by: §3.1.
  32. T. Veniat and L. Denoyer (2018) Learning time-efficient deep architectures with budgeted super networks. In Proc. of CVPR, Cited by: §2.
  33. W.Wen, C. Wu, Y.Wang, Y. Chen and H.Li (2016) Learning structured sparsity in deep neural networks. In proc of NIPS, Cited by: §2.
  34. S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In proc. of BMVC, Cited by: §5.1.
  35. B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. proc of ICLR. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description