Structured Pruning of Neural Networks with BudgetAware Regularization
Abstract
Pruning methods have shown to be effective at reducing the size of deep neural networks while keeping accuracy almost intact. Among the most effective methods are those that prune a network while training it with a sparsity prior loss and learnable dropout parameters. A shortcoming of these approaches however is that neither the size nor the inference speed of the pruned network can be controlled directly; yet this is a key feature for targeting deployment of CNNs on lowpower hardware. To overcome this, we introduce a budgeted regularized pruning framework for deep CNNs. Our approach naturally fits into traditional neural network training as it consists of a learnable masking layer, a novel budgetaware objective function, and the use of knowledge distillation. We also provide insights on how to prune a residual network and how this can lead to new architectures. Experimental results reveal that CNNs pruned with our method are more accurate and less computehungry than stateoftheart methods. Also, our approach is more effective at preventing accuracy collapse in case of severe pruning; this allows pruning factors of up to without significant accuracy drop.
1 Introduction
Convolutional Neural Networks (CNN) have proven to be effective feature extractors for many computer vision tasks [12, 15, 18, 30]. The design of several CNNs involve many heuristics, such as using increasing powers of two as the number of feature maps, or width, of each layer. While such heuristics allow achieving excellent results, they may be too crude in situations where the amount of compute power and memory is restricted, such as with mobile platforms. Thus arises the problem of finding the right number of layers that solve a given task while respecting a budget. Since the number of layers depends highly on the effectiveness of the learned filters (and their combination), one cannot determine these hyperparameters a priori.
Convolution operations constitute the main computational burden of a CNN. The execution of these operations benefit from a high degree of parallelism, which requires them to have regular structures. This implies that one cannot remove isolated neurons from a CNN filter as they must be full grids. To achieve the same effect as removing a neuron, one can zeroout its weights. While doing this reduces the theoretical size of the model, it does not reduce the computational demands of the model nor the amount of feature map memory. Therefore, to accelerate a CNN and reduce its memory footprint, one has to rely on structured sparsity pruning methods that aim at reducing the number of feature maps and not just individual neurons.
By removing unimportant filters from a network and retraining it, one can shrink it while maintaining good performance [10, 19]. This can be explained by the following hypothesis: the initial value of a filter’s weights is not guaranteed to allow the learning of a useful feature; thus, a trained network might contain many expendable features [7].
Among the structured pruning methods, those that implement a sparsity learning (SL) framework have shown to be effective as pruning and training are done simultaneously [1, 17, 21, 22, 24, 26]. Unfortunately, most SL methods cannot prune a network while respecting a neuron budget imposed by the very nature of a device on which the network shall be deployed. As of today, pruning a network while respecting a budget can only be done by trialanderror, typically by training multiple times a network with various pruning hyperparameters.
In this paper, we present a SL framework which allows learning and selecting filters of a CNN while respecting a neuron budget. Our main contributions are:

We present a novel objective function which includes a variant of the logbarrier [2] function for simultaneously training and pruning a CNN while respecting a total neuron budget;

We propose a variant of the barrier method [2] for optimizing a CNN;

We demonstrate the effectiveness of combining SL and knowledge distillation [14];

We empirically confirm the existence of the automatic depth determination property of residual networks pruned with filterwise methods, and give insights on how to ensure the viability of the pruned network by preventing “fatal pruning”;

We propose a new mixedconnectivity block which roughly doubles the effective pruning factors attainable with our method.
2 Previous Works
Compressing neural networks without affecting too much their accuracy implies that networks are often overparametrized. Denil \etal [5] have shown that typical neural networks are overparametrized; in the best case of their experiments, they could predict 95% of the network weights from the others. Recent work by Frankle \etal [7] support the hypothesis that a large proportion (typically 90%) of weights in standard neural networks are initialized to a value that will lead to an expendable feature. In this section, we review six categories of methods for reducing the size of a neural network.
Neural network compression aims to reduce the storage requirements of the network’s weights. In [6, 16], lowrank approximation through matrix factorization, such as singularvalue decomposition, is used to factorize the weight matrices. The factors’ rank is reduced by keeping only the leading eigenvalues and their associated eigenvectors. In [8], quantization is used to reduce the storage taken by the model; both scalar quantization and vector quantization (VQ) have been considered. Using VQ, a weight matrix can be reconstructed from a list of indices and a dictionary of vectors. Thus, practical computation savings can be obtained. Unfortunately, most network compression methods do not decrease the memory and compute usage during inference.
Neural network pruning consists of identifying and removing neurons that are not necessary for achieving high performance. Some of the first approaches used the secondorder derivative to determine the sensitivity of the network to the value of each weight [19, 11]. A more recent, very simple and effective approach selects which neurons to remove by thresholding the magnitude of their weights; smaller magnitudes are associated with unimportant neurons [10]. The resulting network is then finetuned for better performance. Nonetheless, experimental results (c.f. Section 5) show that variational pruning methods (discussed below) outperform the previously mentioned works.
Sparsity Learning (SL) methods aim at pruning a network while training it. Some methods add to the training loss a regularization function such as [21], Group LASSO [33], or an approximation of the norm [22, 27]. Several variational methods have also been proposed [1, 17, 26, 24]. These methods formalize the problem of network pruning as a problem of learning the parameters of a dropout probability density function (PDF) via the reparametrization trick [17]. Pruning is enforced via a sparsity prior that derives from a variational evidence lower bound (ELBO). In general, SL methods do not apply an explicit constraint to limit the number of neurons used. To enforce a budget, one has to turn towards budgeted pruning.
Budgeted pruning is an approach that provides a direct control on the size of the pruned network via some “network size” hyperparameter. MorphNet [9] alternates between training with a sparsifying regularizer and applying a width multiplier to the layer widths to enforce the budget. Contrary to our method, this work does not leverage dropoutbased SL. Budgeted Super Networks [32] is a method that finds an architecture satisfying a resource budget by sparsifying a super network at the module level. This method is less convenient to use than ours, as it requires “neural fabric” training through reinforcement learning. Another budgeted pruning approach is “LearningCompression” [3], which uses the method of auxiliary coordinates [4] instead of backpropagation. Contrary to this method, our approach adopts a usual gradient descent optimization scheme, and does not rely on the magnitude of the weights as a surrogate of their importance.
Architecture search (AS) is an approach that led to efficient neural networks in terms of performance and parameterization. Using reinforcement learning and vast amounts of processing power, NAS [35] have learned novel architectures; some that advanced the stateoftheart, others that had relatively few parameters compared to similarly effective handcrafted models. PNAS [20] and ENAS [29] have extended this work by cutting the necessary compute resources. These works have been aggregated by EPNAS [28]. AS is orthogonal to our line of work as the learned architectures could be pruned by our method. In addition, AS is more complicated to implement as it requires learning a controller model by reinforcement learning. In contrast, our method features tools widely used in CNN training.
3 Our Approach
3.1 Dropout Sparsity Learning
Before we introduce the specifics of our approach, let us first summarize the fundamental concepts of Dropout Sparsity Learning (DSL).
Let be the output of the th hidden layer of a CNN computed by , a transformation of the previous layer, typically a convolution followed by a batch norm and a nonlinearity. As mentioned before, one way of reducing the size of a network is by shutting down neurons with an elementwise product between the output of layer and a binary tensor :
(1) 
To enforce structured pruning and shutdown feature maps (not just individual neurons), one can redefine as a vector of size where is the number of feature maps in . Then, is applied over the spatial dimensions by performing an elementwise product with .
As one might notice, Eq. (1) is the same as that of dropout [31] for which is a tensor of independent random variables i.i.d. of a Bernoulli distribution . To prune a network, DSL redefines as random variables sampled from a distribution whose parameters can be learned while training the model. In this way, the network can learn which feature maps to drop and which ones to keep.
Since the operation of sampling from a distribution is not differentiable, it is common practice to redefine it with the reparametrization trick [17]:
(2) 
where is a continuous function differentiable with respect to and stochastic with respect to , a random variable typically sampled from or .
In order to enforce network pruning, one usually incorporates a twoterm loss :
(3) 
where is the prior’s weight, are the parameters of the network, is a data loss that measures how well the model fits the training data (e.g. the crossentropy loss) and is a sparsity loss that measures how sparse the model is. While varies from one method to another, the KL divergence between and some prior distribution is typically used by variational approaches [17, 24]. Note that during inference, one can make the network deterministic by replacing the random variable by its mean.
3.2 Soft and hard pruning
As mentioned before, is a continuous function differentiable with respect to . Thus, instead of being binary, the pruning of Eq. (2) becomes continuous (soft pruning), so there is always a nonzero probability that a feature map will be activated during training. However, to achieve practical speedups, one eventually needs to “hardprune” filters. To do so, once training is over, the values of are thresholded to select which filters to remove completely. Then, the network may be finetuned for several epochs with the loss only, to let the network adapt to hardpruning. We call this the “finetuning phase”, and the earlier epochs constitute the “training phase”.
3.3 BudgetAware Regularization (BAR)
In our implementation, a budget is the maximum number of neurons a “hardpruned” network is allowed to have. To compute this metric, one may replace by its mean so feature maps with have no effect and can be removed, while the others are kept. The network size is thus the total activation volume of the structurally “hardpruned” network :
(4) 
where is the area of the output feature maps of layer and is the indicator function. Our training process is described in Algorithm 1.
A budget constraint imposes on to be smaller than the allowed budget . If embedded in a sparsity loss, that constraint makes the loss go to infinity when , and zero otherwise. This is a typical inequality constrained minimization problem whose binary (and yet nondifferentiable) behavior is not suited to gradient descent optimization. One typical solution to such problem is the logbarrier method [2]. The idea of this barrier method is to approximate the zerotoinfinity constraint by a differentiable logarithmic function : where is a parameter that adjusts the accuracy of the approximation and whose value increases at each optimization iteration (c.f. Algo 11.1 in [2]).
Unfortunately, the logbarrier method requires beginning optimization with a feasible solution (i.e. ), and this brings two major problems. First, we need to compute such that , which is no trivial task. Second, this induces a setting similar to training an ensemble of pruned networks, as the probability that a feature map is “turned on” is very low. This means that filters will receive little gradient and will train very slowly. To avoid this, we need to start training with a larger than the budget.
We thus implemented a modified version of the barrier algorithm. First, as will be shown in the rest of this section, we propose a barrier function as a replacement for the log barrier function (c.f. Fig. 3). Second, instead of having a fixed budget and a parameter that grows at each iteration as required by the barrier method, we eliminate the hardness parameter and instead decrease the budget constraint at each iteration. This budget updating schedule is discussed in Section 3.4.
Our barrier function is designed such that:

it has an infinite value when the volume used by a network exceeds the budget, i.e. ;

it has a value of zero when the budget is comfortably respected, i.e. ;

it has continuity.
Instead of having a jump from zero to infinity at the point where , we define a range where a smooth transition occurs. To do so, we first perform a linear mapping of :
such that (the budget is comfortably respected), and (our constraint is violated). Then, we use the following function:
which has three useful properties: () and , () and () it has a continuity. Those properties correspond to the ones mentioned before. To obtain the desired function, we substitute in and simplify:
(5) 
As shown in Fig. 3, like for log barrier, is an asymptote, as we require . However, corresponds to a respected budget and for , the budget is respected with a comfortable margin, and this corresponds to a penalty of zero.
Our proposed prior loss is as follows:
(6) 
where are the lower and upper budget margins, is the current “hardpruned” volume as computed by Eq. (4), and is a differentiable approximation of . Note that since is not differentiable w.r.t to , we cannot solely optimize .
The content of is bound to . In our case, we use the HardConcrete distribution (which is a smoothed version of the Bernoulli distribution), as well as its corresponding prior loss, both introduced in [22]. This prior loss measures the expectation of the number of feature maps currently unpruned. To account for the spatial dimensions of the output tensors of convolutions, we use:
where is the hardconcrete prior loss [22] and is the area of the output feature maps of layer . Thus, measures the expectation of the activation volume of all convolution operations in the network.
Note that could also be replaced by another metric, such as the total FLOPs used by the network. In this case, should also include the expectation of the number of feature maps of the preceding layer.
3.4 Setting the budget margins
As mentioned earlier, initializing the network with a volume that respects the budget (as required by the barrier method) leads to severe optimization issues. Instead, we iteratively shift the pruning target during training. Specifically, we shift it from at the beginning, to at the end (where is the unpruned network’s volume and the maximum allowed budget).
As shown in Fig. 3b, doing so induces a lateral shift to the “barrier”. This is unlike the barrier method in which the hardness parameter evolves in time (c.f. Fig. 3a). Mathematically, the budget evolves as follows:
(7) 
while is fixed. Here is a transition function which goes from zero at the first iteration all the way to one at the last iteration. While could be a linear transition schedule, experimental results reveal that when approaches , some gradients suffers from extreme spikes due to the nature of . This leads to erratic behavior towards the end of the training phase that can hurt performance. One may also implement an exponential transition schedule. This could compensate for the shape of by having change quickly during the first epochs and slowly towards the end of training. While this gives good results for severe pruning (up to ), the increased stress at the beginning yields suboptimal performance for low pruning factors.
For our method, we propose a sigmoidal schedule, where changes slowly at the beginning and at the end of the training phase, but quickly in the middle. This puts most of the “pruning stress” in the middle of the training phase, which accounts for the difficulty of pruning (1) during the first epochs, where the filters’ relevance is still unknown, and (2) during the last epochs, where more compromises might have to be made. The sigmoidal transition function is illustrated in Fig. 4 (c.f. Supp. materials for details).
3.5 Knowledge Distillation
Knowledge Distillation (KD) [14] is a method for facilitating the training of a small neural network (the student) by having it reproduce the output of a larger network (the teacher). The loss proposed by Hinton et al [14] is :
where is a crossentropy, is the groundtruth, and are the output logits of the student and teacher networks, , and is a temperature parameter used to smooth the softmax output : .
In our case, the unpruned network is the teacher and the pruned network is the student. As such, our final loss is:
where , and are fixed parameters.
4 Pruning Residual Networks
While our method can prune any CNN, pruning a CNN without residual connections does not affect the connectivity patterns of the architecture, and simply selects the width at each layer [9]. In this paper, we are interested in allowing any feature map of a residual network to be pruned. This pruning regime can reduce the depth of the network, and generally results in architectures with atypical connectivity that require special care in their implementation to obtain maximum efficiency.
4.1 Automatic Depth Determination
We found, as in [9], that filterwise pruning can successfully prune entire ResBlocks and change the network depth. This effect was named Automatic Depth Determination in [25]. Since a ResBlock computes a delta that is aggregated with the main (residual) signal by addition (c.f. Fig. 5a), such block can generally be removed without preventing the flow of signal through the network. This is because the main signal’s identity connections cannot be pruned as they lack prunable filters.
However, some ResBlocks, which we call “pooling blocks”, change the spatial dimensions and feature dimensionality of the signal. This type of block breaks the continuity of the residual signal (c.f. Fig. 5b). As such, the convolutions inside this block cannot be completely pruned, as this would prevent any signal from flowing through it (a situation we call “fatal pruning”). As a solution, we clamp the highest value of to ensure that at least one feature map is kept in the conv operation.
4.2 Atypical connectivity of pruned ResNets
Our method allows any feature map in the output of a convolution to be pruned (except for the conv of the pooling block). This produces three types of atypical residual connectivity that requires special care (see Fig. 6). For example, there could be a feature from the residual signal that would pass through without another signal being added to it (Fig. 6b). New feature maps can also be created and concatenated (Fig. 6c). Furthermore, new feature maps could be created while others could pass through (Fig. 6d).
To leverage the speedup incurred by a pruned feature map, the three cases in Fig. 6 must be taken into account through a mixedconnectivity block which allows these unorthodox configurations. Without this special implementation, some zeroedout feature maps would still be computed because the summations of residual and refinement signals must have the same number of feature maps. In fact, a naive implementation does not allow refining only a subset of the features of the main signal (as in Fig. 6b), nor does it allow having a varying number of features in the main signal (as in Fig. 6c).
Fig. 7 shows the benefit of a mixedconnectivity block. In (a) is a ResNet Layer pruned by our method. Using a regular ResBlock implementation, all feature maps in pairs of tensors that are summed together need to have matching width. This means that, in Fig. 7, all feature maps of the first, third and fourth rows (features) are computed, even if they are dotted. Only the second row can be fully removed.On the other hand, by using mixedconnectivity, only unpruned feature maps are computed, yielding architectures such as in Fig. 7b, that saves substantial compute (c.f. Section 5).
Technical details on our mixedconnectivity block are provided in the Supplementary materials.
5 Experiments
5.1 Experimental Setup
We tested our pruning framework on two residual architectures and report results on four datasets. We pruned WideResNet [34] on CIFAR10, CIFAR100 and TinyImageNet (with a width multiplier of 12 as per [34]), and ResNet50 [13] on MioTCD [23], a larger and more complex dataset devoted to traffic analysis. TinyImageNet and MioTCD samples are resized to and , respectively. Since this ResNet50 has a larger input and is deeper than its CIFAR counterpart, we do not opt for the “wide” version and thus save significant training time. Both network architectures have approximately the same volume.
For all experiments, we use the Adam optimizer with an initial learning rate of and a weight decay of . For CIFAR and TinyImageNet, we use a batch size of 64. For our objective function, we use , , and . We use PyTorch and its standard image preprocessing. For experiments on MioTCD, we start training/pruning with the weights of the unpruned network whereas we initialize with random values for CIFAR and TinyImageNet. Please refer to the Supplementary materials for the number of epochs used in each training phase.
We compare our approach to the following methods:

Random. This approach randomly selects feature maps to be removed.

Weight Magnitude (WM) [10]. This method uses the absolute sum of the weights in a filter as a surrogate of its importance. Lower magnitude filters are removed.

Vector Quantization (VQ) [8] This approach vectorizes the filters and quantizes them into clusters, where is the target width for the layer. The clusters’ center are used as the new filters.

Interpolative Decomposition (ID). This method is based on lowrank approximation for network compression [6, 16]. This algorithm factorizes each filters into , where has a specific number of rows corresponding to the budget. replaces , and is multiplied at the next layer (i.e. ) to approximate the original sequence of transformations.

regularization (LZR) [22]. This DSL method is the closest to our method. However, it incorporates no budget, penalizes layer width instead of activation tensor volume, and does not use Knowledge Distillation.

Information Bottleneck (IB) [1]. This DSL method uses a factorized Gaussian distribution (with parameters ) to mask the feature maps as well as the following prior loss : .

MorphNet [9]. This approach uses the scaling parameter of Batch Norm modules as a learnable mask over features. The said parameters are driven to zero by a objective that considers the resources used by a filter (e.g. FLOPs). This method computes a new width for each layer by counting the nonzero parameters. We set the sparsity tradeoff parameter after an hyperparameter search, with as the target pruning factor for CIFAR10.
For every method, we set a budget of tensor activation volume corresponding to of the unpruned volume . Since LZR and IB do not allow setting a budget, we went through trialanderror to find the hyperparameter value that yield the desired resource usage. For Random, WM, VQ, and ID we scale the width of all layers uniformly to satisfy the budget and implement a pruning scheme which revealed to be the most effective (c.f. Supplementary materials). We also apply our mixedconnectivity block to the output of every method for a fair comparison.
5.2 Results
Results for every method executed on all four datasets are shown in Fig. 8. The first row shows test accuracies w.r.t. the network volume reduction factor for CIFAR10, CIFAR100, TinyImageNet and MioTCD. As one can see, our method is above the others (or competitive) for CIFAR10 and CIFAR100. It is also above every other method on TinyImageNet and MioTCD except for MorphNet which is better for pruning factors of 2 and 4. However, MorphNet gets a severe drop of accuracy at 16x, a phenomena we observed as well on CIFAR10 and CIFAR100. Our method is also always better than IB and LZR, the other two DSL methods. Overall, our method is resilient to severe (16x) pruning ratios.
Furthermore, for every dataset, networks pruned with our method (as well as some others) get better results than the initial unpruned network. This illustrates the fact that WideResNet and ResNet50 are overparameterized for certain tasks and that decreasing their number of feature maps reduces overfitting and thus improves test accuracy.
We then took every pruned network and computed their FLOP reduction factor (we considered operations from convolutions only). This is illustrated in the second row of Fig. 8. There again, our method outperforms (or is competitive with) the others for CIFAR10 and CIFAR100. Our method reduces FLOPs by up to a factor of x on CIFAR10, x on CIFAR100 and x on MioTCD without decreasing test accuracy. We get similar results as LZR for pruning ratios around x on CIFAR10 and CIFAR100 and x on MioTCD. MorphNet gets better accuracy for pruning ratios of x and x on MioTCD, but then drops significantly around x. Results are similar for TinyImageNet.
In Table 1, we report results of an ablation study on WideResNetCIFAR10 with two pruning factors. We replaced the Knowledge Distillation data loss (c.f. Section 3.5) by a crossentropy loss, and changed the Sigmoid pruning schedule (c.f. Section 3.4) by a linear one. As can be seen, removing either of those reduces accuracy, thus showing their efficiency. We also studied the impact of not using the mixedconnectivity block introduced in Section 4.2. As shown in Table 2, when replacing our mixedconnectivity blocks by regular ResBlocks, we get a drop of the effective pruned volume of more than 50% for 16x (even up to 58% for CIFAR10).
Configuration  Pruning factor  

2x  16x  
Our method  92.70%  91.62% 
w/o Knowledge Distillation  1.37%  0.40% 
w/o Sigmoid pruning schedule  0.87%  0.92% 
Dataset  2x  4x  8x  16x 

CIFAR10  12%  43%  53%  58% 
CIFAR100  14%  49%  55%  57% 
MIOTCD  32%  37%  40%  52% 
We illustrate in Fig. 9 results of our pruning method for CIFAR10 (for the other datasets, see supplementary materials). The figure shows the number of neurons per residual block for the full network, and for the networks pruned with varying pruning factors. These plots show that our method has the capability of eliminating entire residual blocks (especially around 1.3 and 1.4). Also, the pruning configurations follow no obvious trend thus showing the inherent plasticity of a DSL method such as ours.
As mentioned in Section 3.3, instead of the volume metric (Eq. (4)) the budget could be set w.r.t a FLOP metric by accounting for the expectation of the number of feature maps in the preceding layer. We compare in Fig. 10 the results given by these two budget metrics for WideResnetCIFAR10. As one might expect, pruning a network with a volume metric (VTrained) yields significantly better performances w.r.t. the volume pruning factor whereas pruning a network with a FLOP metric (FTrained) yields better performances w.r.t. to the FLOP reduction factor, although by a slight margin. In light of these results, we conclude that the volume metric (Eq. (4)) is overall a better choice.
6 Conclusion
We presented a structured budgeted pruning method based on a dropout sparsity learning framework. We proposed a knowledge distillation loss function combined with a budgetconstrained sparsity loss whose formulation is that of a barrier function. Since the logbarrier solution is illsuited for pruning a CNN, we proposed a novel barrier function as well as a novel optimization schedule. We provided concrete insights on how to prune residual networks and used a novel mixedconnectivity block. Results obtained on two ResNets architecture and four datasets reveal that our method outperforms (or is competitive to) 7 other pruning methods.
Acknowledgements
We thank Christian Desrosiers for his insights. This work was supported by FRQNT scholarship #257800 and Mitacs grant IT08995. Supercomputers from Compute Canada and Calcul Quebec were used.
Footnotes
 Our code is available here: https://tinyurl.com/lemaire2019
References
 (2018) Compressing neural networks using the variational information bottleneck. proc. of ICML. Cited by: §1, §2, 6th item.
 (2004) Convex optimization. Cambridge University Press. External Links: ISBN 0521833787 Cited by: 1st item, 2nd item, §3.3.
 (2018) Learningcompression algorithms for neural net pruning. In Proc. of CVPR, pp. 8532–8541. Cited by: §2.
 (2014) Distributed optimization of deeply nested systems. In AI and Stats, pp. 10–19. Cited by: §2.
 (2013) Predicting parameters in deep learning. In proc of NIPS, pp. 2148–2156. Cited by: §2.
 (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In proc of NIPS, pp. 1269–1277. Cited by: §2, 4th item.
 (2018) The lottery ticket hypothesis: training pruned neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2.
 (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2, 3rd item.
 (2018) Morphnet: fast & simple resourceconstrained structure learning of deep networks. In Proc. of CVPR, Cited by: §2, §4.1, §4, 7th item.
 (2015) Learning both weights and connections for efficient neural network. In proc of NIPS, pp. 1135–1143. Cited by: §1, §2, 2nd item.
 (1993) Second order derivatives for network pruning: optimal brain surgeon. In proc of NIPS, pp. 164–171. Cited by: §2.
 (2017) Mask rcnn. In proc. of ICCV, pp. 2980–2988. Cited by: §1.
 (201606) Deep residual learning for image recognition. In Proc. of CVPR, Cited by: §5.1.
 (2015) Distilling the knowledge in a neural network. proc of NIPS DLRL Workshop. Cited by: 3rd item, §3.5.
 (2017) Densely connected convolutional networks.. In proc. of CVPR, Cited by: §1.
 (2014) Speeding up convolutional neural networks with low rank expansions. proc of BMVC. Cited by: §2, 4th item.
 (2015) Variational dropout and the local reparameterization trick. In proc of NIPS, pp. 2575–2583. Cited by: §1, §2, §3.1, §3.1.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 (1990) Optimal brain damage. In proc of NIPS, pp. 598–605. Cited by: §1, §2.
 (2017) Progressive neural architecture search. arXiv preprint arXiv:1712.00559. Cited by: §2.
 (2017) Learning efficient convolutional networks through network slimming. In proc of ICCV, Cited by: §1, §2.
 (2018) Learning sparse neural networks through regularization. In proc. of ICLR, Cited by: §1, §2, §3.3, 5th item.
 (2018) MIOtcd: a new benchmark dataset for vehicle classification and localization. IEEE TIP 27 (10), pp. 5129–5141. Cited by: §5.1.
 (2017) Variational dropout sparsifies deep neural networks. proc of ICML. Cited by: §1, §2, §3.1.
 (2018) Unifying the dropout family through structured shrinkage priors. arXiv preprint arXiv:1810.04045. Cited by: §4.1.
 (2017) Structured bayesian pruning via lognormal multiplicative noise. In proc of NIPS, Cited by: §1, §2.
 (2016) DropNeuron: simplifying the structure of deep neural networks. In arXiv preprint arXiv:1606.07326, Cited by: §2.
 (2018) Efficient progressive neural architecture search. arXiv preprint arXiv:1808.00391. Cited by: §2.
 (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.
 (2015) Unet: convolutional networks for biomedical image segmentation. In proc of MICCAI, pp. 234–241. Cited by: §1.
 (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of ML Research 15, pp. 1929–1958. Cited by: §3.1.
 (2018) Learning timeefficient deep architectures with budgeted super networks. In Proc. of CVPR, Cited by: §2.
 (2016) Learning structured sparsity in deep neural networks. In proc of NIPS, Cited by: §2.
 (2016) Wide residual networks. In proc. of BMVC, Cited by: §5.1.
 (2017) Neural architecture search with reinforcement learning. proc of ICLR. Cited by: §2.