Data Agnostic Filter Gating for Efficient Deep Networks
Abstract
To deploy a welltrained CNN model on lowend computation edge devices, it is usually supposed to compress or prune the model under certain computation budget (e.g., FLOPs). Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores, which ignores the variance of input batches to the difference in sparse structure over filters. In this paper, we propose a data agnostic filter pruning method that uses an auxiliary network named Dagger module to induce pruning and takes pretrained weights as input to learn the importance of each filter. In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPsaware regularization to directly promote pruning filters toward target FLOPs. Extensive experimental results on CIFAR10 and ImageNet datasets indicate our superiority to other stateoftheart filter pruning methods. For example, our 50% FLOPs ResNet50 can achieve 76.1% Top1 accuracy on ImageNet dataset, surpassing many other filter pruning methods.
keywords:
Deep learning; Filter pruning; Model compression; Data agnostic; Dagger module; FLOPsaware regularization.url]xisu5992@uni.sydney.edu.au
url]youshan@sensetime.com
url]huangtao@sensetime.com
url]tjdxxhy@tju.edu.cn
url]wangfei@sensetime.com
url]qianchen@sensetime.com
url]zcs@mail.tsinghua.edu.cn
url]c.xu@sydney.edu.au
1 Introduction
Recently, artificial intelligence (AI) engines with deep learning techniques has achieved remarkable success in various tasks Li et al. (2020); Wang et al. (2018b); Du et al. (2020); Shi et al. (2019); Wei et al. (2020); Ming et al. (2019); Yang et al. (); Yang et al. (2018), and networks (e.g., convolutional neural networks, CNNs) are thus favored in the establishment of cloud and edge computing, which are mainly deployed in terminal devices, such as mobile phones, tablets, AR glasses, wearable watches, and onboard surveillance equipment. However, aiming at the stateoftheart accuracy performance, conventional trained CNN models in the industrial model zoo usually have huge model size. And they are clumsy for deployment on lowend computational devices. In this way, a natural problem goes that besides the fundamental accuracy performance, how we can develop a readytodeploy model under certain computation budget, such as FLOPs. Luckily, due to the development of model compression and acceleration, pruning has been an efficient way to acquire light models based on existing clumsy models with foundation performance.
Currently, pruning can be divided into the categories of weight pruning or filter pruning. However, filter pruning is more competitive than weight pruning, since it can result to a lightweight model which has the consistent network structure of pretrained model and is friendly to current offtheshelf deep learning frameworks. Moreover, filter pruning is complementary to other compression techniques, the pruned networks can be usually further compressed using quantization Chen et al. (); Han et al. (2015), lowrank decomposition Denton et al. (); Sindhwani et al. () or knowledge distillation You et al. (2017, 2018); Kong et al. (2020). To prune the redundant filters, one effective way is to prune filters with pretrained weights, and the filters with less importance to network performance are referred to as redundant filters in this paper. Basically, filter pruning works by first finding and pruning the redundant filters, then retraining (finetuning) the pruned network to recover its performance.
Identifying redundant filters matters in filter pruning. In specific, many filter pruning methods leverage scaling factors to find out the redundant filter, e.g. scaling factors You et al. (2019), trainable auxiliary parameters Xiao et al. (2019). However, all of these methods leverage filterwise auxiliary parameters to determine the redundancy of filters, which is usually optimized simultaneously with network parameters in the form of multiplication, the collaboration or competition of these parameters may lead to unexpected result, e.g., the filter weights corresponding to the smaller auxiliary parameters may be large for balance, so it may inaccurate to judge the redundant filters directly from auxiliary parameters. Besides, to determine the redundancy of a specific filter in the overall convolution kernel, it is better to use the information of all feature maps or filters together rather than the filterwise auxiliary parameters. To identify the filter redundancy globally, many filter pruning algorithms propose to construct a gate network by leveraging the feature maps as input to generate importance scores for filters. However, the variance of input batches may lead to the difference in sparse structure over filters. Thus those filters with nonsupport scores are to be pruned, where nonsupporting
In addition, filter pruning methods mainly neglect the allocation of the FLOPs budget during training. In order to ensure the pruned network is under some FLOPs budgets, they have to resort to various sparsity proxies, e.g., norm of filters, and norm of filter weights or scaling factors. Nevertheless, the obtained network induced by these sparsity proxies is not necessarily optimal for the constrained FLOPs budget. And it usually needs a cautious hyperparameter setup for the sparsity proxy so that the FLOPs of pruned network matches exactly with the given budget.
In this paper, we propose to prune the redundant filters through the Dagger module with kernel weights used as the input to deduce the redundancy of different filters, which has three folds of advantages. Firstly, gates of filters are generated based on the Dagger module, which avoids the joint optimization of gates and filter weights. Secondly, kernel weights are optimized through the whole training dataset for several epochs, which indicates the kernel weights contains information related to the whole dataset. Thirdly, we can easily and quickly adapt the pretrained network to different budgets of pruned networks. Besides, based on the Dagger module, we also propose a FLOPs aware regularizer to directly pruning redundant filters from the pretrained model with target FLOPs budget. Concretely, we allocate a binary gate for each filter where 0 means that the filter should be pruned and 1 is otherwise. In this way, the status of all binary gates corresponds to a certain filter configuration and FLOPs value. However, binary gates are hard to optimize, thus we relax the binary gates into real gates of the interval , and model them by a Dagger module using the pretrained weights. Therefore, these Dagger modules can sufficiently exploit the information within pretrained filters, and serve as a decent surrogate for binary gates, thereby deriving a corresponding FLOPs regularization term. With the weights of the pretrained model fixed, the gates generated by Dagger module can be optimized by maintaining the accuracy performance as well as reducing the FLOPs such that a smaller network can be induced. Besides, for cohering with a fixed FLOPs budget, we propose to optimize the Dagger module in a greedy manner, so that the model is pruned gradually and we can check whether the valid FLOPs of current pruned network matches with the budget.
We have conducted extensive experiments on benchmark CIFAR10 Krizhevsky et al. (2014) dataset and largescale ImageNet dataset Deng et al. (2009). The experimental results show that under the same FLOPs budget or acceleration rate, our method achieves higher accuracy than other stateoftheart filter pruning methods. For example, with half of the FLOPs (2 acceleration) of ResNet50 He et al. (2016), we can achieve 76.1% Top1 accuracy on ImageNet dataset, far exceeding other filter pruning methods. Our main contributions can be summarized as follows.

We adopt a Dagger module based on pretrained weights to model the gates for filter pruning, such that redundant filters can be pruned more accurately to the entire dataset, avoiding the issue of joint optimization of auxiliary parameters and kernel weights.

We propose to involve FLOPs as an explicit regularization to guide pruning redundant filters besides the classification performance.

Our method is easy to implement, and experimental results indicate our superiority to other stateoftheart pruning methods.
2 Related Work
To enable an overparameterized convolutional neural network to be deployed in lowend computational devices, various methods have been developed to reduce the model capacity and FLOPs, such as weight pruning Tung and Mori (2018); Dong et al. (2017); Guo et al. (2016); Carreiraperpinan and Idelbayev (2018), filter pruning He et al. (2017); Luo et al. (2017); Huang and Wang (2018); Tang et al. (2020); Liu et al. (2017); Zhuang et al. (2018b); Tang et al. (2019), parameter quantization Chen et al. (); Wu et al. (); Han et al. (2015), lowrank approximation Denton et al. (); Sindhwani et al. () and so on. Essentially, weight pruning always aims to optimize the weights in an unstructured way, which makes it hard to deploy on lowend computational devices and often requires a special design to achieve acceleration. While for parameter quantization and lowrank approximation, these algorithms can be applied as a complementary method to filter pruning for further reducing the computation budget. In general, Our method can be cast into the filter pruning category.
Filterwise scaling factors. Filter pruning is designed to speed up the inference of the network by pruning the redundant filters. An important task in filter pruning is to assess the redundancy of filters. Many methods leverage filterwise auxiliary parameters to obtain the redundancy of filters, for example, Liu et al. Liu et al. (2017) used the filterwise scaling factor to prune the large network into a reduced model with comparable precision. Huang et al. Huang and Wang (2018) proposed to use filterwise scaling factors to indicate the redundancy of filters. Xiao et al. Xiao et al. (2019) proposed to prune filters through optimizing a set of trainable auxiliary parameters instead of original weights. You et al. You et al. (2019) proposed to leverage filterwise scaling factors to select redundant filters. However, in these algorithms, filterwise auxiliary parameters are inevitably optimized simultaneously with network parameters through multiplication, and it does not make sense to prune the redundant filters directly according to auxiliary parameters without considering the scale of filter weights.
Gate network for pruning. To solve the above issue, some algorithms involve a gate network to induce pruning. For example, Zhuang et al. Zhuang et al. (2018b) proposed a discriminationaware network with the additional loss to select the filters that really contribute to discriminative power. Liu et al. Liu et al. (2019) proposed to leverage a metanetwork to help identify the number of filters in each layer. He et al. Wang et al. (2018a) proposed to use spectral clustering on filters to select redundant filters. Veit et al. Veit and Belongie (2018) proposed to leverage gates to define their network topology conditioned on the input batches. The AutoPruner Luo and Wu (2018) proposed by Luo et al. can be regarded as a separate layer, which is attached to any convolution layer to automatically prune the filters. These methods neglect the variance of input batches to the difference in sparse structure over filters and thus lead to the variance in redundant filters.
AutoML methods. Since the filter pruning is generally regarded as an optimizing problem Liu et al. (2018b), some works adopt AutoML (i.e. NAS) methods Liu et al. (2018a); Guo et al. (2019); Mei et al. (2020); You et al. (2020); Yang et al. (2020) to automatically search the best network structure given fixed FLOPs budget. Although AutoMLbased approaches usually achieve competitive performance, they can not take full advantage of a pretrained model and are usually computationally expensive. A typical way of AutoML methods is to optimize a wide network with various operations from a huge space as a performance evaluator, and then searching the one with the best performance from the performance evaluator for training from scratch. While our method focuses on leveraging pretrained weights to investigate the redundancy in filters and aim to obtain a compact network for certain FLOPs budgets.
Contributions. we would like to highlight the contributions and advantages of our method as illustrated in follows: (1) our method uses a Dagger module to generate gates for each filter by leveraging filter parameters as input and thus being able to generate dataset related gates, which is more suitable for pruning filters since different input batches in our method have same sparse structure over filters. While other methods (e.g. You et al. (2019); Xiao et al. (2019); Hu et al. (2019)) take feature maps as input, which makes their gates depend on input batches; (2) As for pruning methods that use filterwise scale factors, (e.g. Huang and Wang (2018); Li et al. (2016)), they take only one additional parameter per filter, and lack fitting complexity to model filter redundancy more adaptively. (3) We propose a FLOPs aware regularizer to directly pruning redundant filters from the pretrained models with target FLOPs, while other filter pruning methods generally resort to all sorts of sparsity proxies, e.g., norm of filters and norm of filter weights or scaling factors.
3 Modeling Redundancy of Filters with Dagger Module
Filter pruning intends to identify redundant filters of the pretrained model so that a compact pruning network with a smaller FLOPs budget can be derived. Current methods use different scaling factors, gate network or sparsity proxies to prune redundant filters, such as sparse CNN filters Zhuang et al. (2018b), scaling factors of BN layers Liu et al. (2017), auxiliary factors(gates) You et al. (2019); Xiao et al. (2019); Hu et al. (2019) and scaling methods Huang and Wang (2018); Li et al. (2016). However, in practice, these methods generally use additional scaling factors or modeled gates with batch information as input to select the redundant filters, which makes them batchdependent and may be harmful to locating redundant filters since different input batches may have various sparse structures over filters. In addition, there is usually a requirement that the pretrained model should be pruned under a specified FLOPs level or accelerated by certain times. To solve the above problems, we propose to leverage the Dagger module to identify redundant filters and use pretrained weights as input to directly obtain a network structure with certain FLOPs budget. Denote the original network as with pretrained weights .
3.1 Binary gates
Suppose the network has layers, and feature maps for each layer are denoted as , where is the batch size, is the number of filters, and and are the spatial height and width of the feature map, respectively. In this way, to identify redundant filters based on the pretrained model, we can allocate a binary gate for each filter where means the corresponding filter should be retained, while is to be pruned. With the introduced gates, the feature maps can be augmented filterwisely. Mathematically, for th layer, the augmented feature map is thus expressed as a multiplication of tensor by the scalar, i.e.,
(1) 
where is the th feature map of . corresponds to the gate of th filter w.r.t. th layer. As Eq.(1), controls whether the corresponding filter in is activated. When is 0, the corresponding filters in feature map are deactivated and pruned.
As a result, the number of retrained filters of the pruned network is directly controlled by the binary gates . Concretely, for feature map w.r.t. th layer, its valid filter number is exactly the amount of nonzero gates, i.e.,
(2) 
where is an indicator function and . Specifically, for the nonpruned pretrained model, its gates are all ones, and . Given a pretrained model, its FLOPs are only up to the valid number of filters since the operations and spatial size have been fixed. Then, the FLOPs of the pruned network is determined by the gates , which can be written as
(3) 
where is the FLOPs calculator w.r.t. th layer. For example, if th layer has filters, than for a 11 convolutional layer, its FLOPs calculator with gate will be
(4) 
As a result, we can formulate the number of filters as a mixed 01 binary optimization problem,
(5) 
where is the weights of network with pretrained weights . is the training dataset. Note that the hard constraint in Eq.(5) can amount to a version indicated by acceleration rate , i.e.,
(6) 
where is the overall FLOPs of the pretrained model . Then the number of filters (or gates) will be learned to minimize the training cost but under certain FLOPs budget .
3.2 Modeling gates with the Dagger module
However, 01 optimization in Eq.(5) is an NPhard problem. Thus we relax the original problem by considering a realnumber gate in the interval . Moreover, to better model a real gate, accompanied by the main network we leverage an auxiliary network named Dagger module to generate realnumber gates with the help of trained weights . The generated filterwise binary gates are directly applied to output feature maps with dot products for identifying redundant filters. The proposed framework is shown in Fig. 1.
Besides, the previous filter pruning algorithms usually take batch information (e.g. images) as input to prune the redundant filters, since different input batches may have various sparse structures over filters, the selected gates can be modeled as:
(7) 
Where denotes th batch information from the training dataset . However, different batch information may lead to different redundant filters, and it is almost impossible to infer the global optimal solution of the redundancy filter of the entire dataset from the local optimal solution corresponding to the batch information.
Therefore, directly using the information related to the as input for the Dagger module can result in the global optimal solution about redundant filters. In detail, we use the pretrained convolution weights as the input information for the Dagger module, since these weights are updated by the information of entire through gradient descent for several epochs. The gate is supposed to be generated by taking kernel weights as input, which makes the computation of the Dagger module independent from the input batches, thereby eliminating the batch variation over filters and avoiding the issue of joint optimization with kernel weights. The generation of gates can be modeled via a network (i.e., Dagger module) denoted as with Daggerweights , namely,
(8) 
The structure of our adopted Dagger module is shown in Fig. 2(a). To reduce the computation complexity of the Dagger module, we first merge the filter by average pooling so that it will have the same size as the gates . Then the merged filters are passed through two simple fullyconnected (FC) layers and further activated via a sigmoid function , i.e.,
(9) 
Note the sigmoid function is used for mapping the gates into interval . Besides, since the CNN filters may have some different magnitude of values, we also implement a normalization before these two FC layers by subtracting the mean.
By using the Dagger module, the gates can be modeled continuously, and gates approximating zero are thus reckoned to be redundant, so their corresponding filters are supposed to be pruned. Remark SE Hu et al. (2018), we do not model the gates as the squeezeandexcitation (SE) module and Autopruner Luo and Wu (2018) for we can discard the Dagger module after the pruning since it is independent with input batches. However, the modules in SE and Autopruner both use feature maps as input to the auxiliary network, which will cause their gates to be batch dependent, so they can only produce suboptimal batch related pruning results.
4 Pruning with FLOPsaware Regularization
With the gates generated by Dagger module, the original Eq.(5) has been relaxed into a continuous optimization problem. However, the hard constraint of FLOPs in Eq.(5) depends on the norm of gates , which is not computationally feasible for optimization. In this case, we approximate it by adopting the surrogate norm. For example, the FLOPs of Eq.(4) can be estimated as
(10) 
And the total FLOPs in Eq.(3) can also be estimated as
(11) 
In this case, if the augmented network is initialized with allone gates, will be an accurate estimation of FLOPs since they are equal to each other. Then minimizing will lead the gates to decrease to different values. The differences among gates reflect their different importance and sensitivity with respect to the FLOPs calculation, which amounts to the different redundancy over filters. As a result, can be regarded as a regularization for reducing the FLOPs of the pretrained model. Moreover, since the regularization is continuous, it enables the optimization to resort to various gradientbased optimizers, such as stochastic gradient descent (SGD).
4.1 pruning filters under accurate estimation
To identify the redundant filters, the gates are supposed to also accommodate a better accuracy performance. Hence, we propose to learn them under the supervision of classification performance as well as the Dagger module and FLOPsaware regularization. However, loss function defined in Eq.(5) can’t be directly optimized through gradient descent. To solve this issue, we reformulated Eq.(5) according to lagrange multiplier Wah et al. (2000) and leverage Eq.(11) as the estimation of FLOPs. Then the optimization problem can be formulated as:
(12) 
where balance is the coefficient of lagrange multiplier and will be detailed discuss in Sec V. However, the regularization is not always a good estimation due to the gap between and norm, which means that it might not be sensible to simply adopt Eq.(12) for learning gates in an endtoend manner.
Inspired by the fact that is an accurate estimation of FLOPs if all gates are ones, we propose to greedily prune redundant filters. Concretely, we proceed from the pretrained model, and all gates are initialized to be ones. In consequence, we can safely minimize the loss Eq.(12) to learn gates since the estimation of is now exact and accurate. As a result, under the supervision and FLOPs guidance, gates with different values will be obtained. Different values indicate the current redundancy differences over corresponding filters. As a result, we greedily prune those gates with smaller values.
Nevertheless, since the gates are imposed on the feature maps, they are highly coupled in the magnitude of values. This implies that greedy pruning gates entangled with optimizing the weights may not do the trick. Therefore, we adopt an iterative update strategy, i.e., during the learning, we optimize one while fixing the other. Our proposed algorithm works in an iterative manner, as illustrated in Algorithm 1.
4.2 Iterative optimization with FLOPs examination
For a clear presentation, we refer to the original network as main network in contrast with the Dagger module. As previously illustrated, we implement the iterative update for the gates (Dagger module ) and the weights of the main network , which is presented in Algorithm 1 and elaborated as follows.
Greedy pruning gates with fixed weights
In specific, with the main network fixed, the Dagger module acts as a pruner to generate gates for the main network, which provides guidance on how to prune redundant filters in the main network. Besides, the main network supplies the gates with a classification evaluation, so that the retained gates can maintain the classification performance as much as possible. As a result, the Dagger module can be optimized by the following objective:
(13) 
where is fixed compared to Eq.(12). Therefore, the gates can be optimized under the mutual supervision of classification loss and FLOPsaware regularization.
However, before we optimize the Dagger module as well as the gates, we need to fix the estimation gap of FLOPs regularization . In our method, we propose to make an alignment of gates generated by the Dagger module , as line 4 of Algorithm 1. Concretely, we first retrain the Dagger module to enable its output gates to be , which amounts to that the second output of the FC layer in Fig.2 equals to . Then we add to the output gates so that the values of gates are equal to . The advantages of this aligning gates are twofolds. First, after alignment, the gates are all ones, thus can be an accurate estimation of FLOPs for further regularizing the redundant gates. Second, aligning gates with prior to the sigmoid activation corresponds to its maximum slopes, which in a way enhances the impact of regularization for optimizing gates.
After the alignment, we can safely optimize Eq.(13) to obtain gates. However, the estimation gap may be enlarged by optimization. In this way, we propose to prune redundant filters, i.e., greedy pruning some gates for multiple times until the retained gates satisfy the FLOPs budget. Generally, the closer a gate is to zero, the smaller its contribution to the network. Besides, since the gates are all ones after the alignment, we optimize Eq.(13) for some steps (line 5 of Algorithm 1), and then prune the gates with smaller gates. Usually, we can set a pruning ratio of (e.g., 0.6%) to control the number of pruned gates for each update (line 6 of Algorithm 1). In addition, to meet a hard FLOPs constraint, we can simply implement a FLOPs examination after pruning gates each time (line 8 of Algorithm 1). If the currently retained gates satisfy the FLOPs budget, redundant filters are expected to be learned well. As a result of using the greedy algorithm, based on pretrained weights, our algorithm can use only a small number of input batches to prune redundant filters, so as to achieve the purpose of rapid filter pruning.
Finetuning weights with fixed gates
After some gates are pruned (set as fixed zero), we need to finetune the weights of the main network with fixed gates. However, after the greedy pruning, the values of those retained gates are no longer ones but in . If we implement finetuning weights based on them, it will further worsen the coupled issue since the weights are trained from biased gates. So we propose to set all retrained gates to ones (line 9 of Algorithm 1), and then implement finetuning afterward. The objective goes as:
(14) 
which compensates for the lost information in retained filters due to filter pruning.
4.3 Dealing with skipping layers
To construct the FLOPsaware regularization , the FLOPs needs to be calculated. For a regular layer (e.g., 11 convolution in Eq.(4)), the number of filters within different layers is independent, thus we can use the norm of gates for each layer to represent the number of filters, and calculate the FLOPs in a simple form. However, for those bottlenecks with skipping layers, there is a structural constraint that the input and output of the bottleneck should have the same number of filters, such as the ResNet He et al. (2016) and MobileNetV2 Sandler et al. (2018).
For a bottleneck with skipping layers as Fig. 2(b), each layer will have its own gates . Denote the gates of input and output as and , respectively. For a pretrained model, the size of and is the same. Then, to calculate the FLOPs of this bottleneck, the valid gates of and should be their union as Fig. 2(b) shows, i.e.,
(15) 
where is the union operation. Then based on the valid gates, the FLOPs can be calculated as regular layers.
5 Experimental Results
In this section, we implement extensive experiments on benchmark CIFAR10 and ImageNet datasets to validate the superiority of our proposed method. Besides, we also conduct ablation studies to further investigate how our method contributes to methods of filter pruning.
5.1 Configuration and settings
Comparison methods. In order to compare pruning performance, we select several stateoftheart filter pruning methods, AutoPruner Luo and Wu (2018), LEGR Chin et al. (2019), SFP He et al. (2018), FPGM He et al. (2019), DCP Zhuang et al. (2018b), ThiNet Luo et al. (2017), CP He et al. (2017), Slimming Liu et al. (2017) and PFS Wang et al. (2019). Besides, since our method aims to identify redundant filters, we also cover two vanilla baselines. The first one is Uniform, i.e., shrinking the width of a network by the fixed rate to meet the requirement of FLOPs budget. The second one is a variant of the random set of filters within each layer, denoted as Random. Concretely, we randomly adjust the number of filters within Uniform in a certain range to meet the FLOPs budget. The Random method is implemented for 10 times, and we report the average performance.
Training. Based on a pretrained model, we prune redundant filters until the FLOPs budget is satisfied. Specifically, we optimize the Dagger module and main network for iterations with a batch size of for ImageNet (CIFAR10) dataset before pruning the gates in each update in Algorithm 1. The pruning rate per update (line 6 in Algorithm 1) is set to 0.6% and the balance parameter is set to 8 for all networks. We use SGD optimizer with momentum 0.9 and nesterov acceleration. The weight decay is set to . Besides, the learning rate is annealed with a cosine strategy from initial value 0.001 for Dagger module (main network). Once the FLOPs budget is achieved, we will finetune the pruned weights with the learning rate initialized to 0.01. For the CIFAR10 dataset, we finetune 100 epochs and the learning rate is divided by 10 at 75th, 112th epoch. For the ImageNet dataset, we use the cosine learning rate to finetune the network for 60 epochs. All experiments are implemented with PyTorch Paszke et al. (2017) on NVIDIA 1080 Ti GPUs.
MobileNetV2  VGGNet  
Groups  Methods  FLOPs  Params  Acc  Groups  Methods  FLOPs  Params  Acc 
200M  DCP Zhuang et al. (2018a)  218M    94.69%  200M  DCP Zhuang et al. (2018a)  199M  10.4M  94.16% 
Uniform  207M  1.5M  94.57%  Slimming Liu et al. (2017)  199M  10.4M  93.80%  
Random  207M    94.20%  SSS Huang and Wang (2018)  199M  5.0M  93.63%  
Dagger  207M  1.9M  94.91%  PFS Wang et al. (2019)  199M    93.71%  
148M  MuffNet Chen et al. (2019)  175M    94.71%  VCN Zhao et al. (2019)  190M  3.92M  93.18%  
Uniform  148M  1.1M  94.32%  Uniform  199M  10.0M  93.45%  
Random  148M    93.85%  Random  199M    93.02%  
Dagger  148M  1.7M  94.83%  Dagger  199M  6.0M  94.25%  
88M  AutoSlim Yu and Huang (2019)  88M  1.5M  93.20%  119M  AOFP Ding et al. (2019)  124M    93.84% 
Uniform  88M  0.6M  94.32%  CGNets Hua et al. (2018)  117M    92.88%  
Random  88M    93.85%  Uniform  119M  6.1M  93.03%  
Dagger  88M  1.1M  94.49%  Random  119M    92.22%  
AutoSlim Yu and Huang (2019)  59M  0.7M  93.00%  Dagger  119M  2.7M  93.91% 
5.2 Experiments on CIFAR10 dataset
Dataset and networks. The CIFAR10 dataset includes 60,000 RGB images of 3232 sizes from 10 exclusive categories. The dataset includes 50,000 images for training and 10,000 images for testing. We conduct filter pruning on the benchmark VGGNet19 Simonyan and Zisserman (2014) and the compact MobileNetV2 Sandler et al. (2018). Concretely, VGGNet19 has 20M parameters and 399M FLOPs with an error rate of 6.01%. In contrast, MobileNetV2 only has 2.2M parameters and 297M FLOPs but with an error rate of 5.53%. The results are reported in Table 1.
Results. As shown in Table 1, our method achieves the best accuracy w.r.t. different FLOPs on MobileNetV2 and VGGNet. In detail, for VGGNet, our pruned 50% FLOPs VGGNet outperforms the stateoftheart DCP, Slimming, and PFS by 0.26%, 0.45%, and 0.54%, respectively, and even surpass the pretrained model by 0.26%, which means our method can efficiently prune redundant filters, thereby improving performance. In addition, compared with the two baselines Uniform and Random, our pruned VGGNet19 can improve the accuracy by more than 0.80%. Different from VGGNet19, MobileNetV2 is more compact and has many skipping layers. As shown in Table 1, the performance of our 207M MobileNetV2 can outperform DCP and the pretrained model by 0.44% and 0.22%, respectively. Besides, in the case of a tiny budget (i.e. 88M FLOPs), our pruned MobileNetV2 can still achieve an accuracy of 94.49%, and it is 1.29% higher than AutoSlim, which proves that our method can achieve promising results even with small budgets.
Methods  FLOPs  Params  Top1 ACC  Top5 ACC 

SFP He et al. (2018)  2.4G    74.6%  92.1% 
FPGM He et al. (2019)  2.4G    75.6%  92.6% 
LEGR Chin et al. (2019)  2.4G    75.7%  92.7% 
AutoPruner Luo and Wu (2018)  2.0G    74.8%  92.2% 
MetaPruning Liu et al. (2019)  2.0G    75.4%   
Uniform  2.0G  10.2M  74.1%  90.6% 
Random  2.0G    73.2%  90.4% 
Dagger  2.0G  11.7M  76.1%  92.8% 
Groups  Methods  FLOPs  Params  Top1 ACC  Top5 ACC 

140M  MetaPruning Liu et al. (2019)  140M    68.2%   
GS Ye et al. (2020)  137M  2.0M  68.8%    
Uniform  140M  2.7M  67.6%  88.2%  
Random  140M    67.1%  87.9%  
Dagger  140M  2.84M  69.5%  88.8%  
106M  MetaPruning Liu et al. (2019)  105M    65.0%   
GS Ye et al. (2020)  106M  1.9M  66.9%    
Uniform  106M  1.5M  64.1%  84.2%  
Random  106M    63.5%  84.0%  
Dagger  106M  2.46M  67.2%  86.8% 
5.3 Experiments on ImageNet dataset
Dataset. The ImageNet (ILSVRC12) dataset consists of 1.28 million training images and 50k validation images from 1000 categories. In specific, we report the accuracy of the validation dataset as Zhuang et al. (2018b); Liu et al. (2017). Then we implement pruning on two benchmark networks, i.e., ResNet50 and MobileNetV2. The pretrained models refer to those released by Pytorch.
Results of ResNet50. The pretrained ResNet50 has 25.5M parameters and 4.1G FLOPs with 76.6% Top1 accuracy. As shown in Table 2, our algorithm outperforms the SFP He et al. (2018), FPGM He et al. (2019) and LEGR Chin et al. (2019) by 1.5%, 0.5% and 0.4%, respectively, while our pruned ResNet50 has even smaller FLOPs (by 0.4G). Besides, from comparsion with AutoPruner Luo and Wu (2018) and MetaPruning Liu et al. (2019), our pruned network also achieves the best performance with 1.3% and 0.7% improvement on Top1 accuracy.
Results of MobileNetV2. The pretrained MobileNetV2 has 3.5M parameters and 300M FLOPs with 68.2% Top1 accuracy. We prune the network under two different FLOPs budgets (140M and 106M). As shown in Table 3, by pruning the MobileNetV2 to 140M FLOPs, our pruned MobileNetV2 outperforms the pretrained MobileNetV2 by 1.3%. Our method also leads to 2.4% and 1.9% increase in Top1 accuracy compared with the baseline of Random and Uniform for FLOPs 140M (106M). Moreover, with the same FLOPs budget, our method can actually surpass MetaPruning by a large margin of 1.3% for FLOPs 140M (106M).
5.4 Ablation studies
Effect of balance parameter
According to the lagrange multiplier, the value of is achieved when the derivative of the variable in Eq.(13) is 0, which means the loss has reached the minimum point. And the relationship between loss and gates cannot be expressed explicitly, we can’t get the exact value of through calculation. However, we can experimentally find the value of that makes loss or accuracy achieves minimum point. In detail, we implement pruning with different tradeoff parameters . When the tradeoff parameter becomes larger, the impact of the classification loss will become smaller, and the network will pay more attention to reducing FLOPs, resulting in a rapid increase in the classification loss with more iterations involved. On the other hand, if is too small, the network is likely to be randomly pruned, leading to poor retraining results. The accuracy of each pruned network with different is shown in Fig.3. Empirically, we set to 8 for all networks.
Effect of greedy pruning
In our method, the main network is updated after each pruning process. In this way, the pruning ratio in line 6 of Algorithm 1 not only controls the pruning speed of each update but also changes the strategy of pruning. Since we subtract the mean of filters in the Dagger module when the pruning ratio is chosen to be a larger value, the gates will be pruned more evenly and vice versa. In order to investigate the effect of pruning ratio, we prune VGGNet19 on CIFAR10 dataset and ResNet50 on ImageNet dataset with different pruning ratios, i.e., . As shown in Fig. 3, the pruning ratio favors a medium value since a large value tends to uniformly prune gates over all layers while a smaller value will greedily prune a certain layer. We find 0.6% is empirically a good option.
Visualization of learned number of filters
Based on the CIFAR10 and ImageNet dataset, Fig. 4 shows our pruned results for VGGNet19, MobileNetV2 and ResNet50. For two networks with skipping layers (ResNet50 and MobileNetV2), the pruning is smoothly distributed to all layers. However, MobileNetV2 pays more attention to pruning layers that contain skipping layers, while ResNet50 does not. The network structure of VGGNet19 does not contain any skipping layers, so its pruning results are more uneven than other networks with skipping layers. In addition, the last layers of the above three networks are well preserved after pruning, which may result from they contribute more to the final classification.
To analyze the pruning process of the same network when different FLOPs budgets are given, as shown in Fig. 4, we prune MobileNetV2 on ImageNet dataset with 140M and 106M FLOPs budgets, respectively. When MobileNetV2 is pruned from 300M to 140M, the pruning process is mainly concentrated on the nonskipping layers and those layers near the front of the network. However, when the FLOPs budget is set to 106M, the number of filters at the end of the network begins to decrease, implying that the layers in front of the network are easier to be pruned than the end layers.
Model  Pretrain Epochs  Pretrain ACC  Finetune ACC 

MobileNetV2  10  87.89%  94.51% 
40  89.86%  94.63%  
70  90.74%  94.74%  
100  91.89%  94.77%  
150  93.40%  94.78%  
300  94.47%  94.83%  
VGGNet  10  85.25%  94.07% 
40  89.51%  94.13%  
70  90.57%  94.17%  
100  91.55%  94.22%  
150  92.76%  94.21%  
300  93.99%  94.25% 
Effect of quality of pretrained models
To examine the effect on learned filter numbers with different quality of pretrained models, we use different checkpoints of the pretrained MobileNetV2 (VGGNet) on the CIFAR10 dataset, which have different classification errors. Then we implement our method Dagger based on these pretrained models, and the results are shown in Table 4. In detail, we use pretrained models w.r.t. different pretrain epochs as the pretrained model to implement our method Dagger, the accuracy of pretrained models are referred to as ”Pretrain Acc”. After pruning with 2 acceleration, we finetune the retained weights with 100 epochs and report as ”Finetune Acc”. It can be seen that with the improvement of classification performance of pretrained models, our pruned networks get improved as well. Moreover, the improvement tends to be steady if the quality of the pretrained model is not too bad. For example, when the pretrained epochs of MobileNetV2 (VGGNet) are changed from 300 to 70 epochs, its pretrained accuracy degrades for 3.73% (3.42%), while our finetune accuracy of the pruned results only have 0.09% (0.07%) performance gap. This implies that our method shows small sensitivity towards the quality of the pretrained model; the pretrained models do not necessarily need to be stateoftheart ones but nottoobad ones if we expect a good pruned network.
Effect of Dagger in retraining epochs
To examine the effect of the finetuning epochs in our method Dagger. We finetune the pruned results of 50% FLOPs MobileNetV2 and VGGNet in CIFAR10 dataset w.r.t. different epochs. In detail, we inherit the weights of the retaining filters after pruning and adopt the same training strategy as illustrated before. Specifically, for VGGNet, the learning rate is initialized to 0.1 and divided by 10 at 50% and 75% of the total epochs. As shown in Fig. 5, the test accuracy of the pruned models at the initial remains relatively low, which means that the lost information of the pruned filters has a certain effect on the overall performance. However, as the finetuning epochs increases, the accuracy of the pruned models improves sharply, reaching the highest accuracy with about 100 epochs, proving the effectiveness of our method.
Efficiency of Dagger in pruning filters
To investigate the efficiency of our method in pruning filters, we report the time cost on pruning w.r.t. different pruning ratios in Table 5. All experiments are implemented with 8 NVIDIA 1080 Ti GPUs.
Model  Pruning ratios  Params  FLOPs  Time cost(h) 

ResNet50  30%  16.5M  2.9G  1.58 
50%  11.7M  2.0G  2.58  
70%  9.2M  1.2G  3.48  
MobileNetV2  30%  3.19M  210M  1.08 
50%  2.96M  150M  1.48  
70%  2.54M  90M  1.88 
As shown in Table 5, our method can quickly get the desired model size based on the pretrained model. We optimize the Dagger module and main network for 400 (100) iterations with the batch size of 320 (64) and pruning ratio 0.6% for ImageNet (CIFAR10) dataset in each update. Therefore, taking pruning models to 50% FLOPs as an example, we only need to go through about 8 (10) epochs for ImageNet (CIFAR10) dataset.
5.5 Visualization of feature maps
To intuitively check the learned gates by our method Dagger, we visualize the feature maps w.r.t. different filters with zero gates (pruned) and one gate (retained) in Fig. 6. All the feature maps are from the first convolution of the second bottleneck in MobileNetV2 based on the ImageNet dataset. As shown in Fig. 6, the feature maps of retrained filters (one gate) are more visually informative than that of pruned ones (zero gates). Besides, the pruned filters usually contain more background information, e.g., the bird in the fourth line of Fig. 6. In contrast, our retained filters have a lot of information about the target and suppress background instead, such as snakes and dogs in the second and third rows of Fig. 6.
6 Conclusion
In this paper, we propose to leverage Dagger module with pretrained weights as input and involve FLOPs as an explicit regularization to guide the process of redundant filters pruning. Concretely, we assign a binary gate for each filter to indicate whether the filter should be retrained or pruned. The binary gates can be well modeled by a Dagger module with the help of CNN filters and use filter parameters as input, which helps to generate dataset related gates. Then based on aligning gates, we can have an accurate estimation of FLOPs, and it can also guide the redundant filters learning besides the classification performance. We prune the redundant filters from the pretrained model through the greedy algorithm until the effective FLOPs of the current pruned network matches with the preset budget. Extensive experiments on benchmark CIFAR10 dataset and largescale ImageNet dataset show the superiority of our proposed method over other stateoftheart filter pruning methods.
Footnotes
 Work was done during internship at SenseTime.
 journal: Journal of LaTeX Templates
 https://en.wikipedia.org/wiki/Support_(mathematics)
 https://pytorch.org/docs/stable/torchvision/models.html
References
 ”Learningcompression” algorithms for neural net pruning. pp. 8532–8541. Cited by: §2.
 MuffNet: multilayer feature federation for mobile deep learning. pp. 0–0. Cited by: Table 1.
 Compressing neural networks with the hashing trick. Cited by: §1, §2.
 Legr: filter pruning via learned global ranking. arXiv preprint arXiv:1904.12368. Cited by: §5.1, §5.3, Table 2.
 Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
 Exploiting linear structure within convolutional networks for efficient evaluation. Cited by: §1, §2.
 Approximated oracle filter pruning for destructive cnn width optimization. pp. 1607–1616. Cited by: Table 1.
 Learning to prune deep neural networks via layerwise optimal brain surgeon. pp. 4857–4867. Cited by: §2.
 On the learnability of quantum neural networks. arXiv preprint arXiv:2007.12369. Cited by: §1.
 Dynamic network surgery for efficient dnns. pp. 1387–1395. Cited by: §2.
 Single path oneshot neural architecture search with uniform sampling. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.3.
 Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §5.1, §5.3, Table 2.
 Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §5.1, §5.3, Table 2.
 Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2, §5.1.
 Squeezeandexcitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §2, §3.
 Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
 Channel gating neural networks. arXiv: Learning. Cited by: Table 1.
 Datadriven sparse structure selection for deep neural networks. In The European Conference on Computer Vision (ECCV), Cited by: §2, §2, §2, §3, Table 1.
 Learning student networks with few data.. In AAAI, pp. 4469–4476. Cited by: §1.
 The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §1.
 Speech enhancement using progressive learningbased convolutional recurrent neural network. Applied Acoustics 166, pp. 107347. Cited by: §1.
 Pruning filters for efficient convnets. arXiv: Computer Vision and Pattern Recognition. Cited by: §2, §3.
 DARTS: differentiable architecture search. arXiv: Learning. Cited by: §2.
 MetaPruning: meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258. Cited by: §2, §5.3, Table 2, Table 3.
 Learning efficient convolutional networks through network slimming. pp. 2755–2763. Cited by: §2, §2, §3, §5.1, §5.3, Table 1.
 Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.
 Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2, §5.1.
 Autopruner: an endtoend trainable filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941. Cited by: §2, §3.2, §5.1, §5.3, Table 2.
 AtomNAS: finegrained endtoend neural architecture search. Cited by: §2.
 Group sampling for scale invariant face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3446–3456. Cited by: §1.
 Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6. Cited by: §5.1.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §4.3, §5.2.
 Reinforced molecule generation with heterogeneous states. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 548–557. Cited by: §1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2.
 Structured transforms for smallfootprint deep learning. Cited by: §1, §2.
 Reborn filters: pruning convolutional neural networks with limited data.. In AAAI, pp. 5972–5980. Cited by: §2.
 Bringing giant neural networks down to earth with unlabeled data. arXiv preprint arXiv:1907.06065. Cited by: §2.
 CLIPq: deep network compression learning by inparallel pruningquantization. pp. 7873–7882. Cited by: §2.
 Convolutional networks with adaptive inference graphs. pp. 3–18. Cited by: §2.
 Improving the performance of weighted lagrangemultiplier methods for nonlinear constrained optimization. Information Sciences 124 (1), pp. 241–272. Cited by: §4.1.
 Exploring linear relationship in feature map subspace for convnets compression.. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.
 The devil of face recognition is in the noise. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 765–780. Cited by: §1.
 Pruning from scratch. arXiv: Computer Vision and Pattern Recognition. Cited by: §5.1, Table 1.
 Pointset anchors for object detection, instance segmentation and pose estimation. arXiv preprint arXiv:2007.02846. Cited by: §1.
 Quantized convolutional neural networks for mobile devices. Cited by: §2.
 AutoPrun: automatic network pruning by regularizing auxiliary parameters. pp. 13681–13691. Cited by: §1, §2, §2, §3.
 Shared predictive crossmodal deep quantization. IEEE transactions on neural networks and learning systems 29 (11), pp. 5292–5303. Cited by: §1.
 Deep bayesian hashing with center prior for multimodal neuroimage retrieval. IEEE transactions on medical imaging. Cited by: §1.
 ISTAnas: efficient and consistent neural architecture search by sparse coding. arXiv preprint arXiv:2010.06176. Cited by: §2.
 Good subnetworks provably exist: pruning via greedy forward selection. arXiv: Learning. Cited by: Table 3.
 GreedyNAS: towards fast oneshot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2008. Cited by: §2.
 Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285–1294. Cited by: §1.
 Learning with singleteacher multistudent. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
 Gate decorator: global filter pruning method for accelerating deep convolutional neural networks. pp. 2130–2141. Cited by: §1, §2, §2, §3.
 AutoSlim: towards oneshot architecture search for channel numbers. arXiv preprint arXiv:1903.11728 8. Cited by: Table 1.
 Variational convolutional neural network pruning. pp. 2780–2789. Cited by: Table 1.
 Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: Table 1.
 Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §2, §2, §3, §5.1, §5.3.