Data Agnostic Filter Gating for Efficient Deep Networks

Data Agnostic Filter Gating for Efficient Deep Networks

Abstract

To deploy a well-trained CNN model on low-end computation edge devices, it is usually supposed to compress or prune the model under certain computation budget (e.g., FLOPs). Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores, which ignores the variance of input batches to the difference in sparse structure over filters. In this paper, we propose a data agnostic filter pruning method that uses an auxiliary network named Dagger module to induce pruning and takes pretrained weights as input to learn the importance of each filter. In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPs-aware regularization to directly promote pruning filters toward target FLOPs. Extensive experimental results on CIFAR-10 and ImageNet datasets indicate our superiority to other state-of-the-art filter pruning methods. For example, our 50% FLOPs ResNet-50 can achieve 76.1% Top-1 accuracy on ImageNet dataset, surpassing many other filter pruning methods.

keywords:
Deep learning; Filter pruning; Model compression; Data agnostic; Dagger module; FLOPs-aware regularization.
2

url]xisu5992@uni.sydney.edu.au

url]youshan@sensetime.com

url]huangtao@sensetime.com

url]tjdxxhy@tju.edu.cn

url]wangfei@sensetime.com

url]qianchen@sensetime.com

url]zcs@mail.tsinghua.edu.cn

url]c.xu@sydney.edu.au

1 Introduction

Recently, artificial intelligence (AI) engines with deep learning techniques has achieved remarkable success in various tasks Li et al. (2020); Wang et al. (2018b); Du et al. (2020); Shi et al. (2019); Wei et al. (2020); Ming et al. (2019); Yang et al. (); Yang et al. (2018), and networks (e.g., convolutional neural networks, CNNs) are thus favored in the establishment of cloud and edge computing, which are mainly deployed in terminal devices, such as mobile phones, tablets, AR glasses, wearable watches, and onboard surveillance equipment. However, aiming at the state-of-the-art accuracy performance, conventional trained CNN models in the industrial model zoo usually have huge model size. And they are clumsy for deployment on low-end computational devices. In this way, a natural problem goes that besides the fundamental accuracy performance, how we can develop a ready-to-deploy model under certain computation budget, such as FLOPs. Luckily, due to the development of model compression and acceleration, pruning has been an efficient way to acquire light models based on existing clumsy models with foundation performance.

Currently, pruning can be divided into the categories of weight pruning or filter pruning. However, filter pruning is more competitive than weight pruning, since it can result to a lightweight model which has the consistent network structure of pre-trained model and is friendly to current off-the-shelf deep learning frameworks. Moreover, filter pruning is complementary to other compression techniques, the pruned networks can be usually further compressed using quantization Chen et al. (); Han et al. (2015), low-rank decomposition Denton et al. (); Sindhwani et al. () or knowledge distillation You et al. (2017, 2018); Kong et al. (2020). To prune the redundant filters, one effective way is to prune filters with pre-trained weights, and the filters with less importance to network performance are referred to as redundant filters in this paper. Basically, filter pruning works by first finding and pruning the redundant filters, then retraining (fine-tuning) the pruned network to recover its performance.

Identifying redundant filters matters in filter pruning. In specific, many filter pruning methods leverage scaling factors to find out the redundant filter, e.g.  scaling factors You et al. (2019), trainable auxiliary parameters Xiao et al. (2019). However, all of these methods leverage filter-wise auxiliary parameters to determine the redundancy of filters, which is usually optimized simultaneously with network parameters in the form of multiplication, the collaboration or competition of these parameters may lead to unexpected result, e.g., the filter weights corresponding to the smaller auxiliary parameters may be large for balance, so it may inaccurate to judge the redundant filters directly from auxiliary parameters. Besides, to determine the redundancy of a specific filter in the overall convolution kernel, it is better to use the information of all feature maps or filters together rather than the filter-wise auxiliary parameters. To identify the filter redundancy globally, many filter pruning algorithms propose to construct a gate network by leveraging the feature maps as input to generate importance scores for filters. However, the variance of input batches may lead to the difference in sparse structure over filters. Thus those filters with non-support scores are to be pruned, where non-supporting3 gates represent a subset of gates mapped to zero values. However, since feature maps are subject to input batches, different batches may generate distinct importance scores and the corresponding support, which makes it difficult to determine an optimal score support for all input batches, and causes a performance gap accordingly.

In addition, filter pruning methods mainly neglect the allocation of the FLOPs budget during training. In order to ensure the pruned network is under some FLOPs budgets, they have to resort to various sparsity proxies, e.g., norm of filters, and norm of filter weights or scaling factors. Nevertheless, the obtained network induced by these sparsity proxies is not necessarily optimal for the constrained FLOPs budget. And it usually needs a cautious hyper-parameter setup for the sparsity proxy so that the FLOPs of pruned network matches exactly with the given budget.

In this paper, we propose to prune the redundant filters through the Dagger module with kernel weights used as the input to deduce the redundancy of different filters, which has three folds of advantages. Firstly, gates of filters are generated based on the Dagger module, which avoids the joint optimization of gates and filter weights. Secondly, kernel weights are optimized through the whole training dataset for several epochs, which indicates the kernel weights contains information related to the whole dataset. Thirdly, we can easily and quickly adapt the pre-trained network to different budgets of pruned networks. Besides, based on the Dagger module, we also propose a FLOPs aware regularizer to directly pruning redundant filters from the pre-trained model with target FLOPs budget. Concretely, we allocate a binary gate for each filter where 0 means that the filter should be pruned and 1 is otherwise. In this way, the status of all binary gates corresponds to a certain filter configuration and FLOPs value. However, binary gates are hard to optimize, thus we relax the binary gates into real gates of the interval , and model them by a Dagger module using the pre-trained weights. Therefore, these Dagger modules can sufficiently exploit the information within pre-trained filters, and serve as a decent surrogate for binary gates, thereby deriving a corresponding FLOPs regularization term. With the weights of the pre-trained model fixed, the gates generated by Dagger module can be optimized by maintaining the accuracy performance as well as reducing the FLOPs such that a smaller network can be induced. Besides, for cohering with a fixed FLOPs budget, we propose to optimize the Dagger module in a greedy manner, so that the model is pruned gradually and we can check whether the valid FLOPs of current pruned network matches with the budget.

We have conducted extensive experiments on benchmark CIFAR-10 Krizhevsky et al. (2014) dataset and large-scale ImageNet dataset Deng et al. (2009). The experimental results show that under the same FLOPs budget or acceleration rate, our method achieves higher accuracy than other state-of-the-art filter pruning methods. For example, with half of the FLOPs (2 acceleration) of ResNet-50 He et al. (2016), we can achieve 76.1% Top-1 accuracy on ImageNet dataset, far exceeding other filter pruning methods. Our main contributions can be summarized as follows.

  • We adopt a Dagger module based on pre-trained weights to model the gates for filter pruning, such that redundant filters can be pruned more accurately to the entire dataset, avoiding the issue of joint optimization of auxiliary parameters and kernel weights.

  • We propose to involve FLOPs as an explicit regularization to guide pruning redundant filters besides the classification performance.

  • Our method is easy to implement, and experimental results indicate our superiority to other state-of-the-art pruning methods.

2 Related Work

Figure 1: Overall framework of our proposed method. Dagger module is proposed that uses filter parameters to generate gates for pruning redundant filters with FLOPs regularization, which can be removed without affecting the rest of the network.

To enable an over-parameterized convolutional neural network to be deployed in low-end computational devices, various methods have been developed to reduce the model capacity and FLOPs, such as weight pruning Tung and Mori (2018); Dong et al. (2017); Guo et al. (2016); Carreiraperpinan and Idelbayev (2018), filter pruning He et al. (2017); Luo et al. (2017); Huang and Wang (2018); Tang et al. (2020); Liu et al. (2017); Zhuang et al. (2018b); Tang et al. (2019), parameter quantization Chen et al. (); Wu et al. (); Han et al. (2015), low-rank approximation Denton et al. (); Sindhwani et al. () and so on. Essentially, weight pruning always aims to optimize the weights in an unstructured way, which makes it hard to deploy on low-end computational devices and often requires a special design to achieve acceleration. While for parameter quantization and low-rank approximation, these algorithms can be applied as a complementary method to filter pruning for further reducing the computation budget. In general, Our method can be cast into the filter pruning category.

Filter-wise scaling factors. Filter pruning is designed to speed up the inference of the network by pruning the redundant filters. An important task in filter pruning is to assess the redundancy of filters. Many methods leverage filter-wise auxiliary parameters to obtain the redundancy of filters, for example, Liu et al. Liu et al. (2017) used the filter-wise scaling factor to prune the large network into a reduced model with comparable precision. Huang et al. Huang and Wang (2018) proposed to use filter-wise scaling factors to indicate the redundancy of filters. Xiao et al. Xiao et al. (2019) proposed to prune filters through optimizing a set of trainable auxiliary parameters instead of original weights. You et al. You et al. (2019) proposed to leverage filter-wise scaling factors to select redundant filters. However, in these algorithms, filter-wise auxiliary parameters are inevitably optimized simultaneously with network parameters through multiplication, and it does not make sense to prune the redundant filters directly according to auxiliary parameters without considering the scale of filter weights.

Gate network for pruning. To solve the above issue, some algorithms involve a gate network to induce pruning. For example, Zhuang et al. Zhuang et al. (2018b) proposed a discrimination-aware network with the additional loss to select the filters that really contribute to discriminative power. Liu et al. Liu et al. (2019) proposed to leverage a meta-network to help identify the number of filters in each layer. He et al. Wang et al. (2018a) proposed to use spectral clustering on filters to select redundant filters. Veit et al. Veit and Belongie (2018) proposed to leverage gates to define their network topology conditioned on the input batches. The AutoPruner Luo and Wu (2018) proposed by Luo et al. can be regarded as a separate layer, which is attached to any convolution layer to automatically prune the filters. These methods neglect the variance of input batches to the difference in sparse structure over filters and thus lead to the variance in redundant filters.

AutoML methods. Since the filter pruning is generally regarded as an optimizing problem Liu et al. (2018b), some works adopt AutoML (i.e. NAS) methods Liu et al. (2018a); Guo et al. (2019); Mei et al. (2020); You et al. (2020); Yang et al. (2020) to automatically search the best network structure given fixed FLOPs budget. Although AutoML-based approaches usually achieve competitive performance, they can not take full advantage of a pre-trained model and are usually computationally expensive. A typical way of AutoML methods is to optimize a wide network with various operations from a huge space as a performance evaluator, and then searching the one with the best performance from the performance evaluator for training from scratch. While our method focuses on leveraging pre-trained weights to investigate the redundancy in filters and aim to obtain a compact network for certain FLOPs budgets.

Contributions. we would like to highlight the contributions and advantages of our method as illustrated in follows: (1) our method uses a Dagger module to generate gates for each filter by leveraging filter parameters as input and thus being able to generate dataset related gates, which is more suitable for pruning filters since different input batches in our method have same sparse structure over filters. While other methods (e.g. You et al. (2019); Xiao et al. (2019); Hu et al. (2019)) take feature maps as input, which makes their gates depend on input batches; (2) As for pruning methods that use filter-wise scale factors, (e.g. Huang and Wang (2018); Li et al. (2016)), they take only one additional parameter per filter, and lack fitting complexity to model filter redundancy more adaptively. (3) We propose a FLOPs aware regularizer to directly pruning redundant filters from the pre-trained models with target FLOPs, while other filter pruning methods generally resort to all sorts of sparsity proxies, e.g., norm of filters and norm of filter weights or scaling factors.

3 Modeling Redundancy of Filters with Dagger Module

Figure 2: (a) Architecture of Dagger module. Dagger module uses convolutional filter weights to generate filter-wise gates, which are directly multiplied to convolutional feature map as final output. (b) FLOPs calculation with binary gates when dealing with skipping layers.

Filter pruning intends to identify redundant filters of the pre-trained model so that a compact pruning network with a smaller FLOPs budget can be derived. Current methods use different scaling factors, gate network or sparsity proxies to prune redundant filters, such as sparse CNN filters Zhuang et al. (2018b), scaling factors of BN layers Liu et al. (2017), auxiliary factors(gates) You et al. (2019); Xiao et al. (2019); Hu et al. (2019) and scaling methods Huang and Wang (2018); Li et al. (2016). However, in practice, these methods generally use additional scaling factors or modeled gates with batch information as input to select the redundant filters, which makes them batch-dependent and may be harmful to locating redundant filters since different input batches may have various sparse structures over filters. In addition, there is usually a requirement that the pre-trained model should be pruned under a specified FLOPs level or accelerated by certain times. To solve the above problems, we propose to leverage the Dagger module to identify redundant filters and use pre-trained weights as input to directly obtain a network structure with certain FLOPs budget. Denote the original network as with pre-trained weights .

3.1 Binary gates

Suppose the network has layers, and feature maps for each layer are denoted as , where is the batch size, is the number of filters, and and are the spatial height and width of the feature map, respectively. In this way, to identify redundant filters based on the pre-trained model, we can allocate a binary gate for each filter where means the corresponding filter should be retained, while is to be pruned. With the introduced gates, the feature maps can be augmented filter-wisely. Mathematically, for -th layer, the augmented feature map is thus expressed as a multiplication of tensor by the scalar, i.e.,

(1)

where is the -th feature map of . corresponds to the gate of -th filter w.r.t. -th layer. As Eq.(1), controls whether the corresponding filter in is activated. When is 0, the corresponding filters in feature map are deactivated and pruned.

As a result, the number of retrained filters of the pruned network is directly controlled by the binary gates . Concretely, for feature map w.r.t. -th layer, its valid filter number is exactly the amount of non-zero gates, i.e.,

(2)

where is an indicator function and . Specifically, for the non-pruned pre-trained model, its gates are all ones, and . Given a pre-trained model, its FLOPs are only up to the valid number of filters since the operations and spatial size have been fixed. Then, the FLOPs of the pruned network is determined by the gates , which can be written as

(3)

where is the FLOPs calculator w.r.t. -th layer. For example, if -th layer has filters, than for a 11 convolutional layer, its FLOPs calculator with gate will be

(4)

As a result, we can formulate the number of filters as a mixed 0-1 binary optimization problem,

(5)

where is the weights of network with pretrained weights . is the training dataset. Note that the hard constraint in Eq.(5) can amount to a version indicated by acceleration rate , i.e.,

(6)

where is the overall FLOPs of the pretrained model . Then the number of filters (or gates) will be learned to minimize the training cost but under certain FLOPs budget .

3.2 Modeling gates with the Dagger module

However, 0-1 optimization in Eq.(5) is an NP-hard problem. Thus we relax the original problem by considering a real-number gate in the interval . Moreover, to better model a real gate, accompanied by the main network we leverage an auxiliary network named Dagger module to generate real-number gates with the help of trained weights . The generated filter-wise binary gates are directly applied to output feature maps with dot products for identifying redundant filters. The proposed framework is shown in Fig. 1.

Besides, the previous filter pruning algorithms usually take batch information (e.g. images) as input to prune the redundant filters, since different input batches may have various sparse structures over filters, the selected gates can be modeled as:

(7)

Where denotes -th batch information from the training dataset . However, different batch information may lead to different redundant filters, and it is almost impossible to infer the global optimal solution of the redundancy filter of the entire dataset from the local optimal solution corresponding to the batch information.

Therefore, directly using the information related to the as input for the Dagger module can result in the global optimal solution about redundant filters. In detail, we use the pre-trained convolution weights as the input information for the Dagger module, since these weights are updated by the information of entire through gradient descent for several epochs. The gate is supposed to be generated by taking kernel weights as input, which makes the computation of the Dagger module independent from the input batches, thereby eliminating the batch variation over filters and avoiding the issue of joint optimization with kernel weights. The generation of gates can be modeled via a network (i.e., Dagger module) denoted as with Dagger-weights , namely,

(8)

The structure of our adopted Dagger module is shown in Fig. 2(a). To reduce the computation complexity of the Dagger module, we first merge the filter by average pooling so that it will have the same size as the gates . Then the merged filters are passed through two simple fully-connected (FC) layers and further activated via a sigmoid function , i.e.,

(9)

Note the sigmoid function is used for mapping the gates into interval . Besides, since the CNN filters may have some different magnitude of values, we also implement a normalization before these two FC layers by subtracting the mean.

By using the Dagger module, the gates can be modeled continuously, and gates approximating zero are thus reckoned to be redundant, so their corresponding filters are supposed to be pruned. Remark SE Hu et al. (2018), we do not model the gates as the squeeze-and-excitation (SE) module and Autopruner Luo and Wu (2018) for we can discard the Dagger module after the pruning since it is independent with input batches. However, the modules in SE and Autopruner both use feature maps as input to the auxiliary network, which will cause their gates to be batch dependent, so they can only produce sub-optimal batch related pruning results.

4 Pruning with FLOPs-aware Regularization

With the gates generated by Dagger module, the original Eq.(5) has been relaxed into a continuous optimization problem. However, the hard constraint of FLOPs in Eq.(5) depends on the norm of gates , which is not computationally feasible for optimization. In this case, we approximate it by adopting the surrogate norm. For example, the FLOPs of Eq.(4) can be estimated as

(10)

And the total FLOPs in Eq.(3) can also be estimated as

(11)

In this case, if the augmented network is initialized with all-one gates, will be an accurate estimation of FLOPs since they are equal to each other. Then minimizing will lead the gates to decrease to different values. The differences among gates reflect their different importance and sensitivity with respect to the FLOPs calculation, which amounts to the different redundancy over filters. As a result, can be regarded as a regularization for reducing the FLOPs of the pre-trained model. Moreover, since the regularization is continuous, it enables the optimization to resort to various gradient-based optimizers, such as stochastic gradient descent (SGD).

4.1 pruning filters under accurate estimation

To identify the redundant filters, the gates are supposed to also accommodate a better accuracy performance. Hence, we propose to learn them under the supervision of classification performance as well as the Dagger module and FLOPs-aware regularization. However, loss function defined in Eq.(5) can’t be directly optimized through gradient descent. To solve this issue, we reformulated Eq.(5) according to lagrange multiplier Wah et al. (2000) and leverage Eq.(11) as the estimation of FLOPs. Then the optimization problem can be formulated as:

(12)

where balance is the coefficient of lagrange multiplier and will be detailed discuss in Sec V. However, the regularization is not always a good estimation due to the gap between and norm, which means that it might not be sensible to simply adopt Eq.(12) for learning gates in an end-to-end manner.

Inspired by the fact that is an accurate estimation of FLOPs if all gates are ones, we propose to greedily prune redundant filters. Concretely, we proceed from the pre-trained model, and all gates are initialized to be ones. In consequence, we can safely minimize the loss Eq.(12) to learn gates since the estimation of is now exact and accurate. As a result, under the supervision and FLOPs guidance, gates with different values will be obtained. Different values indicate the current redundancy differences over corresponding filters. As a result, we greedily prune those gates with smaller values.

Nevertheless, since the gates are imposed on the feature maps, they are highly coupled in the magnitude of values. This implies that greedy pruning gates entangled with optimizing the weights may not do the trick. Therefore, we adopt an iterative update strategy, i.e., during the learning, we optimize one while fixing the other. Our proposed algorithm works in an iterative manner, as illustrated in Algorithm 1.

0:  A well-trained model with weights . Training dataset . FLOPs budget . All gates set .
1:  initialize Dagger module with Dagger weights
2:  one-gate set , zero-gate set
3:  while FLOPsdo
4:     align gates in one-gate set
5:     optimize the gates with fixed weights
6:     get the smallest gates with ratio as
7:     update and
8:     calculate the valid FLOPs via
9:     fine-tune the weights with fixed all gates in being ones
10:  end while
10:  retained gates
Algorithm 1 Data agnostic filter gating for efficient deep networks

4.2 Iterative optimization with FLOPs examination

For a clear presentation, we refer to the original network as main network in contrast with the Dagger module. As previously illustrated, we implement the iterative update for the gates (Dagger module ) and the weights of the main network , which is presented in Algorithm 1 and elaborated as follows.

Greedy pruning gates with fixed weights

In specific, with the main network fixed, the Dagger module acts as a pruner to generate gates for the main network, which provides guidance on how to prune redundant filters in the main network. Besides, the main network supplies the gates with a classification evaluation, so that the retained gates can maintain the classification performance as much as possible. As a result, the Dagger module can be optimized by the following objective:

(13)

where is fixed compared to Eq.(12). Therefore, the gates can be optimized under the mutual supervision of classification loss and FLOPs-aware regularization.

However, before we optimize the Dagger module as well as the gates, we need to fix the estimation gap of FLOPs regularization . In our method, we propose to make an alignment of gates generated by the Dagger module , as line 4 of Algorithm 1. Concretely, we first retrain the Dagger module to enable its output gates to be , which amounts to that the second output of the FC layer in Fig.2 equals to . Then we add to the output gates so that the values of gates are equal to . The advantages of this aligning gates are two-folds. First, after alignment, the gates are all ones, thus can be an accurate estimation of FLOPs for further regularizing the redundant gates. Second, aligning gates with prior to the sigmoid activation corresponds to its maximum slopes, which in a way enhances the impact of regularization for optimizing gates.

After the alignment, we can safely optimize Eq.(13) to obtain gates. However, the estimation gap may be enlarged by optimization. In this way, we propose to prune redundant filters, i.e., greedy pruning some gates for multiple times until the retained gates satisfy the FLOPs budget. Generally, the closer a gate is to zero, the smaller its contribution to the network. Besides, since the gates are all ones after the alignment, we optimize Eq.(13) for some steps (line 5 of Algorithm 1), and then prune the gates with smaller gates. Usually, we can set a pruning ratio of (e.g., 0.6%) to control the number of pruned gates for each update (line 6 of Algorithm 1). In addition, to meet a hard FLOPs constraint, we can simply implement a FLOPs examination after pruning gates each time (line 8 of Algorithm 1). If the currently retained gates satisfy the FLOPs budget, redundant filters are expected to be learned well. As a result of using the greedy algorithm, based on pretrained weights, our algorithm can use only a small number of input batches to prune redundant filters, so as to achieve the purpose of rapid filter pruning.

Fine-tuning weights with fixed gates

After some gates are pruned (set as fixed zero), we need to fine-tune the weights of the main network with fixed gates. However, after the greedy pruning, the values of those retained gates are no longer ones but in . If we implement fine-tuning weights based on them, it will further worsen the coupled issue since the weights are trained from biased gates. So we propose to set all retrained gates to ones (line 9 of Algorithm 1), and then implement fine-tuning afterward. The objective goes as:

(14)

which compensates for the lost information in retained filters due to filter pruning.

4.3 Dealing with skipping layers

To construct the FLOPs-aware regularization , the FLOPs needs to be calculated. For a regular layer (e.g., 11 convolution in Eq.(4)), the number of filters within different layers is independent, thus we can use the norm of gates for each layer to represent the number of filters, and calculate the FLOPs in a simple form. However, for those bottlenecks with skipping layers, there is a structural constraint that the input and output of the bottleneck should have the same number of filters, such as the ResNet He et al. (2016) and MobileNetV2 Sandler et al. (2018).

For a bottleneck with skipping layers as Fig. 2(b), each layer will have its own gates . Denote the gates of input and output as and , respectively. For a pretrained model, the size of and is the same. Then, to calculate the FLOPs of this bottleneck, the valid gates of and should be their union as Fig. 2(b) shows, i.e.,

(15)

where is the union operation. Then based on the valid gates, the FLOPs can be calculated as regular layers.

5 Experimental Results

In this section, we implement extensive experiments on benchmark CIFAR-10 and ImageNet datasets to validate the superiority of our proposed method. Besides, we also conduct ablation studies to further investigate how our method contributes to methods of filter pruning.

5.1 Configuration and settings

Comparison methods. In order to compare pruning performance, we select several state-of-the-art filter pruning methods, AutoPruner Luo and Wu (2018), LEGR Chin et al. (2019), SFP He et al. (2018), FPGM He et al. (2019), DCP Zhuang et al. (2018b), ThiNet Luo et al. (2017), CP He et al. (2017), Slimming Liu et al. (2017) and PFS Wang et al. (2019). Besides, since our method aims to identify redundant filters, we also cover two vanilla baselines. The first one is Uniform, i.e., shrinking the width of a network by the fixed rate to meet the requirement of FLOPs budget. The second one is a variant of the random set of filters within each layer, denoted as Random. Concretely, we randomly adjust the number of filters within Uniform in a certain range to meet the FLOPs budget. The Random method is implemented for 10 times, and we report the average performance.

Training. Based on a pre-trained model, we prune redundant filters until the FLOPs budget is satisfied. Specifically, we optimize the Dagger module and main network for iterations with a batch size of for ImageNet (CIFAR-10) dataset before pruning the gates in each update in Algorithm 1. The pruning rate per update (line 6 in Algorithm 1) is set to 0.6% and the balance parameter is set to 8 for all networks. We use SGD optimizer with momentum 0.9 and nesterov acceleration. The weight decay is set to . Besides, the learning rate is annealed with a cosine strategy from initial value 0.001 for Dagger module (main network). Once the FLOPs budget is achieved, we will finetune the pruned weights with the learning rate initialized to 0.01. For the CIFAR-10 dataset, we finetune 100 epochs and the learning rate is divided by 10 at 75-th, 112-th epoch. For the ImageNet dataset, we use the cosine learning rate to finetune the network for 60 epochs. All experiments are implemented with PyTorch Paszke et al. (2017) on NVIDIA 1080 Ti GPUs.

MobileNetV2 VGGNet
Groups Methods FLOPs Params Acc Groups Methods FLOPs Params Acc
200M DCP Zhuang et al. (2018a) 218M - 94.69% 200M DCP Zhuang et al. (2018a) 199M 10.4M 94.16%
Uniform 207M 1.5M 94.57% Slimming Liu et al. (2017) 199M 10.4M 93.80%
Random 207M - 94.20% SSS Huang and Wang (2018) 199M 5.0M 93.63%
Dagger 207M 1.9M 94.91% PFS Wang et al. (2019) 199M - 93.71%
148M MuffNet Chen et al. (2019) 175M - 94.71% VCN Zhao et al. (2019) 190M 3.92M 93.18%
Uniform 148M 1.1M 94.32% Uniform 199M 10.0M 93.45%
Random 148M - 93.85% Random 199M - 93.02%
Dagger 148M 1.7M 94.83% Dagger 199M 6.0M 94.25%
88M AutoSlim Yu and Huang (2019) 88M 1.5M 93.20% 119M AOFP Ding et al. (2019) 124M - 93.84%
Uniform 88M 0.6M 94.32% CGNets Hua et al. (2018) 117M - 92.88%
Random 88M - 93.85% Uniform 119M 6.1M 93.03%
Dagger 88M 1.1M 94.49% Random 119M - 92.22%
AutoSlim Yu and Huang (2019) 59M 0.7M 93.00% Dagger 119M 2.7M 93.91%
Table 1: Performance comparsion of MobileNetV2 and VGGNet on CIFAR-10.

5.2 Experiments on CIFAR-10 dataset

Dataset and networks. The CIFAR-10 dataset includes 60,000 RGB images of 3232 sizes from 10 exclusive categories. The dataset includes 50,000 images for training and 10,000 images for testing. We conduct filter pruning on the benchmark VGGNet-19 Simonyan and Zisserman (2014) and the compact MobileNetV2 Sandler et al. (2018). Concretely, VGGNet-19 has 20M parameters and 399M FLOPs with an error rate of 6.01%. In contrast, MobileNetV2 only has 2.2M parameters and 297M FLOPs but with an error rate of 5.53%. The results are reported in Table 1.

Results. As shown in Table 1, our method achieves the best accuracy w.r.t. different FLOPs on MobileNetV2 and VGGNet. In detail, for VGGNet, our pruned 50% FLOPs VGGNet outperforms the state-of-the-art DCP, Slimming, and PFS by 0.26%, 0.45%, and 0.54%, respectively, and even surpass the pretrained model by 0.26%, which means our method can efficiently prune redundant filters, thereby improving performance. In addition, compared with the two baselines Uniform and Random, our pruned VGGNet-19 can improve the accuracy by more than 0.80%. Different from VGGNet-19, MobileNetV2 is more compact and has many skipping layers. As shown in Table 1, the performance of our 207M MobileNetV2 can outperform DCP and the pretrained model by 0.44% and 0.22%, respectively. Besides, in the case of a tiny budget (i.e. 88M FLOPs), our pruned MobileNetV2 can still achieve an accuracy of 94.49%, and it is 1.29% higher than AutoSlim, which proves that our method can achieve promising results even with small budgets.

Methods FLOPs Params Top-1 ACC Top-5 ACC
SFP He et al. (2018) 2.4G - 74.6% 92.1%
FPGM He et al. (2019) 2.4G - 75.6% 92.6%
LEGR Chin et al. (2019) 2.4G - 75.7% 92.7%
AutoPruner Luo and Wu (2018) 2.0G - 74.8% 92.2%
MetaPruning Liu et al. (2019) 2.0G - 75.4% -
Uniform 2.0G 10.2M 74.1% 90.6%
Random 2.0G - 73.2% 90.4%
Dagger 2.0G 11.7M 76.1% 92.8%
Table 2: Performance comparison of pruned ResNet-50 on ImageNet dataset with 2.0G FLOPs budget.
Groups Methods FLOPs Params Top-1 ACC Top-5 ACC
140M MetaPruning Liu et al. (2019) 140M - 68.2% -
GS Ye et al. (2020) 137M 2.0M 68.8% -
Uniform 140M 2.7M 67.6% 88.2%
Random 140M - 67.1% 87.9%
Dagger 140M 2.84M 69.5% 88.8%
106M MetaPruning Liu et al. (2019) 105M - 65.0% -
GS Ye et al. (2020) 106M 1.9M 66.9% -
Uniform 106M 1.5M 64.1% 84.2%
Random 106M - 63.5% 84.0%
Dagger 106M 2.46M 67.2% 86.8%
Table 3: Performance Comparison of pruned MobileNetV2 on ImageNet dataset with two FLOPs budget.

5.3 Experiments on ImageNet dataset

Dataset. The ImageNet (ILSVRC-12) dataset consists of 1.28 million training images and 50k validation images from 1000 categories. In specific, we report the accuracy of the validation dataset as Zhuang et al. (2018b); Liu et al. (2017). Then we implement pruning on two benchmark networks, i.e., ResNet-50 and MobileNetV2. The pretrained models refer to those released by Pytorch. 4

Results of ResNet-50. The pretrained ResNet-50 has 25.5M parameters and 4.1G FLOPs with 76.6% Top-1 accuracy. As shown in Table 2, our algorithm outperforms the SFP He et al. (2018), FPGM He et al. (2019) and LEGR Chin et al. (2019) by 1.5%, 0.5% and 0.4%, respectively, while our pruned ResNet-50 has even smaller FLOPs (by 0.4G). Besides, from comparsion with AutoPruner Luo and Wu (2018) and MetaPruning Liu et al. (2019), our pruned network also achieves the best performance with 1.3% and 0.7% improvement on Top-1 accuracy.

Results of MobileNetV2. The pretrained MobileNetV2 has 3.5M parameters and 300M FLOPs with 68.2% Top-1 accuracy. We prune the network under two different FLOPs budgets (140M and 106M). As shown in Table 3, by pruning the MobileNetV2 to 140M FLOPs, our pruned MobileNetV2 outperforms the pretrained MobileNetV2 by 1.3%. Our method also leads to 2.4% and 1.9% increase in Top-1 accuracy compared with the baseline of Random and Uniform for FLOPs 140M (106M). Moreover, with the same FLOPs budget, our method can actually surpass MetaPruning by a large margin of 1.3% for FLOPs 140M (106M).

Figure 3: Ablation studies. Classification performance of the pruned networks under different balance parameter in Eq.(13) (Left) and pruning ratios (Right). Note that the blue lines refer to the Top-1 accuracy (%) of VGGNet-19 on CIFAR-10 dataset while the red lines are for the Top-1 accuracy (%) of ResNet-50 on ImageNet dataset.

5.4 Ablation studies

Effect of balance parameter

According to the lagrange multiplier, the value of is achieved when the derivative of the variable in Eq.(13) is 0, which means the loss has reached the minimum point. And the relationship between loss and gates cannot be expressed explicitly, we can’t get the exact value of through calculation. However, we can experimentally find the value of that makes loss or accuracy achieves minimum point. In detail, we implement pruning with different trade-off parameters . When the trade-off parameter becomes larger, the impact of the classification loss will become smaller, and the network will pay more attention to reducing FLOPs, resulting in a rapid increase in the classification loss with more iterations involved. On the other hand, if is too small, the network is likely to be randomly pruned, leading to poor retraining results. The accuracy of each pruned network with different is shown in Fig.3. Empirically, we set to 8 for all networks.

Effect of greedy pruning

In our method, the main network is updated after each pruning process. In this way, the pruning ratio in line 6 of Algorithm 1 not only controls the pruning speed of each update but also changes the strategy of pruning. Since we subtract the mean of filters in the Dagger module when the pruning ratio is chosen to be a larger value, the gates will be pruned more evenly and vice versa. In order to investigate the effect of pruning ratio, we prune VGGNet-19 on CIFAR-10 dataset and ResNet-50 on ImageNet dataset with different pruning ratios, i.e., . As shown in Fig. 3, the pruning ratio favors a medium value since a large value tends to uniformly prune gates over all layers while a smaller value will greedily prune a certain layer. We find 0.6% is empirically a good option.

Figure 4: Visualization of the learned number of filters w.r.t.different networks and datasets.

Visualization of learned number of filters

Based on the CIFAR-10 and ImageNet dataset, Fig. 4 shows our pruned results for VGGNet-19, MobileNetV2 and ResNet-50. For two networks with skipping layers (ResNet-50 and MobileNetV2), the pruning is smoothly distributed to all layers. However, MobileNetV2 pays more attention to pruning layers that contain skipping layers, while ResNet-50 does not. The network structure of VGGNet-19 does not contain any skipping layers, so its pruning results are more uneven than other networks with skipping layers. In addition, the last layers of the above three networks are well preserved after pruning, which may result from they contribute more to the final classification.

To analyze the pruning process of the same network when different FLOPs budgets are given, as shown in Fig. 4, we prune MobileNetV2 on ImageNet dataset with 140M and 106M FLOPs budgets, respectively. When MobileNetV2 is pruned from 300M to 140M, the pruning process is mainly concentrated on the non-skipping layers and those layers near the front of the network. However, when the FLOPs budget is set to 106M, the number of filters at the end of the network begins to decrease, implying that the layers in front of the network are easier to be pruned than the end layers.

Model Pretrain Epochs Pretrain ACC Finetune ACC
MobileNetV2 10 87.89% 94.51%
40 89.86% 94.63%
70 90.74% 94.74%
100 91.89% 94.77%
150 93.40% 94.78%
300 94.47% 94.83%
VGGNet 10 85.25% 94.07%
40 89.51% 94.13%
70 90.57% 94.17%
100 91.55% 94.22%
150 92.76% 94.21%
300 93.99% 94.25%
Table 4: Classification error (%) of pruned networks on CIFAR-10 dataset with 2x acceleration rate w.r.t. different checkpoints of pretrained models.
Figure 5: Finetuning epochs of 2 acceleration of MobileNetV2 and VGGNet on CIFAR-10 dataset.

Effect of quality of pretrained models

To examine the effect on learned filter numbers with different quality of pretrained models, we use different checkpoints of the pretrained MobileNetV2 (VGGNet) on the CIFAR-10 dataset, which have different classification errors. Then we implement our method Dagger based on these pretrained models, and the results are shown in Table 4. In detail, we use pretrained models w.r.t. different pre-train epochs as the pretrained model to implement our method Dagger, the accuracy of pretrained models are referred to as ”Pretrain Acc”. After pruning with 2 acceleration, we finetune the retained weights with 100 epochs and report as ”Finetune Acc”. It can be seen that with the improvement of classification performance of pretrained models, our pruned networks get improved as well. Moreover, the improvement tends to be steady if the quality of the pretrained model is not too bad. For example, when the pretrained epochs of MobileNetV2 (VGGNet) are changed from 300 to 70 epochs, its pretrained accuracy degrades for 3.73% (3.42%), while our finetune accuracy of the pruned results only have 0.09% (0.07%) performance gap. This implies that our method shows small sensitivity towards the quality of the pretrained model; the pretrained models do not necessarily need to be state-of-the-art ones but not-too-bad ones if we expect a good pruned network.

Effect of Dagger in retraining epochs

To examine the effect of the finetuning epochs in our method Dagger. We finetune the pruned results of 50% FLOPs MobileNetV2 and VGGNet in CIFAR-10 dataset w.r.t. different epochs. In detail, we inherit the weights of the retaining filters after pruning and adopt the same training strategy as illustrated before. Specifically, for VGGNet, the learning rate is initialized to 0.1 and divided by 10 at 50% and 75% of the total epochs. As shown in Fig. 5, the test accuracy of the pruned models at the initial remains relatively low, which means that the lost information of the pruned filters has a certain effect on the overall performance. However, as the finetuning epochs increases, the accuracy of the pruned models improves sharply, reaching the highest accuracy with about 100 epochs, proving the effectiveness of our method.

Efficiency of Dagger in pruning filters

To investigate the efficiency of our method in pruning filters, we report the time cost on pruning w.r.t. different pruning ratios in Table 5. All experiments are implemented with 8 NVIDIA 1080 Ti GPUs.

Model Pruning ratios Params FLOPs Time cost(h)
ResNet-50 30% 16.5M 2.9G 1.58
50% 11.7M 2.0G 2.58
70% 9.2M 1.2G 3.48
MobileNetV2 30% 3.19M 210M 1.08
50% 2.96M 150M 1.48
70% 2.54M 90M 1.88
Table 5: Efficiency of pruned ResNet-50 and MobileNetV2 on ImageNet Dataset w.r.t. different pruning ratios.

As shown in Table 5, our method can quickly get the desired model size based on the pretrained model. We optimize the Dagger module and main network for 400 (100) iterations with the batch size of 320 (64) and pruning ratio 0.6% for ImageNet (CIFAR-10) dataset in each update. Therefore, taking pruning models to 50% FLOPs as an example, we only need to go through about 8 (10) epochs for ImageNet (CIFAR-10) dataset.

Figure 6: Visualization of feature maps w.r.t. pruned (middle column) and retrained (right column) filters in second bottleneck of MobileNetV2 on ImageNet dataset.

5.5 Visualization of feature maps

To intuitively check the learned gates by our method Dagger, we visualize the feature maps w.r.t. different filters with zero gates (pruned) and one gate (retained) in Fig. 6. All the feature maps are from the first convolution of the second bottleneck in MobileNetV2 based on the ImageNet dataset. As shown in Fig. 6, the feature maps of retrained filters (one gate) are more visually informative than that of pruned ones (zero gates). Besides, the pruned filters usually contain more background information, e.g., the bird in the fourth line of Fig. 6. In contrast, our retained filters have a lot of information about the target and suppress background instead, such as snakes and dogs in the second and third rows of Fig. 6.

6 Conclusion

In this paper, we propose to leverage Dagger module with pre-trained weights as input and involve FLOPs as an explicit regularization to guide the process of redundant filters pruning. Concretely, we assign a binary gate for each filter to indicate whether the filter should be retrained or pruned. The binary gates can be well modeled by a Dagger module with the help of CNN filters and use filter parameters as input, which helps to generate dataset related gates. Then based on aligning gates, we can have an accurate estimation of FLOPs, and it can also guide the redundant filters learning besides the classification performance. We prune the redundant filters from the pre-trained model through the greedy algorithm until the effective FLOPs of the current pruned network matches with the pre-set budget. Extensive experiments on benchmark CIFAR-10 dataset and large-scale ImageNet dataset show the superiority of our proposed method over other state-of-the-art filter pruning methods.

Footnotes

  1. Work was done during internship at SenseTime.
  2. journal: Journal of LaTeX Templates
  3. https://en.wikipedia.org/wiki/Support_(mathematics)
  4. https://pytorch.org/docs/stable/torchvision/models.html

References

  1. ”Learning-compression” algorithms for neural net pruning. pp. 8532–8541. Cited by: §2.
  2. MuffNet: multi-layer feature federation for mobile deep learning. pp. 0–0. Cited by: Table 1.
  3. W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger and Y. Chen Compressing neural networks with the hashing trick. Cited by: §1, §2.
  4. Legr: filter pruning via learned global ranking. arXiv preprint arXiv:1904.12368. Cited by: §5.1, §5.3, Table 2.
  5. Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  6. E. Denton, W. Zaremba, J. Bruna, Y. LeCun and R. Fergus Exploiting linear structure within convolutional networks for efficient evaluation. Cited by: §1, §2.
  7. Approximated oracle filter pruning for destructive cnn width optimization. pp. 1607–1616. Cited by: Table 1.
  8. Learning to prune deep neural networks via layer-wise optimal brain surgeon. pp. 4857–4867. Cited by: §2.
  9. On the learnability of quantum neural networks. arXiv preprint arXiv:2007.12369. Cited by: §1.
  10. Dynamic network surgery for efficient dnns. pp. 1387–1395. Cited by: §2.
  11. Single path one-shot neural architecture search with uniform sampling. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.
  12. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.3.
  14. Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: §5.1, §5.3, Table 2.
  15. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §5.1, §5.3, Table 2.
  16. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2, §5.1.
  17. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §2, §3.
  18. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
  19. Channel gating neural networks. arXiv: Learning. Cited by: Table 1.
  20. Data-driven sparse structure selection for deep neural networks. In The European Conference on Computer Vision (ECCV), Cited by: §2, §2, §2, §3, Table 1.
  21. Learning student networks with few data.. In AAAI, pp. 4469–4476. Cited by: §1.
  22. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §1.
  23. Speech enhancement using progressive learning-based convolutional recurrent neural network. Applied Acoustics 166, pp. 107347. Cited by: §1.
  24. Pruning filters for efficient convnets. arXiv: Computer Vision and Pattern Recognition. Cited by: §2, §3.
  25. DARTS: differentiable architecture search. arXiv: Learning. Cited by: §2.
  26. MetaPruning: meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258. Cited by: §2, §5.3, Table 2, Table 3.
  27. Learning efficient convolutional networks through network slimming. pp. 2755–2763. Cited by: §2, §2, §3, §5.1, §5.3, Table 1.
  28. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.
  29. Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2, §5.1.
  30. Autopruner: an end-to-end trainable filter pruning method for efficient deep model inference. arXiv preprint arXiv:1805.08941. Cited by: §2, §3.2, §5.1, §5.3, Table 2.
  31. AtomNAS: fine-grained end-to-end neural architecture search. Cited by: §2.
  32. Group sampling for scale invariant face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3446–3456. Cited by: §1.
  33. Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6. Cited by: §5.1.
  34. Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §4.3, §5.2.
  35. Reinforced molecule generation with heterogeneous states. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 548–557. Cited by: §1.
  36. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2.
  37. V. Sindhwani, T. N. Sainath and S. Kumar Structured transforms for small-footprint deep learning. Cited by: §1, §2.
  38. Reborn filters: pruning convolutional neural networks with limited data.. In AAAI, pp. 5972–5980. Cited by: §2.
  39. Bringing giant neural networks down to earth with unlabeled data. arXiv preprint arXiv:1907.06065. Cited by: §2.
  40. CLIP-q: deep network compression learning by in-parallel pruning-quantization. pp. 7873–7882. Cited by: §2.
  41. Convolutional networks with adaptive inference graphs. pp. 3–18. Cited by: §2.
  42. Improving the performance of weighted lagrange-multiplier methods for nonlinear constrained optimization. Information Sciences 124 (1), pp. 241–272. Cited by: §4.1.
  43. Exploring linear relationship in feature map subspace for convnets compression.. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.
  44. The devil of face recognition is in the noise. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 765–780. Cited by: §1.
  45. Pruning from scratch. arXiv: Computer Vision and Pattern Recognition. Cited by: §5.1, Table 1.
  46. Point-set anchors for object detection, instance segmentation and pose estimation. arXiv preprint arXiv:2007.02846. Cited by: §1.
  47. J. Wu, C. Leng, Y. Wang, Q. Hu and J. Cheng Quantized convolutional neural networks for mobile devices. Cited by: §2.
  48. AutoPrun: automatic network pruning by regularizing auxiliary parameters. pp. 13681–13691. Cited by: §1, §2, §2, §3.
  49. Shared predictive cross-modal deep quantization. IEEE transactions on neural networks and learning systems 29 (11), pp. 5292–5303. Cited by: §1.
  50. E. Yang, M. Liu, D. Yao, B. Cao, C. Lian, P. Yap and D. Shen Deep bayesian hashing with center prior for multi-modal neuroimage retrieval. IEEE transactions on medical imaging. Cited by: §1.
  51. ISTA-nas: efficient and consistent neural architecture search by sparse coding. arXiv preprint arXiv:2010.06176. Cited by: §2.
  52. Good subnetworks provably exist: pruning via greedy forward selection. arXiv: Learning. Cited by: Table 3.
  53. GreedyNAS: towards fast one-shot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2008. Cited by: §2.
  54. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285–1294. Cited by: §1.
  55. Learning with single-teacher multi-student. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  56. Gate decorator: global filter pruning method for accelerating deep convolutional neural networks. pp. 2130–2141. Cited by: §1, §2, §2, §3.
  57. AutoSlim: towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728 8. Cited by: Table 1.
  58. Variational convolutional neural network pruning. pp. 2780–2789. Cited by: Table 1.
  59. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: Table 1.
  60. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §2, §2, §3, §5.1, §5.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420688
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description