Pruning Filter in Filter

Pruning Filter in Filter


Pruning has become a very powerful and effective technique to compress and accelerate modern neural networks. Existing pruning methods can be grouped into two categories: filter pruning (FP) and weight pruning (WP). FP wins at hardware compatibility but loses at compression ratio compared with WP. To converge the strength of both methods, we propose to prune the filter in the filter (PFF). Specifically, we treat a filter as stripes, i.e., filters , then by pruning the stripes instead of the whole filter, PFF achieves finer granularity than traditional FP while being hardware friendly. PFF is implemented by introducing a novel learnable matrix called Filter Skeleton, whose values reflect the optimal shape of each filter. As some rencent work has shown that the pruned architecture is more crucial than the inherited important weights, we argue that the architecture of a single filter, i.e., the Filter Skeleton, also matters. Through extensive experiments, we demonstrate that PFF is more effective compared to the previous FP-based methods and achieves the state-of-art pruning ratio on CIFAR-10 and ImageNet datasets without obvious accuracy drop.

1 Introduction


Deep Neural Networks (DNNs) have achieved remarkable progress in many areas including speech recognition [7], computer vision [21, 32], natural language processing [42], etc. However, model deployment is sometimes costly due to the large number of parameters in DNNs. To relieve such a problem, numerous approaches have been proposed to compress DNNs and reduce the amount of computation. These methods can be classified into two main categories: weight pruning (WP) and filter (channel) pruning (FP).

WP is a fine-grained pruning method that prunes the individual weights, e.g., whose value is nearly , inside the network [12, 11], resulting in a sparse network without sacrificing prediction performance. However, since the positions of non-zero weights are irregular and random, we need an extra record of the weight position and the sparse network pruned by WP can not be presented in a structured fashion like FP due to the randomness inside the network, making WP unable to achieve acceleration on general-purpose processors. By contrast, FP-based methods [25, 18, 30] prune filters or channels within the convolution layers, thus the pruned network is still well organized in a structure fashion and can easily achieve acceleration in general processors. A standard filter pruning pipeline is as follows: 1) Train a larger model until convergence. 2) Prune the filters according to some criterions 3) Fine-tune the pruned network. [31] observes that training the pruned model with random initialization can also achieve high performance. Thus it is the network architecture, rather than trained weights that matters. In this paper, we suggest that not only the architecture of the network but the architecture of the filter itself is also important. [36, 37] also draw similar arguments that the filter with larger kernel size may lead to a better performance. However, the computation cost is expensive. Thus for a given input feature map, [36, 37] uses filters with different kernel sizes (e.g., , , and ) to perform convolution and concatenate all the output feature map. But the kernel size of each filter is manually set. It needs professional experience and knowledge to design an efficient network structure. We wonder what if we can learn the optimal kernel size of each filter by pruning. Our assumption is that each filter can be regarded as a combination of stripes. Some stripes may be redundant to the network. Thus if we can learn the optimal shape of the filter, the redundant stripes can be removed without causing the network to lose information. Compared to the traditional FP-based pruning, this pruning paradigm achieves finer granularity since we operate with stripes rather than the whole filter. Also, the pruned network can also be efficiently inferred (See Section 3).

Similarly, shape-wise pruning, introduced in [22, 39] also achieves finer granularity than filter/channel pruning, which removes the weights located in the same position among all the filters in a certain layer. However, shape-wise pruning breaks the independent assumption on the filters. For example, the invalid positions of weights in each filter may be different. By regularizing the network using shape-wise pruning, the network may lose representation ability under a large pruning ratio. In this paper, we also offer comparison to shape-wise pruning in the experiment. Figure 1 further visualizes the average norm of the filters along the channel dimension in VGG19. It can be seen that not all the stripes in a filter contribute equally. Some stripes have a very low norm and can be removed. Thus in this paper, we propose PFF that learns the optimal shape of each filter and performs stripe selection in each filter. PFF keeps each filter independent with each other which does not break the independent assumption among the filters. Throughout the experiments, PFF achieves a higher pruning ratio compared to the filter-wise, channel-wise, and shape-wise pruning methods. We summarize our main contributions below:

  • We propose a new pruning paradigm called PFF. PFF achieves a finer granular than traditional filter pruning and the pruned network can still be inferred efficiently.

  • We introduce Filter Skeleton (FS) to efficiently learn the optimal shape of each filter and deeply analyze the working mechanism of FS. Using FS, we achieve the state-of-art pruning ratio on CIFAR-10 and ImageNet datasets without obvious accuracy drop.

Figure 1: The visualization of filters according to their norm in Conv3 of VGG19 network.

2 Related Work

Weight pruning: Weight pruning (WP) dates back to optimal brain damage and optimal brain surgeon [23, 13], which prune weights based on the hessian of the loss function. [12] prunes the network weights based on the norm criterion and retrain the network to restore the performance and this technique can be incorporated into the deep compression pipeline through pruning, quantization, and huffman coding [11]. [9] reduces the network complexity by making on-the-fly connection pruning, which incorporates connection splicing into the whole process to avoid incorrect pruning and make it as continual network maintenance. [1] removes connections at each DNN layer by solving a convex optimization program. This program seeks a sparse set of weights at each layer that keeps the layer inputs and outputs consistent with the originally trained model. [29] proposes a frequency-domain dynamic pruning scheme to exploit the spatial correlations on CNN. The frequency-domain coefficients are pruned dynamically in each iteration and different frequency bands are pruned discriminatively, given their different importance on accuracy. However, one drawback of these unstructured pruning methods is that the resulting weight matrices are sparse, which cannot lead to compression and speedup without dedicated hardware/libraries [10].

Filter/Channel Pruning: Filter (Channel) Pruning (FP) prunes at the level of filter, channel, or even layer. Since the original convolution structure is still preserved, no dedicated hardware/libraries are required to realize the benefits. Similar to weight pruning [12], [25] also adopts norm criterion that prunes unimportant filters. Instead of pruning filters, [18] proposed to prune channels through LASSO regression-based channel selection and least square reconstruction. [30] optimizes the scaling factor in the BN layer as a channel selection indicator to decide which channel is unimportant and can be removed. [33] introduces ThiNet that formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer. Similarly, [41] optimizes the reconstruction error of the final response layer and propagates an ‘importance score’ for each channel. [17] first proposes that utilize AutoML for Model Compression which leverages reinforcement learning to provide the model compression policy. [28] proposes an effective structured pruning approach that jointly prunes filters as well as other structures in an end-to-end manner. Specifically, the authors introduce a soft mask to scale the output of these structures by defining a new objective function with sparsity regularization to align the output of the baseline and network with this mask. [24] introduces a budgeted regularized pruning framework for deep CNNs which naturally fit into traditional neural network training. The framework consists of a learnable masking layer, a novel budget-aware objective function, and the use of knowledge distillation. [40] proposes a global filter pruning algorithm called Gate Decorator, which transforms a vanilla CNN module by multiplying its output by the channel-wise scaling factors, i.e. gate, and achieves state-of-art results on CIFAR dataset. [6, 31] deeply analyze how initialization affects pruning through extensive experimental results.

Shape-wise Pruning: [39] introduces shape-wise (group-wise) pruning, that learns a structured sparsity in neural networks using group lasso regularization. The shape-wise pruning can still be efficiently processed using ‘im2col’ implementation as filter-wise and channel-wise pruning. [34] further explores a complete range of pruning granularity and evaluate how it affects the prediction accuracy. [38] improves the shape-wise pruning by proposing a dynamic regularization method. However, shape-wise pruning removes the weights located in the same position among all the filters in a certain layer. Since invalid positions of each filter may be different, shape-wise pruning may cause network lose valid information. In a contrast, our approach keeps each filter independent with each other, thus can lead to a more efficient network structure.

3 PFF: Pruning Filter in Filter

Figure 2 shows the implementation of PFF. The overall process can be summarized by following 4 steps:

  • Step 1: We first train a standard DNN with Filter Skeleton (FS). FS is a matrix which is related to the strips. Each convolution layer has a corresponding FS. Suppose the -th convolutional layer’s weight is of size , where is the number of the filters, is the channel dimension and is the kernel size. Then the size of FS in this layer is . I.e., each value in FS corresponds to a strip in the filter. FS in each layer is firstly initialized with all-one matrix. During training, We multiply the filters’ weights with FS and imposing regularization on FS. Mathematically, this process is represented by:


    , where represents the FS, denotes dot product, and is a regularizer on . From (1), the function of is to create a sparse . For values in which are close to 0, the corresponding stripes contribute little to the network output and can be pruned. In this paper, we adopt the norm penalty on , which is commonly used in many pruning approaches [25, 18, 30]. Specifically, is written as:


    Thus not only the filter weights, but also the FS is optimized during training. FS implicitly learns the optimal architecture of each filter. In Section 4.3, we visualize the shape of filters to further show this phenomenon.

  • Step 2: After training, we merge onto the filter weights . I.e., Perform dot product on with , then directly remove . Thus no additional cost is brought to the network.

  • Step 3: During pruning, We first break each filter into strips, and set a threshold . The stripe whose corresponding value in FS is smaller than will be pruned, as shown in Figure 2.

  • Step 4: After pruning, many stripes in filters are removed and the network is sparse. However, when performing inference on the pruned network, we can not directly use the filter as a whole to perform convolution on the input feature map since the filter is broken. Instead, we need to use each stripe independently to perform convolution and sum the feature map produced by each stripe. Mathematically, the convolution process in PFF is written as:

    Figure 2: The implementation of PFF.

    where is one point of the feature map in the -th layer. From (3), PFF only modifies the calculation order in the conventional convolution process, thus no additional operations (Flops) is added to the network. It is worth noting that, since each stripe has its own position in the filter. PFF needs to record the indexes of all the stripes. However, it costs little compared to the whole network parameters. Suppose the -th convolutional layer’s weight is of size . For PFF, we need to record indexes. Compared to the individual weight pruning which records indexes, we reduce the weight pruning’s indexes by times. Also, we do not need to record the indexes of the filter if all the stripes in such filter are removed from the network and PFF degenerates to conventional filter-wise pruning. For a fair comparison with traditional FP-based methods, we add the number of indexes when calculating the number of network parameters.

There are two advantage of PFF compared to the traditional FP-based pruning:

  • Suppose the kernel size is , then PFF achieves finer granularity than traditional FP-based pruning, which leads to a higher pruning ratio.

  • The network pruned by PFF keeps high performance even without a fine-tuning process. This separates PFF from many other FP-based pruning methods that require multiple fine-tuning procedure. The reason is that FS learns a optimal shape for each filter. By pruning unimportant stripes, the filter do not lose much useful information. In contrast, FP pruning directly removes filters which may damage the information learned by the network.

4 Experiments

This section arranges as follows: In Section 4.1, we introduce the implementation details in the paper; in Section 4.2, we show PFF achieves state-of-art pruning ratio on CIFAR-10 and ImageNet datasets compared to filter-wise, channel-wise or shape-wise pruning; in Section 4.3, we deeply analyze how PFF prune the network; in Section 4.5, we perform ablation studies to study how hyper-parameters influence PFF.

4.1 Implementation Details

Datasets and Models: CIFAR-10 [20] and ImageNet [2] are two popular datastes and are adopted in our experiments. CIFAR-10 dataset contains 50K training images and 10K test images for 10 classes. ImageNet contains 1.28 million training images and 50K test images for 1000 classes. On CIFAR-10, we evaluated our method on two popular network structures: VGGNet [35], ResNet [14]. On ImageNet dataset, we adopt ResNet18.

Baseline Setting: Our baseline setting is consistent with [30]. For CIFAR-10, the model was trained for 160 epochs with a batch size of 64. The initial learning rate is set to 0.1 and divide it by 10 at the epoch 80 and 120. The simple data augmentation (random crop and random horizontal flip) is used for training images. For ImageNet, we follow the official PyTorch implementation 2 that train the model for 90 epochs with a batch size of 256. The initial learning rate is set to 0.1 and divide it by 10 every 30 epochs. Images are resized to , then randomly crop a area from the original image for training. The testing is on the center crop of pixels.

PFF setting: The basic hyper-parameters setting is consistent with the baseline. is set to 1e-5 in (1) and the threshold is set to 0.05. For CIFAR-10, we do not fine-tune the network after stripe selection. For ImageNet, we perform one-shot fine-tuning after pruning.

4.2 Comparing PFF with state-of-art methods

We compare PFF with recent state-of-arts pruning methods. Table 1 and Table 2 lists the comparison on CIFAR-10 and ImageNet, respectively. In Table 1, IR [38] is shape-wise pruning method, the others except PFF are filter-wise or channel-wise methods. We can see GBN [40] even outperforms shape-wise pruning method. From our analysis, shape-wise pruning regularizes the network’s weights in the same positions among all the filters, which may cause the network to lose useful information. Thus shape-wise pruning may not be the best choice. However, PFF outperforms other methods by a large margin. For example, when pruning VGG16, PFF can reduce the number of parameters by 92.66% and the number of Flops by 71.16% without losing network performance. Figure 3 shows the validation accuracy with training epochs of PFF. It can be observed that training with a filter skeleton does not lose information drop. In the early training stages, the network even shows better accuracy compared to the baseline. On ImageNet, PFF could also achieve better performance than recent benchmark approaches. For example, PFF can reduce the FLOPs by 54.58% without obvious accuracy drop. We want to emphasize that even though PFF brings indexes of strips, the cost is little. When performing calculation on the number of parameters, We have added these indexes in the calculation on Table 1 and Table 2. The pruning ratio of PFF is still significant and achieves state-of-art results.

Backbone Metrics Params(%) FLOPS(%) Accuracy(%)
L1[25] (ICLR 2017) 64 34.3 -0.15
ThiNet[33] (ICCV 2017) 63.95 64.02 2.49
SSS[19] (ECCV 2018) 73.8 41.6 0.23
VGG16 SFP[15] (IJCAI 2018) 63.95 63.91 1.17
GAL[28] (CVPR 2019) 77.6 39.6 1.22
Hinge[26] (CVPR 2020) 80.05 39.07 -0.34
HRank[27] (CVPR 2020) 82.9 53.5 -0.18
PFF(Ours) 92.66 71.16 -0.4

L1[25] (ICLR 2017) 13.7 27.6 -0.02
CP[18] (ICCV 2017) - 50 1.00
NISP[41] (CVPR 2018) 42.6 43.6 0.03
DCP[43] (NeurIPS 2018) 70.3 47.1 -0.01
ResNet56 IR[38] (IJCNN 2019) - 67.7 0.4
C-SGD[3] (CVPR 2019) - 60.8 -0.23
GBN [40] (NeurIPS 2019) 66.7 70.3 0.03
HRank[27] (CVPR 2020) 68.1 74.1 2.38
PFF(Ours) 77.7 75.6 0.12
Table 1: Comparing PFF with state-of-arts FP-based methods on CIFAR-10 dataset. The baseline accuracy of ResNet56 is 93.1% [40], while VGG16’s baseline accuracy is 93.25% [25].
Backbone Metrics FLOPS(%) Top-1(%) Top-5(%)
LCCL[4] (CVPR 2017) 35.57 3.43 2.14
SFP[15] (IJCAI 2018) 42.72 2.66 1.3
FPGM[16] (CVPR 2019) 42.72 1.35 0.6
ResNet18 TAS[5] (NeurIPS 2019) 43.47 0.61 -0.11
DMCP[8] (CVPR 2020) 42.81 0.56 -
PFF(5e-6) 50.48 -0.23 -0.22
PFF(2e-5) 54.58 0.17 0.04
Table 2: Comparing PFF with state-of-arts pruning methods on ImageNet dataset. All the methods use ResNet18 as the backbone and the baseline top-1 and top-5 accuracy are 69.76% and 89.08%, respectively.
Figure 3: This figure shows the validation accuracy with training epochs of PFF compared to the baseline training on CIFAR-10. The left figure is based on VGG and the right figure is based on ResNet56.

4.3 Analysis for PFF

The success of PFF comes from that Filter Skeleton (FS) could find the optimal shape of each filter. Removing unimportant stripes of each filter causes little information lose. In this section, we further analyze the working mechanism of PFF through experiments.

Does the shape of the filter matter? To verify that the shape of the filter really matters, we perform an experiment shown in Figure 4. We first fix filters’ weights, then train network with learnable filter skeleton. We surprisingly find that the network still achieves 80.58% test accuracy with only 12.64% parameters left. This observation shows that even though the weights of filters are randomly initialized , the network still has good representation capability if we could find an optimal shape of the filters. After learning the shape of each filter, we fix the architecture of the network and finetune the weights. The network ultimately achieves 91.93 % accuracy on the test set.

Does filter skeleton change the distribution of the weights? Figure 5 displays the weights distribution of baseline network and network trained by filter skeleton (FS). We find that with filter skeleton, the weights of the network become more stable. It is worth noting that in this experiment, we do not impose norm regularization on filter skeleton. The network trained with filter skeleton still exhibits good properties. Since the weights are more stable, the network is robust to the variation of the input data or features.

How the pruned filters look like? We visualize the filters of VGG19 to show what the sparse network look like after pruning by PFF. The kernel size of VGG19 is , thus there are 9 strips in each filter. Each filter has forms since each strip can be removed or preserved. We display the filters of each layer according to their frequency of each form. Figure 6 shows the visualization results. There are some interesting phenomenons:

  • For each layer, most filters are directly pruned with all the strips.

  • In the middle layers, most preserved filters only have one strip. However, in the layers that close to input, most preserved layers have multiple strips. Suggesting the redundancy most happens in the middle layers.

We believe this visualization may towards a better understanding of CNNs. In the past, we always regard filter as the smallest unit in CNN. However, from our experiment, the architecture of filter itself is also important and can be learned by pruning. More visualization results can be found in the supplementary material.

Figure 4: The left figure shows training network with fixed filter weights, only filter skeleton can be optimized. The right figure shows training network with fixed filter skeleton, the filter weights can be optimized. The experiment is based on CIFAR-10 with ResNet56.
Figure 5: This left and right figure shows the weights distribution of baseline and FS on the first convolution layer, respectively. In this layer, each filter has 9 strips. Each mini-figure shows the norm of the stripes located in the same position of all the filters. The mean and std are also reported.
Figure 6: The visualization of the VGG19 filters pruned by PFF. From top to bottom, we display the filters according to their frequency in such layer. White color denotes the corresponding strip in the filter is removed by PFF.

4.4 Continual Pruning in PFF

Since PFF can achieve finer granularity than traditional filter pruning methods, we can use PFF to continue pruning the network pruned by other methods without obvious accuracy drop. Table 3 shows the experimental results. It can be observed that PFF can help other FP-based pruning towards higher pruning ratios.

Backbone Metrics FLOPS (M) Params (M) Accuracy
Baseline 14.72 627.36 93.63
VGG16 Network Slimming [30] 1.44 272.83 93.60
Network Slimming + PFF 1.09 204.02 93.62
Baseline 20.04 797.61 93.92
VGG19 DCP[43] 10.36 398.42 93.6
DCP+PFF 3.40 253.24 93.4
Baseline 0.86 251.49 93.1
ResNet56 GBN [40] 0.30 112.77 92.89
GBN+PFF 0.24 81.26 92.67
Table 3: This table shows variant pruning methods on CIFAR-10 dataset. denotes that first prune the network with method A, then continue pruning the network with method B.

4.5 Ablation Study

Hyper-parameters Study

In this section, we study how different hyper-parameters affect pruning results. We mainly study the weighting coefficient in (1) and the pruning threshold . Table 4 shows the experimental results. We find and gives the acceptable pruning ratio and test accuracy.

0.8e-5 1.2e-5 1.4e-5 1e-5
0.05 0.01 0.03 0.05 0.07 0.09
Params (M) 0.25 0.21 0.2 0.45 0.34 0.21 0.16 0.12
Flops (M) 61.16 47.71 41.23 111.68 74.83 56.10 41.59 29.72
Accuracy(%) 92.73 92.43 92.12 93.25 92.82 92.98 92.43 91.83
Table 4: This table shows how and affects PFF results. The experiment is conducted on CIFAR-10. The network is ResNet56.

Filter Skeleton v.s. Group Lasso

In the paper, we use Filter Skeleton (FS) to learn the optimal shape of each filter and prune the unimportant stripes. However, there exists other techniques to regularize the network to make it sparse. For example, Lasso-based regularizer [39], which directly regularizes the network weights. We offer a comparison to Group Lasso regularizer in this section. Figure 7 shows the results. We can see under the same number of parameters or Flops, PFF with Filter Skeleton achieves a higher performance.

Figure 7: Comparing Filter Skeleton with Lasso regularizer on CIFAR-10. The backbone is VGG16.

Pff v.s. Shape-wise pruning

In the paper, we argue that shape-wise pruning breaks the filter independent assumption and may cause the network to lose useful information. In contrast, PFF learns the optimal shape of each filter which keeps each filter independent with each other. We aim to further show this evidence in this section. Since Shape-wise pruning can also be implemented via Filter Skeleton, we perform shape-wise pruning and PFF both based on the Filter Skeleton. Figure 8 shows the results. We can see under the same number of parameters or Flops, PFF achieves a higher performance compared to shape-wise pruning.

Figure 8: Comparing PFF with shape-wise pruning on CIFAR-10. The backbone is VGG16.

5 Conclusion

In this paper, we propose a new pruning paradigm called PFF. Instead of pruning the whole filter, PFF regards each filter as a combination of multiple stripes and perform pruning on the stripes. We also introduce Filter Skeleton (FS) to efficiently learn the optimal shape of the filters for pruning. Through extensive experiments and analyses, we demonstrate the effectiveness of the PFF framework. Future work can be done to develop a more efficient regularizer to further optimize DNNs.

6 Supplementary Material

In this section, we show how the pruned network look like by PFF. Figure 9 shows the visualization results of ResNet56 on CIFAR-10. It can be observed that (1) PFF has a higher pruning ratio on the middle layers, e.g., layer 2.3 to layer 2.9. (2) The pruning ratio of each stripe is different and varies on each layer. Table 5 shows the pruned network on ImageNet. For example, in layer1.1.conv2, there are original 64 filters whose size is . After pruning, there exists 300 stripes whose size is . The pruning ratio in this layer is .

Figure 9: In the figure, We exhibit the ratio of remaining stripes of each layer. Each filter has 9 stripes indexed from to .
keys modules
(conv1): Strip(3,324)
(bn1): BatchNorm(64)
(layer1.0.conv1): Strip(64,102)
(layer1.0.bn1): BatchNorm(57)
(layer1.0.conv2): Strip(57,164)
(layer1.0.bn2): BatchNorm(64)
(layer1.1.conv1): Strip(64,175)
(layer1.1.bn1): BatchNorm(62)
(layer1.1.conv2): Strip(62,300)
(layer1.1.bn2): BatchNorm(64)
(layer2.0.conv1): Strip(64,475,stride=2)
(layer2.0.bn1): BatchNorm(119)
(layer2.0.conv2): Strip(119,636)
(layer2.0.bn2): BatchNorm(128)
(layer2.1.conv1): Strip(128,662)
(layer2.1.bn1): BatchNorm(128)
(layer2.1.conv2): Strip(128,648)
(layer2.1.bn2): BatchNorm(128)
(layer3.0.conv1): Strip(128,995,stride=2)
(layer3.0.bn1): BatchNorm(252)
(layer3.0.conv2): Strip(252,1502)
(layer3.0.bn2): BatchNorm(256)
(layer3.1.conv1): Strip(256,1148)
(layer3.1.bn1): BatchNorm(256)
(layer3.1.conv2): Strip(256,944)
(layer3.1.bn2): BatchNorm(256)
(layer4.0.conv1): Strip(256,1304,stride=2)
(layer4.0.bn1): BatchNorm(498)
(layer4.0.conv2): Strip(498, 2448)
(layer4.0.bn2): BatchNorm(512)
(layer4.1.conv1): Strip(512, 3111)
(layer4.1.bn1): BatchNorm(512)
(layer4.1.conv2): Strip(512, 2927)
(layer4.1.bn2): BatchNorm(512)
(fc): Linear(512,1000)
Table 5: This table shows the structure of pruned ResNet18 on ImageNet.


  1. footnotetext: In the author list, denotes that authors contribute equally; denotes corresponding authors.


  1. A. Aghasi, A. Abdi, N. Nguyen and J. Romberg (2017) Net-trim: convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, pp. 3177–3186. Cited by: §2.
  2. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
  3. X. Ding, G. Ding, Y. Guo and J. Han (2019) Centripetal sgd for pruning very deep convolutional networks with complicated structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4943–4953. Cited by: Table 1.
  4. X. Dong, J. Huang, Y. Yang and S. Yan (2017) More is less: a more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848. Cited by: Table 2.
  5. X. Dong and Y. Yang (2019) Network pruning via transformable architecture search. In Advances in Neural Information Processing Systems, pp. 759–770. Cited by: Table 2.
  6. J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §2.
  7. A. Graves, A. Mohamed and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
  8. S. Guo, Y. Wang, Q. Li and J. Yan (2020) DMCP: differentiable markov channel pruning for neural networks. arXiv preprint arXiv:2005.03354. Cited by: Table 2.
  9. Y. Guo, A. Yao and Y. Chen (2016) Dynamic network surgery for efficient dnns. In Advances in neural information processing systems, pp. 1379–1387. Cited by: §2.
  10. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44 (3), pp. 243–254. Cited by: §2.
  11. S. Han, H. Mao and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
  12. S. Han, J. Pool, J. Tran and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2, §2.
  13. B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §2.
  14. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  15. Y. He, G. Kang, X. Dong, Y. Fu and Y. Yang (2018) Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866. Cited by: Table 1, Table 2.
  16. Y. He, P. Liu, Z. Wang, Z. Hu and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: Table 2.
  17. Y. He, J. Lin, Z. Liu, H. Wang, L. Li and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §2.
  18. Y. He, X. Zhang and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1, §2, 1st item, Table 1.
  19. Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 304–320. Cited by: Table 1.
  20. A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  21. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  22. V. Lebedev and V. Lempitsky (2016) Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564. Cited by: §1.
  23. Y. LeCun, J. S. Denker and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.
  24. C. Lemaire, A. Achkar and P. Jodoin (2019) Structured pruning of neural networks with budget-aware regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §2.
  25. H. Li, A. Kadav, I. Durdanovic, H. Samet and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, 1st item, Table 1.
  26. Y. Li, S. Gu, C. Mayer, L. Van Gool and R. Timofte (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. arXiv preprint arXiv:2003.08935. Cited by: Table 1.
  27. M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian and L. Shao (2020) HRank: filter pruning using high-rank feature map. arXiv preprint arXiv:2002.10179. Cited by: Table 1.
  28. S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang and D. Doermann (2019) Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2799. Cited by: §2, Table 1.
  29. Z. Liu, J. Xu, X. Peng and R. Xiong (2018) Frequency-domain dynamic pruning for convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1043–1053. Cited by: §2.
  30. Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1, §2, 1st item, §4.1, Table 3.
  31. Z. Liu, M. Sun, T. Zhou, G. Huang and T. Darrell (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1, §2.
  32. W. Lotter, G. Kreiman and D. Cox (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104. Cited by: §1.
  33. J. Luo, J. Wu and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2, Table 1.
  34. H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang and W. J. Dally (2017) Exploring the granularity of sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 13–20. Cited by: §2.
  35. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  36. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  37. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
  38. H. Wang, Q. Zhang, Y. Wang, L. Yu and H. Hu (2019) Structured pruning for efficient convnets via incremental regularization. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2, §4.2, Table 1.
  39. W. Wen, C. Wu, Y. Wang, Y. Chen and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1, §2, §4.5.2.
  40. Z. You, K. Yan, J. Ye, M. Ma and P. Wang (2019) Gate decorator: global filter pruning method for accelerating deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2130–2141. Cited by: §2, §4.2, Table 1, Table 3.
  41. R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin and L. S. Davis (2018) Nisp: pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §2, Table 1.
  42. X. Zhang and Y. LeCun (2015) Text understanding from scratch. arXiv preprint arXiv:1502.01710. Cited by: §1.
  43. Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang and J. Zhu (2018) Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: Table 1, Table 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description