Model-Agnostic Structured Sparsification with Learnable Channel Shuffle

Model-Agnostic Structured Sparsification with Learnable Channel Shuffle


Recent advances in convolutional neural networks (CNNs) usually come with the expense of considerable computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. To this end, we propose a model-agnostic structured sparsification method for efficient network compression. The proposed method automatically induces structurally sparse representations of the convolutional weights, thereby facilitating the implementation of the compressed model with the highly-optimized group convolution.

We further address the problem of inter-group communication with a learnable channel shuffle mechanism. The proposed approach is model-agnostic and highly compressible with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach performs favorably against the state-of-the-art network pruning methods. The code will be publicly available after the review process.


1 Introduction



Figure 1: Trade-off between accuracy and complexity on the ImageNet Russakovsky et al. (2015) dataset. Our method (MASS) is highlighted with the solid lines (upper left is better).

Convolutional Neural Networks (CNNs) have made significant advances in a wide range of vision and learning tasks  Krizhevsky et al. (2012); Gehring et al. (2017); Long et al. (2015); Girshick (2015). However, the performance gains usually entail heavy computational costs, which make the deployment of CNNs on portable devices difficult. To meet the memory and computational constraints in real-world applications, numerous model compression techniques have been developed.

Existing network compression techniques are mainly based on weight quantization Chen et al. (2015); Courbariaux et al. (2016); Rastegari et al. (2016); Wu et al. (2016), knowledge distillation Hinton et al. (2014); Chen et al. (2017); Yim et al. (2017), and network pruning Li et al. (2017); He et al. (2017); Liu et al. (2017); Molchanov et al. (2019). Weight quantization methods use low bit-width numbers to represent weights and activations, which usually bring a moderate performance degradation. Knowledge distillation schemes transfer knowledge from a large teacher network to a compact student network, which are typically susceptible to the teacher/student network architecture Mirzadeh et al. (2020); Liu et al. (2019b). Closely related to our work, network pruning approaches reduce the model size by removing a proportion of model parameters that are considered unimportant. Notably, filter pruning algorithms Liu et al. (2017); He et al. (2017); Li et al. (2017) remove the entire filters and result in structured architectures that can be readily incorporated into modern BLAS libraries.

Identifying unimportant filters is critical to pruning methods. It is well-known that the weight norm can serve as a good indicator of the corresponding filter importance Li et al. (2017); Liu et al. (2017). Filters corresponding to smaller weight norms are considered to contribute less to the outputs. Furthermore, the regularization can be used to increase sparsity Liu et al. (2017). Despite the advances, several issues in the existing pruning methods can be improved: 1) pruning a large proportion of convolutional filters will result in severe performance degradation; 2) pruning alters the input/output feature dimensions, and thus meticulous adaptation is required to handle special network architectures (e.g., residual connections He et al. (2016) and dense connections Huang et al. (2017)).

Before presenting the proposed method, we briefly introduce the group convolution (GroupConv) Krizhevsky et al. (2012), which plays an important role in this work. For the typical convolution operation, the output features are densely-connected with the entire input features, while for the GroupConv, the input features are equally split into several groups and transformed within each group independently. Essentially, each output channel is connected with only a proportion of the input channels, which leads to sparse neuron connections. Therefore, deep CNNs with GroupConvs can be trained on less powerful GPUs with smaller amount of memory. In this work, we propose a novel approach for network compression where unimportant neuron connections are pruned to facilitate the usage of GroupConvs. Nevertheless, converting vallina convolutions into GroupConvs is a challenging task. First, not all sparse neuron connectivities correspond to valid GroupConvs, while certain requirements must be satisfied, e.g., mutual exclusiveness of different groups. To guarantee the desired structured sparsity, we impose structured regularization upon the convolutional weights and zero out the sparsified weights. Another challenge is that stacking multiple GroupConvs sequentially will hinder the inter-group information flow. The ShuffleNet Zhang et al. (2018) method proposes a channel shuffle mechanism, i.e., gathering channels from distinct groups, to ensure the inter-group communication, though the order of permutation is hand-crafted. However, we solve the problem of channel shuffle in a learning-based scheme. Concretely, we formulate the learning of channel shuffle as a linear programming problem, which can be solved by efficient algorithms like the network simplex method Bonneel et al. (2011). Since the structured sparsity is induced among the convolutional weights, our method is nominated as Model-Agnostic Structured Sparsification, abbreviated to MASS.

The proposed structured sparsification method is designed for three goals. First, our approach is model-agnostic. A wide range of backbone architectures are amenable to our method without the need for any special adaptation. Second, our method is capable of achieving high compression rates. In modern efficient network architectures, the complexity of convolutions is highly compressed, while the computation bottleneck becomes the point-wise convolutions (i.e.,  convolutions) Zhang et al. (2018). For example, the point-wise convolutions occupy 81.5% of the total FLOPs in the MobileNet-V2 Sandler et al. (2018) backbone and 93.4% in ResNeXt Xie et al. (2017). Our method is applicable to all convolution operators so that a high compression rate is reachable. Third, our approach brings negligible performance drop. As all of the filters are preserved under our methodology, we retain stronger representational capacity of the compressed model and achieve better accuracy-complexity trade-off than the pruning-based counterparts (see Fig. 1).

The main contributions of this work are three-fold:

  • We propose a learnable channel shuffle mechanism (Sec. 3.2) in which the permutation of the convolutional weight norm matrix is learned via linear programming.

  • Upon the permuted weight norm matrix, we impose structured regularization (Sec. 3.3) to obtain valid GroupConvs by zeroing out the sparsified weights.

  • With the structurally sparse convolutional weights, we design the criteria of learning cardinality (Sec. 3.4) in which unimportant neuron connections are pruned with minimal impact on the entire network.

Incorporating the learnable channel shuffle mechanism, the structured regularization and the grouping criteria, the proposed structured sparsification method performs favorably against the state-of-the-art network pruning techniques on both CIFAR Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015) datasets.

2 Related Work

Network Compression

Compression methods for deep models can be broadly categorized based on weight quantization, knowledge distillation, and network pruning. Closely related to our work are the network pruning approaches based on filter pruning. It is well-acknowledged that filters with smaller weight norm are considered to make negligible contribution to the outputs and can be pruned. Li et al. (2017) prune filters with smaller norm, while Liu et al. (2017) remove those corresponding to smaller batch-norm scaling factors, on which an regularization term is imposed to increase sparsity.

However, techniques that remove the entire filters based on the weight norm may negatively affect the representational capacity significantly. Instead of removing the entire filters, the proposed structured sparsification method enforces structured sparsity among neuron connections and merely removes certain unimportant connections while the entire filters are preserved. As such, the network capacity is less affected than pruning-based approaches Li et al. (2017); Liu et al. (2017); He et al. (2019); Molchanov et al. (2019). Furthermore, our method does not alter the input/output dimensions, and can be easily incorporated into numerous backbone models.

Group Convolution.

Group convolution (GroupConv) is introduced in the AlexNet Krizhevsky et al. (2012) to overcome the GPU memory constraints. GroupConv partitions the input features into mutually exclusive groups and transforms the features within each group in parallel. Compared with the vallina (i.e., densely connected) convolution, a GroupConv with groups can reduce the computational cost and number of parameters by a factor of . The ResNeXt  Xie et al. (2017) designs a multi-branch architecture by employing GroupConvs and defines cardinality as the number of parallel transformations, which is simply the group number in each GroupConv. If the cardinality equals to the number of channels, GroupConv becomes the depthwise separable convolution, which is widely used in recent lightweight neural architectures Howard et al. (2017); Sandler et al. (2018); Zhang et al. (2018); Ma et al. (2018); Chollet (2017).

However, the aforementioned methods all treat the cardinality as a hyper-parameter, and the connectivity patterns between consecutive features are hand-crafted as well. On the other hand, there is also a line of research focusing on learnable GroupConvs Huang et al. (2018); Wang et al. (2019); Zhang et al. (2019). Both CondenseNet Huang et al. (2018) and FLGC Wang et al. (2019) pre-define the cardinality of each GroupConv and learn the connectivity patterns. We note that the work by Zhang et al. (2019) learns both the cardinality and neuron connectivity simultaneously. Essentially, this dynamic grouping convolution is modeled by a binary relationship matrix where indicates the connectivity between the input channel and the output channel. To guarantee that the resulting operator is a valid GroupConv, the relationship matrix is constructed using a Kronecker product of several binary symmetric matrices. Nevertheless, the Kronecker product gives a sufficient but unnecessary condition and the space of all valid relationship matrices is not fully exploited.

Our method decouples the learning of cardinality and connectivity. Motivated by the norm-based criterion in the network pruning methods Li et al. (2017); Liu et al. (2017), we quantify the importance of each neuron connection by the corresponding weight norm and learn the connectivity pattern by permuting the weight norm matrix. Besides, the structured regularization is imposed on the permuted weight norm matrix and the cardinality is learned accordingly. The essential difference between our approach and prior art Zhang et al. (2019) is that all possible neuron connectivity patterns, i.e., relationship matrices, can be reached by our method.

Channel Shuffle Mechanism.

The ShuffleNet Zhang et al. (2018) combines the channel shuffle mechanism with GroupConv for efficient network design, in which channels from different groups are gathered so as to facilitate inter-group communication. Without channel shuffle, stacking multiple GroupConvs will eliminate the information flow among different groups and weaken the representational capacity. Different from the hand-crafted counterpart Zhang et al. (2018), the proposed channel shuffle operation is learnable over the space of all possible channel permutations. Furthermore, without bells and whistles, our channel shuffle only involves a simple permutation along the channel dimension, which can be conveniently implemented by an index operation.

Neural Architecture Search.

Neural Architecture Search (NAS) Zoph and Le (2017); Baker et al. (2017); Zoph et al. (2018); Real et al. (2019); Wu et al. (2019) aims to automate the process of learning neural architectures within certain budgets of computational resources. Existing NAS algorithms are developed based on reinforcement learning Zoph and Le (2017); Baker et al. (2017); Zoph et al. (2018), evolutionary search Real et al. (2017, 2019), and differentiable approaches Liu et al. (2019a); Wu et al. (2019). Our method can be viewed as a special case of hyper-parameter (i.e., cardinality) optimization and neuron connectivity search. However, different from existing approaches that evaluate on numerous architectures, the proposed method can determine the compressed architecture in one single training pass and is more scalable than most NAS methods.

3 Model-Agnostic Structured Sparsification

3.1 Overview

The structured sparsification method is designed to zero out a proportion of the convolutional weights so that the vallina convolutions can be converted into group convolutions (GroupConvs), and meanwhile the optimal neuron connectivity can be learned. We adopt the “train, compress, finetune” pipeline, in a way similar to the recent pruning approaches Liu et al. (2017). Concretely, we first train a large model with structured regularization, then convert vallina convolutions into GroupConvs under certain criteria, and finally finetune the compressed model to recover accuracy. The connectivity patterns can be therein learned as the structured regularization heavily depends on them. As such, three issues need to be addressed: 1) how to learn the connectivity patterns (Sec. 3.2); 2) how to design the structured regularization (Sec. 3.3); and 3) how to decide the grouping criteria (Sec. 3.4). Additional details of our pipeline are presented in Sec. 3.5.


[width=]channel-shuffle structuredregularization(a) channel connectivity(b) weight norm matrix

Figure 2: Illustration of the learnable channel shuffle mechanism. The original convolutional (first column) weights are shuffled along the input/output channel dimensions in order to solve Eq. 4. The structured regularization is imposed upon the permuted weight norm matrix (second column) to increase structured sparsity, and connections with small weight norms are discarded (third column). In the original ordering of channels, a structurally sparse connectivity pattern is learned (fourth column), and notably every valid connectivity pattern can be possibly reached in this manner.

3.2 Learning Connectivity with Linear Programming

Let be the input feature map, where denotes the number of input channels. We apply a vallina convolution1 with weights to , i.e., , where with denoting the number of output channels. Each entry of is a weighted sum of a local patch of , namely,


In Eq. 1, the channel of relates to the channel of via weights . Motivated by the norm-based importance estimation in filter pruning Li et al. (2017); Liu et al. (2017), we quantify the importance of the connection between and of by . Thus, the importance matrix can be defined as the norm along the “kernel size” dimensions of , i.e., .

Next, we extend our formulation to GroupConvs with cardinality . A GroupConv can be considered as a convolution with sparse neuron connectivity, in which only a proportion of input channels is visible to each output channel. Without loss of generality, we assume both and are divisible by , and Eq. 1 can be adapted as


where indicates the output channel belongs to the group, and denotes the number of input channels within each group. Clearly, the valid entries of form a block diagonal matrix with equally-split blocks at the input/output channel dimensions. Thus, the GroupConv module requires parameters and FLOPs for processing the feature , and the computational complexity is reduced by a factor of compared with the vallina counterpart.

We note that if a vallina convolution operator can be converted into GroupConv without affecting its functional property (we call such convolution operators groupable), the convolutional weights must be block diagonal after certain permutations along the input/output channel dimensions. Due to the positive definiteness of norm and the fact that permuting corresponds to permuting , a necessary and sufficient condition of a convolution operator being groupable is that


where denotes the set of permutation matrices. Here, the permutation matrices and shuffle the channels of the input and output features, and thus determine the connectivity pattern between and (see Fig. 2).

However, a randomly initialized and trained convolution operator by no means can be groupable unless special sparsity constraints are imposed. To this end, we resort to permuting so as to make “as block diagonal as possible”. The next question is how to rigorously define the term “as block diagonal as possible”. Here, we assume both and are powers of 2, where the most widely-used backbone architectures (e.g., VGG Simonyan and Zisserman (2015) and ResNet He et al. (2016)) satisfy this assumption2. Then, the potential cardinality is also a power of 2. As the cardinality grows, more and more non-diagonal blocks are zeroed out (see Fig. 3(c)). As illustrated in Fig. 3(b), we define the cost matrix to progressively penalize the non-zero entries of the non-diagonal blocks. Finally, we utilize as a metric of the “block-diagonality” of the matrix , where indicates element-wise multiplication and summation over all entries, i.e., . Formally, the optimization problem is formulated as follows:


Solving Eq. 4 gives the optimal connectivity pattern between the adjacent layers.


[width=0.47]group-level group level group level group level (a) permuted weight norm matrix (b) structured regularization(c) relationship matrix

Figure 3: Illustration of the structured regularization matrix and the relationship matrix corresponding to the group level . (a) Heat map of the permuted weight norm matrix . Non-diagonal blocks of the weight norm are sparsified. (b) Structured regularization matrix . The regularization coefficient decays exponentially as the group level grows. A special case of the decay rate of 0.5 is demonstrated. Besides, the matrix depends on the current group level , and when the maximal possible group level is achieved, the matrix becomes the cost matrix in Eq. 4; (c) Relationship matrix . The entries of the permuted weight norm matrix corresponding to the zero entries of the relationship matrix will be zeroed out during grouping.

However, minimization over the set of permutation matrices is a non-convex and NP-hard problem that requires combinatorial search. To this end, we relax the feasible space to its convex hull. The Birkhoff-von Neumann theorem Birkhoff (1946) states that the convex hull of the set of permutation matrices is the set of doubly-stochastic matrices3, known as the Birkhoff polytope:


where denotes the column vector composed of ones.

We solve Eq. 4 with coordinate descent. That is, we iteratively update and until convergence. When updating one variable, we consider the other as fixed. For example, when optimizing , the objective function can be transformed as follows:


As the objective is a linear function of and the Birkhoff polytope is a simplex, Eq. 6 is a linear programming problem, which can be solved by efficient algorithms such as the network simplex method Bonneel et al. (2011). In addition, the theory of linear programming guarantees that at least one of the solutions is achieved at the vertex of the simplex, and the vertices of the Birkhoff polytope are precisely the permutation matrices Birkhoff (1946). Thus, in Eq. 6, minimization over the Birkhoff polytope is equivalent to minimization over the set of permutation matrices, and the solution is naturally a permutation matrix without the need for post-processing. Furthermore, Eq. 6 has the same formulation as the optimal transport problem, and sophisticated computation library4 is available for efficient linear programming.

3.3 Structured Regularization

Permutation alone does not suffice to induce structurally sparse convolutional weights, and we still need to impose special sparsity regularization to achieve the desired sparsity structure. Inspired by the sparsity-inducing penalty in Liu et al. (2017), we impose the structured regularization on the permuted weight norm . We first define the group level as illustrated in Fig. 3, which indicates the current cardinality achieved, i.e.,  and is determined in Sec. 3.4. Then, given the current group level , the structured regularization is formulated as5 , where denotes the structured regularization matrix as illustrated in Fig. 3(b). Intuitively, at group level , additional regularization is imposed upon the non-diagonal blocks to be zeroed out if the group level of is achieved. Furthermore, the regularization coefficient decays exponentially as the group level grows as we desire balanced cardinality distribution among the network. In the end, the overall loss becomes


where denotes the regular data loss (standard classification loss in the following experiments) and is the balancing scalar.

3.4 Criteria of Learning Cardinality

With the structurally sparsified convolutional weights, the next step is to determine the cardinality. The core idea of our criteria is that the weight norms corresponding to the valid connections constitute at least a certain proportion of the total weight norms. At group level , the following requirement should be satisfied:


where is a threshold set to 0.9 in all of our experiments, and is the relationship matrix Zhang et al. (2019) as illustrated in Fig. 3(c). The matrix specifies the valid neuron connections at group level . Therefore, the current group level can be determined by


Initially update the permutation matrices and .
for  to #epochs do

       Train for 1 epoch with the structured regularization;
Solve Eq. 4 to update the matrices and ;
Determine the current group levels by Eq. 9;
Update the structured regularization matrices;
Adjust the coefficient .
end for
Algorithm 1 Training Pipeline.

3.5 Pipeline Details

Methods #Params.() FLOPs () Acc.(%)
Baseline 2.20 3.53 91.70 (0.12)
\cdashline1-4[2pt/3pt] Slimming-40% 1.91 (0.00) 3.10 (0.02) 91.74 (0.35)
MASS-20% 1.76 (0.00) 3.18 (0.07) 91.79 (0.23)
\cdashline1-4[2pt/3pt] Slimming-60% 1.36 (0.02) 2.24 (0.01) 89.68 (0.38)
MASS-40% 1.31 (0.01) 2.58 (0.00) 91.42 (0.04)
Baseline 5.90 9.16 93.50 (0.19)
\cdashline1-4[2pt/3pt] Slimming-60% 4.15 (0.03) 5.75 (0.10) 93.10 (0.25)
MASS-30% 4.08 (0.05) 7.17 (0.20) 94.19 (0.16)
MASS-50% 2.96 (0.03) 4.81 (0.03) 93.70 (0.06)
\cdashline1-4[2pt/3pt] Slimming-80% 2.33 (0.04) 3.50 (0.02) 91.01 (0.02)
MASS-60% 2.34 (0.08) 4.20 (0.08) 93.48 (0.13)
MASS-70% 1.80 (0.00) 3.52 (0.16) 93.25 (0.02)
Baseline 11.47 17.59 94.62 (0.22)
\cdashline1-4[2pt/3pt] Slimming-40% 9.24 (0.03) 12.55 (0.00) 94.49 (0.12)
MASS-20% 9.12 (0.06) 14.76 (0.02) 94.78 (0.11)
MASS-40% 6.69 (0.24) 11.60 (0.01) 94.55 (0.18)
\cdashline1-4[2pt/3pt] Slimming-60% 8.15 (0.03) 10.66 (0.00) 94.29 (0.11)
MASS-30% 7.89 (0.03) 12.47 (0.01) 94.69 (0.08)
MASS-60% 5.41 (0.02) 10.66 (0.01) 94.42 (0.04)
Table 1: Network compression results on the CIFAR-10 Krizhevsky and Hinton (2009) dataset. “Baseline” means the network without compression. The percentages in our method indicate the compression rate (measured by the reduction of “#Params.”), while those in other methods indicate the pruning ratio.

Implementation Details.

Our implementation is based on the PyTorch Steiner et al. (2019) library. The proposed method is applied to the ResNet He et al. (2016) and DenseNet Huang et al. (2017) families, and evaluated on the CIFAR Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015) datasets. For the CIFAR dataset, we follow the common practice of data augmentation He et al. (2016); Liu et al. (2017); Xie et al. (2017): zero-padding of 4 pixels on each side of the image, random crop of a patch, and random horizontal flip. For fair comparisons, we utilize the same network architecture as Liu et al. (2017), and the model is trained on a single GPU with a batch size of 64. For the ImageNet dataset, we adopt the standard data augmentation strategy Simonyan and Zisserman (2015); He et al. (2016); Xie et al. (2017): image resize such that the shortest edge is of 256 pixels, random crop of a patch, and random horizontal flip. The overall batch size is 256, which is distributed to 4 GPUs. For both datasets, we employ the SGD optimizer with momentum 0.9. The source code and trained models will be made available to the public upon acceptance.

Training Protocol.

For the first stage, we train a large model from scratch with the structured regularization described in Sec. 3.3. At the end of each epoch, we update the permutation matrices as in Sec. 3.2, determine the current group levels as in Sec. 3.4, and adjust the structured regularization matrices accordingly. We train with a fixed learning rate of 0.1 for 100 epochs on the CIFAR dataset and 60 epochs on the ImageNet dataset and exclude the weight decay due to the existence of the structured regularization. The coefficient is dynamically adjusted to meet the desired compression rate (see the supplementary materials). The training pipeline is summarized in Alg. 1.

Finetune Protocol.

The remaining parameters are restored from the training stage and the compressed model is finetuned with an initial learning rate of 0.1. We finetune for 160 epochs on the CIFAR dataset and the learning rate decays by a factor of 10 at 50% and 75% of the total epochs. On the ImageNet dataset, the learning rate is decayed according to the cosine annealing strategy Loshchilov and Hutter (2017) within 120 epochs. For both datasets, a standard weight decay of is adopted to prevent overfitting.

Methods #Params.() GFLOPs Acc.(%)
Baseline 25.6 4.14 77.10
\cdashline1-4[2pt/3pt] NISP-A 18.6 2.97 72.75
Slimming-20% 17.8 2.81 75.12
Taylor-19% 17.9 2.66 75.48
FPGM-30% N/A 2.39 75.59
MASS-35% 17.2 3.12 76.82
\cdashline1-4[2pt/3pt] ThiNet-30% 16.9 2.62 72.04
NISP-B 14.3 2.29 72.07
Taylor-28% 14.2 2.25 74.50
FPGM-40% N/A 1.93 74.83
MASS-65% 10.3 1.67 75.10
\cdashline1-4[2pt/3pt] ThiNet-50% 12.4 1.83 71.01
Taylor-44% 7.9 1.34 71.69
Slimming-50% 11.1 1.87 71.99
MASS-85% 5.6 0.90 72.47
Baseline 44.5 7.87 78.64
\cdashline1-4[2pt/3pt] FPGM-30% N/A 4.55 77.32
Taylor-25% 31.2 4.70 77.35
MASS-40% 26.7 5.05 78.16
\cdashline1-4[2pt/3pt] BN-ISTA-v1 17.3 3.69 74.56
BN-ISTA-v2 23.6 4.47 75.27
Taylor-45% 20.7 2.85 75.95
Slimming-50% 20.9 3.16 75.97
MASS-65% 16.5 2.98 77.62
\cdashline1-4[2pt/3pt] Taylor-60% 13.6 1.76 74.16
MASS-80% 10.6 1.70 75.73
Baseline 20.0 4.39 77.88
\cdashline1-4[2pt/3pt] Taylor-40% 12.5 3.02 76.51
MASS-38% 13.1 3.53 77.43
\cdashline1-4[2pt/3pt] Taylor-64% 9.0 2.21 75.28
MASS-60% 9.2 2.10 75.86
Table 2: Network compression results on the ImageNet Russakovsky et al. (2015) dataset. The center-crop validation accuracy is reported. “Baseline” means the network without compression. The percentages in the table have the same meaning as those in Tab. 1.

4 Experiments and Analysis

In this section, we present the experimental results on the CIFAR and ImageNet datasets. In addition, we carry out ablation studies to demonstrate the effectiveness of components of the proposed method.

4.1 Results on CIFAR

We first compare our proposed method with the Network Slimming Liu et al. (2017) approach on the CIFAR-10 dataset. The Network Slimming approach is a representative pruning method that compresses CNN models by pruning less important filters. As the experimental results on the CIFAR-10 dataset are somewhat random, we repeat the train-compress-finetune pipeline for 10 times and record the mean and standard deviation (std). As shown in Tab. 1, the proposed MASS method performs favorably under various compression rates. For ResNet-110, with 60% parameters compressed, MASS can still achieve 94.42% top-1 accuracy which is nearly equal to the performance of the baseline method without compression. Compared with the Network Slimming, MASS consistently performs better, especially under high compression rates. Experiments on the CIFAR-10 dataset demonstrate that MASS is able to compress CNNs with negligible performance drop and favorable accuracy against pruning methods such as Network Slimming.

4.2 Results on ImageNet

Tab. 2 shows the evaluation results of the proposed method against the state-of-the-art network pruning approaches, including ThiNet Luo et al. (2017), Slimming Liu et al. (2017), NISP Yu et al. (2018), BN-ISTA Ye et al. (2018), FPGM He et al. (2019), and Taylor Molchanov et al. (2019). Overall, the MASS method performs favorably against the state-of-the-art network compression methods under different settings. These performance gains achieved by the MASS method can be attributed to the fact that discarding the entire filters will negatively affect the representational strength of the network model, especially when the pruning ratio is high, e.g., 50%. In contrast, the MASS method removes only a proportion of neuron connections and preserves all of the filters, thereby making a mild impact on the model capacity. In addition, it is known that pruning neuron connections would eliminate the information flow and affect performance. To alleviate this issue, the learnable channel shuffle mechanism assists the information exchange among different groups, thereby minimizing the potential negative impact.

4.3 Ablation Studies

Accuracy v.s. Complexity.

As shown in Fig. 1, the proposed MASS method is designed to make sound accuracy-complexity trade-off. On the ImageNet Krizhevsky et al. (2012) dataset, a slight top-1 accuracy drop of 0.28% is compromised for about 25% complexity reduction on the ResNet-50 backbone, and an accuracy loss of 1.02% for about 60% reduction on ResNet-101. Furthermore, high compression rates can be achieved in our methodology while maintaining competitive performance. It is worth noticing that our method achieves an accuracy of 72.47% with only about 20% complexity of the ResNet-50 backbone, which performs favorably against the pruning methods with two times complexity.

Config. ResNet-50-65% ResNet-101-65%
Acc. Top-1 Top-5 Top-1 Top-5
Finetune 75.10 92.52 77.62 93.72
FromScratch 75.02 92.46 77.14 93.53
ShuffleNet 74.97 92.41 76.91 93.38
Random 69.45 89.45 73.16 91.44
NoShuffle 73.30 91.39 75.31 92.64
Table 3: Ablation study of different channel shuffle operations on the ImageNet dataset Russakovsky et al. (2015).

Learned Channel Shuffle Mechanism.

We evaluate the effectiveness of our learned channel shuffle mechanism on the ResNet backbone with a compression rate of 65%. We use the following five settings for performance evaluation:

  • Finetune: The preserved parameters after compression are restored and the compressed model is finetuned. For the other four settings, the parameters of the compressed model are re-initialized for the finetune stage.

  • FromScratch: We keep the learned channel connectivity, i.e.,  and , from the training stage, and train the model from randomly re-initialized weights.

  • ShuffleNet: The same channel shuffle operation in the ShuffleNet Zhang et al. (2018) is adopted. Specifically, if a convolution is of cardinality and has output channels, then the channel shuffle operation is equivalent to reshaping the output channel dimension into , transposing and flattening it back. Compared with ShuffleNet, the way of channel shuffle is learned rather than pre-defined in our method, i.e., Finetune and FromScratch.

  • Random: The permutation matrices and are randomly given, independent of the learned ones.

  • NoShuffle: The channel shuffle operations are removed, i.e.,  and are identity matrices.

The results are demonstrated in Tab. 3. First, the finetuned models perform slightly better than those trained from scratch, which implies that the preserved parameters take an essential role in the final performance. Furthermore, the model with learned channel shuffle mechanism, i.e., neuron connectivity, performs the best among all settings. The channel shuffle mechanism in the ShuffleNet Zhang et al. (2018) is effective as it outperforms the no-shuffle counterpart. However, it is can be further improved by a learning-based strategy. Interestingly, the random channel shuffle scheme performs the worst, even worse than the no-shuffle scheme. This implies learning the channel shuffle operation is a challenging task, and randomly gathering channels from different groups gives no benefits.

4.4 Discussion

To the best of our knowledge, our work is the first to introduce structured sparsification for network compression. As there is still room for improvement, we discuss three potential directions for future work along this line of work.

  1. Data-Driven Structured Sparsification. Currently, the gradients of the data loss and those of the sparsity regularization are computed independently (Eq. 7) in each training iteration. Thus, the structured regularization is imposed uniformly on the convolutional layers, and the learned cardinality distribution is task-agnostic and prone to uniformity. However, better cardinality distribution may be achieved if the structured sparsification is guided by the back-propagated signals of the data loss. Thus, optimization-based meta-learning techniques Finn et al. (2017) can be exploited for this purpose.

  2. Progressive Sparsification Solution. Typically, finetune-free compression techniques are desired in practical applications Cheng et al. (2018). Therefore, the sparsified weights can be removed progressively during training, and the architecture search as well as model training can be completed in a single training pass.

  3. Combination with Filter Pruning Techniques. As the entire feature maps are reserved in our approach, the reduction of memory footprint is limited. This issue can be addressed by combining with the filter pruning techniques, which is non-trivial as uniform filter pruning is required within each group. It is of great interest to exploit group sparsity constraints Yoon and Hwang (2017) to achieve such uniform sparsity.

5 Conclusion

In this work, we propose a method for efficient network compression. Our approach induces structurally sparse representations of the convolutional weights and the compressed model can be readily incorporated in the modern deep learning libraries thanks to their support for the group convolution. The problem of inter-group communication is further solved via the learnable channel shuffle mechanism. Our approach is model-agnostic and highly compressible with negligible performance degradation, which is validated by extensive experiments on the CIFAR and ImageNet datasets. In addition, experimental evaluation against the state-of-the-art compression approaches shows techniques of structured sparsification can be a fruitful future research direction.

Appendix A Structured Regularization in General Form

Generally, we can relax the constraints that both and are powers of 2, and assume both and have many factors of 2. Under this assumption, the potential candidates of cardinality are still restricted to powers of 2. Specifically, if the greatest common divisor of and can be factored as


where is an odd integer, then the potential candidates of the group level are . For example, if the minimal is 4 among all convolutional layers6, the potential candidates of cardinality are , giving adequate flexibility of the compressed model. The structured regularization and the relationship matrix corresponding to each group level are designed in a similar way. For clarity, we provide an exemplar implementation based on the NumPy library.

1import numpy as np
3def struc_reg(dim1, dim2, level=None, power=0.5):
4    r"""
5    Compute the structured regularization matrix.
7    Args::
8        dim1 (int): number of output channels.
9        dim2 (int): number of input channels.
10        level (int or None): current group level.
11                             Specify ’None’ if the cost matrix is desired.
12        power (float): decay rate of the regularization coefficients.
14    Return::
15        Structured regularization matrix.
16    """
17    reg = np.zeros((dim1, dim2))
18    assign_val(reg, 1., level, power)
19    return reg
21def assign_val(arr, val, level, power):
22    dim1, dim2 = arr.shape
23    if dim1 % 2 != 0 or dim2 % 2 != 0 or level == 0:
24        return
25    else:
26        next_level = None if level is None else level - 1
27        arr[dim1//2:, :dim2//2] = val
28        arr[:dim1//2, dim2//2:] = val
29        assign_val(arr[dim1//2:, dim2//2:], val*power, next_level, power)
30        assign_val(arr[:dim1//2, :dim2//2], val*power, next_level, power)

Appendix B Dynamic Penalty Adjustment

As the desired compression rate is customized according to user preference, manually choosing an appropriate regularization coefficient in Eq. (7) of Sec. 3.3 for each experimental setting is extremely inefficient. To alleviate this issue, we dynamically adjust based on the sparsification progress. The algorithm is summarized in Alg. 2.

Concretely, after the training epoch, we first determine the current group level of each convolutional layer according to Eq. (9) of Sec. 3.4. Then, we define the model sparsity based on the reduction of model parameters. For the convolutional layer, the number of parameters is reduced by a factor of , where is the cardinality. Thus, the original number of parameters and the reduced one are given by


respectively. Here, and denote the input channel number and the kernel size of the convolutional layer, respectively. Therefore, the current model sparsity is calculated as


Afterwards, we assume the model sparsity grows linearly, and calculate the expected sparsity gain. If the expected sparsity gain is not met, i.e.,


where is the total training epoch number and is the target sparsity, we increase by . If the model sparsity exceeds the target, i.e., , we decrease by .

In all experiments, the coefficient is initialized from and is set to .

Initialize , , ,
for  to  do train for 1 epoch

       Determine the current group levels ;
Compute the current sparsity by Eq. 12 and 11
if  then
       else if  then
end for
Algorithm 2 Dynamically adjust

Appendix C Experimental Details

In this section, we provide more results and details of our experiments. We provide the loss and accuracy curves along with the performance after each stage in Sec. C.1, and analyze the compressed model architectures in Sec. C.2.

c.1 Training Dynamics

Backbone ResNet-50 ResNet-101 DenseNet-201
Compression Rate 35% 65% 85% 40% 65% 80% 38% 60%
Pre-compression Acc. 69.07 66.36 64.30 69.56 67.13 64.20 69.10 66.26
Post-compression Acc. 60.92 42.78 8.82 65.78 58.63 18.57 66.15 17.35
Finetune Acc. 76.82 75.10 72,47 78.16 77.62 75.73 77.43 75.86
threshold 0.127 0.115 0.125 0.095 0.090 0.103 0.098 0.115
Table 4: Performance along the timeline of our approach. The evaluation is performed on the ImageNet dataset.

We first provide the pre- and post-compression accuracy along with the finetune accuracy of our pipeline in Tab. 4. During, compression, we use a binary search to decide the threshold of the grouping criteria (Eq. (9)) so that the network can be compressed at the desired compression rate. The searched thresholds are also illustrated. Apart from this, we further provide the training and finetune curves in Fig. 4. In the training stage, the accuracy gradually increases till saturation, and then the compression leads to a slight performance drop. Finally, the performance is recovered in the finetune stage.



Figure 4: Training dynamics of the full MASS pipeline. We plot the training and finetune curves of the DenseNet-201 backbone with a compression rate of 38%. At the end of the 60 epoch of the training stage, we compress the network following our criteria. Then, we finetune for 120 epochs to recover performance.

c.2 Compressed Architectures

We illustrate the compressed architecture by showing the cardinality of each convolution layer in Fig. 6 and 5. Note that our method is applied to all convolution operators, i.e., both convolutions and convolutions, so a high compression rate, e.g., 80%, can be achieved. Besides, as discussed in Sec. 4.4, the learned cardinality distribution is prone to uniformity, but there are still certain patterns. For example, shallow layers are relatively more difficult to be compressed. A possible explanation is that shallow layers have fewer filters, so a large cardinality will inevitably eliminate the communication between certain groups. Moreover, we observe convolutions are generally more compressible than convolutions. This is intuitive as convolutions have more parameters, thus leading to heavier redundancy.



Figure 5: Learned cardinalities of the ResNet-50 backbone with the compression rates of 35% and 65%.


Figure 6: Learned cardinalities of the ResNet-101 backbone with the compression rates of 40% and 80%.

Besides, we illustrate the learned neuron connectivity and compare with the ShuffleNet Zhang et al. (2018) counterpart. Here, we consider the channel permutation between two group convolutions (GroupConvs) and demonstrate the connectivity via the confusion matrix. Specifically, we assume the first GroupConv is of cardinality and the second of , then the confusion matrix is a matrix with denoting the number of channels that come from the group of the first GroupConv and belong to the group of the second.

In Tab. 5, we can see that the inter-group communication is guaranteed as there are connections between every two groups. Furthermore, the learnable channel shuffle scheme is more flexible. The ShuffleNet Zhang et al. (2018) scheme uniformly partitions and distributes channels within each group, while our approach allows small variations of the number of connections for each group. In this way, the network can itself control the information flow from each group by customizing its neuron connectivity. More examples can be found in Fig. 7. All models illustrated in this section are trained on the ImageNet dataset.

G1 G2 G3 G4 G5 G6 G7 G8 G1 6 6 10 8 9 6 13 6 G2 9 8 7 9 11 8 4 8 G3 11 8 11 6 4 8 7 9 G4 16 9 5 9 10 4 6 5 G5 7 9 7 7 8 10 9 7 G6 5 7 10 6 7 11 7 11 G7 4 8 7 14 6 8 7 10 G8 6 9 7 5 9 9 11 8 G1 G2 G3 G4 G5 G6 G7 G8 G1 8 8 8 8 8 8 8 8 G2 8 8 8 8 8 8 8 8 G3 8 8 8 8 8 8 8 8 G4 8 8 8 8 8 8 8 8 G5 8 8 8 8 8 8 8 8 G6 8 8 8 8 8 8 8 8 G7 8 8 8 8 8 8 8 8 G8 8 8 8 8 8 8 8 8
Table 5: Confusion matrices of the adjacent GroupConvs. Here, the neuron connectivity between “Layer4-Bottleneck1-conv1” and “Layer4-Bottleneck1-conv2” of the ResNet-50-85% model is demonstrated. Left: the learned neuron connectivity; Right: the neuron connectivity of the ShuffleNet Zhang et al. (2018).

[width=0.8]confusion (a) DenseNet-201-60%Block4-Layer24-conv1-2(b) ResNet-50-85%Layer1-Bottleneck1-conv2-3(c) ResNet-101-80%Layer4-Bottleneck2-conv2-3(d) ResNet-50-85%Layer3-Bottleneck4-conv1-2(e) ResNet-101-80%Layer3-Bottleneck1-conv1-2

Figure 7: More examples of the confusion matrices.


  1. For simplicity, we omit the bias term from Eq. 1, and assume the convolution operator is of stride 1 with proper paddings.
  2. Similar reasoning can be applied if both and have many factors of 2. See the supplementary materials for details.
  3. Doubly-stochastic matrices are non-negative square matrices whose rows and columns sum to one.
  5. Here, we simply compute the regularization of a single convolutional layer. In the experiments, the regularization is the summation of those of all the convolution layers.
  6. The standard DenseNet Huang et al. (2017) family satisfies this condition.


  1. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §2.
  2. Three observations on linear algebra. Univ. Nac. Tacuman, Rev. Ser. A 5, pp. 147–151. Cited by: §3.2, §3.2.
  3. Displacement interpolation using lagrangian mass transport. In ACM Transactions on Graphics (SIGGRAPH Asia), pp. 1–12. Cited by: §1, §3.2.
  4. Learning efficient object detection models with knowledge distillation. In Neural Information Processing Systems (NIPS), pp. 742–751. Cited by: §1.
  5. Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pp. 2285–2294. Cited by: §1.
  6. Recent advances in efficient computation of deep convolutional neural networks. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 64–77. Cited by: item 2.
  7. Xception: deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258. Cited by: §2.
  8. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830. Cited by: §1.
  9. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), pp. 1126–1135. Cited by: item 1.
  10. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), pp. 1243–1252. Cited by: §1.
  11. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  12. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §3.2, §3.5.
  13. Filter pruning via geometric median for deep convolutional neural networks acceleration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349. Cited by: §2, §4.2.
  14. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406. Cited by: §1.
  15. Distilling the knowledge in a neural network. In Neural Information Processing Systems (NIPS), Cited by: §1.
  16. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
  17. Condensenet: an efficient densenet using learned group convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2752–2761. Cited by: §2.
  18. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §1, §3.5, footnote 6.
  19. Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §3.5, Table 1.
  20. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), pp. 1097–1105. Cited by: §1, §1, §2, §4.3.
  21. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2, §2, §2, §3.2.
  22. DARTS: differentiable architecture search. In International Conference on Learning Representations (ICLR), Cited by: §2.
  23. Search to distill: pearls are everywhere but not the eyes. arXiv preprint arXiv:1911.09074. Cited by: §1.
  24. Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision (ICCV), pp. 2736–2744. Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, §3.3, §3.5, §4.1, §4.2.
  25. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
  26. SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), Cited by: §3.5.
  27. Thinet: a filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision (ICCV), pp. 5058–5066. Cited by: §4.2.
  28. Shufflenet v2: practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §2.
  29. Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. In Association for the Advancement of Artificial Intelligence (AAAI), Cited by: §1.
  30. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11264–11272. Cited by: §1, §2, §4.2.
  31. Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pp. 525–542. Cited by: §1.
  32. Regularized evolution for image classifier architecture search. In Association for the Advancement of Artificial Intelligence (AAAI), Vol. 33, pp. 4780–4789. Cited by: §2.
  33. Large-scale evolution of image classifiers. In International Conference on Machine Learning (ICML), pp. 2902–2911. Cited by: §2.
  34. ImageNet Large Scale Visual Recognition Challenge. International Journal on Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: Figure 1, §1, §3.5, Table 2, Table 3.
  35. Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. Cited by: §1, §2.
  36. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §3.2, §3.5.
  37. PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems (NIPS), Cited by: §3.5.
  38. Fully learnable group convolution for acceleration of deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9049–9058. Cited by: §2.
  39. FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10734–10742. Cited by: §2.
  40. Quantized convolutional neural networks for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4820–4828. Cited by: §1.
  41. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: §1, §2, §3.5.
  42. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
  43. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4133–4141. Cited by: §1.
  44. Combined group and exclusive sparsity for deep neural networks. In International Conference on Machine Learning (ICML), pp. 3958–3966. Cited by: item 3.
  45. NISP: pruning networks using neuron importance score propagation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9194–9203. Cited by: §4.2.
  46. Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856. Cited by: §C.2, §C.2, Table 5, §1, §1, §2, §2, 3rd item, §4.3.
  47. Differentiable learning-to-group channels via groupable convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3542–3551. Cited by: §2, §2, §3.4.
  48. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §2.
  49. Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description