ModelAgnostic Structured Sparsification with Learnable Channel Shuffle
Abstract
Recent advances in convolutional neural networks (CNNs) usually come with the expense of considerable computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. To this end, we propose a modelagnostic structured sparsification method for efficient network compression. The proposed method automatically induces structurally sparse representations of the convolutional weights, thereby facilitating the implementation of the compressed model with the highlyoptimized group convolution.
We further address the problem of intergroup communication with a learnable channel shuffle mechanism. The proposed approach is modelagnostic and highly compressible with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach performs favorably against the stateoftheart network pruning methods. The code will be publicly available after the review process.
1 Introduction
Convolutional Neural Networks (CNNs) have made significant advances in a wide range of vision and learning tasks Krizhevsky et al. (2012); Gehring et al. (2017); Long et al. (2015); Girshick (2015). However, the performance gains usually entail heavy computational costs, which make the deployment of CNNs on portable devices difficult. To meet the memory and computational constraints in realworld applications, numerous model compression techniques have been developed.
Existing network compression techniques are mainly based on weight quantization Chen et al. (2015); Courbariaux et al. (2016); Rastegari et al. (2016); Wu et al. (2016), knowledge distillation Hinton et al. (2014); Chen et al. (2017); Yim et al. (2017), and network pruning Li et al. (2017); He et al. (2017); Liu et al. (2017); Molchanov et al. (2019). Weight quantization methods use low bitwidth numbers to represent weights and activations, which usually bring a moderate performance degradation. Knowledge distillation schemes transfer knowledge from a large teacher network to a compact student network, which are typically susceptible to the teacher/student network architecture Mirzadeh et al. (2020); Liu et al. (2019b). Closely related to our work, network pruning approaches reduce the model size by removing a proportion of model parameters that are considered unimportant. Notably, filter pruning algorithms Liu et al. (2017); He et al. (2017); Li et al. (2017) remove the entire filters and result in structured architectures that can be readily incorporated into modern BLAS libraries.
Identifying unimportant filters is critical to pruning methods. It is wellknown that the weight norm can serve as a good indicator of the corresponding filter importance Li et al. (2017); Liu et al. (2017). Filters corresponding to smaller weight norms are considered to contribute less to the outputs. Furthermore, the regularization can be used to increase sparsity Liu et al. (2017). Despite the advances, several issues in the existing pruning methods can be improved: 1) pruning a large proportion of convolutional filters will result in severe performance degradation; 2) pruning alters the input/output feature dimensions, and thus meticulous adaptation is required to handle special network architectures (e.g., residual connections He et al. (2016) and dense connections Huang et al. (2017)).
Before presenting the proposed method, we briefly introduce the group convolution (GroupConv) Krizhevsky et al. (2012), which plays an important role in this work. For the typical convolution operation, the output features are denselyconnected with the entire input features, while for the GroupConv, the input features are equally split into several groups and transformed within each group independently. Essentially, each output channel is connected with only a proportion of the input channels, which leads to sparse neuron connections. Therefore, deep CNNs with GroupConvs can be trained on less powerful GPUs with smaller amount of memory. In this work, we propose a novel approach for network compression where unimportant neuron connections are pruned to facilitate the usage of GroupConvs. Nevertheless, converting vallina convolutions into GroupConvs is a challenging task. First, not all sparse neuron connectivities correspond to valid GroupConvs, while certain requirements must be satisfied, e.g., mutual exclusiveness of different groups. To guarantee the desired structured sparsity, we impose structured regularization upon the convolutional weights and zero out the sparsified weights. Another challenge is that stacking multiple GroupConvs sequentially will hinder the intergroup information flow. The ShuffleNet Zhang et al. (2018) method proposes a channel shuffle mechanism, i.e., gathering channels from distinct groups, to ensure the intergroup communication, though the order of permutation is handcrafted. However, we solve the problem of channel shuffle in a learningbased scheme. Concretely, we formulate the learning of channel shuffle as a linear programming problem, which can be solved by efficient algorithms like the network simplex method Bonneel et al. (2011). Since the structured sparsity is induced among the convolutional weights, our method is nominated as ModelAgnostic Structured Sparsification, abbreviated to MASS.
The proposed structured sparsification method is designed for three goals. First, our approach is modelagnostic. A wide range of backbone architectures are amenable to our method without the need for any special adaptation. Second, our method is capable of achieving high compression rates. In modern efficient network architectures, the complexity of convolutions is highly compressed, while the computation bottleneck becomes the pointwise convolutions (i.e., convolutions) Zhang et al. (2018). For example, the pointwise convolutions occupy 81.5% of the total FLOPs in the MobileNetV2 Sandler et al. (2018) backbone and 93.4% in ResNeXt Xie et al. (2017). Our method is applicable to all convolution operators so that a high compression rate is reachable. Third, our approach brings negligible performance drop. As all of the filters are preserved under our methodology, we retain stronger representational capacity of the compressed model and achieve better accuracycomplexity tradeoff than the pruningbased counterparts (see Fig. 1).
The main contributions of this work are threefold:

We propose a learnable channel shuffle mechanism (Sec. 3.2) in which the permutation of the convolutional weight norm matrix is learned via linear programming.

Upon the permuted weight norm matrix, we impose structured regularization (Sec. 3.3) to obtain valid GroupConvs by zeroing out the sparsified weights.

With the structurally sparse convolutional weights, we design the criteria of learning cardinality (Sec. 3.4) in which unimportant neuron connections are pruned with minimal impact on the entire network.
Incorporating the learnable channel shuffle mechanism, the structured regularization and the grouping criteria, the proposed structured sparsification method performs favorably against the stateoftheart network pruning techniques on both CIFAR Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015) datasets.
2 Related Work
Network Compression
Compression methods for deep models can be broadly categorized based on weight quantization, knowledge distillation, and network pruning. Closely related to our work are the network pruning approaches based on filter pruning. It is wellacknowledged that filters with smaller weight norm are considered to make negligible contribution to the outputs and can be pruned. Li et al. (2017) prune filters with smaller norm, while Liu et al. (2017) remove those corresponding to smaller batchnorm scaling factors, on which an regularization term is imposed to increase sparsity.
However, techniques that remove the entire filters based on the weight norm may negatively affect the representational capacity significantly. Instead of removing the entire filters, the proposed structured sparsification method enforces structured sparsity among neuron connections and merely removes certain unimportant connections while the entire filters are preserved. As such, the network capacity is less affected than pruningbased approaches Li et al. (2017); Liu et al. (2017); He et al. (2019); Molchanov et al. (2019). Furthermore, our method does not alter the input/output dimensions, and can be easily incorporated into numerous backbone models.
Group Convolution.
Group convolution (GroupConv) is introduced in the AlexNet Krizhevsky et al. (2012) to overcome the GPU memory constraints. GroupConv partitions the input features into mutually exclusive groups and transforms the features within each group in parallel. Compared with the vallina (i.e., densely connected) convolution, a GroupConv with groups can reduce the computational cost and number of parameters by a factor of . The ResNeXt Xie et al. (2017) designs a multibranch architecture by employing GroupConvs and defines cardinality as the number of parallel transformations, which is simply the group number in each GroupConv. If the cardinality equals to the number of channels, GroupConv becomes the depthwise separable convolution, which is widely used in recent lightweight neural architectures Howard et al. (2017); Sandler et al. (2018); Zhang et al. (2018); Ma et al. (2018); Chollet (2017).
However, the aforementioned methods all treat the cardinality as a hyperparameter, and the connectivity patterns between consecutive features are handcrafted as well. On the other hand, there is also a line of research focusing on learnable GroupConvs Huang et al. (2018); Wang et al. (2019); Zhang et al. (2019). Both CondenseNet Huang et al. (2018) and FLGC Wang et al. (2019) predefine the cardinality of each GroupConv and learn the connectivity patterns. We note that the work by Zhang et al. (2019) learns both the cardinality and neuron connectivity simultaneously. Essentially, this dynamic grouping convolution is modeled by a binary relationship matrix where indicates the connectivity between the input channel and the output channel. To guarantee that the resulting operator is a valid GroupConv, the relationship matrix is constructed using a Kronecker product of several binary symmetric matrices. Nevertheless, the Kronecker product gives a sufficient but unnecessary condition and the space of all valid relationship matrices is not fully exploited.
Our method decouples the learning of cardinality and connectivity. Motivated by the normbased criterion in the network pruning methods Li et al. (2017); Liu et al. (2017), we quantify the importance of each neuron connection by the corresponding weight norm and learn the connectivity pattern by permuting the weight norm matrix. Besides, the structured regularization is imposed on the permuted weight norm matrix and the cardinality is learned accordingly. The essential difference between our approach and prior art Zhang et al. (2019) is that all possible neuron connectivity patterns, i.e., relationship matrices, can be reached by our method.
Channel Shuffle Mechanism.
The ShuffleNet Zhang et al. (2018) combines the channel shuffle mechanism with GroupConv for efficient network design, in which channels from different groups are gathered so as to facilitate intergroup communication. Without channel shuffle, stacking multiple GroupConvs will eliminate the information flow among different groups and weaken the representational capacity. Different from the handcrafted counterpart Zhang et al. (2018), the proposed channel shuffle operation is learnable over the space of all possible channel permutations. Furthermore, without bells and whistles, our channel shuffle only involves a simple permutation along the channel dimension, which can be conveniently implemented by an index operation.
Neural Architecture Search.
Neural Architecture Search (NAS) Zoph and Le (2017); Baker et al. (2017); Zoph et al. (2018); Real et al. (2019); Wu et al. (2019) aims to automate the process of learning neural architectures within certain budgets of computational resources. Existing NAS algorithms are developed based on reinforcement learning Zoph and Le (2017); Baker et al. (2017); Zoph et al. (2018), evolutionary search Real et al. (2017, 2019), and differentiable approaches Liu et al. (2019a); Wu et al. (2019). Our method can be viewed as a special case of hyperparameter (i.e., cardinality) optimization and neuron connectivity search. However, different from existing approaches that evaluate on numerous architectures, the proposed method can determine the compressed architecture in one single training pass and is more scalable than most NAS methods.
3 ModelAgnostic Structured Sparsification
3.1 Overview
The structured sparsification method is designed to zero out a proportion of the convolutional weights so that the vallina convolutions can be converted into group convolutions (GroupConvs), and meanwhile the optimal neuron connectivity can be learned. We adopt the “train, compress, finetune” pipeline, in a way similar to the recent pruning approaches Liu et al. (2017). Concretely, we first train a large model with structured regularization, then convert vallina convolutions into GroupConvs under certain criteria, and finally finetune the compressed model to recover accuracy. The connectivity patterns can be therein learned as the structured regularization heavily depends on them. As such, three issues need to be addressed: 1) how to learn the connectivity patterns (Sec. 3.2); 2) how to design the structured regularization (Sec. 3.3); and 3) how to decide the grouping criteria (Sec. 3.4). Additional details of our pipeline are presented in Sec. 3.5.
3.2 Learning Connectivity with Linear Programming
Let be the input feature map,
where denotes the number of input channels.
We apply a vallina convolution
(1) 
In Eq. 1, the channel of relates to the channel of via weights . Motivated by the normbased importance estimation in filter pruning Li et al. (2017); Liu et al. (2017), we quantify the importance of the connection between and of by . Thus, the importance matrix can be defined as the norm along the “kernel size” dimensions of , i.e., .
Next, we extend our formulation to GroupConvs with cardinality . A GroupConv can be considered as a convolution with sparse neuron connectivity, in which only a proportion of input channels is visible to each output channel. Without loss of generality, we assume both and are divisible by , and Eq. 1 can be adapted as
(2) 
where indicates the output channel belongs to the group, and denotes the number of input channels within each group. Clearly, the valid entries of form a block diagonal matrix with equallysplit blocks at the input/output channel dimensions. Thus, the GroupConv module requires parameters and FLOPs for processing the feature , and the computational complexity is reduced by a factor of compared with the vallina counterpart.
We note that if a vallina convolution operator can be converted into GroupConv without affecting its functional property (we call such convolution operators groupable), the convolutional weights must be block diagonal after certain permutations along the input/output channel dimensions. Due to the positive definiteness of norm and the fact that permuting corresponds to permuting , a necessary and sufficient condition of a convolution operator being groupable is that
(3) 
where denotes the set of permutation matrices. Here, the permutation matrices and shuffle the channels of the input and output features, and thus determine the connectivity pattern between and (see Fig. 2).
However, a randomly initialized and trained convolution operator
by no means can be groupable unless special sparsity constraints are imposed.
To this end, we resort to permuting so as to make
“as block diagonal as possible”.
The next question is how to rigorously define the term
“as block diagonal as possible”.
Here, we assume both and are powers of
2, where the most widelyused backbone architectures (e.g., VGG Simonyan and Zisserman (2015) and
ResNet He et al. (2016)) satisfy this assumption
(4) 
Solving Eq. 4 gives the optimal connectivity pattern between the adjacent layers.
However, minimization over the set of permutation
matrices is a nonconvex and NPhard problem that requires combinatorial search.
To this end, we relax the feasible space to its convex hull.
The Birkhoffvon Neumann theorem Birkhoff (1946) states that the convex hull of
the set of permutation matrices is the set of doublystochastic
matrices
(5) 
where denotes the column vector composed of ones.
We solve Eq. 4 with coordinate descent. That is, we iteratively update and until convergence. When updating one variable, we consider the other as fixed. For example, when optimizing , the objective function can be transformed as follows:
(6) 
As the objective is a linear function of and the Birkhoff polytope is a
simplex, Eq. 6 is a linear programming problem,
which can be solved by efficient algorithms such as the
network simplex method Bonneel et al. (2011).
In addition, the theory of linear programming guarantees that at least
one of the solutions is achieved at the vertex of the simplex,
and the vertices of the Birkhoff polytope are precisely
the permutation matrices Birkhoff (1946).
Thus, in Eq. 6, minimization over the Birkhoff polytope is
equivalent to minimization over the set of permutation matrices, and
the solution is naturally a permutation matrix without the need for
postprocessing.
Furthermore, Eq. 6 has the same formulation as the
optimal transport problem, and sophisticated computation
library
3.3 Structured Regularization
Permutation alone does not suffice to induce structurally sparse
convolutional weights, and
we still need to impose special sparsity regularization to achieve
the desired sparsity structure.
Inspired by the sparsityinducing penalty in Liu et al. (2017),
we impose the structured regularization on
the permuted weight norm .
We first define the group level as illustrated in
Fig. 3, which indicates the current cardinality
achieved, i.e., and is determined
in Sec. 3.4.
Then, given the current group level , the structured
regularization is formulated as
(7) 
where denotes the regular data loss (standard classification loss in the following experiments) and is the balancing scalar.
3.4 Criteria of Learning Cardinality
With the structurally sparsified convolutional weights, the next step is to determine the cardinality. The core idea of our criteria is that the weight norms corresponding to the valid connections constitute at least a certain proportion of the total weight norms. At group level , the following requirement should be satisfied:
(8) 
where is a threshold set to 0.9 in all of our experiments, and is the relationship matrix Zhang et al. (2019) as illustrated in Fig. 3(c). The matrix specifies the valid neuron connections at group level . Therefore, the current group level can be determined by
(9) 
3.5 Pipeline Details
Methods  #Params.()  FLOPs ()  Acc.(%) 
ResNet20  
Baseline  2.20  3.53  91.70 (0.12) 
\cdashline14[2pt/3pt] Slimming40%  1.91 (0.00)  3.10 (0.02)  91.74 (0.35) 
MASS20%  1.76 (0.00)  3.18 (0.07)  91.79 (0.23) 
\cdashline14[2pt/3pt] Slimming60%  1.36 (0.02)  2.24 (0.01)  89.68 (0.38) 
MASS40%  1.31 (0.01)  2.58 (0.00)  91.42 (0.04) 
ResNet56  
Baseline  5.90  9.16  93.50 (0.19) 
\cdashline14[2pt/3pt] Slimming60%  4.15 (0.03)  5.75 (0.10)  93.10 (0.25) 
MASS30%  4.08 (0.05)  7.17 (0.20)  94.19 (0.16) 
MASS50%  2.96 (0.03)  4.81 (0.03)  93.70 (0.06) 
\cdashline14[2pt/3pt] Slimming80%  2.33 (0.04)  3.50 (0.02)  91.01 (0.02) 
MASS60%  2.34 (0.08)  4.20 (0.08)  93.48 (0.13) 
MASS70%  1.80 (0.00)  3.52 (0.16)  93.25 (0.02) 
ResNet110  
Baseline  11.47  17.59  94.62 (0.22) 
\cdashline14[2pt/3pt] Slimming40%  9.24 (0.03)  12.55 (0.00)  94.49 (0.12) 
MASS20%  9.12 (0.06)  14.76 (0.02)  94.78 (0.11) 
MASS40%  6.69 (0.24)  11.60 (0.01)  94.55 (0.18) 
\cdashline14[2pt/3pt] Slimming60%  8.15 (0.03)  10.66 (0.00)  94.29 (0.11) 
MASS30%  7.89 (0.03)  12.47 (0.01)  94.69 (0.08) 
MASS60%  5.41 (0.02)  10.66 (0.01)  94.42 (0.04) 
Implementation Details.
Our implementation is based on the PyTorch Steiner et al. (2019) library. The proposed method is applied to the ResNet He et al. (2016) and DenseNet Huang et al. (2017) families, and evaluated on the CIFAR Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015) datasets. For the CIFAR dataset, we follow the common practice of data augmentation He et al. (2016); Liu et al. (2017); Xie et al. (2017): zeropadding of 4 pixels on each side of the image, random crop of a patch, and random horizontal flip. For fair comparisons, we utilize the same network architecture as Liu et al. (2017), and the model is trained on a single GPU with a batch size of 64. For the ImageNet dataset, we adopt the standard data augmentation strategy Simonyan and Zisserman (2015); He et al. (2016); Xie et al. (2017): image resize such that the shortest edge is of 256 pixels, random crop of a patch, and random horizontal flip. The overall batch size is 256, which is distributed to 4 GPUs. For both datasets, we employ the SGD optimizer with momentum 0.9. The source code and trained models will be made available to the public upon acceptance.
Training Protocol.
For the first stage, we train a large model from scratch with the structured regularization described in Sec. 3.3. At the end of each epoch, we update the permutation matrices as in Sec. 3.2, determine the current group levels as in Sec. 3.4, and adjust the structured regularization matrices accordingly. We train with a fixed learning rate of 0.1 for 100 epochs on the CIFAR dataset and 60 epochs on the ImageNet dataset and exclude the weight decay due to the existence of the structured regularization. The coefficient is dynamically adjusted to meet the desired compression rate (see the supplementary materials). The training pipeline is summarized in Alg. 1.
Finetune Protocol.
The remaining parameters are restored from the training stage and the compressed model is finetuned with an initial learning rate of 0.1. We finetune for 160 epochs on the CIFAR dataset and the learning rate decays by a factor of 10 at 50% and 75% of the total epochs. On the ImageNet dataset, the learning rate is decayed according to the cosine annealing strategy Loshchilov and Hutter (2017) within 120 epochs. For both datasets, a standard weight decay of is adopted to prevent overfitting.
Methods  #Params.()  GFLOPs  Acc.(%) 
ResNet50  
Baseline  25.6  4.14  77.10 
\cdashline14[2pt/3pt] NISPA  18.6  2.97  72.75 
Slimming20%  17.8  2.81  75.12 
Taylor19%  17.9  2.66  75.48 
FPGM30%  N/A  2.39  75.59 
MASS35%  17.2  3.12  76.82 
\cdashline14[2pt/3pt] ThiNet30%  16.9  2.62  72.04 
NISPB  14.3  2.29  72.07 
Taylor28%  14.2  2.25  74.50 
FPGM40%  N/A  1.93  74.83 
MASS65%  10.3  1.67  75.10 
\cdashline14[2pt/3pt] ThiNet50%  12.4  1.83  71.01 
Taylor44%  7.9  1.34  71.69 
Slimming50%  11.1  1.87  71.99 
MASS85%  5.6  0.90  72.47 
ResNet101  
Baseline  44.5  7.87  78.64 
\cdashline14[2pt/3pt] FPGM30%  N/A  4.55  77.32 
Taylor25%  31.2  4.70  77.35 
MASS40%  26.7  5.05  78.16 
\cdashline14[2pt/3pt] BNISTAv1  17.3  3.69  74.56 
BNISTAv2  23.6  4.47  75.27 
Taylor45%  20.7  2.85  75.95 
Slimming50%  20.9  3.16  75.97 
MASS65%  16.5  2.98  77.62 
\cdashline14[2pt/3pt] Taylor60%  13.6  1.76  74.16 
MASS80%  10.6  1.70  75.73 
DenseNet201  
Baseline  20.0  4.39  77.88 
\cdashline14[2pt/3pt] Taylor40%  12.5  3.02  76.51 
MASS38%  13.1  3.53  77.43 
\cdashline14[2pt/3pt] Taylor64%  9.0  2.21  75.28 
MASS60%  9.2  2.10  75.86 
\cdashline14[2pt/3pt] 
4 Experiments and Analysis
In this section, we present the experimental results on the CIFAR and ImageNet datasets. In addition, we carry out ablation studies to demonstrate the effectiveness of components of the proposed method.
4.1 Results on CIFAR
We first compare our proposed method with the Network Slimming Liu et al. (2017) approach on the CIFAR10 dataset. The Network Slimming approach is a representative pruning method that compresses CNN models by pruning less important filters. As the experimental results on the CIFAR10 dataset are somewhat random, we repeat the traincompressfinetune pipeline for 10 times and record the mean and standard deviation (std). As shown in Tab. 1, the proposed MASS method performs favorably under various compression rates. For ResNet110, with 60% parameters compressed, MASS can still achieve 94.42% top1 accuracy which is nearly equal to the performance of the baseline method without compression. Compared with the Network Slimming, MASS consistently performs better, especially under high compression rates. Experiments on the CIFAR10 dataset demonstrate that MASS is able to compress CNNs with negligible performance drop and favorable accuracy against pruning methods such as Network Slimming.
4.2 Results on ImageNet
Tab. 2 shows the evaluation results of the proposed method against the stateoftheart network pruning approaches, including ThiNet Luo et al. (2017), Slimming Liu et al. (2017), NISP Yu et al. (2018), BNISTA Ye et al. (2018), FPGM He et al. (2019), and Taylor Molchanov et al. (2019). Overall, the MASS method performs favorably against the stateoftheart network compression methods under different settings. These performance gains achieved by the MASS method can be attributed to the fact that discarding the entire filters will negatively affect the representational strength of the network model, especially when the pruning ratio is high, e.g., 50%. In contrast, the MASS method removes only a proportion of neuron connections and preserves all of the filters, thereby making a mild impact on the model capacity. In addition, it is known that pruning neuron connections would eliminate the information flow and affect performance. To alleviate this issue, the learnable channel shuffle mechanism assists the information exchange among different groups, thereby minimizing the potential negative impact.
4.3 Ablation Studies
Accuracy v.s. Complexity.
As shown in Fig. 1, the proposed MASS method is designed to make sound accuracycomplexity tradeoff. On the ImageNet Krizhevsky et al. (2012) dataset, a slight top1 accuracy drop of 0.28% is compromised for about 25% complexity reduction on the ResNet50 backbone, and an accuracy loss of 1.02% for about 60% reduction on ResNet101. Furthermore, high compression rates can be achieved in our methodology while maintaining competitive performance. It is worth noticing that our method achieves an accuracy of 72.47% with only about 20% complexity of the ResNet50 backbone, which performs favorably against the pruning methods with two times complexity.
Config.  ResNet5065%  ResNet10165%  

Acc.  Top1  Top5  Top1  Top5 
Finetune  75.10  92.52  77.62  93.72 
FromScratch  75.02  92.46  77.14  93.53 
ShuffleNet  74.97  92.41  76.91  93.38 
Random  69.45  89.45  73.16  91.44 
NoShuffle  73.30  91.39  75.31  92.64 
Learned Channel Shuffle Mechanism.
We evaluate the effectiveness of our learned channel shuffle mechanism on the ResNet backbone with a compression rate of 65%. We use the following five settings for performance evaluation:

Finetune: The preserved parameters after compression are restored and the compressed model is finetuned. For the other four settings, the parameters of the compressed model are reinitialized for the finetune stage.

FromScratch: We keep the learned channel connectivity, i.e., and , from the training stage, and train the model from randomly reinitialized weights.

ShuffleNet: The same channel shuffle operation in the ShuffleNet Zhang et al. (2018) is adopted. Specifically, if a convolution is of cardinality and has output channels, then the channel shuffle operation is equivalent to reshaping the output channel dimension into , transposing and flattening it back. Compared with ShuffleNet, the way of channel shuffle is learned rather than predefined in our method, i.e., Finetune and FromScratch.

Random: The permutation matrices and are randomly given, independent of the learned ones.

NoShuffle: The channel shuffle operations are removed, i.e., and are identity matrices.
The results are demonstrated in Tab. 3. First, the finetuned models perform slightly better than those trained from scratch, which implies that the preserved parameters take an essential role in the final performance. Furthermore, the model with learned channel shuffle mechanism, i.e., neuron connectivity, performs the best among all settings. The channel shuffle mechanism in the ShuffleNet Zhang et al. (2018) is effective as it outperforms the noshuffle counterpart. However, it is can be further improved by a learningbased strategy. Interestingly, the random channel shuffle scheme performs the worst, even worse than the noshuffle scheme. This implies learning the channel shuffle operation is a challenging task, and randomly gathering channels from different groups gives no benefits.
4.4 Discussion
To the best of our knowledge, our work is the first to introduce structured sparsification for network compression. As there is still room for improvement, we discuss three potential directions for future work along this line of work.

DataDriven Structured Sparsification. Currently, the gradients of the data loss and those of the sparsity regularization are computed independently (Eq. 7) in each training iteration. Thus, the structured regularization is imposed uniformly on the convolutional layers, and the learned cardinality distribution is taskagnostic and prone to uniformity. However, better cardinality distribution may be achieved if the structured sparsification is guided by the backpropagated signals of the data loss. Thus, optimizationbased metalearning techniques Finn et al. (2017) can be exploited for this purpose.

Progressive Sparsification Solution. Typically, finetunefree compression techniques are desired in practical applications Cheng et al. (2018). Therefore, the sparsified weights can be removed progressively during training, and the architecture search as well as model training can be completed in a single training pass.

Combination with Filter Pruning Techniques. As the entire feature maps are reserved in our approach, the reduction of memory footprint is limited. This issue can be addressed by combining with the filter pruning techniques, which is nontrivial as uniform filter pruning is required within each group. It is of great interest to exploit group sparsity constraints Yoon and Hwang (2017) to achieve such uniform sparsity.
5 Conclusion
In this work, we propose a method for efficient network compression. Our approach induces structurally sparse representations of the convolutional weights and the compressed model can be readily incorporated in the modern deep learning libraries thanks to their support for the group convolution. The problem of intergroup communication is further solved via the learnable channel shuffle mechanism. Our approach is modelagnostic and highly compressible with negligible performance degradation, which is validated by extensive experiments on the CIFAR and ImageNet datasets. In addition, experimental evaluation against the stateoftheart compression approaches shows techniques of structured sparsification can be a fruitful future research direction.
Appendix A Structured Regularization in General Form
Generally, we can relax the constraints that both and are powers of 2, and assume both and have many factors of 2. Under this assumption, the potential candidates of cardinality are still restricted to powers of 2. Specifically, if the greatest common divisor of and can be factored as
(10) 
where is an odd integer,
then the potential candidates of the group level are
.
For example, if the minimal is 4 among all convolutional
layers
Appendix B Dynamic Penalty Adjustment
As the desired compression rate is customized according to user preference, manually choosing an appropriate regularization coefficient in Eq. (7) of Sec. 3.3 for each experimental setting is extremely inefficient. To alleviate this issue, we dynamically adjust based on the sparsification progress. The algorithm is summarized in Alg. 2.
Concretely, after the training epoch, we first determine the current group level of each convolutional layer according to Eq. (9) of Sec. 3.4. Then, we define the model sparsity based on the reduction of model parameters. For the convolutional layer, the number of parameters is reduced by a factor of , where is the cardinality. Thus, the original number of parameters and the reduced one are given by
(11) 
respectively. Here, and denote the input channel number and the kernel size of the convolutional layer, respectively. Therefore, the current model sparsity is calculated as
(12) 
Afterwards, we assume the model sparsity grows linearly, and calculate the expected sparsity gain. If the expected sparsity gain is not met, i.e.,
(13) 
where is the total training epoch number and is the target sparsity, we increase by . If the model sparsity exceeds the target, i.e., , we decrease by .
In all experiments, the coefficient is initialized from and is set to .
Initialize , , ,
for to do train for 1 epoch
Appendix C Experimental Details
In this section, we provide more results and details of our experiments. We provide the loss and accuracy curves along with the performance after each stage in Sec. C.1, and analyze the compressed model architectures in Sec. C.2.
c.1 Training Dynamics
Backbone  ResNet50  ResNet101  DenseNet201  

Compression Rate  35%  65%  85%  40%  65%  80%  38%  60% 
Precompression Acc.  69.07  66.36  64.30  69.56  67.13  64.20  69.10  66.26 
Postcompression Acc.  60.92  42.78  8.82  65.78  58.63  18.57  66.15  17.35 
Finetune Acc.  76.82  75.10  72,47  78.16  77.62  75.73  77.43  75.86 
threshold  0.127  0.115  0.125  0.095  0.090  0.103  0.098  0.115 
We first provide the pre and postcompression accuracy along with the finetune accuracy of our pipeline in Tab. 4. During, compression, we use a binary search to decide the threshold of the grouping criteria (Eq. (9)) so that the network can be compressed at the desired compression rate. The searched thresholds are also illustrated. Apart from this, we further provide the training and finetune curves in Fig. 4. In the training stage, the accuracy gradually increases till saturation, and then the compression leads to a slight performance drop. Finally, the performance is recovered in the finetune stage.
[width=]trainingcurve
c.2 Compressed Architectures
We illustrate the compressed architecture by showing the cardinality of each convolution layer in Fig. 6 and 5. Note that our method is applied to all convolution operators, i.e., both convolutions and convolutions, so a high compression rate, e.g., 80%, can be achieved. Besides, as discussed in Sec. 4.4, the learned cardinality distribution is prone to uniformity, but there are still certain patterns. For example, shallow layers are relatively more difficult to be compressed. A possible explanation is that shallow layers have fewer filters, so a large cardinality will inevitably eliminate the communication between certain groups. Moreover, we observe convolutions are generally more compressible than convolutions. This is intuitive as convolutions have more parameters, thus leading to heavier redundancy.
[width=]barplotresnet50
[width=]barplotresnet101
Besides, we illustrate the learned neuron connectivity and compare with the ShuffleNet Zhang et al. (2018) counterpart. Here, we consider the channel permutation between two group convolutions (GroupConvs) and demonstrate the connectivity via the confusion matrix. Specifically, we assume the first GroupConv is of cardinality and the second of , then the confusion matrix is a matrix with denoting the number of channels that come from the group of the first GroupConv and belong to the group of the second.
In Tab. 5, we can see that the intergroup communication is guaranteed as there are connections between every two groups. Furthermore, the learnable channel shuffle scheme is more flexible. The ShuffleNet Zhang et al. (2018) scheme uniformly partitions and distributes channels within each group, while our approach allows small variations of the number of connections for each group. In this way, the network can itself control the information flow from each group by customizing its neuron connectivity. More examples can be found in Fig. 7. All models illustrated in this section are trained on the ImageNet dataset.
[width=0.8]confusion
Footnotes
 For simplicity, we omit the bias term from Eq. 1, and assume the convolution operator is of stride 1 with proper paddings.
 Similar reasoning can be applied if both and have many factors of 2. See the supplementary materials for details.
 Doublystochastic matrices are nonnegative square matrices whose rows and columns sum to one.
 https://github.com/rflamary/POT/
 Here, we simply compute the regularization of a single convolutional layer. In the experiments, the regularization is the summation of those of all the convolution layers.
 The standard DenseNet Huang et al. (2017) family satisfies this condition.
References
 Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Three observations on linear algebra. Univ. Nac. Tacuman, Rev. Ser. A 5, pp. 147–151. Cited by: §3.2, §3.2.
 Displacement interpolation using lagrangian mass transport. In ACM Transactions on Graphics (SIGGRAPH Asia), pp. 1–12. Cited by: §1, §3.2.
 Learning efficient object detection models with knowledge distillation. In Neural Information Processing Systems (NIPS), pp. 742–751. Cited by: §1.
 Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pp. 2285–2294. Cited by: §1.
 Recent advances in efficient computation of deep convolutional neural networks. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 64–77. Cited by: item 2.
 Xception: deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258. Cited by: §2.
 Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830. Cited by: §1.
 Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), pp. 1126–1135. Cited by: item 1.
 Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), pp. 1243–1252. Cited by: §1.
 Fast RCNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
 Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §3.2, §3.5.
 Filter pruning via geometric median for deep convolutional neural networks acceleration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349. Cited by: §2, §4.2.
 Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406. Cited by: §1.
 Distilling the knowledge in a neural network. In Neural Information Processing Systems (NIPS), Cited by: §1.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
 Condensenet: an efficient densenet using learned group convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2752–2761. Cited by: §2.
 Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §1, §3.5, footnote 6.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §3.5, Table 1.
 Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), pp. 1097–1105. Cited by: §1, §1, §2, §4.3.
 Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2, §2, §2, §3.2.
 DARTS: differentiable architecture search. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Search to distill: pearls are everywhere but not the eyes. arXiv preprint arXiv:1911.09074. Cited by: §1.
 Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision (ICCV), pp. 2736–2744. Cited by: §1, §1, §2, §2, §2, §3.1, §3.2, §3.3, §3.5, §4.1, §4.2.
 Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
 SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), Cited by: §3.5.
 Thinet: a filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision (ICCV), pp. 5058–5066. Cited by: §4.2.
 Shufflenet v2: practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §2.
 Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. In Association for the Advancement of Artificial Intelligence (AAAI), Cited by: §1.
 Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11264–11272. Cited by: §1, §2, §4.2.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pp. 525–542. Cited by: §1.
 Regularized evolution for image classifier architecture search. In Association for the Advancement of Artificial Intelligence (AAAI), Vol. 33, pp. 4780–4789. Cited by: §2.
 Largescale evolution of image classifiers. In International Conference on Machine Learning (ICML), pp. 2902–2911. Cited by: §2.
 ImageNet Large Scale Visual Recognition Challenge. International Journal on Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: Figure 1, §1, §3.5, Table 2, Table 3.
 Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. Cited by: §1, §2.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §3.2, §3.5.
 PyTorch: an imperative style, highperformance deep learning library. In Neural Information Processing Systems (NIPS), Cited by: §3.5.
 Fully learnable group convolution for acceleration of deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9049–9058. Cited by: §2.
 FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10734–10742. Cited by: §2.
 Quantized convolutional neural networks for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4820–4828. Cited by: §1.
 Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: §1, §2, §3.5.
 Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
 A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4133–4141. Cited by: §1.
 Combined group and exclusive sparsity for deep neural networks. In International Conference on Machine Learning (ICML), pp. 3958–3966. Cited by: item 3.
 NISP: pruning networks using neuron importance score propagation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9194–9203. Cited by: §4.2.
 Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856. Cited by: §C.2, §C.2, Table 5, §1, §1, §2, §2, 3rd item, §4.3.
 Differentiable learningtogroup channels via groupable convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3542–3551. Cited by: §2, §2, §3.4.
 Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710. Cited by: §2.