Selfgrouping Convolutional Neural Networks
Abstract
Although group convolution operators are increasingly used in deep convolutional neural networks to improve the computational efficiency and to reduce the number of parameters, most existing methods construct their group convolution architectures by a predefined partitioning of the filters of each convolutional layer into multiple regular filter groups with an equal spatial group size and dataindependence, which prevents a full exploitation of their potential. To tackle this issue, we propose a novel method of designing selfgrouping convolutional neural networks, called SGCNN, in which the filters of each convolutional layer group themselves based on the similarity of their importance vectors. Concretely, for each filter, we first evaluate the importance value of their input channels to identify the importance vectors, and then group these vectors by clustering. Using the resulting datadependent centroids, we prune the less important connections, which implicitly minimizes the accuracy loss of the pruning, thus yielding a set of diverse group convolution filters. Subsequently, we develop two finetuning schemes, i.e. (1) both local and global finetuning and (2) global only finetuning, which experimentally deliver comparable results, to recover the recognition capacity of the pruned network. Comprehensive experiments carried out on the CIFAR10/100 and ImageNet datasets demonstrate that our selfgrouping convolution method adapts to various stateoftheart CNN architectures, such as ResNet and DenseNet, and delivers superior performance in terms of compression ratio, speedup and recognition accuracy. We demonstrate the ability of SGCNN to generalise by transfer learning, including domain adaption and object detection, showing competitive results. Our source code is available at https://github.com/QingbeiGuo/SGCNN.git.
keywords:
Deep Neural Network, Group Convolution, Compression, Acceleration1 Introduction
Recently, an enormous progress has been made in deep neural networks in connection with various computer vision tasks, such as image classification Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Zhang et al. (2016), object detection Girshick et al. (2014); Girshick (2015); Ren et al. (2015), semantic segmentation Long et al. (2015); Ronneberger et al. (2015); Zhou et al. (2018) and visual tracking Xu et al. (2018), etc. Increasingly deeper network architectures are designed to improve performance, by optimising a huge set of parameters, involving heavy computation. However, most embedded systems and mobile platforms cannot afford such huge memory requirements and intensive computation due to their constrained resources Guo et al. (2019). This severely impedes the application of deep neural networks. Lots of evidence has been provided to show that deep neural networks tend to be overparameterised, and can be compressed with little, or no loss of accuracy. Many methods have been proposed to compress and accelerate deep neural networks, including pruning methods Han et al. (2015); Hu et al. (2016); Wen et al. (2016); Luo et al. (2018), quantization methods Courbariaux et al. (2015); Rastegari et al. (2016); Li et al. (2016a); Zhu et al. (2016), decomposition with low rank Masana et al. (2017); Peng et al. (2018), and designing compact architectures Iandola et al. (2016); Howard et al. (2017); Zhang et al. (2018); Huang et al. (2017).
The key processing step in convolutional neural networks is convolution, in which each output channel corresponds to one filter over all of the input channels. Different from regular convolution, group convolution separately divides the input channels into multiple disjoint filter groups, thus convolutions are independently performed within each group for the reduction of computation budget and parameter cost. Since group convolution has an efficiently compact structure, and is particularly suitable for mobile and embedded applications, it has been attracting increasing interest as a means to compress and accelerate deep neural networks. These two convolutional architectures are illustrated in Fig. 1 (a) and (b), respectively. The group convolution was first used in AlexNet Krizhevsky et al. (2012) to handle the shortage of GPU’s memory and surprisingly it delivered remarkable performance in image classification on ImageNet. Inspired by this idea, Xie et al. (2017) constructed an efficient architecture, named ResNeXt, by combining a stacking strategy and a multibranch architecture with group convolution, achieving a better classification result on ImageNet than its ResNet counterpart at a lower computational complexity. Zhang et al. (2017a) presented a novel modularized neural network built by stacking interleaved group convolution (IGC) blocks, composed of primary and secondary group convolutions. To improve the representational power, IGC permutes the output channels of primary group convolutions as input channels of secondary group convolutions. Similarly, ShuffleNet Zhang et al. (2018) introduced an efficient architecture in which two operations of pointwise group convolution and channel shuffle are adopted to significantly reduce the computational complexity, without degrading classification accuracy. Based on a similar idea, Gao et al. (2018) used a channelwise convolution to perform information fusion for the features outputted by prior independent groups. These methods permute the output channels of each group and put them into all the groups of the subsequent convolutional layer, such that the features of different groups interact with each other in a predefined manner. This type of architecture, shown in Fig. 1 (c), is called permuting group convolution. Huang et al. (2018) proposed a learned group convolution, in which a compact network architecture, termed CondenseNet, is constructed using dense connectivity, as shown in Fig. 1 (d). CondenseNet is distinguished from the above methods in that each input channel is incorporated into one filter group by learning, rather than being predetermined. It exhibits a better computational efficiency than MobileNet Howard et al. (2017) and ShuffleNet Zhang et al. (2018) at the same level of accuracy.
The above methods aim at selecting input channels for each filter group to improve the performance of deep neural networks. However, they are constrained by predefined group structures. A fixed assignment of filters to independent groups is not conducive to enhancing the recognition capability of deep neural networks. Firstly, the initial filter grouping in predefined grouping designs is dataindependent. Secondly, because of their simplicity, these group convolution architectures, in which each group has the same number of filters and input channels, are prevented from realising their potential representation capacity. We hypothesise that filter groups should not be homogeneous, but rather diverse in the spatial group size, so that the diversity of the architectural features of group convolution can exploit the representational potential of deep neural networks. PolyNet Zhang et al. (2017b) has verified that diverse structures can improve the performance of image recognition as an additional dimension of optimisation, beyond depth and width in network design.
In this paper, we propose a novel method of selfgrouping convolutional neural networks, which automatically groups the filters for each convolutional layer by clustering, instead of being predefined, to compress and accelerate deep neural networks. A neural network guides each filter to learn different representations from its input information through training, and each input channel plays a different role for such representations. For each filter, we first evaluate the importance of its input channels by an importance vector. Each element of the importance vector conveys an importance value of the corresponding input channel. We then learn the filter groups by clustering the importance vectors, which is datadependent. Considering the redundancy of parameters, a sparse structure of each filter group is realised by pruning their unimportant connections based on their cluster centroids. In this way, we convert regular convolutions into selfgrouping convolutions, where the diversity of group structures is promoted by differences in spacial group size. This distinguishes the proposed method from existing group convolutions Krizhevsky et al. (2012); Zhang et al. (2018, 2017a); Huang et al. (2018). Subsequently, we compensate the accuracy loss from pruning by two finetuning schemes, namely (1) global only finetuning and (2) both local and global finetuning. The computational complexity of the resulting efficient and compact selfgrouping convolutional neural network and its memory requirements are further reduced by extending the proposed selfgrouping approach to the fullyconnected layers.
In Fig. 1, we illustrate the evolution of group convolutions, from regular group convolution, through permuting group convolution, learned group convolution, to our selfgrouping convolution. By comprehensive experiments using various stateoftheart CNN architectures, including ResNet He et al. (2016) and DenseNet Huang et al. (2017), we show that our SGCNN significantly reduces the size of network models and accelerates the inferences on popular vision datasets CIFAR10/100 Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015), achieving superior performance. We present an ablation study that compares the performance of the proposed scheme in different conditions, which provides a deep insight into the properties of SGCNNs. Furthermore, we also investigate the amenability of our SGCNN to generalisation by transfer learning, such as domain adaption and object detection.
The main contributions of this paper are summarized as follows:

A selfgrouping convolution method for the compression and acceleration of deep neural networks by automatically converting regular convolutions into data dependent group convolutions with diverse group structures that are learned using both, filter clustering, based on importance vectors, and network pruning based on cluster centroids.

Our selfgrouping method adapts to the fullyconnected layers as well as the convolutional layers for extreme compression and acceleration.

The proposed selfgrouping method supports a global only finetuning for an efficient network compression, preserving most of the information flow through datadependent and diverse group structures.

Comprehensive experiments testify that our selfgrouping approach can be effectively applied to various stateoftheart CNN architectures, including ResNet and DenseNet, with high compression ratio, low FLOPs and with a low, or no loss in accuracy.
The rest of this paper is organized as follows: We first introduce the related work in Section 2. We present our selfgrouping convolution method in Section 3. Our selfgrouping convolution is compared with previous group convolutions to elaborate its datadependence and structural diversity by matrix decomposition in Section 4. Subsequently, we validate our SGCNN to show its superior performance through comprehensive experiments involving various network models and datasets in Section 5. We present an ablation study, which enhances the understanding of SGCNNs in Section 6. We also investigate the generalization ability of SGCNNs by transfer learning in Section 7. Finally, we draw the paper to conclusion in Section 8.
2 Related Work
Pruning methods. Pruning is one of the widely used methods to compress and accelerate deep neural networks. There are structural and nonstructural pruning methods based on the sparsity of spatial patterns. Han et al. (2015) proposed a simple nonstructural pruning strategy to compress deep neural networks by removing the connections corresponding to unimportant weights. Structural pruning methods have received considerable attention because they are a very direct way to obtain structurally sparse architectures. Mao et al. (2017) explored the effect of different pruning granularities on deep neural networks, and suggested coarsegrained pruning, such as connectionwise Huang et al. (2018), channelwise He et al. (2017), filterwise Li et al. (2016b); Lin et al. (2019), and even layerwise Wen et al. (2016) pruning, to compress and accelerate deep neural networks. He et al. (2017) introduced a channel pruning method to compress deep neural networks. This method removes redundant channels through LASSO regularization, and reduces the error accumulated from pruning by minimizing the reconstruction error at the output feature maps. Li et al. (2016b) estimates the importance of each filter according to the absolute sum of their kernel weights, and removes the unimportant filters based on a threshold, implying that the filters with low magnitude weights tend to yield weak feature maps. Recently, Lin et al. (2019) proposed a very effective method of pruning by structured sparsity regularization, achieving superior performance in terms of accuracy and speedup. In CondenseNet Huang et al. (2018), less important connections were removed from filter groups to directly get structurally sparse patterns during the condensing stage. In our paper, we also adopt connectionwise pruning method to design structurally sparse architectures for filter groups.
Designing compact architectures The motivation for applying deep neural networks on devices with constrained resources also encourages the studies of designing efficient and compact network architectures. AlexNet Krizhevsky et al. (2012) was a pioneering study in designing a group convolution architecture, although the main motivation for its design was to address the shortage of GPU resources. ResNeXt Xie et al. (2017) applied group convolutions in its building blocks to reduce the computation complexity and the number of parameters. Zhang et al. (2017a) proposed an interleaved group convolutional neural network (IGCNet) in which each building block consists of two separate group convolution layers. To enhance the representation power of building blocks, the input channels of secondary group convolutions are linked to each primary group convolution. Similar to Zhang et al. (2017a), Zhang et al. (2018) introduced a channel shuffle operation for multiple group convolutions to improve the representation power. These methods exhibited recognition accuracy comparable to that of the original network, while achieving low computational complexity. But they have one drawback in common, that is, the composition of input channels as well as the output channels in each group is predetermined rather than learned. Huang et al. (2018) recently presented a learned group convolution, in which input channels are learned for each group. However, the filter partitions are still predefined. Moreover, only convolution groups are learned, excluding convolutional and fullyconnected layers. In contrast, our selfgrouping method can be applied to all of these layers. Recently, Wang et al. (2019) proposed a fully learnable group convolution (FLGC) method to dynamically optimize the grouping structure, focusing on the convolutional layers for acceleration while achieving better accuracy than CondenseNet. Additionally, although the group structure is fully learnable, binary selection matrices for input channels and filters are approximately optimized by applying a softmax function to confront the problem of performance degradation. Compared to Wang et al. (2019), our motivation is similar, but we automatically construct the grouping structure by clustering based on importance vectors and by pruning based on cluster centroids. What is more, our selfgrouping approach can be applied not only to the convolutional layers but also the fullyconnected layers for simultaneous compression and acceleration.
Depthwise separable convolution is also a significant building block, which consists of two separate layers Howard et al. (2017); Zoph et al. (2018); Sandler et al. (2018); Chollet (2017). The first layer is a depthwise convolution, which performs spatial filtering over each input channel, and it can be viewed as a special group convolution in which each filter group independently contains only one input channel. The other is called pointwise convolution which projects the output of the depthwise convolution into a new feature space by performing 11 convolution over all of its input channels. Many stateoftheart network architectures, such as MobileNet Howard et al. (2017) and NASNet Zoph et al. (2018), have adopted such a building block to tradeoff reasonable accuracy against model size. Moreover, in order to keep representational power, the nonlinearity operation between the two layers is usually removed from depthwise separable convolution Sandler et al. (2018); Chollet (2017). Recently, Gao et al. (2018) proposed an efficient and compact channelwise convolution which can be combined with group convolution and depthwise separable convolution to achieve a better tradeoff between efficiency and accuracy.
3 SGCNNs
In this section, we first introduce the notation and preliminaries. Next, given a welltrained neural network, we introduce the concept of importance and the use of importance vectors in filter evaluation. Then, we present our selfgrouping method to automatically cluster filters based on the similarity of their importance vectors. A centroidbased pruning scheme is proposed to implement both the convolutional and fullyconnected layers to compress and accelerate the neural network computation, followed by optional local finetuning and the obligatory global finetuning for the performance recovery. The outcome is a compact and efficient neural network with datadependent and diverse group structures. We illustrate the overall pipeline of our selfgrouping convolution in Fig. 2.
3.1 Notations and Preliminaries
Given an layer deep convolutional neural network, we denote the weights of its th convolutional layer as , where and are the number of input channels and output channels, respectively, and is the kernel size. is an input tensor which is obtained by sampling the input layer with sliding window. Here, and can be viewed as a matrix with shape and a vector with shape , respectively, such that we have
(1) 
where is the corresponding output vector. corresponds to the kernel of the th input channel for the th output one in the th layer. For simplicity, we omit the bias term. In this paper, if not otherwise specified, all the notations indicate the parameters in the th layer.
In order to reduce the computation cost and memory overhead, the regular group convolution approach focuses its convolution operation on the spatial or channel dimension of the filters. Suppose we partition filters and input channels into groups, denoted as , making each group to contain filters and input channels. Then the regular group convolution can be formulated as follows,
(2) 
where is an input vector for group , is the corresponding output vector of group , and denotes the weight block matrix of group . Let = diag(), which is a quasidiagonal matrix, assuming an equal group size, such that .
For a fullyconnected layer, we treat each of its neurons as convolutional channels with 11 spatial size, i.e., = 1, such that we can obtain
(3) 
where is an input vector, is the weight matrix of the fullyconnected layer, and is the corresponding output vector. is a scalar, and denotes the weight value of the th input neuron for the th output one. We also omit the bias term for simplicity.
By analogy, for the fullyconnected layer, Eq. (3) can be rewritten as , i.e.,
(4) 
where is an input vector for group , is the corresponding output vector of group , and denotes the block weight matrix of group .
However, the limited spatial operation restricts the expressive power of the regular group convolution. To avoid this shortcoming, we propose a selfgrouping convolution to relax the spatial restriction. This is achieved by clustering the filters based on the degree of similarity of the ”so called” importance vectors and pruning the unimportant connections based on a centroid pruning strategy.
3.2 Importance Vectors
For a welltrained deep neural network shown in Fig. 2 (a), its parameters are trained to make it attain a local or global optimum. Note that the training of neural networks effectively identifies the important parameters, while inhibiting the less important connections. The distribution of these parameters conveys information about their relative importance. Generally, the parameters with low magnitudes tend to produce feature maps with weak activations, representing minor contributions to the neural network Han et al. (2015); Hu et al. (2016) output. On the contrary, the parameters of high magnitude are destined to make significant contributions. However, scalars cannot represent the information contained in a distribution. Considering group convolutions are closely relate to multiple filters and input channels, we introduce a novel concept, referred to as importance vector, for a filter to represent the importance of all its input channels.
For the th layer, we define as a set of the importance vectors of all its filters. corresponds to the th filter, such that ( = 1,2, …, ), where stands for the importance value of the th input channel to the th filter. We estimate by the norm of its corresponding kernel , as
(5) 
Similarly, for the fullyconnected layers, we denote their importance vector set as , and ( = 1,2, …, ). The importance value is estimated by the absolute value of its corresponding weight , as follows
(6) 
Unlike the conventional methods in which the importance of these parameters is defined by scalars Hu et al. (2016); Huang et al. (2018); Molchanov et al. (2016), our method assesses their importance in terms of vectors. This concept suggests that the importance of weights should be gauged using the importance distribution of input channels for a filter. This can be achieved by assigning different filters into different groups by a clustering based on the similarity of the importance distributions.
3.3 Selfgrouping Filters by Clustering
In this part, we present how to automatically group filters by clustering based on the similarity of importance vectors. Clustering is an efficient way to generate multiple filter groups where the behaviors of the input channels is similar within each group but divergent between groups. Therefore, for the th layer, we partition its importance vector set into groups by kmeans clustering method so as to minimize the withingroup sum of Euclidean distances, as follows,
(7) 
Here, , and where is the centroid vector of , and corresponds to the th input channels in . As shown in Fig. 2 (b), all the filers are grouped into three groups for the convolutional layer, and each group has different group spatial size. Certainly, other clustering methods (e.g. kmedoids) could also be used for grouping the filters with similar importance vectors, which is beyond the scope of this paper.
Likewise, we apply the kmeans clustering based on the similarity of importance vectors to the fullyconnected layers, thus achieving groups , satisfying the following condition,
(8) 
where , and stands for the centroid vector of , such that . Here, corresponds to the th input neuron in .
The existing methods have aimed to design distinct group convolutions in which the filters are assigned to specific groups in a predefined manner and each group has the same number of filters, so that these designs are dataindependent Xie et al. (2017); Zhang et al. (2017a, 2018); Gao et al. (2018). In contrast, we automatically determine the filters for each group by clustering, instead of fixing a priori. Each group may have different number of filters, which is datadependent. Therefore, selfgrouping filters by clustering helps to enhance the representation potential of group convolutions.
3.4 Centroidbased Pruning Scheme
The requirement of group sparsity attracts increasing attention due to its beneficial effect on compression and acceleration Wen et al. (2016); Alvarez and Salzmann (2016). A connection based pruning can generate such structured sparse architecture for group convolutions by removing connections identified by negligible weights from groups. This enables parameter reduction and efficient computation Han et al. (2015). Furthermore, considering that the cluster centroids are representative importance vectors of their corresponding groups, we use them to determine the incoming input channels for each group. The result is a centroidbased pruning scheme to construct our selfgrouping convolution.
To be specific, we arrange each element of the centroid vectors in an ascending order to obtain a sorted set , as follows,
(9) 
Here, indicates the order of in to be , and each element corresponds to multiple connections of its corresponding group. Then, we truncate the smallest values as follows,
(10) 
and prune their corresponding multiple weakest connections in the th layer.
Correspondingly, for the fullyconnected layers, the sorted set and the smallest values are defined as follows,
(11) 
(12) 
Note that for a centroid vector, if some of its elements are within , then partial connections of its corresponding whole group is discarded. As extreme cases, if all its elements are within , then its corresponding whole group is discarded; On the contrary, if they are above , then the corresponding whole group is reserved. As a consequence, different groups have different number of input channels. Moreover, the input channels can be shared by different groups, and can also be neglected by all the groups, which is similar to Huang et al. (2018).
In this way, the compression ratio of the th layer can be calculated as follows,
(13) 
where denotes the number of filters in , and is the number of the ’s elements that belong to in , such that . Further, the compression ratio of the neural network can be calculated as follows,
(14) 
At each pruning iteration, the pruning step can be different, but for simplicity, the same pruning step is set for the th layer to be , which means the identical proportion of connections are removed from the th layer each time, which is closely related with . In other words, after iterations, we truncate an appropriate number of from to delete their corresponding connections, while satisfying the condition: .
So far, a selfgrouping convolution with diverse structures has been formed by the remaining sparse connections. Such diverse structures significantly preserve the majority of information flow in each pruned layer, which helps to exploit the representation potential of group convolutions. The selfgrouping convolution is shown in Fig. 2 (c). Obviously, the connection pattern in selfgrouping convolutions is controlled by , and the training dataset together, where determines the number of filter groups. The filters of each group depend on the training dataset, and decides the number of input channels in each filter group. In Section 6, we investigate the effect of different and on the network performance to guide their setting in detail.
In summary, our selfgrouping convolution method affords many advantages compared to the existing pruning methods.: (1) By virtue of a novel centroidbased pruning scheme, we exploit the full knowledge of weight parameter importance conveyed by the importance vector distribution. (2) Our proposed method preserves the majority of information flowing through the network, which helps achieving better recognition performance. (3) As our proposed method is appl;icable to the fullyconnected layers as well as the convolutional layers, they can be pruned together for efficient compression and acceleration. (4) Different from the existing methods with a layerbylayer grouping in a fixed manner, which impacts on the compression efficiency of networks with increasing depth, our method prunes the parameters for different layers in parallel. Therefore, it does not depend on the depth of the network but on the pruning step. This helps to improve the compression efficiency, especially for deep neural networks.
3.5 Finetuning
Although our proposed method minimises the performance degradation caused by the centroidbased pruning scheme, the cumulative error will damage the overall performance of the original neural networks. Therefore, a finetuning that compensates for the loss of accuracy from the pruning is desirable. There are two forms of finetuning: local finetuning and global finetuning. The former represents repeating local finetuning after each pruning to recover the performance of networks Hu et al. (2016); Molchanov et al. (2016); Liu et al. (2017). This impacts on the computational time, while helping to maintain the network performance. The latter represents a global finetuning to strengthen the remaining part of the network to enhance its expressive ability Lin et al. (2018b). Considering both, the performance and efficiency, we investigate two kinds of finetuning schemes: (1) global only finetuning and (2) both local and global finetuning. In section 5, our extensive experiments on ImageNet testify that our selfgrouping method obtains comparable results with each of these two finetuning schemes, which convincingly shows our method preserves the majority of information flow through datadependent and diverse group structures.
We depict the complete process of SGCNNs to compress and accelerate deep network models in Algorithm. 1. Our selfgrouping method prunes the unimportant connections from a welltrained neural network to reduce the size of the models and to accelerate the inference. The whole framework consists of five basic steps: (1) the importance vector computation for each filter; (2) filter grouping by clustering based on their importance vectors; (3) prune unimportant connections based on the centroidbased pruning scheme; (4) (optionally) local finetuning the pruned networks; (5) global finetuning the pruned network.
3.6 Deployment
When the compressed model is deployed on mobile devices or embedded platforms, we convert it into a network with regular connection patterns for inference speedups. Specifically, for each filter group, we duplicate the reused feature maps and delete the ignored feature maps. Afterwards, we rearrange these feature maps. The output channels are also rearranged to merge to locate the filters of the same group together. As a result, we obtain a regular group convolution with diverse group structures, which requires no special libraries or hardware for efficient inference, as shown in Fig. 2 (d). The conversion process can easily be implemented by permutation matrices, as described in Section 4 in detail.
4 Analysis
Regular group convolution is highly restricted in its representational ability due to the limited scope of spatial calculations for each group. To enhance its representation power, a lot of methods have been introduced to relax the spatial restrictions, such as permuting output channels Zhang et al. (2017a), shuffling channels Zhang et al. (2018), introducing channelwise convolutions Gao et al. (2018), and using learned group convolutions Huang et al. (2018), which are equivalent to the deliberate selection of input channels for each disjoint group. However, they are rather simplistic in the composition of the filters for each groups. In the following, we compare our selfgrouping convolution with these group convolutions to elaborate its datadependence and structural diversity by matrix decomposition.
Permuting group convolution. For IGCNets, permuting the output channels of the primary group convolution can be interpreted as a specific selection of the input channels for each partition of the secondary group convolution, so that the input channels of the same secondary partition lie in different primary partitions Zhang et al. (2017a). Similarly, shuffling channels can also be viewed as not only an organized rearrangement of input channels but also an intentional selection of input channels for each filter group to improve the representation capacity Zhang et al. (2018). The channelwise convolution computes the output channels of each group from all input channels, while maintaining sparsity, which improves the interactions among filter groups for more representational power Gao et al. (2018).
The above networks have something in common. They have the same number of filters and input channels in each group, and a similar way to rearrange the input channels, as illustrated in Fig. 3 (a). We formulate the permuting group convolution as follows,
(15) 
where is a permutation matrix to rearrange the order of input channels. It should be noted that is constant matrix due to predefined permutation designs. is a quasidiagonal matrix, and the block structure of is also predefined. That is to say that sparse pattern of and is known before training.
Learned group convolution. By contrast with the above methods, learned group convolution also predefined the filters of each group, but learned input channels for each group based on its condensation criterion Huang et al. (2018). We show the equivalent group convolution in Fig. 3 (b), and formulate it as follows,
(16) 
Here, is a permutation matrix which is used to rearrange the input channels. Unlike in Equ. 15, is learnable to reuse the important input features and to ignore the less important ones. is the same as in the block structure which is predefined.
Selfgrouping convolution. For our selfgrouping convolution, the filters, as well as the input channels, cluster into different groups by learning. We split the filters in the same convolution layer into multiple groups by clustering, in contrast to prefixing. The input channels for each group are determined by centroidbased pruning. In the convolution pattern, the number of the filters and input channels is different among groups. The input channels may be reused for different groups, and even may be ignored by all the groups, as shown in Fig. 3 (c) and (d). Thus, we produce a diverse group convolutions with datadependence, which is mathematically formulated as follows,
(17) 
where both and are permutation matrices, but different in function. is used to rearrange the order of input channels, which is the same as in function. Distinguished from the other methods, we introduce a novel to organize the filters into multiple distinct groups, such that the sparse matrix is transformed into the block diagonal matrix . More importantly, these two permutation matrices are learned, rather than predefined, by clustering based on the similarity of importance vectors and pruning based on the cluster centroids. In contrast to , and which have equal size blocks, is a block diagonal matrix, but may have blocks of different size. The design is datadependent because its block structure strongly depends on the training dataset. As a result, our selfgrouping convolution operation is very effective and diverse, and does not require manually predefined permutation operations to improve the interaction of groups for better performance. This is verified by experiments in section 5.
5 Experiments
In this section, we empirically demonstrate the efficacy and efficiency of our proposed SGCNN on four highly competitive computer vision recognition benchmark tasks, i.e., CIFAR10/100 Krizhevsky and Hinton (2009) and ImageNet Russakovsky et al. (2015). Comprehensive experiments are carried out on several stateoftheart network architectures, including ResNet He et al. (2016) and DenseNet Huang et al. (2017). All the experiments are implemented in pyTorch and are run on NVIDIA TITAN Xp GPU card with 12GB and 128G RAM. Actually, for simplicity, the same number of groups is set for each compressed layer. Additionally, there are few parameters (e.g. 3 channels for RGB images and 1 channel for gray images) in the first convolutional layer, but they are crucial as they provide the original input information for the neural networks. Therefore, in order to keep enough input information, we do not compress the first convolutional layer.
5.1 Datasets
CIFAR10/100. These two datasets consist of 50,000 images for training and 10,000 images for testing. The resolution of the images is 3232. The data sets contain 10 and 100 categories, respectively. Due to the limited number of training samples, we augment the training datasets by random cropping and padding, and by horizontal flipping, which is the same technique of data augmentation adopted in Li et al. (2016a).
ImageNet. ILSVRC2012, a subset of the ImageNet dataset, contains 1.2M training images and 50K validation images as test samples. The image samples are categorized into 1000 classes. We follow the data augmentation scheme described in He et al. (2016), i.e., each sample is randomly cropped to the 224224 size from the rescaled 256256 size, and horizontally flipped. We also apply a 224224 center crop to each test sample from the rescaled 256256 size at the test time.
5.2 DenseNet on CIFAR10/100
Model. For CIFAR10/100, we use two modified version of DenseNet121 as our baselines, and train them with the same hyperparameters for 200 epochs from scratch. The batchsize is set to be 64, weight decay 1e4 and momentum 0.9. We follow the learning rate schedule: 0.1 for the first 100 epochs, 0.01 until epoch 150, and 0.001 to epoch 200. Finally, we obtain the baseline models with 95.23% top1 accuracy and 99.86% top5 accuracy for CIFAR10 and 78.67% top1 accuracy and 94.55% top5 accuracy for CIFAR100.
Implementation. For both DenseNet121 models on CIFAR10/100, the number of groups is set to 8 for each layer. We simultaneously prune both the convolutional and fullyconnected layers, and 5% parameters are discarded from these layers each time. After each pruning, we locally finetune the pruned network for 4 epochs with a constant learning rate of 0.001. Finally, we globally tune the pruned networks for 200 epochs with the same hyperparameters as in training, i.e., batchsize, weight decay, momentum and learning rate decay schedule, except for the initial learning rate of 0.01.
Results. We report the compression result of DenseNet on CIFAR10 in Table 1. When the compression ratio is not more than 85%, our approach achieves higher recognition accuracy and lower FLOPs than the original network model.
Firstly, we compare our SGCNN with several stateoftheart group convolution methods to demonstrate the efficacy of our method. Compared with IGC Zhang et al. (2017a); Xie et al. (2018); Sun et al. (2018), our SGCNN achieves accuracy that is 0.37% higher than the best result of IGC’s three versions at the approximate model size (1.71M vs. 2.2M). Also, compared with CondenseNet Huang et al. (2018), our SGCNN achieves comparative recognition performance at approximate model size (0.68M vs. 0.52M). It is slightly inferior to CondenseNet at almost the same model size (0.34M vs. 0.33M), while achieving about 30% lower FLOPs than CondenseNet, which means our SGCNN has faster inference speed than CondenseNet. As for FLGC Wang et al. (2019), with fully learned group convolutions, our SGCNN is better by up to 1.8% at the same model size (0.68M vs. 0.68M).
Secondly, our SGCNN is also compared with other pruning methods. We can see that our SGCNN surpasses Slimming Liu et al. (2017) by 0.37% and 0.22% in top1 accuracy at comparable model size (1.37M vs. 1.44M and 0.68M vs. 0.66M), while achieving lower FLOPs. Compared with DMRNet Zhao et al. (2018), SGCNN outperforms it by 0.36% in top1 accuracy at almost the same model size (1.71M vs. 1.7M). For Variational Pruning Zhao et al. (2019), the gap reaches to 1.16% in top1 accuracy and more than 2 in FLOPs (0.34M vs. 0.42M). And then for root Ioannou et al. (2017), our SGCNN is better by a large margin in terms of FLOPs and top1 accuracy.
Finally, it is worth highlighting that our SGCNN even surpasses other counterparts constructed by shift operations Chen et al. (2019); Wu et al. (2018); Jeon and Kim (2018). At comparable model size (0.68M vs. 0.55M, 1.71M vs. 1.76M and 1.03M vs. 0.99M), our SGCNN surpasses them by 1.23%, 2.61% and 0.86% in top1 accuracy, respectively, while achieving lower FLOPs.
Model  Params  FLOPs  Top1 (%)  Top5 (%)  Epochs 
Baseline (k = 32)  6.89M  888.36M  95.23  99.86  200 
DenseNet (Conv75/FC75)  1.71M  221.90M  95.40  99.91  200+15*4+200 
DenseNet (Conv80/FC80)  1.37M  177.72M  95.29  99.91  200+16*4+200 
DenseNet (Conv85/FC85)  1.03M  134.10M  95.39  99.90  200+17*4+200 
DenseNet (Conv90/FC90)  0.68M  89.77M  95.03  99.93  200+18*4+200 
DenseNet (Conv95/FC95)  0.34M  45.76M  94.32  99.89  200+19*4+200 
IGCL4M8 Zhang et al. (2017a)  0.96M  145M  90.12    400 
IGCL4M8 Zhang et al. (2017a)  0.57M  86.2M  92.81    400 
IGCL24M2 Zhang et al. (2017a)  0.52M  94.8M  90.88    400 
IGCL24M2 Zhang et al. (2017a)  0.31M  57.1M  92.86    400 
IGCV2*C416 Xie et al. (2018)  0.65M    94.51    400 
IGCV3 Sun et al. (2018)  2.2M    95.03    400 
CondenseNet Huang et al. (2018)  0.52M  122M  95    300 
CondenseNet Huang et al. (2018)  0.33M  65M  95    300 
ResNet50FLGC2 Wang et al. (2019)  0.68M  44M  93.23     
ResNet50FLGC1 Wang et al. (2019)  0.22M  23M  92.05     
MobileNetV2FLGC(G=2) Wang et al. (2019)  1.18M  158M  94.11     
MobileNetV2FLGC(G=3) Wang et al. (2019)  0.85M  122M  94.20     
MobileNetV2FLGC(G=4) Wang et al. (2019)  0.68M  103M  94.16     
MobileNetV2FLGC(G=8) Wang et al. (2019)  0.43M  76M  93.09     
ResNetSlimming Liu et al. (2017)  1.44M  381M  94.92    160+160 
DenseNetSlimming Liu et al. (2017)  0.66M  381M  94.81    160+160 
DMRNet Zhao et al. (2018)  1.7M    95.04     
DenseNet40 Pruned Zhao et al. (2019)  0.42M  156M  93.16    300 
root2 Ioannou et al. (2017)  1.64M  737M  92.09     
root4 Ioannou et al. (2017)  1.23M  455M  92.02     
root8 Ioannou et al. (2017)  1.03M  315M  92.15     
root16 Ioannou et al. (2017)  0.93M  245M  91.67     
ShiftResNet (SSL) Chen et al. (2019)  0.55M  166M  93.8     
ShiftResNet Wu et al. (2018)  1.76M  279M  92.79     
ShiftResNet Wu et al. (2018)  0.87M  151M  92.74     
ShiftResNet Wu et al. (2018)  0.28M  67M  91.69     
ASNet Jeon and Kim (2018)  0.99M    94.53     
Table 2 shows the compression results of DenseNet on CIFAR100. From the results, we note that our SGCNN achieves 0.11% higher top1 accuracy than the original network model at the compression ratio of 70%. As the compression ratio increases, the network gradually degrades in recognition accuracy, while achieving lower and lower FLOPs.
Our proposed method is compared with existing methods of group convolution to show its effectiveness. Compared with IGC Zhang et al. (2017a); Sun et al. (2018), it is significantly better, demonstrating that our selfgrouping convolutions are more expressive. CondenseNet Huang et al. (2018) is outperformed by 0.37% at comparable model size (0.71M vs. 0.52M). However, our method slightly underperforms at approximate model size (0.36M vs. 0.33M), while achieving about 1/4 lower FLOPs than CondenseNet.
Compared with Slimming Liu et al. (2017), our method achieves over 1% and over 2% higher top1 accuracy at circa 1 and 3 lower FLOPs at approximately equal model size (1.40M vs. 1.46M and 0.71M vs. 0.66M). Compared with DMRNet Zhao et al. (2018), our SGCNN achieves 2.81% higher top1 accuracy at almost the same model size (1.75M vs. 1.7M). For Variational Pruning Zhao et al. (2019), our SGCNN surpasses it by up to 4.54% top1 accuracy at comparable compression ratio (0.71M vs. 0.65M), while achieving about 1.5 lower FLOPs.
Finally, in contrast to the methods constructed by shift operations Chen et al. (2019); Wu et al. (2018); Jeon and Kim (2018), our SGCNN is better by 4.33%, 4.3% and 1.45% in top1 accuracy at comparable model size (0.71M vs. 0.55M, 1.75M vs. 1.76M and 1.06M vs. 0.99M), while achieving lower FLOPs.
Based on our observation, we find that most of training time is dominated by local finetuning in the experiments on both CIFAR10 and CIFAR100, which has a negative impact on the training efficiency. To this end, we try the other strategy, i.e., global only finetuning strategy, in the following experiments on ImageNet, and show the effectiveness and efficiency of our selfgrouping convolution method by comparing between them.
Model  Params  FLOPs  Top1 (%)  Top5 (%)  Epochs 
Baseline (k = 32)  6.99M  888.45M  78.67  94.55  200 
DenseNet (Conv70/FC70)  2.10M  266.60M  78.78  94.51  200+14*4+200 
DenseNet (Conv75/FC75)  1.75M  222.14M  78.40  94.19  200+15*4+200 
DenseNet (Conv80/FC80)  1.40M  176.46M  78.24  94.28  200+16*4+200 
DenseNet (Conv85/FC85)  1.06M  133.46M  78.18  94.34  200+17*4+200 
DenseNet (Conv90/FC90)  0.71M  89.86M  76.73  94.04  200+18*4+200 
DenseNet (Conv95/FC95)  0.36M  45.67M  74.37  93.33  200+19*4+200 
IGCL4M8 Zhang et al. (2017a)  0.96M  145M  64.48    400 
IGCL4M8 Zhang et al. (2017a)  0.57M  86.2M  67.81    400 
IGCL24M2 Zhang et al. (2017a)  0.52M  94.8M  66.59    400 
IGCL24M2 Zhang et al. (2017a)  0.31M  57.1M  70.32    400 
IGCV3 Sun et al. (2018)  2.2M    78.34    400 
CondenseNet Huang et al. (2018)  0.52M  122M  76.36    300 
CondenseNet Huang et al. (2018)  0.33M  65M  75.92    300 
ResNetSlimming Liu et al. (2017)  1.46M  333M  77.13    160+160 
DenseNetSlimming Liu et al. (2017)  0.66M  371M  74.72    160+160 
DMRNet Zhao et al. (2018)  1.7M    75.59     
DenseNet40 Pruned Zhao et al. (2019)  0.65M  218M  72.19    300 
ShiftResNet (SSL) Chen et al. (2019)  0.55M  166M  72.4     
ShiftResNet Wu et al. (2018)  1.76M  279M  74.10     
ShiftResNet Wu et al. (2018)  0.87M  151M  73.64     
ShiftResNet Wu et al. (2018)  0.28M  67M  69.82     
ASNet Jeon and Kim (2018)  0.99M    76.73     
5.3 ResNet and DenseNet on ImageNet
Model. In this experiment, we investigate the proposed SGCNN on two stateoftheart CNN architectures, i.e., ResNet50 and DenseNet201. For a fair comparison, we use their network models pretrained on ImageNet as our baseline networks instead of ones trained from scratch.
Implementation. For ResNet50 and DenseNet201, we set the number of groups to 16. These two models are pruned with the compression step of 10% to get a series of models of different size. Considering the difference between the convolutional and fullyconnected layers in redundancy, the compression ratio ranges from 10% to 80% for the convolutional layers and from 10% to 60% for the fullyconnected ones. We apply two finetuning schemes, i.e. the local and global finetuning and the global only finetuning, to verify the effectiveness and efficiency of our method. For the former scheme, the local finetuning is performed for a small number 20 of epochs after each pruning, the learning rate is set to be 0.0001, and kept constant. For the global finetuning in these two schemes, the learning rate is set to be 0.01, which is divided by 10 at 30 and 60 epoch, respectively, until 90 epochs. The other hyperparameters are set up as follows: batchsize 128, weight decay 0.0001 and momentum 0.9.
Results. We illustrate the compression result of ResNet50 on ImageNet in Table 3. For the models with and without local finetuning, there is little loss at low compression ratio. The loss gradually increases at high compression ratio, but the loss of top1 accuracy is less than 3%. Additionally, there is a very little gap between the two finetuning strategies, namely less than 0.2% in top1 accuracy. This result manifests that our proposed method preserves the flow of relevant information after pruning, which improves the network performance.
In order to show the efficacy of our method, our SGCNN is compared with the stateoftheart CNNs. First, we compare our SGCNN with other group convolution methods, such as IGCV1 Zhang et al. (2017a) and root Ioannou et al. (2017). The gap between our SGCNN and IGCV1 reaches 4.6% and 2.87% in top1 and top5 accuracy at comparable model size (11.88M vs. 11.329M), respectively. For root, we observe that it is outperformed by our SGCNN by a significant margin in recognition performance, while achieving a smaller model size and a lower computation complexity.
Second, our SGCNN is compared with other compression methods, including ThiNet Luo et al. (2018), SSR Lin et al. (2019), GDP Lin et al. (2018b) and LRDKT Lin et al. (2018a). It outperforms ThiNet by more than 3% and 5% in top1 accuracy at comparable model sizes (11.88M vs. 12.38M and 7.76M vs. 8.66M), respectively. In comparison with SSR, our SGCNN achieves better recognition performance, smaller model size and lower FLOPs. Thus, it can be seen that our proposed method is superior to SSR. Compared to GDP, which focuses on accelerating deep convolutional neural networks, our proposed method achieves lower FLOPs than GDP, while achieving better accuracy. For LRDKT, our SGCNN obtains comparable recognition performance at low compression ratio (9.83M vs. 9.8M), surpassing LRDKT by more than 4% at high compression ratio (7.76M vs. 6.3M).
Model  Params  FLOPs  Top1 (%)  Top5 (%)  Epochs 
Baseline  25.55M  4.09G  76.13  92.86  90 
ResNetG (Conv60/FC60)  11.88M  1.91G  75.20  92.55  90+90 
ResNetG (Conv70/FC60)  9.83M  1.55G  74.43  92.30  90+90 
ResNetG (Conv80/FC60)  7.76M  1.20G  73.22  91.70  90+90 
ResNetLG (Conv60/FC60)  11.87M  1.91G  75.12  92.59  90+6*20+90 
ResNetLG (Conv70/FC60)  9.83M  1.56G  74.42  92.31  90+7*20+90 
ResNetLG (Conv80/FC60)  7.76M  1.20G  73.38  91.69  90+8*20+90 
IGCV1 Zhang et al. (2017a)  11.329M  2.2G  70.6  89.68  95 
IGCV1 Zhang et al. (2017a)  11.205M  1.9G  69.23  89.01  95 
IGCV1 Zhang et al. (2017a)  8.61M  1.3G  73.05  91.08  95 
root2 Ioannou et al. (2017)  25.4M  3.86G  72.7  91.2   
root4 Ioannou et al. (2017)  25.1M  3.37G  73.4  91.8   
root8 Ioannou et al. (2017)  23.2M  2.86G  73.4  91.8   
root16 Ioannou et al. (2017)  18.7M  2.43G  73.2  91.8   
root32 Ioannou et al. (2017)  16.4M  2.22G  72.9  91.5   
root64 Ioannou et al. (2017)  15.3M  2.11G  73.2  91.5   
ThiNet Luo et al. (2018)  16.94M  2.44G  74.03  92.11  196+48 
ThiNet Luo et al. (2018)  12.38M  1.70G  72.03  90.99  196+48 
ThiNet Luo et al. (2018)  8.66M  1.10G  68.17  88.86  196+48 
SSRL2,1 Lin et al. (2019)  15.9M  1.9G  72.13  90.57  90+30 
SSRL2,0 Lin et al. (2019)  15.5M  1.9G  72.29  90.73  90+30 
GDP Lin et al. (2018b)    2.24G  72.61  91.05  90+20 
GDP Lin et al. (2018b)    1.88G  71.89  90.71  90+20 
GDP Lin et al. (2018b)    1.57G  70.93  90.14  90+20 
LRDKT Lin et al. (2018a)  9.8M    74.64  91.86  90+15 
LRDKT Lin et al. (2018a)  6.3M    69.07  88.5  90+15 
For DenseNet201, we summarize the performance result on ImageNet in Table 4. For the models with and without local finetuning, their loss of top1 accuracy is less than 2%. The gap between them is less than 0.2%, and the models without the local finetuning even achieve higher accuracy than those with the local finetuning at the same compression ratio. This result verifies that our method preserves the relevant flow of information. Our SGCNN achieves an acceleration of over 4 FLOPs reduction for higher compression ratios.
First, we compare our method with several stateoftheart of group convolution methods, showing outstanding performance in recognition accuracy. We compare our best results with two versions of ShuffleNet Zhang et al. (2018); Ma et al. (2018), achieving 1.47% and 1.31% higher top1 accuracy (4.32M vs. 5.3M and 6.00M vs. 7.4M), respectively. Additionally, two versions of MobileNet Howard et al. (2017); Sandler et al. (2018) are also compared with our best results. We observe that SGCNN outperforms MobileNetV2 by about 1.5% in top1 accuracy (6.00M vs. 6.9M), and MobileNetV1 by up to 4.39% in top1 accuracy (4.32M vs. 4.2M). In contrast with SENet Hu et al. (2018), SGCNN with a smaller model size achieves slightly higher top1 accuracy (4.32M vs. 4.7M). We also compare our SGCNN with IGCV2 Xie et al. (2018) and IGCV3 Sun et al. (2018). The gap reaches 4.47% and 1.66% in top1 accuracy at comparable model size (4.32M vs. 4.1M and 6.00M vs. 7.2M), respectively. Compared to ChannelNet Gao et al. (2018), the largest gap is as much as 4.67% in top1 accuracy (4.32M vs. 3.7M). Compared to CondenseNet Huang et al. (2018) and CondenseNetFLGC Wang et al. (2019), our method achieves 1.37% and 0.47% higher top1 accuracy, and 0.9% and 0.5% higher top5 accuracy for DenseNetLG (4.32M vs. 4.8M). It obtains 1.19% and 0.29% higher top1 accuracy and 0.85% and 0.45% higher top5 accuracy for DenseNetG (4.32M vs. 4.8M). But both CondenseNet and CondenseNetFLGC obtain lower FLOPs than our SGCNN, mainly benefiting from their deployment on a new dense architecture, which is instrumental in achieving a low computation complexity. Our selfgrouping method outperforms these stateoftheart group convolutions methods at similar compression ratios.
Compared to KSE Li et al. (2019), our SGCNN has a better performance by 1.27% in top1 accuracy and by 0.66% in top5 accuracy at approximately equal model size (4.32M vs. 4.21M), respectively. Moreover, at approximately equal computation complexity (0.99G vs. 0.9G), the gap reaches up to 2.14% and 1.4% in top1 and top5 accuracy, respectively.
Finally, our SGCNN is compared with autosearched networks, such as NASNet Zoph et al. (2018), PNASNet Liu et al. (2018) and MnasNet Tan et al. (2018), which consume more time and GPUs to complete the search process. Clearly, our SGCNN again achieves competitive recognition performance at the same model size.
We also observe that the global strategy significantly improves the training efficiency, and achieves good or even better accuracy than the strategy with local finetuning. These experimental results fully show that our selfgrouping convolution can preserve considerable representation ability after pruning even without local finetuning.
Model  Params  FLOPs  Top1 (%)  Top5 (%)  Epochs 
Baseline  19.82M  4.29G  76.90  93.37  90 
DenseNetG (Conv70/FC60)  6.00M  1.34G  76.21  93.07  90+90 
DenseNetG (Conv80/FC60)  4.32M  0.99G  74.99  92.55  90+90 
DenseNetLG (Conv70/FC60)  6.00M  1.34G  76.12  93.06  90+7*20+90 
DenseNetLG (Conv80/FC60)  4.32M  0.99G  75.17  92.60  90+8*20+90 
ShuffleNetV1 Zhang et al. (2018)  5.3M  524M  73.7    240 
ShuffleNetV2 Ma et al. (2018)  7.4M  591M  74.9    240 
MobileNetV1 Howard et al. (2017)  4.2M  569M  70.6     
MobileNetV2 Sandler et al. (2018)  6.9M  585M  74.7     
SEMobileNet Hu et al. (2018)  4.7M  572M  74.7  92.1  100 
SEShuffleNet Hu et al. (2018)  2.4M  142M  68.3  88.3  100 
IGCV2 Xie et al. (2018)  4.1M  564M  70.7    100+20 
IGCV2 Xie et al. (2018)  1.3M  156M  65.5    100+20 
IGCV2 Xie et al. (2018)  0.5M  46M  54.9    100+20 
IGCV3 Sun et al. (2018)  7.2M  610M  74.55    480+50 
IGCV3 Sun et al. (2018)  3.5M  318M  72.2    480+50 
ChannelNetv1 Gao et al. (2018)  3.7M  407M  70.5    80 
ChannelNetv2 Gao et al. (2018)  2.7M    69.5    80 
ChannelNetv3 Gao et al. (2018)  1.7M    66.7    80 
CondenseNet Huang et al. (2018)  4.8M  529M  73.8  91.7  120 
CondenseNet Huang et al. (2018)  2.9M  274M  71.0  90.0  120 
CondenseNetFLGC Wang et al. (2019)  4.8M  529M  74.7  92.1   
KSE DenseNet169A Li et al. (2019)  7.00M  1.28G  75.79  92.87  90+21 
KSE DenseNet121A Li et al. (2019)  4.21M  1.24G  73.9  91.94  90+21 
KSE DenseNet121B Li et al. (2019)  3.37M  0.9G  73.03  91.2  90+21 
NASNetA Zoph et al. (2018)  5.3M  564M  74.0  91.6   
NASNetB Zoph et al. (2018)  5.3M  488M  72.8  91.3   
NASNetC Zoph et al. (2018)  4.9M  558M  72.5  91.0   
PNASNet5 Liu et al. (2018)  5.1M  588M  74.2  91.9   
MnasNetA3 Tan et al. (2018)  5.2M  403M  76.7  93.3   
MnasNetA2 Tan et al. (2018)  4.8M  340M  75.6  92.7   
MnasNetA1 Tan et al. (2018)  3.9M  312M  75.2  92.5   
6 Ablation Study
In this part, we conduct an ablation study to investigate the effect of the parameters such as the number of groups, the pruning step, and Conv vs. FC layers on DenseNet on the classification task of CIFAR10/100.
Effect of the group number. Fig. 4 (a) and (b) show the effect of different number of groups on DenseNet121 for CIFAR10/100. Thanks to reusing and ignoring the shared input channels for different groups, we can have multiple group size for the same compression ratio. We fix the pruning step to 5% for all the models, which means the same number of parameters are removed from these models each time, and further finetune the pruned network. From the result, we observe that a larger number of groups tends to achieve a better recognition accuracy. The gap in accuracy gradually increases with the increasing compression ratio. This suggests that increasing the number of group enhances the structural diversity of group convolutions, while preserving the information flow, which substantially improves the expressive power of the pruned networks.
Effect of the pruning step. As shown in Fig. 5 (a) and (b), we illustrate the effect of different pruning steps on DenseNet121 for CIFAR10/100. We vary the pruning step from 5% to 30%, and fix the number of groups to 8. The results indicate that smaller pruning steps tend to achieve higher recognition accuracy. However, a small pruning step also affects the compression efficiency for deep neural networks. Thus, we argue that the pruning step should be set as a tradeoff between good performance and efficient compression of deep neural networks.
Effect of Conv vs. FC layers. There are great differences between the convolutional and fullyconnected layers in redundancy. To investigate their differences, we develop three different pruning schemes, i.e., pruning only Conv layers, pruning only FC layers, and pruning both of them, simultaneously. The same pruning step 5% is set for all the models without finetuning. As shown in Fig. 6 (a) and (b), we compare these models with different pruning schemes. All the curves remain steady when the compression ratio is less than 25% for DenseNet121 on CIFAR10 and 15% for DenseNet121 on CIFAR100, which experimentally proves that there is redundancy in these two types of layers. Afterward, the two curves of Conv and Conv+FC quickly drop with the increasing compression ratio. However, the curve of FC remains almost unchanged until its compression ratio reaches 85% for DenseNet121 on CIFAR10 and 65% for DenseNet121 on CIFAR100. So it tells us that pruning the convolutional layer excessively can result in a degraded recognition performance. In other words, the fullyconnected layers have more redundancy than the convolutional ones. Therefore, they cannot be treated the same. Additionally, when the compression ratio decreases, the fullyconnected layer has no significant influence on the network performance. To optimise the compression ratio it is important to evaluate the degree of redundancy in the fullyconnected layer.
7 Generalization Ability
In this section, we further evaluate the generalization ability of our SGCNN in transfer learning, including domain adaption on CUB200 Wah et al. (2011) and object detection on MS COCO Lin et al. (2014). We adopt ResNet50 and DenseNet201 as our baseline models.
Domain Adaptation. The CUB200 dataset contains 11,788 images of 200 different bird species. 5,994 images are used for training and 5,794 images for testing. In order to evaluate the propensity for domain adaption, we transfer the compressed model on ImageNet into another domain, i.e., CUB200, by finetuning. The same hyperparameters and epochs are set for a fair comparison.
The result of the finegrained classification is presented in Table 5. We observe that our SGCNN is effective in transfer learning. The models built on ImageNet also perform well on CUB200. In these compressed models, the models with only local finetuning yield the best performance, even surpassing the baseline model by a significant margin for ResNet. In comparison to both SSR Lin et al. (2019), LRDKT Lin et al. (2018a) and MobileNetV2 Sandler et al. (2018), our compressed models achieve even more outstanding performance for both ResNet and DenseNet. Therefore, our SGCNN models can provide a powerful generalization to other domains or datasets.
Model  Params  FLOPs  Top1 (%)  Top5 (%) 
Baseline  23.86M  4.09G  74.37  94.43 
ResNetL (Conv60/FC60)  11.46M  1.91G  76.82  94.96 
ResNetL (Conv70/FC60)  9.42M  1.56G  76.61  94.96 
ResNetL (Conv80/FC60)  7.35M  1.20G  75.18  94.60 
ResNetG (Conv60/FC60)  11.47M  1.91G  73.04  94.01 
ResNetG (Conv70/FC60)  9.42M  1.55G  72.75  93.65 
ResNetG (Conv80/FC60)  7.35M  1.20G  71.92  92.89 
ResNetLG (Conv60/FC60)  11.46M  1.91G  73.25  93.63 
ResNetLG (Conv70/FC60)  9.42M  1.56G  73.11  93.22 
ResNetLG (Conv80/FC60)  7.35M  1.20G  71.94  93.32 
Baseline  18.28M  4.29G  78.65  95.46 
DenseNetL (Conv70/FC60)  5.61M  1.34G  77.93  95.44 
DenseNetL (Conv80/FC60)  3.93M  0.94G  77.17  95.05 
DenseNetG (Conv70/FC60)  5.66M  1.35G  77.20  94.70 
DenseNetG (Conv80/FC60)  3.94M  0.94G  75.73  94.25 
DenseNetLG (Conv70/FC60)  5.61M  1.34G  77.46  94.98 
DenseNetLG (Conv80/FC60)  3.93M  0.94G  75.66  94.56 
SSRL2,1 Lin et al. (2019)  124.6M  4.5G  71.30   
SSRL2,1GAP Lin et al. (2019)  8.8M  4.4G  70.45   
LRDKT Lin et al. (2018a)  9.5M  1.31G  75.10   
LRDKT Lin et al. (2018a)  3.7M  0.64G  63.18   
MobileNetV2 Sandler et al. (2018) (Our impl.)  2.45M  0.30G  68.85  91.75 
Object Detection. To evaluate the ability to detect objects, we deploy the compressed model over Faster RCNN Ren et al. (2015) as the detection framework, and use the publicly released pytorch code Yang et al. (2017) for implementation with default settings. The models are trained on COCO train+val dataset excluding 5K minival images, and evaluated on the minival set, with 300 and 600 input resolutions.
Table 6 shows the comparison results on the two input resolutions. For both network frameworks, compared to their baseline models, our SGCNN offers 3 smaller model size and 3.5 lower FLOPs for ResNet50 and 4 smaller model size and 4.5 lower FLOPs for DenseNet201, while obtaining good or even better performance in object detection. Our models with only global finetuning achieve comparable object detection results to the competitors with both local and global finetuning, showing that our method preserves the flow of relevant information at each layer after pruning. Both MobileNet Howard et al. (2017) and ShuffleNet Zhang et al. (2018) are outperformed by our selfgrouping method by a significant margin on both resolutions. It is also apparent that our method exhibits excellent generalization ability in object detection.
Model  Params  FLOPs  mAP1 (%)  mAP2 (%) 
Baseline  24.44M  4.09G  24.5  30.9 
ResNetG (Conv60/FC60)  12.00M  1.91G  24.8  30.9 
ResNetG (Conv70/FC60)  9.95M  1.55G  24.2  29.5 
ResNetG (Conv80/FC60)  7.88M  1.20G  23.2  28.6 
ResNetLG (Conv60/FC60)  11.99M  1.91G  24.9  30.9 
ResNetLG (Conv70/FC60)  9.95M  1.56G  24.2  30.0 
ResNetLG (Conv80/FC60)  7.88M  1.20G  23.0  28.2 
Baseline  18.78M  4.29G  26.0  32.8 
DenseNetG (Conv70/FC60)  6.15M  1.35G  23.9  30.3 
DenseNetG (Conv80/FC60)  4.44M  0.94G  22.7  28.7 
DenseNetLG (Conv70/FC60)  6.11M  1.34G  24.0  30.3 
DenseNetLG (Conv80/FC60)  4.43M  0.94G  23.0  28.9 
MobileNetV1 Howard et al. (2017)  4.25M  516.80M  16.4  19.8 
ShuffleNetV1 Zhang et al. (2018)  4.25M  516.80M  18.7  25.0 
8 Conclusion
We have presented a selfgrouping convolutional neural network, named SGCNN, to improve the existing group convolution methods for the compression and acceleration of deep neural networks, for the deployment on mobile and embedded devices with constrained memory and computation. We automatically group the filters for each convolutional layer by clustering based on the importance vectors, and further enhance sparsity of each group by pruning based on their cluster centroids, thus yielding a selfgrouping convolution which is datadependent and has diverse group structures. Furthermore, our SGCNN works throughout the fullyconnected layers as well as the convolutional layers, aiming to simultaneously accelerate inference and reduce memory consumption. Both local finetuning and global tuning further improve the recognition accuracy of the pruned network. We empirically demonstrated the effectiveness and efficiency of our approach on a variety of stateoftheart CNN architectures, such as ResNet and DenseNet, on four popular datasets, including CIFAR10/100 and ImageNet. The experimental results show our selfgrouping method achieves superior performance. Particularly, for ResNet50, our SGCNN achieves over 3 compression rate and about 3.5 FLOPs reduction with 73.38% top1 accuracy and 91.69% to5 accuracy on ImageNet. For DenseNet201, our SGCNN achieves over 4.5 compression rate and over 4 FLOPs reduction, while delivering 75.17% top1 recognition accuracy and 92.6% top5 accuracy on ImageNet. We further evaluated the generalization ability of SGCNN on both domain adaption and object detection, and obtained competitive results.
Acknowledgment
This work is supported by the National Key R&D Program of China (Grant No. 2018YFB1004901), and the Independent Innovation Team Project of Jinan City (No. 2019GXRC013), by the National Natural Science Foundation of China (Grant No.61672265, U1836218), by the 111 Project of Ministry of Education of China (Grant No. B12018), by UK EPSRC GRANT EP/N007743/1, MURI/EPSRC/DSTL GRANT EP/R018456/1.
Footnotes
 life Member, IEEE
 journal: Neural Networks
References
 Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pp. 2270–2278. Cited by: §3.4.
 All you need is a few shifts: designing efficient convolutional neural networks for image classification. arXiv preprint arXiv:1903.05285. Cited by: §5.2, §5.2, Table 1, Table 2.
 Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §2.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1.
 ChannelNets: compact and efficient convolutional neural networks via channelwise convolutions. In Advances in Neural Information Processing Systems, pp. 5197–5205. Cited by: §1, §2, §3.3, §4, §4, §5.3, Table 4.
 Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
 Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
 A localization method avoiding flip ambiguities for microuavs with bounded distance measurement errors. IEEE Transactions on Mobile Computing 18, pp. 1718–1730. Cited by: §1.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, §2, §3.2, §3.4.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §5.1, §5.
 Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §1, §2, §5.3, Table 4, Table 6, §7.
 Network trimming: a datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §1, §3.2, §3.2, §3.5.
 Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §5.3, Table 4.
 Condensenet: an efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761. Cited by: §1, §1, §2, §2, §3.2, §3.4, §4, §4, §5.2, §5.2, §5.3, Table 1, Table 2, Table 4.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §1, §5.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
 Deep roots: improving cnn efficiency with hierarchical filter groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1231–1240. Cited by: §5.2, §5.3, Table 1, Table 3.
 Constructing fast network through deconstruction of convolution. In Advances in Neural Information Processing Systems, pp. 5951–5961. Cited by: §5.2, §5.2, Table 1, Table 2.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §5.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §1, §1, §2.
 Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1, §5.1.
 Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.
 Exploiting kernel sparsity and entropy for interpretable cnn compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8847–8856. Cited by: §5.3, Table 4.
 Holistic cnn compression via lowrank decomposition with knowledge transfer. IEEE transactions on pattern analysis and machine intelligence. Cited by: §5.3, Table 3, Table 5, §7.
 Towards compact convnets via structuresparsity regularized filter pruning. In IEEE Transactions on Neural Networks and Learning Systems, Cited by: §2, §5.3, Table 3, Table 5, §7.
 Accelerating convolutional networks via global & dynamic filter pruning.. In IJCAI, pp. 2425–2432. Cited by: §3.5, §5.3, Table 3.
 Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §7.
 Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §5.3, Table 4.
 Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §3.5, §5.2, §5.2, Table 1, Table 2.
 Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 ThiNet: pruning cnn filters for a thinner net. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §5.3, Table 3.
 Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §5.3, Table 4.
 Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922. Cited by: §2.
 Domainadaptive deep network compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4289–4297. Cited by: §1.
 Pruning convolutional neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440 3. Cited by: §3.2, §3.5.
 Extreme network compression via filter group approximation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–316. Cited by: §1.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1.
 Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §7.
 Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §1.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1, §5.
 Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. arxiv 2018. arXiv preprint arXiv:1801.04381. Cited by: §2, §5.3, Table 4, Table 5, §7.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 IGCv3: interleaved lowrank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178. Cited by: §5.2, §5.2, §5.3, Table 1, Table 2, Table 4.
 Mnasnet: platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626. Cited by: §5.3, Table 4.
 The caltechucsd birds2002011 dataset. Technical report Cited by: §7.
 Fully learnable group convolution for acceleration of deep neural networks. arXiv preprint arXiv:1904.00346. Cited by: §2, §5.2, §5.3, Table 1, Table 4.
 Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1, §2, §3.4.
 Shift: a zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9127–9135. Cited by: §5.2, §5.2, Table 1, Table 2.
 Interleaved structured sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8847–8856. Cited by: §5.2, §5.3, Table 1, Table 4.
 Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §2, §3.3.
 Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual tracking. CoRR abs/1807.11348. External Links: Link, 1807.11348 Cited by: §1.
 A faster pytorch implementation of faster rcnn. https://github.com/jwyang/fasterrcnn.pytorch. Cited by: §7.
 Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4373–4382. Cited by: §1, §1, §2, §3.3, §4, §4, §5.2, §5.2, §5.3, Table 1, Table 2, Table 3.
 Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1, §1, §1, §2, §3.3, §4, §4, §5.3, Table 4, Table 6, §7.
 Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §1.
 Polynet: a pursuit of structural diversity in very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 718–726. Cited by: §1.
 Variational convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789. Cited by: §5.2, §5.2, Table 1, Table 2.
 Deep convolutional neural networks with mergeandrun mappings. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3170–3176. Cited by: §5.2, §5.2, Table 1, Table 2.
 Unet++: a nested unet architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §1.
 Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §1.
 Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2, §5.3, Table 4.