ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks
Abstract
As designing appropriate Convolutional Neural Network (CNN) architecture in the context of a given application usually involves heavy human works or numerous GPU hours, the research community is soliciting the architectureneutral CNN structures, which can be easily plugged into multiple mature architectures to improve the performance on our realworld applications. We propose Asymmetric Convolution Block (ACB), an architectureneutral structure as a CNN building block, which uses 1D asymmetric convolutions to strengthen the square convolution kernels. For an offtheshelf architecture, we replace the standard squarekernel convolutional layers with ACBs to construct an Asymmetric Convolutional Network (ACNet), which can be trained to reach a higher level of accuracy. After training, we equivalently convert the ACNet into the same original architecture, thus requiring no extra computations anymore. We have observed that ACNet can improve the performance of various models on CIFAR and ImageNet by a clear margin. Through further experiments, we attribute the effectiveness of ACB to its capability of enhancing the model’s robustness to rotational distortions and strengthening the central skeleton parts of square convolution kernels.
1 Introduction
Convolutional Neural Network (CNN) has achieved great success in visual understanding, which makes them useful for various applications in wearable devices, security systems, mobile phones, automobiles, \etc. As the frontend devices are usually limited in computational resources and demand realtime inference, these applications require CNN that delivers high accuracy with the constraints of a certain level of computational budgets. Thus it may not be practical to enhance the model by simply employing more trainable parameters and complicated connections. Therefore, we consider it meaningful to improve the performance of CNN with no extra inferencetime computations, memory footprint, or energy consumption.
On the other hand, along with the advancements in the CNN architecture designing literature, the performance of the offtheshelf models has been significantly improved. However, when the existing models cannot meet our specific needs, we may not be allowed to customize a new architecture at the costs of heavy human works or numerous GPU hours [36]. Recently, the research community is soliciting innovative architectureneutral CNN structures, \eg, SE blocks [14] and quasihexagonal kernels [30], which can be directly combined with various uptodate architectures to improve the performance on our realworld applications.
Some recent investigations on CNN architectures focus on 1) how the layers are connected with each other, \eg, simply stacked together [20, 28], through identity mapping [13, 31, 35] or densely connected [15] and 2) how the outputs of different layers are combined to increase the quality of learned representations [16, 31, 32, 33]. Considering this, in quest of a generic architectureneutral CNN structure which can be combined with numerous architectures, we seek to strengthen standard convolutional layers by digging into an orthogonal aspect: the relationship between the weights and their spatial locations in the kernels.
In this paper, we propose Asymmetric Convolution Block (ACB), an innovative structure as a building block to replace the standard convolutional layers with square kernels, \eg, layers, which are widely used in modern CNN. Concretely, for the replacement of a layer, we construct an ACB comprising three parallel layers with , and kernels, respectively, of which the outputs are summed up to enrich the feature space (Fig. 1). As the introduced and layers have nonsquare kernels, we refer to them as the asymmetric convolutional layers, following [33]. Given an offtheshelf architecture, we construct an Asymmetric Convolutional Network (ACNet) by replacing every squarekernel layer with an ACB and train it until convergence. After that, we equivalently convert the ACNet into the same original architecture by adding the asymmetric kernels in each ACB onto the corresponding positions of the square kernels. Due to the additivity of convolutions with compatible kernel sizes (Fig. 2), which is obvious but has long been ignored, the resulting model can produce the same outputs as the trainingtime ACNet. As will be shown in our experiments (Sect. 4.1, 4.2), doing so can improve the performance of several benchmark models on CIFAR [19] and ImageNet [3] by a clear margin. Better still, ACNet 1) introduces NO hyperparameters, such that it can be combined with different architectures without careful tuning; 2) is simple to implement on the mainstream CNN frameworks like PyTorch [26] and Tensorflow [1]; 3) requires NO extra inferencetime computational burdens compared to the original architecture.
Through our further experiments, we have partly explained the effectiveness of ACNet. It is observed that a square convolution kernel distributes its learned knowledge unequally, as the weights on the central crisscross positions (which are referred to as the “skeleton” of the kernel) are usually larger in magnitude, and removing them causes higher accuracy drop, compared to those in the corners. In each ACB, we add the horizontal and vertical kernels onto the skeletons, thus explicitly making the skeletons more powerful, following the nature of square kernels. Interestingly, the weights on the corresponding positions of the square, horizontal and vertical kernels are randomly initialized and have a possibility to grow opposite in sign, thus summing them up may result in a stronger or weaker skeleton. However, we have empirically observed a consistent phenomenon that the model always learn to enhance the skeletons at every layer. This observation may shed light on future researches on the relationship among the weights at different spatial locations. The codes are available at https://github.com/ShawnDing1994/ACNet.
Our contributions are summarized as follows.

We propose to use asymmetric convolutions to explicitly enhance the representational power of a standard squarekernel layer in a way that the asymmetric convolutions can be fused into the square kernels with NO extra inferencetime computations needed, rather than approximate a squarekernel layer like many prior works [4, 17, 18, 23, 25, 33].

We propose ACB as a novel architectureneutral CNN building block. We can construct an ACNet by simply replacing every squarekernel convolutional layer in a mature architecture with an ACB without introducing any hyperparameters, such that its effectiveness can be combined with the numerous advancements in the CNN architecture designing literature.

We have improved the accuracy of several common benchmark models on CIFAR10, CIFAR100, and ImageNet by a clear margin.

We have justified the significance of skeletons in standard square convolution kernels and demonstrated the effectiveness of ACNet in enhancing such skeletons.

We have shown that ACNet can enhance the model’s robustness to rotational distortions, which may inspire further studies on the rotational invariance problem.
2 Related work
2.1 Asymmetric convolutions
Asymmetric convolutions are typically used to approximate an existing squarekernel convolutional layer for compression and acceleration. Some prior works [4, 17] have shown that a standard convolutional layer can be factorized as a sequence of two layers with and kernels to reduce the parameters and required computations. The theory behind is quite simple: if a 2D kernel has a rank of one, the operation can be equivalently transformed into a series of 1D convolutions. However, as the learned kernels in deep networks have distributed eigenvalues, their intrinsic rank is higher than one in practice, thus applying the transformation directly to the kernels results in significant information loss [18]. Denton \etal[4] tackled this problem by finding a lowrank approximation in an SVDbased manner then finetuning the upper layers to restore the performance. Jaderberg \etal[17] succeeded in learning the horizontal and vertical kernels by minimizing the 2 reconstruction error. Jin \etal[18] applied structural constraints to make the 2D kernels separable and obtained comparable performance as conventional CNN with speedup.
On the other hand, asymmetric convolutions are also widely employed as an architectural design element to save the parameters and computations. For example, in Inceptionv3 [33], the convolutions are replaced by a sequence of and convolutions. However, the authors found out that such replacement is not equivalent as it did not work well on the lowlevel layers. ENet [25] also adopted this approach for the design of an efficient semantic segmentation network, where the convolutions are decomposed, allowing to increase the receptive field with reasonable computational budgets. EDANet [23] used a similar method to decompose the convolutions, resulting in a 33% saving in the number of parameters and required computations with minor performance degradation.
In contrast, we use 1D asymmetric convolutions not to factorize any layers as part of the architectural designs but enrich the feature space during training and then fuse their learned knowledge into the squarekernel layers.
2.2 Architectureneutral CNN structures
We intend not to modify the CNN architecture but use some architectureneutral structures to enhance the offtheshelf models. Thus the effectiveness of our method is supplementary to the advancements achieved by the innovative architectures. Specifically, a CNN structure can be called architectureneutral if it 1) makes no assumptions on the specific architecture, thus can be applied on various models, and 2) brings universal benefits. For example, SE blocks [14] can be appended after a convolutional layer to rescale the feature map channels with learned weights, resulting in a clear accuracy improvement at reasonable costs of extra parameters and computational burdens. As another example, auxiliary classifier [32] can be inserted into the model to assist in supervising the learning process, which can indeed improve the performance by an observable margin but requires extra human works to tune the hyperparameters.
By contrast, ACNet introduces NO hyperparameters during training and requires NO extra parameters or computations during inference. Therefore, in realworld applications, the developer can use ACNet to enhance a variety of models without exhausting parameter tunings, and the endusers can enjoy the performance improvement without slowing down the inference. Better still, since we introduce no custom structures into the deployed model, it can be future compressed via techniques including connection pruning [9, 12], channel pruning [6, 5, 22, 24], quantization [2, 10, 27], feature map compacting [34], \etc.
3 Asymmetric Convolutional Network
3.1 Formulation
For a convolutional layer with a kernel size of and filters which takes a channel feature map as input, we use to denote the 3D convolution kernel of a filter, for the input, which is a feature map with a spatial resolution of and channels, and for the output with channels, respectively. For the th filter at such a layer, the corresponding output feature map channel is
(1) 
where is the 2D convolution operator, is the th channel of in the form of a matrix, and is the th input channel of , \ie, a 2D kernel of .
In modern CNN architectures, batch normalizations [16] are widely adopted to reduce overfitting and accelerate the training process. As a common practice, a batch normalization layer is usually followed by a linear scaling transformation to enhance the representational power. Compared to Eq. 1, the output channel then becomes
(2) 
where and are the values of channelwise mean and standard deviation of batch normalization, and are the learned scaling factor and bias term, respectively.
3.2 Exploiting the additivity of convolution
We seek to employ asymmetric convolutions in a way that they can be equivalently fused into the standard squarekernel layers, such that no extra inferencetime computational burdens are introduced. We notice a useful property of convolution: if several 2D kernels with compatible sizes operate on the same input with the same stride to produce outputs of the same resolution, and their outputs are summed up, we can add up these kernels on the corresponding positions to obtain an equivalent kernel which will produce the same output. That is, the additivity may hold for 2D convolutions, even with different kernel sizes,
(3) 
where is a matrix, and are two 2D kernels with compatible sizes, and is the elementwise addition of the kernel parameters on the corresponding positions. Note is that may need to be appropriately clipped or padded.
Here compatible means that we can “patch” the smaller kernel onto the bigger. Formally, this kind of transformation on layer and is feasible if
(4) 
, and kernels are compatible with .
This can be easily verified by investigating the calculation of convolution in the form of sliding windows (Fig. 2). For a certain filter with kernel , a certain point on the output channel is given by
(5) 
where is the corresponding sliding window on input . Obviously, when we sum up two output channels produced by two filters, the additivity (Eq. 3) holds if for each point on one channel, its corresponding point on the other channel shares the same sliding window .
3.3 ACB for free inferencetime improvements
In this paper, we focus on convolutions, which are heavily used in modern CNN architectures. Given an architecture, we construct an ACNet by simply replacing every layer (together with the following batch normalization layer, if any) with an ACB which comprises three parallel layers with kernel size , and , respectively. Similar to the common practice in standard CNN, each of the three layers is followed by batch normalization, which is referred to as a branch, and the outputs of three branches are summed up as the output of ACB. Note that we can train the ACNet using the same configurations as the original model without any extra hyperparameters to be tuned.
As will be shown in Sect. 4.1 and Sect. 4.2, the ACNet can be trained to reach a higher level of accuracy. When the training is completed, we seek to convert every ACB to a standard convolutional layer which produces identical outputs. By doing so, we can obtain a more powerful network which requires no extra computations, compared to a normally trained counterpart. This conversion is achieved through two steps, namely, BN fusion and branch fusion.
BN fusion.
The homogeneity of convolution allows the following batch normalization and linear scaling transformation to be equivalently fused into the convolutional layer with an added bias. It can be observed from Eq. 2 that for each branch, if we construct a new kernel as along with an added bias term , we will produce the same output, which can be easily verified.
Branch fusion.
We merge the three BNfused branches into a standard convolutional layer by adding the asymmetric kernels onto the corresponding positions of the square kernel. In practice, this transformation is implemented by building a network of the original structure and using the fused weights for initialization, thus we can produce the same outputs as the ACNet with the same computational budgets as the original architecture. Formally, for every filter , let be the fused 3D kernel, be the obtained bias term, and be the kernels of the corresponding filter at the and layer, respectively, we have
(6) 
(7) 
Then we can easily verify that for an arbitrary filter ,
(8) 
where , and are the outputs of the original , and branch, respectively. Fig. 3 shows an example on a single input channel for more intuitions.
Of note is that though an ACB can be equivalently transformed into a standard layer, the equivalence only holds at inferencetime because the training dynamics are different, thus giving rise to different performance. The nonequivalence of the training process is due to the random initialization of kernel weights, and the gradients derived by different computation flows they participate in.
4 Experiments
We have conducted abundant experiments to verify the effectiveness of ACNet in improving the performance of CNN across a range of datasets and architectures. Concretely, we pick an offtheshelf architecture as the baseline, build an ACNet counterpart, train it from scratch, convert it into the same structure as the baseline, and test it to collect the accuracy. For the comparability, all the models are trained until the complete convergence, and every pair of baseline and ACNet uses identical configurations, \eg, learning rate schedules and batch sizes.
4.1 Performance improvements on CIFAR
In order to preliminarily evaluate our method on various CNN architectures, we experiment with several representative benchmark models including Cifarquick [29], VGG16 [28], ResNet56 [13], WRN168 [35] and DenseNet40 [15] on CIFAR10 and CIFAR100 [19].
For Cifarquick, VGG16, ResNet56, and DenseNet40, we train the models using a staircase learning rate of 0.1, 0.01, 0.001 and 0.0001 following the common practice. For WRN168, we follow the training configurations reported in the original paper [35]. We use the data augmentation techniques adopted by [13], \ie, padding to , random cropping and leftright flipping.
As can be observed from Table. 1 and Table. 2, the performance of all the models is consistently lifted by a clear margin, suggesting that the benefits of ACBs can be combined with various architectures.
Model  Base Top1  ACNet Top1  Top1 
Cifarquick  83.13  84.24  1.11 
VGG  94.12  94.47  0.35 
ResNet56  94.31  95.09  0.78 
WRN168  95.56  96.15  0.59 
DenseNet40  94.29  94.84  0.55 
Model  Base Top1  ACNet Top1  Top1 
Cifarquick  53.22  54.30  1.08 
VGG  74.56  75.20  0.64 
ResNet56  73.58  74.04  0.46 
WRN168  80.01  80.49  0.48 
DenseNet40  73.14  73.41  0.27 
4.2 Performance improvements on ImageNet
Model  Base Top1  ACNet Top1  Top1  Base Top5  ACNet Top5  Top5 
AlexNet  55.92  57.44  1.52  79.53  80.73  1.20 
ResNet18  70.36  71.14  0.78  89.61  89.96  0.35 
DenseNet121  75.15  75.82  0.67  92.45  92.77  0.32 
We then move on to the effectiveness validation of our method on the realworld applications through a series of experiments on ImageNet [3] which comprises 1.28M images for training and 50K for validation from 1000 classes. We use AlexNet [20], ResNet18 [13] and DenseNet121 [15] as the representatives for the plainstyle, residual and densely connected architectures, respectively. Every model is trained with a batch size of 256 for 150 epochs, which is longer than the usually adopted benchmarks (\eg, 90 epochs [13]), such that the accuracy improvement cannot be simply attributed to the incomplete convergence of the base models. For the data augmentation, we employ the standard pipeline including bounding box distortion, leftright flipping and color shift, as a common practice. Especially, the plain version of AlexNet we use comes from the Tensorflow GitHub [8], which is composed of five stacked convolutional layers and three fullyconnected layers with no local response normalizations (LRN) or crossGPU connections. For the faster convergence, we apply batch normalization [16] on its every convolutional layer. Of note is that since the first two layers of AlexNet use and kernels, respectively, it is possible to extend ACBs to have larger asymmetric kernels. However, we still only use and convolutions for these two layers, because such largescale convolutions are becoming less favored in modern CNN, making large ACBs less useful.
As shown in Table. 3, the singlecrop Top1 accuracy of AlexNet, ResNet18 and DenseNet121 is lifted by 1.52%, 0.78% and 1.18%, respectively. In practice, aiming at the same target of accuracy, we can use ACNet to enhance a more efficient model to achieve the target with less inference time, energy consumption, and storage space. On the other hand, with the same constraints on computational budgets or model size, we can use ACNet to improve the accuracy by a clear margin such that the gained performance can be viewed as free benefits, from the viewpoint of endusers.
4.3 Ablation studies
Model  Horizontal kernel  Vertical kernel  BN in branch  Original input  Rotate  Rotate  Updown flip 
AlexNet  55.92  28.18  31.41  31.62  
AlexNet  ✓  ✓  57.10  29.65  32.86  33.02  
AlexNet  ✓  ✓  57.25  29.97  33.74  33.74  
AlexNet  ✓  ✓  ✓  57.44  30.49  33.98  33.82 
AlexNet  ✓  ✓  56.18  28.81  32.12  32.33  
ResNet18  70.36  41.00  41.95  41.86  
ResNet18  ✓  ✓  70.78  41.61  42.47  42.66  
ResNet18  ✓  ✓  70.70  42.06  43.22  43.05  
ResNet18  ✓  ✓  ✓  71.14  42.20  42.89  43.10 
ResNet18  ✓  ✓  70.82  41.70  42.92  42.90 
Though we have empirically justified the effectiveness of ACNet, we still desire to find some explanations. In this subsection, we seek to investigate ACNet through a series of ablation studies. Specifically, we focus on the following three design decisions: the usage of 1) horizontal kernels, 2) vertical kernels, and 3) batch normalization in every branch. For the comparability, we train several AlexNet and ResNet18 models on ImageNet with different ablations using the same training configurations. Of note is that if the batch normalizations in the branches are removed, we batchnormalize the output of the whole ACB instead, \ie, the position of batch normalization layer is changed from presummation to postsummation.
As can be observed from Table. 4, removing any of the three designs degrades the model. However, though the horizontal and vertical convolutions can both improve the performance, there may exist some difference because the horizontal and vertical directions are treated unequally in practice, \eg, we usually perform random leftright but no updown image flipping to augment the training data. Therefore, if an upsidedown image is fed into the model, the original layers should produce meaningless results, which is natural, but a horizontal kernel will produce the same outputs as on the original image at the axially symmetric locations (Fig. 4). \Ie, a part of the ACB can still extract the correct features. Considering this, we assume that ACBs may enhance the model’s robustness to rotational distortions, enabling the model to generalize better on the unseen data.
We then test the formerly trained models with rotationally distorted images from the whole validation set including counterclockwise rotation, rotation, and updown flipping. Naturally, the accuracy of every model is significantly reduced, but the models with horizontal kernels deliver observably higher accuracy on the rotated and updown flipped images. \Eg, the ResNet18 equipped with only horizontal kernels delivers an accuracy slightly lower than that of the counterpart with only vertical kernels on the original inputs, but 0.75% higher on the rotated inputs. And when compared with the base model, its accuracy is 0.34% / 1.27% higher on the original / flipped images, respectively. Predictably, the models exert similar performance on the rotated and updown flipped inputs, as rotation plus leftright flipping is equivalent to updown flipping, and the model is robust to leftright flipping due to the data augmentation methods.
In summary, we have shown that ACBs, especially the horizontal kernels inside, can enhance the model’s robustness to rotational distortions by an observable margin. Though this may not be the primary reason for the effectiveness of ACNet, we consider it promising to inspire further researches on the rotational invariance problem.
4.4 ACB enhances the skeletons of square kernels
Intuitively, as adding the horizontal and vertical kernels onto the square kernel can be viewed as a means to explicitly enhance the skeleton part, we seek to explain the effectiveness of ACNet by investigating the difference between the skeleton and the weights at the corners.
Inspired by the CNN pruning methods [9, 11, 12], we start from removing some weights at different spatial locations and observing the performance drop using ResNet56 on CIFAR10. Concretely, we randomly set some individual weights in the kernels to zero and test the model. As shown in Fig. 4(a), for the curve labeled as corner, we randomly select the weights from the four corners of every kernel and set them to zero in order to attain a given global sparsity ratio of every convolutional layer. Note that as , a sparsity ratio of 44% means removing most of the weights at the four corners. For skeleton, we randomly select the weights only from the skeleton of every kernel. For global, every individual weight in the kernel has an equal chance to be chosen. The experiments are repeated five times with different random seeds, and the meanstd curves are depicted.
As can be observed, all the curves show a tendency of decreasing as the sparsity ratio increases, but not monotonically, due to the random effects. It is obvious that removing the weights from the corners causes less damage to the model, but pruning the skeletons does more harm. This phenomenon suggests that the skeleton weights are more important to the model’s representational capacity.
We continue to verify if this observation holds for ACNet. We convert the ACNet counterpart via BN and branch fusion, then conduct the same experiments on it. As shown in Fig. 4(b), we observe an even more significant gap, \eg, pruning almost all the corner weights only degrades the model’s accuracy to above 60%. On the other hand, pruning the skeletons causes more damage, as the model is destroyed when the global sparsity ratio attained by pruning the skeletons merely reaches 13%, \ie, weights of the skeletons are removed.
Then we explore the cause of the above observations by investigating the numeric values of the kernels. We use the magnitude (\ie, absolute value) as the metric for the importance of parameters, which is adopted by many prior CNN pruning works [7, 9, 12, 21]. Specifically, we add up all the fused 2D kernels in a convolutional layer, perform a layerwise normalization by the max value, and finally obtain an average of the normalized kernels of all the layers. Formally, let be the 3D kernel of the th filter at the th layer, be the number of all such layers, and be the max and elementwise absolute value, respectively, the average kernel magnitude matrix is computed as
(9) 
where the sum of absolute kernels of layer is
(10) 
We present the values of the normally trained ResNet56 and the fused ACNet counterpart in Fig. 5(a) and Fig. 5(b), where the numeric value and color at a certain grid indicate the average relative importance of the parameter on the corresponding position across all the layers, \ie, a larger value and darker background color indicates a higher average importance of the parameter.
As can be observed from Fig. 5(a), the normally trained ResNet56 distributes the magnitude of the parameters in an imbalance manner, \ie, the central point has the largest magnitude, and the points at the four corners have the smallest. Fig. 5(b) shows that ACNet aggravates such imbalance, as the values of the four corners are decreased to below 0.400, and the skeleton points have the values above 0.666. In particular, the central point has an value of 1.000, which means that this location has a dominant importance consistently in every single layer. It is noteworthy that the weights on the corresponding positions of the square, horizontal and vertical kernels have a possibility to grow opposite in sign, thus summing them up may result in a larger or smaller magnitude. But we have observed a consistent phenomenon that the model always learn to enhance the skeletons at every layer.
We continue to study how the model will behave if we add the asymmetric kernels onto the other positions rather than the central skeletons. Specifically, we train an ACNet counterpart of ResNet56 using the same training configurations as before, but shift the horizontal convolutions one pixel towards the bottom on the inputs and shift the vertical convolutions towards the right. Accordingly, during branch fusion, we add the BNfused asymmetric kernels to the bottomright borders of the square kernels (Fig. 5(c)) in order for an equivalent resulting network. It is observed that such ACBs can also enhance the borders, but not as intensively as the regular ACBs do to the skeletons. The model delivers an accuracy of 94.67%, which is 0.42% lower than the regular ACNet (Table. 1). Moreover, similar pruning experiments are conducted on the fused model (Fig. 4(c)). As observed, pruning the corners still delivers the best accuracy, and pruning the enhanced bottomright borders gives no better results than the topleft squares, \ie, though the magnitudes of the borders have increased, the other parts remain essential to the whole kernels.
In summary: 1) the skeletons are inherently more important than the corners in standard square kernels; 2) ACB can significantly enhance the skeletons, resulting in improved performance; 3) adding the horizontal and vertical kernels to the borders degrades the model’s performance compared to regular ACBs; 4) doing so can also increase the magnitude of the borders but cannot diminish the importance of the other parts. Therefore, we partly attribute the effectiveness of ACNet to its capability of further strengthening the skeletons. Intuitively, ACNet follows the nature of the square convolution kernels.
5 Conclusion
In order to improve the performance of various CNN architectures, we proposed Asymmetric Convolution Block (ACB), which sums up the outputs of three convolutional branches with square, horizontal and vertical kernels, respectively. We construct an Asymmetric Convolutional Network (ACNet) by replacing the squarekernel layers in a mature architecture with ACBs and convert it into the original architecture after training. We have evaluated ACNet by improving various plainstyle, residual and densely connected models on CIFAR and ImageNet. We have shown that ACNet can enhance the model’s robustness to rotational distortions by an observable margin, and explicitly strengthening the skeletons following the nature of square kernels. Of note is that ACNet introduces NO hyperparameters to be tuned, requires NO extra inferencetime computations, and is simple to implement using mainstream frameworks.
Acknowledgement
This work was supported by the National Key R&D Program of China (No. 2018YFC0807500), National Natural Science Foundation of China (No. 61571269), National Postdoctoral Program for Innovative Talents (No. BX20180172), and the China Postdoctoral Science Foundation (No. 2018M640131). Corresponding author: Guiguang Ding, Yuchen Guo.
References
 (2016) TensorFlow: a system for largescale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §1.
 (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §2.2.
 (2009) Imagenet: a largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §1, §4.2.
 (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: 1st item, §2.1.
 (2019) Approximated oracle filter pruning for destructive cnn width optimization. In International Conference on Machine Learning, pp. 1607–1616. Cited by: §2.2.
 (2019) Centripetal sgd for pruning very deep convolutional networks with complicated structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4943–4953. Cited by: §2.2.
 (2018) Autobalanced filter pruning for efficient convolutional neural networks. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §4.4.
 (2017) Tensorflowalexnet. Note: \urlhttps://github.com/tensorflow/models/blob/master/research/slim/nets/alexnet.py Cited by: §4.2.
 (2016) Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §2.2, §4.4, §4.4.
 (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §2.2.
 (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §4.4.
 (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §2.2, §4.4, §4.4.
 (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1, §4.1, §4.2.
 (2018) Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.2.
 (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 3. Cited by: §1, §4.1, §4.2.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §1, §3.1, §4.2.
 (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: 1st item, §2.1.
 (2014) Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474. Cited by: 1st item, §2.1.
 (2009) Learning multiple layers of features from tiny images. Cited by: §1, §4.1.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.2.
 (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §4.4.
 (2017) Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. Cited by: §2.2.
 (2018) Efficient dense modules of asymmetric convolution for realtime semantic segmentation. arXiv preprint arXiv:1809.06323. Cited by: 1st item, §2.1.
 (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2.2.
 (2016) Enet: a deep neural network architecture for realtime semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: 1st item, §2.1.
 (2017) Automatic differentiation in pytorch. In NIPSW, Cited by: §1.
 (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.2.
 (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.1.
 (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §4.1.
 (2016) Design of kernels in convolutional neural networks for image classification. In European Conference on Computer Vision, pp. 51–66. Cited by: §1.
 (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §1.
 (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §2.2.
 (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: 1st item, §1, §1, §2.1.
 (2017) Beyond filters: compact feature map for portable deep model. In International Conference on Machine Learning, pp. 3703–3711. Cited by: §2.2.
 (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §1, §4.1, §4.1.
 (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1.