ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks
As designing appropriate Convolutional Neural Network (CNN) architecture in the context of a given application usually involves heavy human works or numerous GPU hours, the research community is soliciting the architecture-neutral CNN structures, which can be easily plugged into multiple mature architectures to improve the performance on our real-world applications. We propose Asymmetric Convolution Block (ACB), an architecture-neutral structure as a CNN building block, which uses 1D asymmetric convolutions to strengthen the square convolution kernels. For an off-the-shelf architecture, we replace the standard square-kernel convolutional layers with ACBs to construct an Asymmetric Convolutional Network (ACNet), which can be trained to reach a higher level of accuracy. After training, we equivalently convert the ACNet into the same original architecture, thus requiring no extra computations anymore. We have observed that ACNet can improve the performance of various models on CIFAR and ImageNet by a clear margin. Through further experiments, we attribute the effectiveness of ACB to its capability of enhancing the model’s robustness to rotational distortions and strengthening the central skeleton parts of square convolution kernels.
Convolutional Neural Network (CNN) has achieved great success in visual understanding, which makes them useful for various applications in wearable devices, security systems, mobile phones, automobiles, \etc. As the front-end devices are usually limited in computational resources and demand real-time inference, these applications require CNN that delivers high accuracy with the constraints of a certain level of computational budgets. Thus it may not be practical to enhance the model by simply employing more trainable parameters and complicated connections. Therefore, we consider it meaningful to improve the performance of CNN with no extra inference-time computations, memory footprint, or energy consumption.
On the other hand, along with the advancements in the CNN architecture designing literature, the performance of the off-the-shelf models has been significantly improved. However, when the existing models cannot meet our specific needs, we may not be allowed to customize a new architecture at the costs of heavy human works or numerous GPU hours [zoph2018learning]. Recently, the research community is soliciting innovative architecture-neutral CNN structures, \eg, SE blocks [hu2018squeeze] and quasi-hexagonal kernels [sun2016design], which can be directly combined with various up-to-date architectures to improve the performance on our real-world applications.
Some recent investigations on CNN architectures focus on 1) how the layers are connected with each other, \eg, simply stacked together [krizhevsky2012imagenet, simonyan2014very], through identity mapping [he2016deep, szegedy2017inception, zagoruyko2016wide] or densely connected [huang2017densely] and 2) how the outputs of different layers are combined to increase the quality of learned representations [ioffe2015batch, szegedy2017inception, szegedy2015going, szegedy2016rethinking]. Considering this, in quest of a generic architecture-neutral CNN structure which can be combined with numerous architectures, we seek to strengthen standard convolutional layers by digging into an orthogonal aspect: the relationship between the weights and their spatial locations in the kernels.
In this paper, we propose Asymmetric Convolution Block (ACB), an innovative structure as a building block to replace the standard convolutional layers with square kernels, \eg, layers, which are widely used in modern CNN. Concretely, for the replacement of a layer, we construct an ACB comprising three parallel layers with , and kernels, respectively, of which the outputs are summed up to enrich the feature space (Fig. 1). As the introduced and layers have non-square kernels, we refer to them as the asymmetric convolutional layers, following [szegedy2016rethinking]. Given an off-the-shelf architecture, we construct an Asymmetric Convolutional Network (ACNet) by replacing every square-kernel layer with an ACB and train it until convergence. After that, we equivalently convert the ACNet into the same original architecture by adding the asymmetric kernels in each ACB onto the corresponding positions of the square kernels. Due to the additivity of convolutions with compatible kernel sizes (Fig. 2), which is obvious but has long been ignored, the resulting model can produce the same outputs as the training-time ACNet. As will be shown in our experiments (Sect. 4.1, 4.2), doing so can improve the performance of several benchmark models on CIFAR [krizhevsky2009learning] and ImageNet [deng2009imagenet] by a clear margin. Better still, ACNet 1) introduces NO hyper-parameters, such that it can be combined with different architectures without careful tuning; 2) is simple to implement on the mainstream CNN frameworks like PyTorch [paszke2017automatic] and Tensorflow [abadi2016tensorflow]; 3) requires NO extra inference-time computational burdens compared to the original architecture.
Through our further experiments, we have partly explained the effectiveness of ACNet. It is observed that a square convolution kernel distributes its learned knowledge unequally, as the weights on the central crisscross positions (which are referred to as the “skeleton” of the kernel) are usually larger in magnitude, and removing them causes higher accuracy drop, compared to those in the corners. In each ACB, we add the horizontal and vertical kernels onto the skeletons, thus explicitly making the skeletons more powerful, following the nature of square kernels. Interestingly, the weights on the corresponding positions of the square, horizontal and vertical kernels are randomly initialized and have a possibility to grow opposite in sign, thus summing them up may result in a stronger or weaker skeleton. However, we have empirically observed a consistent phenomenon that the model always learn to enhance the skeletons at every layer. This observation may shed light on future researches on the relationship among the weights at different spatial locations. The codes are available at https://github.com/ShawnDing1994/ACNet.
Our contributions are summarized as follows.
We propose to use asymmetric convolutions to explicitly enhance the representational power of a standard square-kernel layer in a way that the asymmetric convolutions can be fused into the square kernels with NO extra inference-time computations needed, rather than approximate a square-kernel layer like many prior works [denton2014exploiting, jaderberg2014speeding, jin2014flattened, lo2018efficient, paszke2016enet, szegedy2016rethinking].
We propose ACB as a novel architecture-neutral CNN building block. We can construct an ACNet by simply replacing every square-kernel convolutional layer in a mature architecture with an ACB without introducing any hyper-parameters, such that its effectiveness can be combined with the numerous advancements in the CNN architecture designing literature.
We have improved the accuracy of several common benchmark models on CIFAR-10, CIFAR-100, and ImageNet by a clear margin.
We have justified the significance of skeletons in standard square convolution kernels and demonstrated the effectiveness of ACNet in enhancing such skeletons.
We have shown that ACNet can enhance the model’s robustness to rotational distortions, which may inspire further studies on the rotational invariance problem.
2 Related work
2.1 Asymmetric convolutions
Asymmetric convolutions are typically used to approximate an existing square-kernel convolutional layer for compression and acceleration. Some prior works [denton2014exploiting, jaderberg2014speeding] have shown that a standard convolutional layer can be factorized as a sequence of two layers with and kernels to reduce the parameters and required computations. The theory behind is quite simple: if a 2D kernel has a rank of one, the operation can be equivalently transformed into a series of 1D convolutions. However, as the learned kernels in deep networks have distributed eigenvalues, their intrinsic rank is higher than one in practice, thus applying the transformation directly to the kernels results in significant information loss [jin2014flattened]. Denton \etal[denton2014exploiting] tackled this problem by finding a low-rank approximation in an SVD-based manner then finetuning the upper layers to restore the performance. Jaderberg \etal[jaderberg2014speeding] succeeded in learning the horizontal and vertical kernels by minimizing the -2 reconstruction error. Jin \etal[jin2014flattened] applied structural constraints to make the 2D kernels separable and obtained comparable performance as conventional CNN with speed-up.
On the other hand, asymmetric convolutions are also widely employed as an architectural design element to save the parameters and computations. For example, in Inception-v3 [szegedy2016rethinking], the convolutions are replaced by a sequence of and convolutions. However, the authors found out that such replacement is not equivalent as it did not work well on the low-level layers. ENet [paszke2016enet] also adopted this approach for the design of an efficient semantic segmentation network, where the convolutions are decomposed, allowing to increase the receptive field with reasonable computational budgets. EDANet [lo2018efficient] used a similar method to decompose the convolutions, resulting in a 33% saving in the number of parameters and required computations with minor performance degradation.
In contrast, we use 1D asymmetric convolutions not to factorize any layers as part of the architectural designs but enrich the feature space during training and then fuse their learned knowledge into the square-kernel layers.
2.2 Architecture-neutral CNN structures
We intend not to modify the CNN architecture but use some architecture-neutral structures to enhance the off-the-shelf models. Thus the effectiveness of our method is supplementary to the advancements achieved by the innovative architectures. Specifically, a CNN structure can be called architecture-neutral if it 1) makes no assumptions on the specific architecture, thus can be applied on various models, and 2) brings universal benefits. For example, SE blocks [hu2018squeeze] can be appended after a convolutional layer to rescale the feature map channels with learned weights, resulting in a clear accuracy improvement at reasonable costs of extra parameters and computational burdens. As another example, auxiliary classifier [szegedy2015going] can be inserted into the model to assist in supervising the learning process, which can indeed improve the performance by an observable margin but requires extra human works to tune the hyper-parameters.
By contrast, ACNet introduces NO hyper-parameters during training and requires NO extra parameters or computations during inference. Therefore, in real-world applications, the developer can use ACNet to enhance a variety of models without exhausting parameter tunings, and the end-users can enjoy the performance improvement without slowing down the inference. Better still, since we introduce no custom structures into the deployed model, it can be future compressed via techniques including connection pruning [guo2016dynamic, han2015learning], channel pruning [ding2019centripetal, ding2019approximated, liu2017learning, luo2017thinet], quantization [courbariaux2016binarized, gupta2015deep, rastegari2016xnor], feature map compacting [wang2017beyond], \etc.
3 Asymmetric Convolutional Network
For a convolutional layer with a kernel size of and filters which takes a -channel feature map as input, we use to denote the 3D convolution kernel of a filter, for the input, which is a feature map with a spatial resolution of and channels, and for the output with channels, respectively. For the -th filter at such a layer, the corresponding output feature map channel is
where is the 2D convolution operator, is the -th channel of in the form of a matrix, and is the -th input channel of , \ie, a 2D kernel of .
In modern CNN architectures, batch normalizations [ioffe2015batch] are widely adopted to reduce overfitting and accelerate the training process. As a common practice, a batch normalization layer is usually followed by a linear scaling transformation to enhance the representational power. Compared to Eq. 1, the output channel then becomes
where and are the values of channel-wise mean and standard deviation of batch normalization, and are the learned scaling factor and bias term, respectively.
3.2 Exploiting the additivity of convolution
We seek to employ asymmetric convolutions in a way that they can be equivalently fused into the standard square-kernel layers, such that no extra inference-time computational burdens are introduced. We notice a useful property of convolution: if several 2D kernels with compatible sizes operate on the same input with the same stride to produce outputs of the same resolution, and their outputs are summed up, we can add up these kernels on the corresponding positions to obtain an equivalent kernel which will produce the same output. That is, the additivity may hold for 2D convolutions, even with different kernel sizes,
where is a matrix, and are two 2D kernels with compatible sizes, and is the element-wise addition of the kernel parameters on the corresponding positions. Note is that may need to be appropriately clipped or padded.
Here compatible means that we can “patch” the smaller kernel onto the bigger. Formally, this kind of transformation on layer and is feasible if
, and kernels are compatible with .
This can be easily verified by investigating the calculation of convolution in the form of sliding windows (Fig. 2). For a certain filter with kernel , a certain point on the output channel is given by
where is the corresponding sliding window on input . Obviously, when we sum up two output channels produced by two filters, the additivity (Eq. 3) holds if for each point on one channel, its corresponding point on the other channel shares the same sliding window .
3.3 ACB for free inference-time improvements
In this paper, we focus on convolutions, which are heavily used in modern CNN architectures. Given an architecture, we construct an ACNet by simply replacing every layer (together with the following batch normalization layer, if any) with an ACB which comprises three parallel layers with kernel size , and , respectively. Similar to the common practice in standard CNN, each of the three layers is followed by batch normalization, which is referred to as a branch, and the outputs of three branches are summed up as the output of ACB. Note that we can train the ACNet using the same configurations as the original model without any extra hyper-parameters to be tuned.
As will be shown in Sect. 4.1 and Sect. 4.2, the ACNet can be trained to reach a higher level of accuracy. When the training is completed, we seek to convert every ACB to a standard convolutional layer which produces identical outputs. By doing so, we can obtain a more powerful network which requires no extra computations, compared to a normally trained counterpart. This conversion is achieved through two steps, namely, BN fusion and branch fusion.
The homogeneity of convolution allows the following batch normalization and linear scaling transformation to be equivalently fused into the convolutional layer with an added bias. It can be observed from Eq. 2 that for each branch, if we construct a new kernel as along with an added bias term , we will produce the same output, which can be easily verified.
We merge the three BN-fused branches into a standard convolutional layer by adding the asymmetric kernels onto the corresponding positions of the square kernel. In practice, this transformation is implemented by building a network of the original structure and using the fused weights for initialization, thus we can produce the same outputs as the ACNet with the same computational budgets as the original architecture. Formally, for every filter , let be the fused 3D kernel, be the obtained bias term, and be the kernels of the corresponding filter at the and layer, respectively, we have
Then we can easily verify that for an arbitrary filter ,
where , and are the outputs of the original , and branch, respectively. Fig. 3 shows an example on a single input channel for more intuitions.
Of note is that though an ACB can be equivalently transformed into a standard layer, the equivalence only holds at inference-time because the training dynamics are different, thus giving rise to different performance. The non-equivalence of the training process is due to the random initialization of kernel weights, and the gradients derived by different computation flows they participate in.
We have conducted abundant experiments to verify the effectiveness of ACNet in improving the performance of CNN across a range of datasets and architectures. Concretely, we pick an off-the-shelf architecture as the baseline, build an ACNet counterpart, train it from scratch, convert it into the same structure as the baseline, and test it to collect the accuracy. For the comparability, all the models are trained until the complete convergence, and every pair of baseline and ACNet uses identical configurations, \eg, learning rate schedules and batch sizes.
4.1 Performance improvements on CIFAR
In order to preliminarily evaluate our method on various CNN architectures, we experiment with several representative benchmark models including Cifar-quick [snoek2012practical], VGG-16 [simonyan2014very], ResNet-56 [he2016deep], WRN-16-8 [zagoruyko2016wide] and DenseNet-40 [huang2017densely] on CIFAR-10 and CIFAR-100 [krizhevsky2009learning].
For Cifar-quick, VGG-16, ResNet-56, and DenseNet-40, we train the models using a staircase learning rate of 0.1, 0.01, 0.001 and 0.0001 following the common practice. For WRN-16-8, we follow the training configurations reported in the original paper [zagoruyko2016wide]. We use the data augmentation techniques adopted by [he2016deep], \ie, padding to , random cropping and left-right flipping.
As can be observed from Table. 1 and Table. 2, the performance of all the models is consistently lifted by a clear margin, suggesting that the benefits of ACBs can be combined with various architectures.
|Model||Base Top-1||ACNet Top-1||Top-1|
|Model||Base Top-1||ACNet Top-1||Top-1|
4.2 Performance improvements on ImageNet
|Model||Base Top-1||ACNet Top-1||Top-1||Base Top-5||ACNet Top-5||Top-5|
We then move on to the effectiveness validation of our method on the real-world applications through a series of experiments on ImageNet [deng2009imagenet] which comprises 1.28M images for training and 50K for validation from 1000 classes. We use AlexNet [krizhevsky2012imagenet], ResNet-18 [he2016deep] and DenseNet-121 [huang2017densely] as the representatives for the plain-style, residual and densely connected architectures, respectively. Every model is trained with a batch size of 256 for 150 epochs, which is longer than the usually adopted benchmarks (\eg, 90 epochs [he2016deep]), such that the accuracy improvement cannot be simply attributed to the incomplete convergence of the base models. For the data augmentation, we employ the standard pipeline including bounding box distortion, left-right flipping and color shift, as a common practice. Especially, the plain version of AlexNet we use comes from the Tensorflow GitHub [Tensorflow-AlexNet], which is composed of five stacked convolutional layers and three fully-connected layers with no local response normalizations (LRN) or cross-GPU connections. For the faster convergence, we apply batch normalization [ioffe2015batch] on its every convolutional layer. Of note is that since the first two layers of AlexNet use and kernels, respectively, it is possible to extend ACBs to have larger asymmetric kernels. However, we still only use and convolutions for these two layers, because such large-scale convolutions are becoming less favored in modern CNN, making large ACBs less useful.
As shown in Table. 3, the single-crop Top-1 accuracy of AlexNet, ResNet-18 and DenseNet-121 is lifted by 1.52%, 0.78% and 1.18%, respectively. In practice, aiming at the same target of accuracy, we can use ACNet to enhance a more efficient model to achieve the target with less inference time, energy consumption, and storage space. On the other hand, with the same constraints on computational budgets or model size, we can use ACNet to improve the accuracy by a clear margin such that the gained performance can be viewed as free benefits, from the viewpoint of end-users.
4.3 Ablation studies
|Model||Horizontal kernel||Vertical kernel||BN in branch||Original input||Rotate||Rotate||Up-down flip|
Though we have empirically justified the effectiveness of ACNet, we still desire to find some explanations. In this subsection, we seek to investigate ACNet through a series of ablation studies. Specifically, we focus on the following three design decisions: the usage of 1) horizontal kernels, 2) vertical kernels, and 3) batch normalization in every branch. For the comparability, we train several AlexNet and ResNet-18 models on ImageNet with different ablations using the same training configurations. Of note is that if the batch normalizations in the branches are removed, we batch-normalize the output of the whole ACB instead, \ie, the position of batch normalization layer is changed from pre-summation to post-summation.
As can be observed from Table. 4, removing any of the three designs degrades the model. However, though the horizontal and vertical convolutions can both improve the performance, there may exist some difference because the horizontal and vertical directions are treated unequally in practice, \eg, we usually perform random left-right but no up-down image flipping to augment the training data. Therefore, if an upside-down image is fed into the model, the original layers should produce meaningless results, which is natural, but a horizontal kernel will produce the same outputs as on the original image at the axially symmetric locations (Fig. 4). \Ie, a part of the ACB can still extract the correct features. Considering this, we assume that ACBs may enhance the model’s robustness to rotational distortions, enabling the model to generalize better on the unseen data.
We then test the formerly trained models with rotationally distorted images from the whole validation set including counterclockwise rotation, rotation, and up-down flipping. Naturally, the accuracy of every model is significantly reduced, but the models with horizontal kernels deliver observably higher accuracy on the rotated and up-down flipped images. \Eg, the ResNet-18 equipped with only horizontal kernels delivers an accuracy slightly lower than that of the counterpart with only vertical kernels on the original inputs, but 0.75% higher on the rotated inputs. And when compared with the base model, its accuracy is 0.34% / 1.27% higher on the original / flipped images, respectively. Predictably, the models exert similar performance on the rotated and up-down flipped inputs, as rotation plus left-right flipping is equivalent to up-down flipping, and the model is robust to left-right flipping due to the data augmentation methods.
In summary, we have shown that ACBs, especially the horizontal kernels inside, can enhance the model’s robustness to rotational distortions by an observable margin. Though this may not be the primary reason for the effectiveness of ACNet, we consider it promising to inspire further researches on the rotational invariance problem.
4.4 ACB enhances the skeletons of square kernels
Intuitively, as adding the horizontal and vertical kernels onto the square kernel can be viewed as a means to explicitly enhance the skeleton part, we seek to explain the effectiveness of ACNet by investigating the difference between the skeleton and the weights at the corners.
Inspired by the CNN pruning methods [guo2016dynamic, han2015deep, han2015learning], we start from removing some weights at different spatial locations and observing the performance drop using ResNet-56 on CIFAR-10. Concretely, we randomly set some individual weights in the kernels to zero and test the model. As shown in Fig. 4(a), for the curve labeled as corner, we randomly select the weights from the four corners of every kernel and set them to zero in order to attain a given global sparsity ratio of every convolutional layer. Note that as , a sparsity ratio of 44% means removing most of the weights at the four corners. For skeleton, we randomly select the weights only from the skeleton of every kernel. For global, every individual weight in the kernel has an equal chance to be chosen. The experiments are repeated five times with different random seeds, and the meanstd curves are depicted.
As can be observed, all the curves show a tendency of decreasing as the sparsity ratio increases, but not monotonically, due to the random effects. It is obvious that removing the weights from the corners causes less damage to the model, but pruning the skeletons does more harm. This phenomenon suggests that the skeleton weights are more important to the model’s representational capacity.
We continue to verify if this observation holds for ACNet. We convert the ACNet counterpart via BN and branch fusion, then conduct the same experiments on it. As shown in Fig. 4(b), we observe an even more significant gap, \eg, pruning almost all the corner weights only degrades the model’s accuracy to above 60%. On the other hand, pruning the skeletons causes more damage, as the model is destroyed when the global sparsity ratio attained by pruning the skeletons merely reaches 13%, \ie, weights of the skeletons are removed.
Then we explore the cause of the above observations by investigating the numeric values of the kernels. We use the magnitude (\ie, absolute value) as the metric for the importance of parameters, which is adopted by many prior CNN pruning works [ding2018auto, guo2016dynamic, han2015learning, li2016pruning]. Specifically, we add up all the fused 2D kernels in a convolutional layer, perform a layer-wise normalization by the max value, and finally obtain an average of the normalized kernels of all the layers. Formally, let be the 3D kernel of the -th filter at the -th layer, be the number of all such layers, and be the max and element-wise absolute value, respectively, the average kernel magnitude matrix is computed as
where the sum of absolute kernels of layer is
We present the values of the normally trained ResNet-56 and the fused ACNet counterpart in Fig. 5(a) and Fig. 5(b), where the numeric value and color at a certain grid indicate the average relative importance of the parameter on the corresponding position across all the layers, \ie, a larger value and darker background color indicates a higher average importance of the parameter.
As can be observed from Fig. 5(a), the normally trained ResNet-56 distributes the magnitude of the parameters in an imbalance manner, \ie, the central point has the largest magnitude, and the points at the four corners have the smallest. Fig. 5(b) shows that ACNet aggravates such imbalance, as the values of the four corners are decreased to below 0.400, and the skeleton points have the values above 0.666. In particular, the central point has an value of 1.000, which means that this location has a dominant importance consistently in every single layer. It is noteworthy that the weights on the corresponding positions of the square, horizontal and vertical kernels have a possibility to grow opposite in sign, thus summing them up may result in a larger or smaller magnitude. But we have observed a consistent phenomenon that the model always learn to enhance the skeletons at every layer.
We continue to study how the model will behave if we add the asymmetric kernels onto the other positions rather than the central skeletons. Specifically, we train an ACNet counterpart of ResNet-56 using the same training configurations as before, but shift the horizontal convolutions one pixel towards the bottom on the inputs and shift the vertical convolutions towards the right. Accordingly, during branch fusion, we add the BN-fused asymmetric kernels to the bottom-right borders of the square kernels (Fig. 5(c)) in order for an equivalent resulting network. It is observed that such ACBs can also enhance the borders, but not as intensively as the regular ACBs do to the skeletons. The model delivers an accuracy of 94.67%, which is 0.42% lower than the regular ACNet (Table. 1). Moreover, similar pruning experiments are conducted on the fused model (Fig. 4(c)). As observed, pruning the corners still delivers the best accuracy, and pruning the enhanced bottom-right borders gives no better results than the top-left squares, \ie, though the magnitudes of the borders have increased, the other parts remain essential to the whole kernels.
In summary: 1) the skeletons are inherently more important than the corners in standard square kernels; 2) ACB can significantly enhance the skeletons, resulting in improved performance; 3) adding the horizontal and vertical kernels to the borders degrades the model’s performance compared to regular ACBs; 4) doing so can also increase the magnitude of the borders but cannot diminish the importance of the other parts. Therefore, we partly attribute the effectiveness of ACNet to its capability of further strengthening the skeletons. Intuitively, ACNet follows the nature of the square convolution kernels.
In order to improve the performance of various CNN architectures, we proposed Asymmetric Convolution Block (ACB), which sums up the outputs of three convolutional branches with square, horizontal and vertical kernels, respectively. We construct an Asymmetric Convolutional Network (ACNet) by replacing the square-kernel layers in a mature architecture with ACBs and convert it into the original architecture after training. We have evaluated ACNet by improving various plain-style, residual and densely connected models on CIFAR and ImageNet. We have shown that ACNet can enhance the model’s robustness to rotational distortions by an observable margin, and explicitly strengthening the skeletons following the nature of square kernels. Of note is that ACNet introduces NO hyper-parameters to be tuned, requires NO extra inference-time computations, and is simple to implement using mainstream frameworks.
This work was supported by the National Key R&D Program of China (No. 2018YFC0807500), National Natural Science Foundation of China (No. 61571269), National Postdoctoral Program for Innovative Talents (No. BX20180172), and the China Postdoctoral Science Foundation (No. 2018M640131). Corresponding author: Guiguang Ding, Yuchen Guo.