LinearConv: Regenerating Redundancy in Convolution Filters as Linear Combinations for Parameter Reduction
Convolutional Neural Networks (CNNs) show state-of-the-art performance in computer vision tasks. However, convolutional layers of CNNs are known to learn redundant features, still not being efficient in memory requirement. In this work, we explore the redundancy of learned features in the form of correlation between convolutional filters, and propose a novel layer to reproduce it efficiently. The proposed ”LinearConv” layer generates a portion of convolutional filters as a learnable linear combination of the rest of the filters and introduces a correlation-based regularization to achieve flexibility and control over the correlation between filters, and the number of parameters, in turn. This is developed as a plug-in layer to conveniently replace a conventional convolutional layer without any modification required in the network architecture. Our experiments verify that the LinearConv-based models are able to achieve a performance on-par with counterparts with up to of parameter reduction, and having same computational requirement in run time.
Deep Learning has been widely adopted in recent years over feature design and hand-picked feature extraction. This development was supported by the improvement in compute-power availability and large-scale public datasets. In such a resourceful setting, research community has put forth deep learning models with exceptional performance at the cost of heavy computation and memory usage. Recent studies suggest—although the automated feature learning process captures more meaningful and high-level features—that the accuracy gain comes with a considerable redundancy in learned features [denil2013predicting, chakraborty2019feature, song2012sparselet]. Such an inefficiency of the process hinders the deployment of deep learning models in resource-constrained environments. To this end, it is interesting to investigate the possibility of controlling the redundancy in features without sacrificing the performance.
CNNs have become the backbone of deep neural networks with their vast success in feature extraction. A set of convolutional filters running through the input of each layer produces output feature maps, extracting and combining localized features. A general architecture comprising a cascade of such convolutional layers, with the resolution of feature maps decreasing and the depth increasing, followed by a couple of fully-connected layers, results in state-of-the-art performance in most computer vision tasks. However, when these weights (filters) are learned optimizing the loss function, they converge to a point where the final learned filters of each layer being correlated [shang2016understanding, chen2015correlative, wang2017building]. This means that the set of filters in each layer is linearly dependent, and hence, the filter subspace could be spanned with a fewer number of filters, at least theoretically. In other words, the exact same performance could be achieved with fewer parameters if the weights are carefully optimized. However, in practice, an over-complete spanning set of filters is allowed, to reach fine-grained performance improvements. Even though this is the case, enabling a better control over this redundancy may reveal ways of efficiently replicating the same behavior.
In this regard, recent literature has explored the possibility of inducing sparsity [liu2015sparse, song2012sparselet] and having separability [chollet2017xception, xie2017aggregated, howard2017mobilenets] in convolutional filters. Although some works consider the inherent correlation in learned filters for various improvements [shang2016understanding, wang2017building], it has been overlooked for parameter reduction in deep networks. Moreover, previous works fall short in conveniently controlling the feature redundancy identified as the correlation between features.
In this paper, we discuss an approach of gaining control over the feature redundancy as seen by convolutional filter correlation in CNNs, and utilizing it for improving efficiency. To do this, we propose a novel convolutional element, which we call LinearConv as presented in Fig.1, that consists of two sets of filters: primary filters, a set of convolutional filters learned as usual, but with adjustable correlation, and secondary filters, generated by linearly combining the former. The coefficients that generate this linear combination is co-learned along with the primary filters. Here, the intuition is to control the correlation between filters in each layer and efficiently replicate the required redundancy. The memory efficiency, i.e., the reduction of trainable parameters is due to the set of secondary filters expanded by fewer parameters, which is the set of linear coefficients.
The main contributions of this paper are as follows:
We propose a novel LinearConv layer that comprises a learned set of filters and a learned linear combination of these initial filters to replace the convolutional layers in CNNs. We experimentally validate the proposed LinearConv-based models to achieve a performance on-par with counterparts with a reduced number of parameters.
We propose a novel correlation-based regularization loss for convolutional layers which gives the flexibility and the control over the correlation between convolutional filters. The proposed regularization loss together and the LinearConv layer are designed to be conveniently plugged into existing CNN architectures without any modifications.
2 Related Work
The capacity of deep neural networks was openly identified after the proposal of AlexNet [krizhevsky2012imagenet], which achieved state-of-the-art performance in ILSVRC-2012 [ILSVRC15]. Since then, deep CNN architectures such as VGG [simonyan2014very], first introducing very deep networks, ResNet [he2016deep] and DenseNet [huang2017densely], proposing better learning with depth, have flourished, improving different facets of deep learning. In parallel to working on better architectures and optimization techniques, community has looked into improving the resource efficiency of networks over the years. ResNeXt [xie2017aggregated] and ShuffleNet [zhang2018shufflenet] utilize group convolutions to reduce channel-wise redundancy in learned feature maps, which is taken a step further in Xception [chollet2017xception] and MobileNet [howard2017mobilenets, sandler2018mobilenetv2] with depth-wise separable convolutions. In [han2015deep], a pruning technique is proposed to remove non-significant features and fine-tune the network, which achieves similar performance with reduced complexity. OctConv [chen2019drop] is proposed to process low-frequency and high-frequency features separately to reduce the number of computations and parameters. In [chakraborty2019feature], authors randomly drop a certain amount of feature maps, hoping to reduce redundancy, whereas, in [denil2013predicting], authors predict a majority of the weights using a small subset. In contrast, we approach this redundancy, observing it as the correlation between convolutional filters and controlling it.
The correlation between feature maps, and the resulting redundancy have been identified in recent literature. In [shang2016understanding], authors observe a pair-wise negative correlation in low-level features of CNNs. Motivated by this, a novel activation function called concatenated ReLU is proposed to preserve both positive and negative phase information, mitigating the need for processing both features in a correlated pair. A similar property of correlation is identified in [chen2015correlative, wang2017building, chen2018static], where authors suggest to generate such features based on a separate set of correlation filters rather than learning all the redundant features. From these directions, it is evident that the previous works have utilized the feature correlation up to a certain extent. However, they fall short in subtly manipulating the correlation to gain an advantage. To address this, we propose a correlation-based regularization method for optimizing convolutional weights.
Linear combinations linked with convolutional layers have been proposed to improve CNNs in multiple aspects. Separable filters [rigamonti2013learning] and Sparselet models [song2012sparselet] explore the idea of approximating a set of convolutional filters as a linear combination of a smaller set of basis filters. One other direction suggests linearly combining feature maps, rather than the filters which generate them, to efficiently impose the feature redundancy [jaderberg2014speeding]. In [chen2015correlative, wang2017building], authors try to generate correlated filters as a matrix multiplication with a set correlation matrices. Here, the authors use a set of static correlation matrices, followed by enabling their parametric learning. Each primary filter is one-to-one mapped in to a dependent filter based on these learnable correlation matrices. We follow a procedure parallel to this, but instead of learning a one-to-one mapping, we linearly combine a group of learnable filters, scaled by learnable linear coefficients to generate a group of correlated filters
In essence, previous works have identified feature correlation and redundancy, utilizing them to improve the efficiency of CNNs. However, all these approaches have limited control over the correlation and thus, a narrow outlook on the redundancy and its replication. In contrast, we achieve a finer manipulation of correlation through the proposed regularization technique and a flexible replication of the redundancy. All this, in a form that can be directly plugged into existing architectures without any additional effort, enables its fast and convenient adoption.
The proposition of this work is to flexibly control the correlation between convolutional filters and regenerate their inherent redundancy efficiently, without sacrificing the performance. In our perspective, this is a two step process: first we have to restrict the convolutional filters to learn linearly independent features, and second, we have to combine these primary filters to generate correlated filters in a learnable manner. Therefore, we introduce a regularization loss which applies to convolutional filters, followed by the proposal of LinearConv layers which can manipulate the correlation and replicate the redundancy through learnable linear combinations.
The intuition for the weight regularization is to reduce the inherent redundancy in learned convolutional filters, by making them as less linearly dependent as possible, whilst providing space to learn. In other words, we want each filter of a certain layer to learn distinctive features. Therefore, when calculating the regularization loss, we flatten the weights of such filters and consider them as vectors to be made linearly independent. Ideally, when the filters are linearly independent, the correlation of the matrix made up of these vectors should be the identity matrix of the same dimensionality. Hence, the element-wise absolute sum of the difference between the correlation matrix and the identity matrix is expressed as the desired loss. In this sense, the proposed method is an extension of L1 regularization, which is applied to the correlation matrix of the filters, rather than to the filters themselves. The steps of calculating this correlation-based regularization loss is elaborated in Algorithm 1.
When training the network, the regularization loss is scaled by a constant and added to the output loss. In back-propagation, the gradient of this regularization term affects only the weight updates of the respective layers. This results in convenient adoption of the regularization in existing CNNs.
3.2 LinearConv operation
To replicate the inherent redundancy, we propose a novel LinearConv layer to replace conventional Conv layers in CNNs, with added flexibility and control over correlation. Here, the intuition is to have a primary set of conventional convolutional filters which can be trained with controlled regularization, and a secondary set of strictly linearly-dependent filters.
The operation of proposed LinearConv layer is as depicted in Listing 1. This is a basic version of the class definition with default bias, stride, padding, group and any other configurations which can be easily added to the definition when required in different models. In addition to the basic initialization parameters such as input filters (channels), output filters and kernel size, LinearConv layer consists of parameter alpha (), which defines the portion of primary filters. As trainable parameters of the layer, we have the weights of primary filters, and the coefficients used to generate their linear combinations. In each forward pass of inputs, the proposed layer first calculates the set of secondary filters as linear combinations of primary filters, and then convolves the input with the concatenated two portions of filters. It is important to note that an input is subjected to a single convolution operation, same as in conventional convolutional layer. In the backward pass, the primary filters get their weights directly updated, whereas the linear coefficients get updated through the secondary filters.
The number of trainable parameters of a LinearConv layers is less than that of an equivalent Conv layer if and only if:
where represents the number of input filters, , the number of output filters, and , the kernel width of a square kernel. Here, both the above equation and the proposed layer stands only when . For convenience, we choose so that is divisible by both and in our experiments. Eq.1 proves to be true for the majority of convolutional layers in common CNN architectures, except at the input, where is 1 or 3 for images.
As evident from Listing 1, the linear combinations need to be calculated per every forward pass of a batch. This increase the computational requirement of the layer when compared to a conventional convolutional layer. The extra cost is for the matrix multiplication, which is a considerable number of computations given this operation is done every forward pass. This cost can be reduced by decomposing the matrix of linear coefficients as a product of two lower rank matrices which reduce the rank of the original matrix in the process as in:
where . Moreover, the calculation of linear combinations is required to be in the forward pass for the training phase only. In run time, since there is no requirement for back-propagation and weight updates, this computation can be done in the initialization phase of the layer, making it a one-time cost. Therefore, although the proposed LinearConv layer requires an increased computational requirement in training, such trained models can be deployed in systems with no additional computational requirement in run time.