BasisConv: A method for compressed representation and learning in CNNs
It is well known that Convolutional Neural Networks (CNNs) have significant redundancy in their filter weights. Various methods have been proposed in the literature to compress trained CNNs. These include techniques like pruning weights, filter quantization and representing filters in terms of a basis functions. Our approach falls in this latter class of strategies, but is distinct in that that we show both compressed learning and representation can be achieved without significant modifications of popular CNN architectures. Specifically, any convolution layer of the CNN is easily replaced by two successive convolution layers: the first is a set of fixed filters (that represent the knowledge space of the entire layer and do not change), which is followed by a layer of one-dimensional filters (that represent the learned knowledge in this space). For the pre-trained networks, the fixed layer is just the truncated eigen-decompositions of the original filters. The 1D filters are initialized as the weights of linear combination, but are fine-tuned to recover any performance loss due to the truncation. For training networks from scratch, we use a set of random orthogonal fixed filters (that never change), and learn the 1D weight vector directly from the labeled data. Our method substantially reduces i) the number of learnable parameters during training, and ii) the number of multiplication operations and filter storage requirements during implementation. It does so without requiring any special operators in the convolution layer, and extends to all known popular CNN architectures. We demonstrate the generality of the proposed approach by applying it to four well known network architectures with three different data sets. The results show a consistent reduction in i) the number of operations by up to a factor of 5, and ii) number of learnable parameters by up to a factor of 18, with less than 3% drop in performance on the CIFAR100 dataset.
BasisConv: A method for compressed representation and learning in CNNs
Muhammad Tayyab††thanks: http://www.mtayyab.com Center for Research in Computer Vision Department of Computer Science University of Central Florida, USA email@example.com Abhijit Mahalanobis Center for Research in Computer Vision Department of Computer Science University of Central Florida, USA firstname.lastname@example.org
noticebox[b]Preprint. Under review.\end@float
While there has been a tremendous surge in convolutional neural networks and their applications in computer vision, relatively little is understood about how information is learned and stored in the network. This is evidenced by the fact that researchers have successfully proposed different approaches for compressing a network after it has been trained Cheng2017ASO (), including techniques like pruning weights optimalbrain (); Surgeon (); Han2016DeepCC (); Han2015LearningBW (); Srinivas2015DatafreePP (); Chen2015CompressingNN (), assuming row-column separability Jaderberg2014SpeedingUC (), applying low rank approximations for computational gains Denton2014ExploitingLS (), and using basis representation Jaderberg2014SpeedingUC (); Qiu2018DCFNetDN () . It is clear that CNNs do not need to explicit learning of a large number of coefficients in the manner in which they are currently trained. Based on this observation, we take a different view of the key component in CNNs - the filtering operation - and propose a fundamentally different approach that combines a "fixed" convolution operator (that is never trained or learned) with a learnable one-dimensional kernel. This is motivated by a salient observation that the filters are points in a hyper-dimensional space that is learned via the training process. We claim that the filters themselves are not important in the end, but it is the representation of the space itself is the key.
For networks that have been already trained, the underlying knowledge space of a layer can be easily represented as truncated eigen decomposition of the filters. We can then efficiently fine-tune the coefficients of linear combination to find new points in this lower dimensional space which recover any loss in performance, and discard the original filters. As we will show, this approach dramatically reduces the number of filtering operations and filter storage requirements, without notable drop in performance. The same construct can be also use to train a network from scratch without having to explicitly learn the filter kernels across the network. For this scenario, we show that random basis functions can be used as fixed convolution kernels (which never require training), with one dimensional weight vectors that learn the relevant information. We refer to this ability to learn in a compressed format as "compressed learning" where instead of learning the 3D filter parameters, we only need to learn relatively fewer parameters that describe where these filters reside in the hyperdimensional information space in a given layer of the CNN. Thus, this paper unifies the goals of compressing previously trained networks, and training networks in a compressed format when learning new information from scratch.
2 Compressed Representation and Learning
Consider the fundamental convolution operation in any given layer of a convolutional neural network depicted on the left in Figure 1. Assume that an input block of data (such as the activations or output of the previous layer) is convolved with a set of 3D filters . The output can be expressed as
where represents the convolution operation. The right side of Figure 1 shows how the same output can be obtained using two successive convolution stages. Here, we assume that the filters can be expressed as a linear combination of Q basis functions , such that
where are the weights of linear combination. Using this representation, the output can be expressed as
The key observation is that the Q convolution terms need to be computed only once, and they are common to all outputs . These can be stacked together to form the 3D intermediate result while the weights can be treated as filter . Therefore, the outputs are simply the convolution of two, i.e
We refer to this construct using two successive convolutions as BasisConv.
2.1 Compression of pretrained Networks
It is well known that eigen decomposition results in a compact basis that minimizes the reconstruction error achieved by a linear combination of basis functions. We therefore choose as the eigen filters that represent the sub-space in which the original filters lie. To obtain the eigen filters, we define the dimensional vector as a vectorized representation of , and construct the matrix with as its columns. The eigenvectors of represent the sub-space of the filters, and satisfy the relation , where are the eigenvectors, and are the corresponding eigenvalues. The eigen filter is readily obtained by re-ordering the elements of the eigenvector into a array. Although the number of possible eigenvectors is equal to the dimensionality of the space, we select a small subset of eigenvectors which correspond to the largest Q eigen-values that best represent the dominant coordinates of the filters’ subspace. Since the eigen-values represent the information present in each eigen-vector, in practice we will use the metric to choose Q such that most of the relevant information is retained in the selected eigen-vectors.
The decomposition of the filters of any given layer of the network can be succinctly expressed in matrix vector notation by defining (i.e. the matrix of eigenvectors of the filters for that layer) so that
and is a vector of weights. Since (i.e. the identity matrix), the weights of easily obtained by computing
Depending on the choices of and , this can lead to substantial reduction in the number of multiplication operations. Specifically, let represents the number of multiplications for one convolution operation (between and either or ). If the size of the filters is , and the size of the input data is , It is easy to show that . Therefore, multiplications required in Eq. (1) is
while multiplications required in Eq (3) is
We see that the ratio of the two is
Thus, as long as , the number of multiplications will be reduced by a factor close to (i.e. the ratio of the the original number of filters and the number of basis filters used).
2.2 compressed Learning
The architecture shown in Figure 1 is not only amenable to reducing the filter storage requirements and multiplications required for each convolution layer, but is amenable to learning in the compressed space where the number of learnable parameters is substantially reduced. Recall that number of learnable parameters in the original filters is . Since there are such filters, the total number of original learnable parameters is . However, the total number of "learnable" parameters for BasisConv is (depicted in Figure 1 as a one-dimensional filters of length ). Therefore, the reduction in the number of learnable parameters is . If , it is clear that the number of scalar weights that need to be refined is substantially less than the original number of learnable parameters.
For pretrained networks, fine tuning is achieved by retraining while freezing the eigenfilters in each basis convolution layer. The reason is the weight vectors are the lower-dimensional embeddings of the original filters in the sub-space represented by the informative eigenvectors. Thus, while the eigenvectors represent “knowledge space” captured in a given layer of the network, the weights represent specific points within this space where each filter resides. This observation allows us to fine-tune the weights directly to mitigate the approximation errors at each layer (without having to fine-tune the filters explicitly).
The more interesting scenario arises for training a network from scratch in compressed format. Of course, if the final values of are not known, then it is not possible to use eigen space representation. Therefore, for compressed learning from scratch, we propose to initialize the columns of with random vectors that will remain fixed, and only allow the coefficients to update during training. In other words, we never need to train 3D filter coefficients, but just the weights of linear combination.
Here, we assume represent a random matrix whose columns , are orthonormal random vectors of dimension so that is the identity matrix. The question is how should such random vectors be chosen to represent the underlying knowledge space of the filters? Of course, the ideal (but unknown) filter (which are of size ) can be exactly represented as a linear combination of such orthogonal random vectors. However, since only has columns, the error between the ideal filter and its linear approximation is
The minimum squared error solution is , which yields
Therefore, the relative error is bounded by
where , and are minimum and maximum eigenvalues of . In other words, an upper bound on the relative approximation error can be minimized by making as large as possible, while the lower bound can be reduced by ensuring that is also large as possible. The sample realizations of the random vectors used for actual experiments can be judiciously chosen to achieve these objectives to ensure that they serve as reasonable choice for basis filters.
3 Background and Related Work
Filter pruning is probably the earliest explored research directions for compression and efficient implementation of CNNs. L. Cun et al. optimalbrain () and Hassibi et al. Surgeon () showed that second derivative of the loss can be used to reduced the number of connections in a network. This strategy not only yields an efficient network but also improves generalization. However these methods are only applicable for training the network from scratch. More recently however there has been growing interest in pruning redundancies from a pre-trained network. Han et al. Han2015LearningBW () proposed a compression method which aims to learn not only weights but also the connections between neurons from training data. While Srinivas et. al. Srinivas2015DatafreePP () proposed a data-free method to prune neurons instead of the whole filters. Chen et al. Chen2015CompressingNN () proposed a hash based parameter sharing strategy which intern reduces storage requirements.
Filter quantization has been also used for network compression. These methods aim to reduces the number of bits required to represent the filters which can in turn lead to efficient CNN implementation. Quantization using k-means clustering has been explored by Gong et al. Gong2014CompressingDC () and Wu et al. Wu2016QuantizedCN (). Similarly Vanhoucke et al. Vanhoucke2011ImprovingTS () also showed that 8-bit quantization of the parameters can result in significant speed-up with minimal loss of accuracy. In contrast Han2016DeepCC () combined quantization with pruning. A special case of quantized networks are binary networks, which use only one bit to represent the filter values. Some of the works which explore this direction are BinaryConnect Courbariaux2015BinaryConnectTD (), BinaryNet Courbariaux2016BinaryNet () and XNORNetworks Rastegari2016XNORNetIC (). Their main idea is to directly learn binary weights or activation during the model training.
Knowledge Distillation methods train a smaller network to mimic the output(s) of a larger pre-trained network. Bucila2006ModelC () is one of the earliest works exploring this idea. They trained a smaller model from a complex ensemble of classifiers without significant loss in accuracy. More recently Hinton2015DistillingTK () further developed this method and proposed a knowledge distillation framework, which eased the training of networks. Another adaption of Bucila2006ModelC () is Ba2014DoDN () which aims to compress deep and wide networks into shallower ones. Belagiannis2018AdversarialNC () also used this idea to transfer knowledge from larger networks to much shallower ones using adversarial loss.
oOur work relates to a class of techniques that rely on basis functions to represent the convolution filters, but differs in several key respects. For instance in Jaderberg2014SpeedingUC (), Jaderberg et al have proposed a similar two stage decomposition in terms of basis functions followed 1D convolutions for recombining outputs of basis filters. However, to achieve processing speed, their focus is on approximating full rank filter banks using rank-1 filter basis filters, which were optimized to reconstruct the original filters and the response of the CNN to the training data. It was shown that this method leads to significant speed up of a four stage CNN for character recognition. However, the authors do not address the problem of learning in compressed format, nor how this method might impact the performance of other well known CNN architectures on standard data sets. Qiu et al Qiu2018DCFNetDN () have also observed that a conventional convolution can be represented as a two successive convolutions involving a basis set and projection coefficients, but their construct differs from the one proposed in Figure 1. Their focus is on 2D Fourier Bessel functions as a basis set for reducing the number of operations required within a given 3D filter kernel, while noting that random basis functions also tend to perform well. Although this method learns with fewer parameters than conventional CNNs, our approach exploits the redundancy in the full 3D structure of the convolution layer (across all channels and filters) and therefore necessitates even fewer learnable parameters.
As described in section 3, we can use BasisConv to compressedly represent all pre-trained convolution layers in a traditional ConvNet. Additionally we can also train such networks (referred to as BasisNet), from scratch in this format. We now describe our experiments in detail.
4.1 Datasets and Models
We performed our experiments on three publicly available image classification datasets. These are CIFAR10, CIFAR100 and SVHN. All three datasets contain 32 32 pixel RGB images. CIFAR10 and SVHN contain 10 object classes while CIFAR100 has 100 classes.
We tested four different CNN architectures with BasisConv. These are Alexnet Krizhevsky2012ImageNetCW (), VGG16 Simonyan2015VeryDC (), Resnet110 He2016DeepRL () and Densenet190 Huang2017DenselyCC (). We used the pytorch implementation and pre-trained weights of these networks provided by bearpaw (), since this implementation is suitable for 32 32 input images unlike the originallly proposed networks which are designed for 224 224 input size.
4.2 Network Compression
To compress the pretrained network we replaced each convolution layer in the network with the BasisConv layer. The resulting network is refered as BasisNet. BasisConv layer implements two operation, i) convolution with the basis filters and ii) linear combination of output using projection coefficients, which is implemented as convolution with 1x1 filters. Parameters for the BasisConv layer are computed from the weights of original convolution layer using eigen decomposition as explained in section 2.1. Compression emerges from the the fact that only a small number (Q) of basis filters are needed to reconstruct the output of convolution layer and rest of them can be safely discarded. To determine Q, we first sort the eigenvectors such that their corresponding eigenvalues are in descending order. The first Q eigenvectors are then selected such that the ratio of the sum of their eigenvalues and the sum of all eigenvalues exceeds a threshold t. Naturally maximum value of t is 1.0 at which point each BasisConv layer retains all basis filters and hence all of the information contained in the original convolution layer. As we reduce t we are able to discard more number of filters corresponding to the smaller eigenvalues with some drop in test accuracy. Figure 2 compares the compressed potential of all four networks, pre-trained on CIFAR100. Figure (a)a shows the percentage of retained filters in compressed network as we reduce t from 1.0 to 0.7, while figure (b)b plots the accuracy against the percentage of retained filters. In this figure we see that, all four networks can discard 20% of their filters with little change in test accuracy with Densenet190 being the most compressible which can discard upto 60% filters with only 3% drop in accuracy. It should be noted that these plots show the performance of the networks prior to fine tuning of the learnable parameters (also referred to as in Figure 1), and the red dots indicate the compression points selected for performance optimization by subsequent fine tuning.
|Model||Original ConvNet||Compressed BasisNet|
|GFlops 111GFlop refers to number of multiplications in Billions, counting only the multiplications in convolutional layers.||Accuracy||t||
|Alexnet||1152||0.24||43.9 %||0.85||358||0.085||7.1 %||42.5 %|
|VGG16||4187||0.313||68.7 %||0.85||1855||0.18||54.1 %||67.3 %|
|Resnet110||4096||0.253||72.0 %||0.85||1719||0.11||24.7 %||69.9 %|
|Densenet190||20117||18.678||82.8 %||0.70||3525||3.5||14.2 %||80.7 %|
4.3 Fine tuning of learnable parameters
As we further reduce t we are able to get more compression but test accuracy also drops significantly. To mitigate this we train each network in two steps for a total of 25 epochs. In step one we train the projection coefficients (i.e. the 1D filters ) only for 15 epochs with SGD. Since our network has significantly less learnable parameters (see Table 1) 15 epochs are enough to re-train these coefficients. We used step learning rate starting with 0.1 and dividing by 10 every 5 epochs. In step two we update all non-convolutional parameters (including the fully connected layers) in the network (but hold the basis filters constant) for another 10 epochs with 5e-4 learning rate. This process enables us to recover test accuracy even when large numbers of basis filters are discarded. Table 2 compares the the maximum compression we were able to achieve for all four networks, pre-trained on CIFAR100, while keeping the accuracy within 3% of the original network. We can see here that Densenet190 is the most compressible with reduction in number of filters by a factor of 5.7 and reduction in multiplications by a factor of more than 5.3.
4.4 Network Compression vs Dataset
Intuitively it is clear that complexity of information learned by a network during training must depend on the complexity of the dataset it was trained on. This means that the same network architecture trained on different datasets will have different compressibility. To verify this, we trained VGG16 on three image classification detests mentioned in section 4.1. Figure 3 shows the graph of test accuracy for these datasets plotted against the percentage of filters retained in the compressed network. These trends are consistent with our intuition that the SVHN data set is the simplest and the resulting trained network is highly compressible. On the other hand, the CIFAR100 is the most complex of the three datasets, which is reflected in the faster drop in performance with increasing compression. Not surprisingly, as the complexity of the problem increases, more knowledge is stored in each convolution layer, and larger number of eigenfilters are required to capture the most relevant information.
4.5 Training from scratch
To illustrate the process for learning with random basis sets, we describe an example provided in Matlab 2018b that trains a simple CNN for image classification on CIFAR10 data set. The original network configuration (shown in Table 3 on left) has three convolution layers (two of size 5x5x32 and one of size 5x5x64). This network is trained for 40 epochs, and achieves a classification test accuracy of 74%. The number of learnable parameters in the three convolution layers is 79,328. On the right, each conventional layer is replaced by the BasisConv structure which reduces the number of learnable parameters to 6400 (i.e. a reduction by a factor of 12). In this configuration, the 3D filters in the layers marked “conv_1”, “conv_3”, and “conv_5” are initialized as orthonormal random functions but then held frozen during the learning process, while the 1D filters marked “conv_2”, “conv_4”, and “conv_6” are allowed to update. After 40 epochs, the configuration on the right achieves a test accuracy of 71%. Setting aside the fully connected layers (which are common to both configuration), this experiment illustrates how BasisConv reduces the number of learnable parameter by an order of magnitude, without significant loss in performance. Additionally we also trained Alexnet and VGG16 from scratch with random basis in pytorch. In these experiments we normalized the intermediate tensor with BatchNormalization before convolving with 1D filters, . Recall that the conventional versions of these networks achieve 43.9% and 68.7% accuracy on the CIFAR100 data set, respectively. We were able to get 42.5 % and 66.3 % using BasisConv for Alexnet and VGG16 respectively, which is within 2% of the accuracy of original network while reducing the the number of learnable parameters by a factor of 7.2 and 7.9 respectively.
In summary, we have presented a general method for network compression and efficient implementation which can be easily incorporated into existing CNN architectures. For pre-trained network, each convolution layer is replaced by two successive convolutions: first with eigen basis filters (that capture the underlying knowledge space of the layer), followed by 1D kernels (that can be finetuned) to generate the activations. We used four network architectures and three datasets to show that our method consistently reduces i) the number of learnable parameters by an order of magnitude, and ii) multiplications and filter storage by as much as a factor of 5, with less than 3% degradation in performance. Finally, using random basis functions and significantly fewer learnable parameters, BasisNet achieve comparable performance to a conventional CNNs when learning from scratch.
-  Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017.
-  Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
-  Babak Hassibi, David G. Stork, and Stork Crc. Ricoh. Com. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, pages 164–171. Morgan Kaufmann, 1993.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2016.
-  Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. In NIPS, 2015.
-  Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural networks. In BMVC, 2015.
-  Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
-  Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. CoRR, abs/1405.3866, 2014.
-  Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
-  Qiang Qiu, Xiuyuan Cheng, A. Robert Calderbank, and Guillermo Sapiro. Dcfnet: Deep neural network with decomposed convolutional filters. In ICML, 2018.
-  Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014.
-  Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4820–4828, 2016.
-  Vincent Vanhoucke, Andrew W. Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. 2011.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
-  M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.
-  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
-  Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006.
-  Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
-  Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014.
-  Vasileios Belagiannis, Azade Farshad, and Fabio Galasso. Adversarial network compression. In ECCV Workshops, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60:84–90, 2012.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
-  Wei Yang. Classification with pytorch. https://github.com/bearpaw/pytorch-classification, 2017.