Compression of Fully-Connected Layer in Neural Network by Kronecker Product
In this paper we propose and study a technique to reduce the number of parameters and computation time in fully-connected layers of neural networks using Kronecker product, at a mild cost of the prediction quality. The technique proceeds by replacing Fully-Connected layers with so-called Kronecker Fully-Connected layers, where the weight matrices of the FC layers are approximated by linear combinations of multiple Kronecker products of smaller matrices. In particular, given a model trained on SVHN dataset, we are able to construct a new KFC model with 73% reduction in total number of parameters, while the error only rises mildly. In contrast, using low-rank method can only achieve 35% reduction in total number of parameters given similar quality degradation allowance. If we only compare the KFC layer with its counterpart fully-connected layer, the reduction in the number of parameters exceeds 99%. The amount of computation is also reduced as we replace matrix product of the large matrices in FC layers with matrix products of a few smaller matrices in KFC layers. Further experiments on MNIST, SVHN and some Chinese Character recognition models also demonstrate effectiveness of our technique.
Model approximation aims at reducing the number of parameters and amount of computation of neural network models, while keeping the quality of prediction results mostly the same.
In general, given a neural network , we want to construct another neural network within some pre-specified resource constraint, and minimize the differences between the outputs of two functions on the possible inputs. An example setup is to directly minimize the differences between the output of the two functions:
where is some distance function and runs over all input data.
The formulation Equation 1 does not give any constraints between the structure of and , meaning that any model can be used to approximate another model. In practice, a structural similar model is often used to approximate another model. In this case, model approximation may be approached in a modular fashion w.r.t. to each layer.
2.1Low Rank Model Approximation
Low rank approximation in linear regression dates back to . In , low rank approximation of fully-connected layer is used; and  considered low rank approximation of convolution layer.  considered approximation of multiple layers with nonlinear activations.
We first outline the low rank approximation method below. The fully-connected layer widely used in neural network construction may be formulated as:
where is the output of the -th layer of the neural network, is often referred to as “weight term” and as “bias term” of the -th layer.
As the coefficients of the weight term in the fully-connected layers are organized into matrices, it is possible to perform low-rank approximation of these matrices to achieve an approximation of the layer, and consequently the whole model. Given Singular Value Decomposition of a matrix , where are unitary matrices and is a diagonal matrix with the diagonal made up of singular values of , a rank- approximation of is:
where and are the first -columns of the and respectively, and is a diagonal matrix made up of the largest entries of .
In this case approximation by SVD is optimal in the sense that the following holds :
The approximate fully connected layer induced by SVD is:
In the modular representation of neural network, this means that the original fully connected layer is now replaced by two consequent fully-connected layers.
However, the above post-processing approach only ensures getting an optimal approximation of under the rank constraint, while there is still no guarantee that such an approximation is optimal w.r.t. the input data. I.e., the optimum of the following may well be different from the rank- approximation w.r.t. some given input :
Hence it is often necessary for the resulting low-rank model to be trained for a few more epochs on the input, which is also known as the “fine-tuning” process.
Alternatively, we note that the rank constraint can be enforced by the following structural requirement for :
In light of this, if we want to impose a rank constraint on a fully-connected layer in a neural network where , we can replace that layer with two consecutive layers and , where , , and where , and then train the structurally constrained neural network on the training data.
As a third method, a regularization term inducing low rank matrices may be imposed on the weight matrices. In this case, the training of a -layer model is modified to be:
where is the regularization term. For the weight term of the FC layers, conceptually we may use the matrix rank function as the regularization term. However, as the rank function is only well-defined for infinite-precision numbers, nuclear norm may be used as its convex proxy .
3Model Approximation by Kronecker Product
Next we propose to use Kronecker product of matrices of particular shapes for model approximation in Section 3.1. We also outline the relationship between the Kronecker product approximation and low-rank approximation in Section 3.2.
Below we measure the reduction in amount of computation by number of floating point operations. In particular, we will assume the computation complexity of two matrices of dimensions and to be , as many neural network implementations  have not used algorithms of lower computation complexity for the typical inputs of the neural networks. Our analysis is mostly immune to the “hidden constant” problem in computation complexity analysis as the underlying computations of the transformed model may also be carried out by matrix products.
3.1Weight Matrix Approximation by Kronecker Product
We next discuss how to use Kronecker product to approximate weight matrices of FC layers, leading to construction of a new kind of layer which we call Kronecker Fully-Connected layer. The idea originates from the observation that for a matrix where the dimensions are not prime
where , , , .
Any factors of and may be selected as and in the above formulation. However, in a Convolutional Neural Network, the input to a FC layer may be a tensor of order 4, which has some natural shape constraints that we will try to leverage in ?. Otherwise, when the input is a matrix, we do not have natural choices of and . We will explore heuristics to pick and in ?.
Kronecker product approximation for fully-connected layer with 4D tensor input
In a convolutional layer processing images, the input data may be a tensor of order 4 as where runs over different instances of data, runs over channels of the given images, runs over rows of the images, and runs over columns of the images. is often reshaped into a matrix before being fed into a fully connected layer as , where runs over the different instances of data and runs over the combined dimension of channel, height, and width of images. The weights of the fully-connected layer would then be a matrix where and runs over output number of channels. I.e., the layer may be written as:
Though the reshaping transformation from to does not incur any loss in pixel values of data, we note that the dimension information of the tensor of order 4 is lost in the matrix representation. As a consequence, has number of parameters.
Due to the shape of , we may propose a few kinds of structural constraint on by requiring to be Kronecker product of matrices of particular shapes.
In this formulation, we require , where , and . The number of parameters is reduced to . The underlying assumption for this model is that the transformation is invariant across rows and columns of the images.
In this formulation, we require , where , and . The number of parameters is reduced to . The underlying assumption for this model is that the channel transformation should be decoupled from the spatial transformation.
In this formulation, we require , where , and . The number of parameters is reduced to . The underlying assumption for this model is that the transformation w.r.t. columns may be decoupled.
In this formulation, we require , where , and . The number of parameters is reduced to . The underlying assumption for this model is that the transformation w.r.t. rows may be decoupled.
Note that the above four formulations may be linearly combined to produce more possible kinds of formulations. It would be a design choice with respect to trade off between the number of parameters, amount of computation and the particular formulation to select.
Kronecker product approximation for matrix input
For fully-connected layer whose input are matrices, there does not exist natural dimensions to adopt for the shape of smaller weight matrices in KFC. Through experiments, we find it possible to arbitrarily pick a decomposition of input matrix dimensions to enforce the Kronecker product structural constraint. We will refer to this formulation as KFCM.
Concretely, when input to a fully-connected layer is and the weight matrix of the layer is , we can construct approximation of as:
where , , and .
The computation complexity will be reduced from to , while the number of parameters will be reduced from to .
Through experiments, we have found it sensible to pick and .
As the choice of and above is arbitrary, we may use linear combination of Kronecker products if matrices of different shapes for approximation.
where and .
3.2Relationship between Kronecker Product Constraint and Low Rank Constraint
It turns out that factorization by Kronecker product is closely related to the low rank approximation method. In fact, approximating a matrix with Kronecker product of two matrices may be casted into a Nearest Kronecker product Problem:
An equivalence relation in the above problem is given in  as:
where is a matrix formed by a fixed reordering of entries .
Note the right-hand side of formula Equation 3 is a rank-1 approximation of matrix , hence has a closed form solution. However, the above approximation is only optimal w.r.t. the parameters of the weight matrices, but not w.r.t. the prediction quality over input data.
Similarly, though there are iterative algorithms for rank-1 approximation of tensor , the optimality of the approximation is lost once input data distribution is taken into consideration.
Hence in practice, we only use the Kronecker Product constraint to construct KFC layers and optimize the values of the weights through the training process on the input data.
3.3Extension to Sum of Kronecker Product
Just as low-rank approximation may be extended beyond rank-1 to arbitrary number of ranks, one could extend the Kronecker Product approximation to Sum of Kronecker Product approximation. Concretely, one not the following decomposition of :
Hence it is possible to find -approximations:
We can then generalize Formulation I-IV in Section 3.1 to the case of sum of Kronecker Product.
We may further combine the multiple shape formulation of Equation 2 to get the general form of KFC layer:
where and .
4Empirical Evaluation of Kronecker product method
We next empirically study the properties and efficacy of the Kronecker product method and compare it with some other common low rank model approximation methods.
To make a fair comparison, for each dataset, we train a covolutional neural network with a fully-connected layer as a baseline. Then we replace the fully-connected layer with different layers according to different methods and train the new network until quality metrics stabilizes. We then compare KFC method with low-rank method and the baseline model in terms of number of parameters and prediction quality. We do the experiments based on implementation of KFC layers in Theano framework.
As the running time may depend on particular implementation details of the KFC and the Theano work, we do not report running time below. However, there is no noticeable slow down in our experiments and the complexity analysis suggests that there should be significant reduction in amount of computation.
The MNIST dataset consists of grey scale images of handwritten digits. There are 60000 training images and 10000 test images. We select the last 10000 training images as validation set.
Our baseline model has layers and the first layers consist of four convolutional layers and two pooling layers. The 7th layer is the fully-connected layer and the 8th is the softmax output. The input of the fully-connected layer is of size , where is the number of channel and is the side length of image patches(the mini-batch size is omitted). The output of the fully-connected layer is of size , so the weight matrix is of size .
CNN training is done with Adam with weight decay of 0.0001. Dropout of 0.5 is used on the fully-connected layer and KFC layer. is used as activation function. Initial learning rate is for Adam.
Test results are listed in Table ?. The number of layer parameters means the number of parameters of the fully-connected layer or its counterpart layer(s). The number of model parameters is the number of the parameters of the whole model. The test error is the min-validation model’s test error.
In Cut-96 method, we use 96 output neurons instead of 256 in fully-connected layer. In the LowRank-96 method, we replace the fully-connected layer with two fully-connected layer where the first FC layer output size is 96 and the second FC layer output size is 256. In the KFC-II method, we replace the fully-connected layer with KFC layer using formulation II with and . In the KFC-Combined method, we replace the fully-connected layer with KFC layer and linear combined the formulation II, III and IV( in formulation II, in formulation III and IV).
|Methods||# of Layer Params(%Reduction)||# of Model Params(%Reduction)||Test Error|
4.2Street View House Numbers
The SVHN dataset is a real-world digit recognition dataset consisting of photos of house numbers in Google Street View images. The dataset comes in two formats and we consider the second format: 32-by-32 colored images centered around a single character. There are 73257 digits for training, 26032 digits for testing, and 531131 less difficult samples which can be used as extra training data. To build a validation set, we randomly select 400 images per class from training set and 200 images per class from extra training set as  did.
Here we use a similar but larger neural network as used in MNIST to be the baseline. The input of the fully-connected layer is of size . The fully-connected layer has output neurons. Other implementation details are not changed. Test results are listed in Table ?. In the Cut- method, we use output neurons instead of 256 in fully-connected layer. In the LowRank- method, we replace the fully-connected layer with two fully-connected layer where the first FC layer output size is and the second FC layer output size is 256. In the KFC-II method, we replace the fully-connected layer with KFC layer using formulation II with and . In the KFC-Combined method, we replace the fully-connected layer with KFC layer and linear combined the formulation II, III and IV( in formulation II, in formulation III and IV). In the KFC-Rank method, we use KFC formulation II with and extend it to rank with as described above.
|Methods||# of Layer Params(%Reduction)||# of Model Params(%Reduction)||Test Error|
4.3Chinese Character Recognition
We also evaluate application of KFC to a Chinese character recognition model. Our experiments are done on a private dataset for the moment and may extend to other established Chinese character recognition datasets like HCL2000() and CASIA-HWDB().
For this task we also use a convolutional neural network. The distinguishing feature of the neural network is that following the convolution and pooling layers, it has two FC layers, one with 1536 hidden size, and the other with more than 6000 hidden size.
The two FC layers happen to be different type. The 1st FC layer accepts tensor as input and the 2nd FC layer accepts matrix as input. We apply KFC-I formulation to 1st FC and KFCM to 2nd FC.
|Methods||%Reduction of 1st FC Layer Params||%Reduction of 2nd FC Layer Params||%Reduction of Total Params||Test Error|
It can be seen KFC can significantly reduce the number of parameters. However, in case of “KFC and KFCM (rank=1)”, this also leads to serious degradation of prediction quality. However, by increasing the rank from 1 to 10, we are able to recover most of the lost prediction quality. Nevertheless, the rank-10 model is still very small compared to the baseline model.
5Conclusion and Future Work
In this paper, we propose and study methods for approximating the weight matrices of fully-connected layers with sums of Kronecker product of smaller matrices, resulting in a new type of layer which we call Kronecker Fully-Connected layer. We consider both the cases when input to the fully-connected layer is a tensor of order 4 and when the input is a matrix. We have found that using the KFC layer can significantly reduce the number of parameters and amount of computation in experiments on MNIST, SVHN and Chinese character recognition.
As future work, we note that when weight parameters of a convolutional layer is a tensor of order 4 as , it can be represented as a collection of matrices . We can then approximate each matrix by Kronecker products as following KFCM formulation, and apply the other techniques outlined in this paper. It is also noted that the Kronecker product technique may also be applied to other neural network architectures like Recurrent Neural Network, for example approximating transition matrices with linear combination of Kronecker products.
- In some circumstances, as less number of model parameters reduce the effect of overfitting, model approximation sometimes leads to more accurate predictions.
- In case any of and is prime, it is possible to add some extra dummy feature or output class to make the dimensions dividable.
- Estimating linear restrictions on regression coefficients for multivariate normal distributions.
Theodore Wilbur Anderson. The Annals of Mathematical Statistics
- Theano: new features and speed improvements.
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
- Theano: a CPU and GPU math expression compiler.
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
- Torch7: A matlab-like environment for machine learning.
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
- Exploiting linear structure within convolutional networks for efficient evaluation.
Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
- On best rank one approximation of tensors.
Shmuel Friedland, Volker Mehrmann, Renato Pajarola, and SK Suter. Numerical Linear Algebra with Applications
- Maxout networks.
Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. arXiv preprint arXiv:1302.4389
- Improving neural networks by preventing co-adaptation of feature detectors.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. arXiv preprint arXiv:1207.0580
Topics in matrix analysis
Roger A Horn and Charles R Johnson. .
- Speeding up convolutional neural networks with low rank expansions.
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. arXiv preprint arXiv:1405.3866
- Caffe: Convolutional architecture for fast feature embedding.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
- Adam: A method for stochastic optimization.
Diederik Kingma and Jimmy Ba. arXiv preprint arXiv:1412.6980
- On the best rank-1 and rank-(r1,r2,. . .,rn) approximation of higher-order tensors.
Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. SIAM J. Matrix Anal. Appl.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition.
Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky. CoRR
- Gradient-based learning applied to document recognition.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Proceedings of the IEEE
- Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription.
Hank Liao, Erik McDermott, and Andrew Senior. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 368–373. IEEE, 2013.
- Online and offline handwritten chinese character recognition: benchmarking on new databases.
Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Pattern Recognition
- Reading digits in natural images with unsupervised feature learning.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5. Granada, Spain, 2011.
- Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization.
Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. SIAM Rev.
- Learning separable filters.
Roberto Rigamonti, Amos Sironi, Vincent Lepetit, and Pascal Fua. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2754–2761. IEEE, 2013.
- Low-rank matrix factorization for deep neural network training with high-dimensional output targets.
Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659. IEEE, 2013.
- Convolutional neural networks applied to house numbers digit classification.
Pierre Sermanet, Sandhya Chintala, and Yann LeCun. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3288–3291. IEEE, 2012.
- The ubiquitous kronecker product.
Charles F Van Loan. Journal of computational and applied mathematics
Approximation with Kronecker products
Charles F Van Loan and Nikos Pitsianis. .
- Restructuring of deep neural network acoustic models with singular value decomposition.
Jian Xue, Jinyu Li, and Yifan Gong. In INTERSPEECH, pages 2365–2369, 2013.
- Hcl2000-a large-scale handwritten chinese character database for handwritten character recognition.
Honggang Zhang, Jun Guo, Guang Chen, and Chunguang Li. In Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, pages 286–290. IEEE, 2009.
- Efficient and accurate approximations of nonlinear convolutional networks.
Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. arXiv preprint arXiv:1411.4229
- Extracting deep neural network bottleneck features using low-rank matrix factorization.
Yu Zhang, Ekapol Chuangsuwanich, and James Glass. In Proc. ICASSP, 2014b.