# Rank-1 Convolutional Neural Network

###### Abstract

In this paper, we propose a convolutional neural network(CNN) with 3-D rank-1 filters which are composed by the outer products of 1-D vectors. After being trained, the 3-D rank-1 filters can be decomposed into 1-D filters in the test time for fast inference. The reason that we train 3-D rank-1 filters in the training stage instead of consecutive 1-D filters is that a better gradient flow can be obtained with this setting, which makes the training possible even in the case where the network with consecutive 1-D filters cannot be trained. The 3-D rank-1 filters are updated by both the gradient flow and the outer product of the 1-D vectors in every epoch, where the gradient flow tries to obtain a solution which minimizes the loss function, while the outer product operation tries to make the parameters of the filter to live on a rank-1 sub-space. Furthermore, we show that the convolution with the rank-1 filters results in low rank outputs, constraining the final output of the CNN also to live on a low dimensional subspace.

C[1]¿\arraybackslashm#1

## I Introduction

Nowdays deep convolutional neural networks (CNNs) have achieved top results in many difficult image classification
tasks. However, the number of parameters in CNN models is high which limits the use of deep models on devices with limited resources such as smartphones, embedded systems, etc.
Meanwhile, it has been known that there exist a lot of redundancy between the parameters and the feature maps in deep models, i.e., that CNN models are over-parametrized.
The reason that over-parametrized CNN models are used instead of small sized CNN models is that the over-parametrization makes the training of the network easier as has been shown in the experiments in [1]. The reason for this phenomenon is believed to be due to the fact that the gradient flow in networks with many parameters achieves a better trained network than the gradient flow in small networks.
Therefore, a well-known traditional principle of designing good neural networks is to make a network with a large number of parameters, and then use regularization techniques to avoid over-fitting rather than making a network with small number of parameters from the beginning.

However, it has been shown in [2] that even with the use of regularization methods, there still exists excessive capacity in the trained networks, which means that
the redundancy between the parameters is still large.
This again implies the fact that the parameters or the feature maps can be expressed in a structured subspace with a smaller number of coefficients.
Finding the underlying structure that exist between the parameters in the CNN models and reducing the redundancy of parameters and feature maps are the topics of the deep compression field.
As has been well summarized in [3], researches on the compression of deep models can be categorized into works which try to eliminate unnecessary weight parameters [4], works which try to compress the parameters by projecting them onto a low rank subspace [5][6][7], and works which try to group similar parameters into groups and represent them by representative features[8][9][10][11][12].
These works follow the common framework shown in Fig. 1(a), i.e.,
they first train the original uncompressed CNN model by back-propagation to obtain the uncompressed parameters, and then try to find a compressed expression for these parameters to construct a new compressed CNN model.

In comparison, researches which try to restrict the number of parameters in the first place by proposing small networks are also actively in progress (Fig. 1(b)). However, as mentioned above, the reduction in the number of parameters changes the gradient flow, so the networks have to be designed carefully to achieve a trained network with good performance.
For example, MobileNets [13] and Xception networks [14] use depthwise separable convolution filters, while the Squeezenet [15] uses a bottleneck approach to reduce the number of parameters.
Other models use 1-D filters to reduce the size of networks such as the highly factorized Flattened network [16], or the models in [17] where 1-D filters are used together with other filters of different sizes.
Recently, Google’s Inception model has also adopted 1-D filters in version 4.
One difficulty in using 1-D filters is that 1-D filters are not easy to train, and therefore, they are used only partially like in the Google’s Inception model, or in the models in [17] etc., except for the Flattened network which is constituted of consecutive 1-D filters only.
However, even the Flattened network uses only three layers of 1-D filters in their experiments, due to the difficulty of training 1-D filters with many layers.

In this paper, we propose a rank-1 CNN, where the rank-1 3-D filters are constructed by the outer products of 1-D vectors.
At the outer product based composition step at each epoch of training, the number of parameters in the 3-D filters become the same as in the filters in standard CNNs, allowing a good gradient flow to flow throughout the network. This gradient flow also updates the parameters in the 1-D vectors, from which the 3-D filters are composed. At the next composition step, the weights in the 3-D filters are updated again, not by the gradient flow but by the outer product operation, to be projected onto the rank-1 subspace. By iterating this two-step update, all the 3-D filters in the network are trained to minimize the loss function while maintaining its rank-1 property.
This is different from approaches which try to approximate the trained filters by low rank approximation
after the training has finished, e.g., like the low rank approximation in [20]. The composition operation is included in the training phase in our network,
which directs the gradient flow in a different direction from that of standard CNNs, directing the solution to live on a rank-1 subspace.
In the testing phase, we do not need the outer product operation anymore, and can directly filter the input channels with the trained 1-D vectors treating them now as 1-D filters. That is, we take consecutive 1-D convolutions with the trained 1-D vectors, since the result is the same as being filtered with the 3-D filter constituted of the trained 1-D vectors. Therefore, the inference speed is exactly the same as that of the Flattened network. However, due to the better gradient flow, better parameters for the 1-D filters can be found with the proposed method, and more importantly, the network can be trained even in the case when the Flattened network can be not.

We will also show that the convolution with rank-1 filters results in rank-deficient outputs, where the rank of the output is upper-bounded by a smaller bound than in normal CNNs.
Therefore, the output feature vectors are constrained to live on a rank-deficient subspace in a high dimensional space. This coincides with the well-known belief that the feature vectors corresponding to images live on a low-dimensional manifold in a high dimensional space, and the fact that we get similar accuracy results with the rank-1 net can be another proof for this belief.

We also explain in analogy to the bilateral-projection based 2-D principal component analysis(B2DPCA) what the 1-D vectors are trying to learn, and why the redundancy becomes reduced in the parameters with the rank-1 network.
The reduction of the redundancy between the parameters is expressed by the reduced number of effective parameters, i.e., the number of parameters in the 1-D vectors.
Therefore, the rank-1 net can be thought of as a compressed version of the standard CNN, and the reduced number of parameters as a smaller upper bound for the effective capacity of the standard CNN.
Compared with regularization methods, such as stochastic gradient descent, drop-out, and regularization methods, which do not reduce the excessive capacities of deep models as much as expected, the rank-1 projection reduces the capacity proportionally to the ratio of decrease in the number of parameters, and therefore, maybe can help to define a better upper bound for the effective capacity of deep networks.

## Ii Related Works

The following works are related to our work. It is the work of the B2DPCA which gave us the insight for the rank-1 net. After we designed the rank-1 net, we found out that a similar research, i.e., the work on the Flattened network, has been done in the past. We explain both works below.

### Ii-a Bilateral-projection based 2DPCA

In [18], a bilateral-projection based 2D principal component analysis(B2DPCA) has been proposed, which minimizes the following energy functional:

(1) |

where is the two dimensional image, and are the left- and right- multiplying projection matrices, respectively, and is the extracted feature matrix for the image . The optimal projection matrices and are simultaneously constructed, where projects the column vectors of to a subspace, while projects the row vectors of to another one. To see why is projecting the column vectors of to a subspace, consider a simple example where has column vectors:

(2) |

Then, left-multiplying to the image , results in:

(3) |

where it can be observed that all the components in are the projections of the column vectors of onto the column vectors of . Meanwhile, the right-multiplication of the matrix to results in,

(4) |

where the components of are the projections of the row vectors of onto the column vectors of . From the above observation, we can see that the components of the feature matrix is a result of simultaneously projecting the row vectors of onto the column vectors of , and the column vectors of onto the column vectors of . It has been shown in [18], that the advantage of the bilateral projection over the unilateral-projection scheme is that can be represented effectively with smaller number of coefficients than in the unilateral case, i.e., a small-sized matrix can well represent the image . This means that the bilateral-projection effectively removes the redundancies among both rows and columns of the image. Furthermore, since

(5) |

it can be seen that the components of are the 2-D projections of the image onto the 2-D planes made up by the outer products of the column vectors of and . The 2-D planes have a rank of one, since they are the outer products of two 1-D vectors. Therefore, the fact that can be well represented by a small-sized also implies the fact that can be well represented by a few rank-1 2-D planes, i.e., only a few
1-D vectors , where and .

In the case of (1), the learned 2-D planes try to minimize the loss function

(6) |

i.e., try to learn to best approximate . A natural question arises, if good rank-1 2-D planes can be obtained to minimize other loss functions too, e.g., loss functions related to the image classification problem, such as

(7) |

where denotes the true classification label for a certain input image , and is the output of the network constituted by the outer products of the column vectors in the learned matrices and . In this paper, we will show that it is possible to learn such rank-1 2-D planes, i.e., 2-D filters, if they are used in a deep structure. Furthermore, we extend the rank-1 2-D filter case to the rank-1 3-D filter case, where the rank-1 3-D filter is constituted as the outer product of three column vectors from three different learned matrices.

### Ii-B Flattened Convolutional Neural Networks

In [16], the ‘Flattened CNN’ has been proposed for fast feed-forward execution by separating the conventional 3-D convolution filter into three consecutive 1-D filters. The 1-D filters sequentially convolve the input over different directions, i.e., the lateral, horizontal, and vertical directions. Figure 2 shows the network structure of the Flattened CNN. The Flattened CNN uses the same network structure in both the training and the testing phases. This is in comparison with our proposed model, where we use a different network structure in the training phase as will be seen later.

However, the consecutive use of 1-D filters in the training phase makes the training difficult. This is due to the fact that the gradient path becomes longer than in normal CNN, and therefore, the gradient flow vanishes faster while the error is more accumulating. Another reason is that the reduction in the number of parameters causes a gradient flow different from that of the standard CNN, which is more difficult to find an appropriate solution. This fact coincides with the experiments in [1] which show that the gradient flow in a network with small number of parameters cannot find good parameters. Therefore, a particular weight initialization method has to be used with this setting. Furthermore, in [16], the networks in the experiments have only three layers of convolution, which is maybe due to the fact of the difficulty in training networks with more layers.

## Iii Proposed Method

In comparison with other CNN models using 1-D rank-1 filters, we propose the use of 3-D rank-1 filters() in the training stage, where the 3-D rank-1 filters are constructed by the outer product of three 1-D vectors, say , , and :

(8) |

This is an extension of the 2-D rank-1 planes used in the B2DPCA, where the 2-D planes are constructed by . Figure 3 shows the training and the testing phases of the proposed method. The structure of the proposed network is different for the training phase and the testing phase. In comparison with the Flattened network (Fig. 2), in the training phase, the gradient flow first flows through the 3-D rank-1 filters and then through the 1-D vectors. Therefore, the gradient flow is different from that of the Flattened network resulting in a different and better solution of parameters in the 1-D vectors. The solution can be obtained even in large networks with the proposed method, for which the gradient flow in the Flattened network cannot obtain a solution at all. Furthermore, at test time, i.e., at the end of optimization, we can use the 1-D vectors directly as 1-D filters in the same manner as in the Flattened network, resulting in the same inference speed as the Flattened network(Fig. 3).

Figure 4 explains the training process with the proposed network structure in detail. At every epoch of the training phase, we first take the outer product of the three 1-D vectors , , and . Then, we assign the result of the outer product to the weight values of the 3-D convolution filter, i.e., for every weight value in the 3-D convolution filter , we assign

(9) |

where, correspond to the 3-D coordinates in , the 3-D domain of the 3-D convolution filter .
Since the matrix constructed by the outer product of vectors has always a rank of one, the 3-D convolution filter is a rank-1 filter.

During the back-propagation phase, every weight value in will be updated by

(10) |

where denotes the gradient of the loss function with respect to the weight , and is the learning rate.
In normal networks, in (10) is the final updated weight value. However, the updated filter normally is not a rank-1 filter. This is due to the fact that the update in (10) is done in the direction which considers only the minimizing of the loss function and not the rank of the filter.

With the proposed training network structure, we take a further update step, i.e., we update the 1-D vectors , , and :

(11) |

(12) |

(13) |

Here, , , and can be calculated as

(14) |

(15) |

(16) |

At the next feed forward step of the back-propagation, an outer product of the updated 1-D vectors , , and is taken to concatenate them back into the 3-D convolution filter :

(17) |

where

(18) |

As the outer product of 1-D vectors always results in a rank-1 filter, is a rank-1 filter as compared with which is not. Comparing (10) with (17), we get

(19) |

Therefore, is the incremental update vector which projects back onto the rank-1 subspace.

## Iv Property of rank-1 filters

Below, we explain some properties of the 3-D rank-1 filters.

### Iv-a Multilateral property of 3-D rank-1 filters

We explain the bilateral property of the 2-D rank-1 filters in analogy to the B2DPCA. The extension to the multilateral property of the 3-D rank-1 filters is then straightforward. We first observe that a 2-D convolution can be seen as shifting inner products, where each component at position of the output matrix is computed as the inner product of a 2-D filter and the image patch centered at :

(20) |

If is a 2-D rank-1 filter, then,

(21) |

As has been explained in the case of B2DPCA, since is multiplied to the rows of , tries to extract the features from the rows of which can minimize the loss function. That is, searches the rows in all patches for some common features which can reduce the loss function, while looks for the features in the columns of the patches. This is in analogy to the B2DPCA, where the bilateral projection removes the redundancies among the rows and columns in the 2-D filters. Therefore, by easy extension, the 3-D rank-1 filters which are learned by the multilateral projection will have less redundancies among the rows, columns, and the channels than the normal 3-D filters in standard CNNs.

### Iv-B Property of projecting onto a low dimensional subspace

In this section, we show that the convolution with the rank-1 filters projects the output channels onto a low dimensional subspace. In [19], it has been shown via the block Hankel matrix formulation that the auto-reconstructing U-Net with insufficient number of filters results in a low-rank approximation of its input. Using the same block Hankel matrix formulation for the 3-D convolution, we can show that the 3-D rank-1 filter projects the input onto a low dimensional subspace in a high dimension. To avoid confusion, we use the same definitions and notations as in [19]. A wrap-around Hankel matrix of a function with respect to the number of columns is defined as

(22) |

Using the Hankel matrix, a convolution operation with a 1-D filter of length can be expressed in a matrix-vector form as

(23) |

where is the flipped version of , and is the output result of the convolution.

The 2-D convolution can be expressed using the block Hankel matrix expression of the input channel. The block Hankel matrix of a 2-D input with being the columns of , becomes

(24) |

where and . With the block Hankel matrix, a single-input single-output 2-D convolution with a 2-D filter of size can be expressed in matrix-vector form,

(25) |

where denotes the vectorization operation by stacking up the column vectors of the 2-D matrix .

In the case of multiple input channels , the block Hankel matrix is extended to

(26) |

and a single output of the multi-input convolution with multiple filters becomes

(27) |

where is the number of filters. Last, the matrix-vector form of the multi-input multi-output convolution resulting in multiple outputs can be expressed as

(28) |

where

(29) |

and

(30) |

To calculate the upper bound of the rank of , we use the rank inequality

(31) |

on to get

(32) |

Now to investigate the rank of , we first observe that

(33) |

as can be seen in Fig. 5.

Then, expressing as the stack of its sub-matrices,

(34) |

where

(35) |

which columns are the vectorized forms of the 2-D slices in the 3-D filters which convolve with the -th image. We observe that all the sub-matrices have a rank of 1, since all the column vectors in are in the same direction and differ only in their magnitudes, i.e., by the different values of .
Therefore, the upper bound of is instead of which is the upper bound we get if we use non-rank-1 filters.

As a result, the output is upper bounded as

(36) |

where

(37) |

As can be seen from (37), the upper bound is determined by the ranks of Hankel matrices of the input channels or the numbers of input channels or filters. In common deep neural network structures, the number of filters are normally larger than the number of input channels, e.g., the VGG-16 uses in every layer a number of filters larger or equal to the number of input channels. So if we use the same structure for the proposed rank-1 network as in the VGG-16 model, the upper bound will be determined mainly by the number of input channels. Therefore, the outputs of layers in the proposed CNN are constrained to live on sub-spaces having lower ranks than the sub-spaces on which the outputs of layers in standard CNNs live. Since the output of a certain layer becomes the input of the next layer, the difference in the rank between the standard and the proposed rank-1 CNN accumulates in higher layers. Therefore, the final output of the proposed rank-1 CNN lives on a sub-space of much lower rank than the output of the standard CNN.

## V Experiments

We compared the performance of the proposed model with the standard CNN and the Flattened CNN model [16].
We used the same number of layers for all the models, where for the Flattened CNN we regarded the combination of the lateral, vertical, and horizontal 1-D convolutional layers as a single layer. Furthermore, we used the same numbers of input and output channels in each layer for all the models, and also the same ReLU, Batch normalization, and dropout operations.
The codes for the proposed rank-1 CNN will be opened at https://github.com/petrasuk/Rank-1-CNN.

Table 1-3 show the different structures of the models used for each dataset in the training stage.
The outer product operation of three 1-D filters , , and into a 3-D rank-1 filter is denoted as in the tables.
The datasets that we used in the experiments are the MNIST, the CIFAR10, and the ‘Dog and Cat’(https://www.kaggle.com/c/dogs-vs-cats) datasets.
We used different structures for different datasets.
For the experiments on the MNIST and the CIFAR10 datasets, we trained on 50,000 images, and
then tested on 100 batches each consisting of 100 random images, and calculated the overall average accuracy. The sizes of the images in the MNIST and the CIFAR10 datasets are and , respectively.
For the ‘Dog and Cat’ dataset, we trained on 24,900 training images (size ), and tested on a set of 100 test images.

The proposed rank-1 CNN achieved a slightly larger testing accuracy on the MNIST dataset than the other two models (Fig. 6). This is maybe due to the fact that the MNIST dataset is in its nature a low-ranked one, for which the proposed method can find the best approximation since the proposed method constrains the solution to a low rank sub-space. With the CIFAR10 dataset, the accuracy is slightly less than that of the standard CNN which maybe due to the fact that the images in the CIFAR10 datasets are of higher ranks than those in the MNIST dataset.
However, the testing accuracy of the proposed CNN is higher than that of the Flattened CNN which shows the fact that the better gradient flow in the proposed CNN model achieves a better solution. The ‘Dog and Cat’ dataset was used in the experiments to verify the performance of the proposed CNN on real-sized images and on a
deep structure. In this case, we could not train the Flattened network due to memory issues. We used the Tensorflow API, and somehow, the Tensorflow API requires much more GPU memory for the Flattened network than the proposed rank-1 network.
We also believe that, even if there is no memory issue, with this deep structure, the Flattened network cannot find good parameters at all due to the limit of the bad gradient flow in the deep structure.
The Standard CNN and the proposed CNN achieved similar test accuracy as can be seen in Fig. 8.

Standard CNN | Flattened CNN | Proposed CNN |

Conv1: 64 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Conv2: 64 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Max Pool ( | ||

Conv3: 144 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Conv4: 144 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Max Pool ( | ||

Conv5: 144 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Conv6: 256 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

Conv7: 256 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

FC 2048 + Batch Normalization + ReLU + Drop Out (Prob. = 0.5) | ||

FC 1024 + Batch Normalization + ReLU + Drop Out (Prob. = 0.5) | ||

FC 10 + ReLU + Drop Out (Prob. = 0.5) | ||

Soft-Max |

Standard CNN | Flattened CNN | Proposed CNN |

Conv1: 64 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Batch Normalization | ||

Conv2: 64 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Max Pool ( + Drop Out (Prob. = 0.5) | ||

Conv3: 144 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Batch Normalization | ||

Conv4: 144 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Max Pool ( +Drop Out (Prob. = 0.5) | ||

Conv5: 256 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Batch Normalization | ||

Conv6: 256 filters, each filter constituted as: | ||

conv | conv | |

conv | ||

conv | ||

conv | ||

ReLU + Max Pool ( + Drop Out (Prob. = 0.5) | ||

FC 1024 + Batch Normalization + ReLU + Drop Out (Prob. = 0.5) | ||

FC 512 + Batch Normalization + ReLU + Drop Out (Prob. = 0.5) | ||

FC 10 | ||

Soft-Max |

Standard CNN | Proposed CNN |

Conv1: 64 filters, each filter constituted as: | |

conv | |

conv | |

Conv2: 64 filters, each filter constituted as: | |

conv | |

conv | |

Batch Normalization + ReLU + Max Pool ( | |

Conv3: 144 filters, each filter constituted as: | |

conv | |

conv | |

ReLU | |

Conv4: 144 filters, each filter constituted as: | |

conv | |

conv | |

Batch Normalization + ReLU + Max Pool ( | |

Conv5: 256 filters, each filter constituted as: | |

conv | |

conv | |

ReLU | |

Conv6: 256 filters, each filter constituted as: | |

conv | |

conv | |

Batch Normalization + ReLU + Max Pool ( | |

Conv7: 256 filters, each filter constituted as: | |

conv | |

conv | |

ReLU | |

Conv8: 484 filters, each filter constituted as: | |

conv | |

conv | |

ReLU | |

Conv9: 484 filters, each filter constituted as: | |

conv | |

conv | |

Batch Normalization + ReLU + Max Pool ( | |

Conv10: 484 filters, each filter constituted as: | |

conv | |

conv | |

ReLU | |

Conv11: 484 filters, each filter constituted as: | |

conv | |

conv | |

Batch Normalization + ReLU + Max Pool ( | |

FC 1024 + Batch Normalization + ReLU | |

FC 512 + Batch Normalization + ReLU | |

FC 2 | |

Soft-Max |

## References

- [1] R. Livni, S. Shalev-Shwartz, and O. Shamir, On the Computational Efficiency of Training Neural Networks, Advances in Neural Information Processing Systems(NIPS), pp. 855-863, 2014.
- [2] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, ICLR 2017.
- [3] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao, On Compressing Deep Models by Low Rank and Sparse Decomposition, CVPR, pp.7370-7379, 2017.
- [4] S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and connections for efficient neural network, In Advances in Neural Information Processing Systems (NIPS), pages 1135â1143, 2015.
- [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, In Advances in Neural Information Processing Systems (NIPS), pages 1269â1277, 2014.
- [6] M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions, arXiv preprint arXiv:1405.3866, 2014.
- [7] X. Zhang, J. Zou, K. He, and J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
- [8] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, 2015.
- [9] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, arXiv preprint arXiv:1412.6115, 2014
- [10] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding, International Conference on Learning Representations (ICLR), 2016.
- [11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, arXiv preprint arXiv:1603.05279, 2016.
- [12] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, CNNpack: Packing convolutional neural networks in the frequency domain, In Advances In Neural Information Processing Systems (NIPS), pages 253â261, 2016.
- [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv preprint arXiv:1704.04861, 2017.
- [14] F. Chollet, Xception: Deep learning with depthwise separable convolutions, arXiv preprint arXiv:1610.02357v2, 2016.
- [15] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 1mb model size, arXiv preprint arXiv:1602.07360, 2016.
- [16] J. Jin, A. Dundar, and E. Culurciello, Flattened convolutional neural networks for feedforward acceleration, arXiv preprint arXiv:1412.5474, 2014.
- [17] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi, Training CNNs with Low-Rank Filters for Efficient Image Classification, arXiv preprint arXiv:1511.06744, 2016.
- [18] H. Kong, L. Wang, E. K.Teoha, X. Li, J.-G. Wang, R. Venkateswarlu, Generalized 2D principal component analysis for face image representation and recognition, Neural Networks, Vol. 18, Issues 5â6, pp. 585-594, 2005.
- [19] J. C. Ye, Y. Han, and E. Cha, Deep Convolutional Framelets: A General Deep Learning Framework for Inverse Problems, arXiv preprint arXiv:1707.00372, 2018.
- [20] M. Jaderberg, A. Vedaldi,and A. Zisserman, Speeding up Convolutional Neural Networks with Low Rank Expansions, CoRR, abs/1405.3866, 2014.