Tensorized Spectrum Preserving Compression for Neural Networks
Abstract
Modern neural networks can have tens of millions of parameters, and are often illsuited for smartphones or IoT devices. In this paper, we describe an efficient mechanism for compressing large networks by tensorizing network layers: i.e. mapping layers on to highorder matrices, for which we introduce new tensor decomposition methods. Compared to previous compression methods, some of which use tensor decomposition, our techniques preserve more of the networks invariance structure. Coupled with a new data reconstructionbased learning method, we show that tensorized compression outperforms existing techniques for both convolutional and fullyconnected layers on stateofthe art networks.
1 Introduction
Neural networks have become defacto structures for many learning problems, including object detection [22], image classification [6], and many forms of prediction and forecasting [11]. Modern network structures provide unprecedented accuracy over difficult problems, and this success has led to neural network solutions being applied to many important domains such as security, autonomous driving, etc. much to write here.
Success and versatility notwithstanding, application of neural networks to real world problems must still overcome several challenges: (1) new sophisticated networks have tens of millions of parameters which increases both training time and memory requirements. (2) The very high number of parameters can (and often does) lead to overfitting [19], making the network susceptible to noisy (perhaps adversarial/poisonous) training examples. (3) Even when trained, the large parameter space requires relatively capable devices to load the entire network into memory, and to execute tests quickly. While network size is not a problem for testing outsourced to cloudresident GPUs or even powerful desktops, neural network applications are being deployed on far more constrained devices, such as smartphones and IoT cameras, where the testing time and network size is a practical bottleneck.
A recent approach to addressing scalability of neural networks is to compress the network layers, which would reduces both the number of parameters that have to be trained and the size of the network for testing. Compressing a successful network, while maintaining its accuracy, is nontrivial. Many approaches have been employed, including recasting layers of the network as concatenation of two layers with a (smaller sized) hidden layer, and by performing SVD on the weight matrices of the original layers with configurable rank [10, 4, 23]. More recent approaches cast the network layers as tensors, and compress these layers using tensor decomposition methods [17, 12, 20, 5].
Tensors are higherorder generalizations of matrices. Much like methods such as SVD can be used to decompose a matrix, tensors can also be decomposed into smaller factors using different algorithmic methods [1, 13, 21]. Tensor decompositions have previously been used to find low rank approximations of higher order objects, and are useful in compressing neural network layers, both convolutional and fullyconnected. The key insight is that the convolutional kernels of neural network layers (often) lie in low dimensional spaces, and tensor decompositions can effectively find low rank approximations of such kernels (and hence reduce the number of parameters by a factor polynomial in the tensor dimension). SVD can also be used to compress fullyconnected layers, with the SVD rank controlling the degree of parameter compression.
While ”regular” tensor decompositions can compress convolutional layers, in this paper we consider further tensorized networks and their decompositions. The basic idea is to map an existing kernel, say a 4order tensor , into a higher order tensor , so called tensorization. The tensorization guarantees . Consider rankR tensor decompositions of , the number of parameters needed is further reduced to (in contrast to for ) . We refer to this method of decomposing tensorized kernels as Tensorized Spectrum Preserving Compression’ (tSPC).
tSPC is particularly effective at capturing and preserving invariant structures often present in neural networks. Consider a vector of dimension , where the single column consists of a repeated “subvectors” of length (). This vector could be mapped into a matrix of dimension , where the the columns of the matrix are repeated. This matrix can now be decomposed with rank reduced to 1 without losing information. We apply this basic idea in tSPC.
In general, decomposing higher order tensors is challenging, and known methods are not guaranteed to converge to the minimum error decompositions [8]. Hence, we must finetune the tensorized decompositions in order to achieve good performance. Fortunately, the optimal performance for the layers that are compressed are already known (from the uncompressed network). We introduce a new training method based on data reconstruction, called sequential training, that minimizes the difference between training output of the uncompressed layers versus compressed. Unlike traditional endtoend backpropagation, sequential training trains individual compressed “blocks” one at a time, reducing the resources required during training.
Contributions
We introduce new methods for tensorized decomposition of higher order kernels. In particular, we present a new architecture for tensorizing network weights/kernels such that CP and Tucker decompositions can be applied on the higher order tensor.
We introduce the sequential training scheme for minimizing the errors accruing from individual layers using data reconstruction optimization. Unlike endtoend backpropagation, sequential training only fits one individual block of compressed layers into (GPU) memory during each training cycle, reducing memory requirements by a factor of total number of blocks. This reduction is crucial, since the trained component’s parameters (product of output channels, filters, image height and width) can occupy many megabytes of memory. This product is further linearly scaled by the number of images in a batch. Beyond memory occupancy, the computation requirements are also high since all of these parameters must be updated during training.
Sequential training enables a spacetime tradeoff allowing large networks to be trained on modest devices. As our results will show, parameters for sequential layers converge quickly in practice, and hence sequential training can reduce training time as well.
We implement our ideas and present an evaluation comparing tSPC with existing methods. Our code will be available online. Our results on CIFAR10 and MNIST show that together tSPC and sequential training can maintain high accuracy even under high compression rate, outperforming endtoend training and nontensorized compression. We also show that tSPC can effectively compress fullyconnected layers, and that the performance scales to large datasets such as ImageNet (2012) applied to ResNet50.
Related work
Matrix and tensor decompositions have been used to reduce the number of parameters for convolutional kernels/weights. A straightforward way to exploit the low rank structure is via singular value decomposition, which has been proposed in [10, 4, 23]. The pioneering work in [17] shows that the parameters in the format of highorder tensor can be directly factorized by CP decomposition. Tucker decomposition [12] and TensorTrain decomposition [5] are also used to decompose the parameters directly. We compare our proposed method against these, and show that by exploiting the invariance structure within filters, we are able to obtain further benefit.
Prior work [20] also shows that parameters in fully connected layers can be compressed by reshaping parameters into a highorder tensor and applying TensorTrain decomposition. Previous techniques [16, 15] have further reduced the number of parameters by introducing novel tensor operation that we show are equivalent to decomposing the reshaped parameters using CP and Tucker decomposition.
2 Tensor Preliminaries and Notations
Let . For a vector , denote the element as . Indices start from 0, and reverse indices start from . Therefore the first entry of a vector is and the last one is . For a matrix , and element as . A multidimensional array with dimensions is defined as an order (or mode) tensor, denoted as . For a tensor , its entry is denoted by ; its mode fiber, which is a vector along mode, is denoted by ; its mode() slice, which is a matrix along mode(), is denoted by . We define a few generalized operations on arbitrary order tensors in table 1. These operations are combined to construct more complicated operations.
Operator  Notation  Definition  












(1) [Compound Operation] Simultaneous multioperations between two tensors. For example, given and , we defined a compound operation (i.e., ) as performing mode(0,0) partial outer product, mode(1,1) tensor convolution product and mode(2,2) tensor contraction simultaneously.
(2) [Compound Operation] Simultaneous operations between a tensor and multiple tensors. For example, given , , and , we can define a compound operation (i.e., ) as the mode(2,0) tensor contraction of and , the mode(0,0) tensor partial outer product of and , the mode(1,0) tensor convolution of and simultaneously.
3 Compression of Neural Networks using Tensor Decompositions
Standard convolutional layer: In modern convolutional neural networks (CNN), the major building block is a convolutional layer. A standard convolutional layer is parameterized by a order kernel , where and are height and width of the filters (which are typically equal), and are the number of input channels and output channels respectively. The convolutional layer maps a 3order input tensor (with input feature map of height and width ) to another 3order output tensor (with output feature map of height and width )
(1) 
The convolutional layer performs a compound operation of two tensor convolutions and one tensor contraction between the input tensor and the parameters kernel. The number of parameters in a standard convolutional layer is and the number of operations needed to evaluate the output is (in terms of floating point multiplication).
Srinah’s [19] paper empirically showed that bounding the spectral norm of the weights increases the robustness of the neural network and therefore the network generalizes better. Based on this observation, we find low rank kernel approximation, thereby reducing the number of parameters, and implement four types of decompositions, singular value decomposition (SVD), CANDECOMP/PARAFAC (CP) decomposition, Tucker (Tk) decomposition and TensorTrain (TT) decomposition, of the kernel . We derive the steps to evaluate the output of the compressed network (forwardpropagation) sequentially using the factors returned by the decompositions.
Kernel Decomposition
Consider a CP decomposition on Kernel , is decomposed into three factor tensors , and as follows
(2) 
where is a vector of length . The structure of the convolutional layer is thus transformed into a new structure, and we call it CPconvolutional layer. The number of parameters is reduced from in the original layer to as CPconvolutional layer only requires storing of the three factor tensors , and . We next derive the number of operations required during forward propagation.
Forward Propagation
We substitute the expression 2 into 1, and break the procedure of evaluating into three steps. Each step requires , and operations respectively:
where , are two intermediate tensors. Effectively, the standard convolutional layer is transformed into three sublayers: The first and the third steps are interpreted as convolutional layers and the second step is known as depthwise convolutional layer [2]. The number of operations needed during forward propagation is .
Due to space limit, we defer the compression procedure and their analysis of convolutional layer using SVD, Tk and TT to Appendix D, but we summarize the number of parameters and operations for each scheme in Table 2. The compression schemes of tensor decompositions for fullyconnected layer are deferred to Appendix E.
Decomp.  # Parameters  # Operations 

None  
CP  
Tk  
TT  
tCP  
tTk  
tTT 
4 Tensorized Spectrum Preserving Compression of Neural Networks
Intuition for Tensorization
Consider a matrix with , where and which is the concatenation of copies of a vector . Obviously, as a rank matrix can be perfectly represented by the vectors and , resulting in parameters. However, if we construct a tensor by ”reshaping” the matrix into a 3order tensor, the tensor can be factorized by CP decomposition as , which shows that can be expressed with only parameters. This argument generalizes to kernel tensors in convolutional layers as reshaping the convolution kernels to higher order tensor could further exploit invariances in the kernel, and thus further compress the network.
Definition 4.1 (Tensorized Convolutional Layer).
A tensorized convolutional layer with a kernel is defined as
(3) 
where the input is and the output .
Given the tensorized convolutional layer defined in Definition 4.1, we establish its equivalence to the standard convolutional layer.
Lemma 4.2.
If and are reshaped versions of and , then is a reshaped version of , where , .
4.1 Compression of Tensorized Convolutional Layer
Now we decompose the kernel using tensor decompositions to compress the tensorized convolutional layer. The intuition behind these decompositions is to reduce the number of parameters while preserving some invariance structure in the kernel . Before tensorization, the invariance structure across the filters might not be picked up by decomposing . However, after tensorization, the invariance structure might be picked up by decomposing .
Kernel Decomposition Consider a CP decomposition on Kernel , is decomposed into factors
(4) 
where and . A CP decomposition on the kernel transforms the structure of the tensorized convolutional layer in 3 into a new Tensorized CPconvolutional layer.
The number of parameters is reduced from in the standard convolutional layer to in the CPconvolutional layer, and finally to in the tensorized CPconvolutional layer (assuming and ). Now the question is what are the number of operations needed in the forward propagation process.
Forward Propagation
The procedure of evaluating requires steps. The first steps require operation, and the last step requires .
(5)  
(6)  
(7) 
where are intermediate tensors , and permutes the ordering of modes in order to the ones in the output. Effectively, the tensorized convolutional layer is transformed into sublayers, where the first one can be interpreted as a convolutional layer, the intermediate ones interpreted as depthwise convolutional layer and the last step is a standard convolutional layer. Here we show that we are able to interact the input tensor with the factors sequentially in order to evaluate the output, so that reconstructing the kernel can be avoided. The number of operations needed during forward propagation is .
Due to space limits, we defer the detailed description of tensorized convolutional compression layer using Tk and TT to Appendix F. In Table 3, we list all three compression mechanisms and their decomposition form; the number of parameters and operations are summarized in Table 2.
Decomposition Form  Forward Propagation  

CP 


Tk 


TT 

Back Propagation for Data Reconstruction Optimization
Tensor decompositions provide weight estimates in the tensorized convolutional layers. However, convergence is not guaranteed in general due to the hardness of general tensor decomposition [8]. Model parameters can further be finetuned using standard back propagation and SGD by minimizing the output of the uncompressed convolutional layer verse output of the tensorized CPconvolutional layer. We refer to this scheme as Data Reconstruction Optimization.
Using tensorized CPconvolutional layer as an example, we define a loss function (usually mean squared error after activation function), where is the ouput of tensorized convolutional layer, and is the output of tensorized CPconvolutional layer.
In order to do backpropagation to minimize the loss , we need to derive the partial derivatives with respect to ’s as well as . As we show in Appendix B, all tensor operations are linear, and therefore the derivatives can both easily be derived and the backpropagation implemented using existing libraries.
Training the Entire Network
Analogous to [9], our training procedure, called sequential training, sequentially trains layer by layer minimizing squared error between the output of the uncompressed and the output of the compressed .
5 Experiments
We present an empirical study of both the compression and accuracy that can be achieved by our methods. By default, we use the CIFAR10 dataset on the ResNet34 network [7]; we evaluate fullyconnected layers using the MNIST dataset, and scalability of our method using ImageNet (2012) dataset on the ResNet50 network [7].
We refer to traditional backpropogationbased training of the network as endtoend (E2E), and to our proposed technique that trains each block individually as sequential training (Seq.). We refer to spectrumpreserving compression using tensor decompositions directly on Kernels as SPC. We refer to SPC compression applied to a specific decomposition, e.g., CPdecomposition as SPCCP. Analogously, we refer to our proposed tensorized spectrumpreserving compression as tSPC. Similarly, tSPCCP refers to tSPC applied to CP decompositions.
Tensorized Spectrum Preserving Compression with Sequential Training
Compression rate: SPC, E2E  Compression rate: tSPC, Seq.  
Method  5%  10%  20%  40%  2%  5%  10%  20% 
CP  84.02  86.93  88.75  88.75  85.7  89.86  91.28   
Tk  83.57  86.00  88.03  89.35  61.06  71.34  81.59  87.11 
TT  77.44  82.92  84.13  86.64  78.95  84.26  87.89   
Our primary contribution in this paper is to introduce a new framework that combines decomposed tensors compressed via tensorized spectrum preserving compression, and then demonstrating that sequential training of these new networks can maintain high accuracy even when the networks themselves are very highly compressed. Table 4 tests this hypothesis, comparing SPCcompressed networks with endtoend training [17, 12, 5] to tSPC with sequential training, over the CIFAR10 dataset. Using CP decomposition, reducing the original network to 10% of its original size and retaining endtoend training reduces the accuracy to 86.93%. However, with tSPC and sequential training, with the same number of parameters (10% of the original), testing accuracy increases to 91.28%. We observe similar trends (higher compression and higher accuracy) for TensorTrain decomposition. The structure of the Tucker decomposition (See Section F) makes it less effective with very high compression, since the “spine” of the network reduces to very low rank, effectively losing necessary information. Increasing the network size to 20% of the original provides reasonable performance on CIFAR10 for Tucker as well.
Sequential Training, tensorized compression, or Both?
Table 4 shows that sequential training combined with Tensorized Spectrum Preserving Compression outperforms endtoend training with SPC compression. In this section, we address the following question: is one factor (sequential training or tensorized compression) primarily responsible for increased performance, or is the benefit due to synergy between the two?
Compression rate  

Method  5%  10%  20%  40% 
CP  (83.19, 84.02)  (88.5, 86.93)  (90.72, 88.75)  (89.75, 88.75) 
Tk  (80.11, 83.57)  (86.75, 86.00)  (89.55, 88.03)  (91.3, 89.35) 
TT  (80.77, 77.44)  (87.08, 82.92)  (89.14, 84.13)  (91.21, 86.64) 
Compression rate  
Method  5%  10% 
CP  (89.86, 83.19)  (91.28, 88.5) 
Tk  (71.34, 80.11)  (81.59, 86.73) 
TT  (84.26, 80.77)  (87.89, 87.08) 
Compression rate  
Method  0.2%  0.5%  1% 
CP  97.21  97.92  98.65 
Tk  97.71  98.56  98.52 
TT  97.69  98.43  98.63 
Table 5 isolates the performance of the different training methods, as they are applied to SPC compressed decompositions. Other than at very high compression ratios (5% column in Table 5), sequential training consistently outperforms endtoend. Table 7 analogously isolates the performance of tSPC. Interestingly, if tSPC is used, the testing accuracy is restored for even very high compression ratios ^{1}^{1}1Note that Tucker remains an exception with high compression due to the low rank internal structure that we previously discussed.. There confirms the existence of extra invariance in the kernels, which is picked up by tensorization combined with low rank approximation, but not by low rank approximation itself. Thus, our results show that tSPC and sequential training are symbiotic, and both are necessary to simultaneously obtain high accuracy and compression.
Convergence Rate
Compared to endtoend, an ancillary benefit of sequential training is much faster and leads to more stable convergence. Figure 1 plots training error over number of gradient updates for various methods. (This experiment is for SPC tensor methods, with network parameters compressed to 10% of original.) There are three salient points: first, sequential training has very high training error in the beginning while the “early” blocks of the network are being trained (and the rest of the network is left unchanged to tensor decomposition values). However, as the final block is trained (around gradient updates) in the figure, the training errors drop to nearly minimum immediately. In comparison, endtoend training requires 50–100% more gradient updates to achieve stable performance. Finally, the result also shows that for each block, sequential training achieves convergence very quickly (and nearly monotonically), which results in the stairstep pattern since extra training of a block does not improve (or appreciably reduce) performance.
Performance on FullyConnected Layers
An extra advantage of tSPC compression is that it can apply flexibly to fullyconnected as well as convolutional layers of a neural network. Table 7 shows the results of applying tSPC compression to various tensor decompositions on a variant of LeNet5 network [18]. The convolutional layers of the LeNet5 network were not compressed, trained or updated in these experiments. The uncompressed network achieves 99.31% accuracy. Table 7 shows the fullyconnected layers can be compressed by 500 while losing about 2% accuracy. In fact, reducing the dense layers to 1% of their original size reduce accuracy by less then 1%, demonstrating the extreme efficacy of tSPC compression when applied to fullyconnected neural network layers.
Scalability
Finally, we show that our techniques scale to stateofthe art large networks, by evaluating performance on the ImageNet 2012 dataset on a 50layer ResNet. Table 8 shows the accuracy of tSPCTT decomposition with sequential training compared to SPCTT trained endtoend, for the ResNet compressed to 10% of its original size. The results are normalized to the accuracy of the original network trained over the same number of epochs. Table 8 shows that sequential training of tSPC compressed networks is faster than alternatives. This is an important result because it empirically validates our hypotheses that (1) tSPC decompositions capture the invariance structure of the convolution layers better than regular decompositions, (2) data reconstruction optimization is effective even on the largest networks and datasets, and (3) our proposed methods scale to stateoftheart neural networks.
6 Conclusion
We describe an efficient mechanism for compressing neural networks by tensorizing network layers. We implement tensor decomposition to find the rankR approximations of the tensorized kernel, potentially preserving invariance structures missed by implementing decompositions on the original kernels. We extend vector/matrix operations to their higher order tensor counterparts, providing systematic notations for tensorization of neural networks and higher order tensor decompositions.
References
 [1] Anima Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed nonorthogonal tensor decomposition via alternating rank1 updates. arXiv preprint arXiv:1402.5180, 2014.
 [2] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
 [3] Andrzej Cichocki, Namgil Lee, Ivan V Oseledets, Anh Huy Phan, Qibin Zhao, and D Mandic. Lowrank tensor networks for dimensionality reduction and largescale optimization problems: Perspectives and challenges part 1. arXiv preprint arXiv:1609.00893, 2016.
 [4] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
 [5] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [8] Christopher J Hillar and LekHeng Lim. Most tensor problems are nphard. Journal of the ACM (JACM), 60(6):45, 2013.
 [9] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. arXiv preprint arXiv:1706.04964, 2017.
 [10] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [11] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neural networks. arXiv preprint arXiv:1711.00811, 2017.
 [12] YongDeok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 [13] Tamara G Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255, 2001.
 [14] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [15] Jean Kossaifi, Aran Khanna, Zachary Lipton, Tommaso Furlanello, and Anima Anandkumar. Tensor contraction layers for parsimonious deep nets. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1940–1946. IEEE, 2017.
 [16] Jean Kossaifi, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
 [17] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [19] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5949–5958, 2017.
 [20] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
 [21] Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [23] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1984–1992, 2015.
Appendix: Tensorized Spectrum Preserving Compression for Neural Networks
Appendix A Notations
Symbols: Lower case letters (e.g. ) are used to denote column vectors, upper case letters (e.g. ) are to denote matrices, and curled letters (e.g. ) for higherorder arrays (tensors). For a tensor , we will refer to the number of indices as order, each individual index as mode and the length at one mode as dimension. Therefore, we will say that is an order tensor which has dimension at mode. Various tensor operations are extensively used in this paper: The (partial) outer product is denoted as , tensor convolution is denoted as , and denotes tensor contraction or tensor multiplication. Each of these operators will be equipped with subscript and superscript when used in practice, for example denotes mode tensor contraction. Furthermore, the symbol is used to construct compound operation. For example, is a compound operator simultaneously performing tensor convolution and tensor contraction between two tensors.
Indexing: (1) Indices start from 0, and reverse indices start from . Therefore the first entry of a vector is and the last one is (2) For array (vector/matrix/tensor), the subscript is used to denote an entry or a subarray within an array, while superscript is to index among a sequence of arrays. For example, denotes the entry located at th row and column of a matrix , and is the matrix among . (3) The symbol colon ’’ is used to slice an array. For example, denotes the frontal slice of a 3order tensor ; (4) Bigendian notation is adopted. Specifically, the operation flattens a highorder tensor into a vector such that . (5) The operation is used to permute the ordering of modes of a tensor as needed. For example, given two tensors and , the operation convert the tensor into such that .
Appendix B Basic tensor operations
In this section, we introduce all necessary tensor operations required to build a tensor network. All operations used in this paper are linear, and simple enough so that their derivatives (with respective to both operants) can be easily derived. Therefore, the backpropagation algorithm can be directed applied to train a network involving tensor operations.
Tensor contraction
Given a order tensor and another order tensor , which share the same dimension at mode of and mode of , i.e. . The mode contraction of and , denoted as , returns a order tensor of size , whose entries are computed as
(8) 
Notice that tensor contraction is a direct generalization of matrix product to higherorder tensor, and the partial derivatives of with respect to and can be easily calculated at the entries level:
(9)  
(10) 
Tensor multiplication (Tensor Product)
For the special case that the second operant is a matrix, the operation is also known as tensor multiplication or tensor product, and is defined slightly different from tensor contraction. Given a order tensor and a matrix , where the dimension at mode in matches the number of the rows in . The mode tensor multiplication of and , denoted as , yields a order tensor of size , whose entries are computed as
(11) 
The derivatives of with respect to and can be derived similarly to tensor contraction.
(12)  
(13) 
Tensor convolution
Given a order tensor and another order tensor . The mode convolution of and , denoted as , returns a order tensor of size . The entries of can be computed using the convolutional operation that defined for two vectors:
(14) 
Here we deliberately do not give an exact definition for the convolutional operation . In fact, convolution can be defined in different ways depending on the user case, and interestingly the ”convolution” used in convolutional neural network actually performs correlation instead of convolution. The resulting dimension at mode depends on the chosen type of convolution, for example the most commonly used ”convolution” in neural network will give , which performs noncircular correlation with zero paddings and unit stride. For simplicity, we only derive the partial derivatives of with respect to and in the case that the operation is circular convolution defined for two vectors of the same length.
(15)  
(16) 
where computes a circulant matrix for a given vector.
Tensor outer product
Given a order tensor and another order tensor , the outer product of and , denoted , concatenates all the indices together and returns a order tensor of size , whose entries are computed as
(17) 
Tensor outer product is a direct generalization for outer product for two vectors . The derivatives of with respect to