Tensorized Spectrum Preserving Compression for Neural Networks

# Tensorized Spectrum Preserving Compression for Neural Networks

Jiahao Su
Department of Electrical & Computer Engineering
University of Maryland
jiahaosu@terpmail.umd.edu

Jingling Li
Department of Computer Science
University of Maryland
jingling@cs.umd.edu

Bobby Bhattacharjee
Department of Computer Science
University of Maryland
bobby@cs.umd.edu
Furong Huang
Department of Computer Science
University of Maryland
* furongh@cs.umd.edu
###### Abstract

Modern neural networks can have tens of millions of parameters, and are often ill-suited for smartphones or IoT devices. In this paper, we describe an efficient mechanism for compressing large networks by tensorizing network layers: i.e. mapping layers on to high-order matrices, for which we introduce new tensor decomposition methods. Compared to previous compression methods, some of which use tensor decomposition, our techniques preserve more of the networks invariance structure. Coupled with a new data reconstruction-based learning method, we show that tensorized compression outperforms existing techniques for both convolutional and fully-connected layers on state-of-the art networks.

## 1 Introduction

Neural networks have become de-facto structures for many learning problems, including object detection [22], image classification [6], and many forms of prediction and forecasting [11]. Modern network structures provide unprecedented accuracy over difficult problems, and this success has led to neural network solutions being applied to many important domains such as security, autonomous driving, etc. much to write here.

Success and versatility notwithstanding, application of neural networks to real world problems must still overcome several challenges: (1) new sophisticated networks have tens of millions of parameters which increases both training time and memory requirements. (2) The very high number of parameters can (and often does) lead to overfitting [19], making the network susceptible to noisy (perhaps adversarial/poisonous) training examples. (3) Even when trained, the large parameter space requires relatively capable devices to load the entire network into memory, and to execute tests quickly. While network size is not a problem for testing outsourced to cloud-resident GPUs or even powerful desktops, neural network applications are being deployed on far more constrained devices, such as smartphones and IoT cameras, where the testing time and network size is a practical bottleneck.

A recent approach to addressing scalability of neural networks is to compress the network layers, which would reduces both the number of parameters that have to be trained and the size of the network for testing. Compressing a successful network, while maintaining its accuracy, is non-trivial. Many approaches have been employed, including re-casting layers of the network as concatenation of two layers with a (smaller sized) hidden layer, and by performing SVD on the weight matrices of the original layers with configurable rank [10, 4, 23]. More recent approaches cast the network layers as tensors, and compress these layers using tensor decomposition methods [17, 12, 20, 5].

Tensors are higher-order generalizations of matrices. Much like methods such as SVD can be used to decompose a matrix, tensors can also be decomposed into smaller factors using different algorithmic methods [1, 13, 21]. Tensor decompositions have previously been used to find low rank approximations of higher order objects, and are useful in compressing neural network layers, both convolutional and fully-connected. The key insight is that the convolutional kernels of neural network layers (often) lie in low dimensional spaces, and tensor decompositions can effectively find low rank approximations of such kernels (and hence reduce the number of parameters by a factor polynomial in the tensor dimension). SVD can also be used to compress fully-connected layers, with the SVD rank controlling the degree of parameter compression.

While ”regular” tensor decompositions can compress convolutional layers, in this paper we consider further tensorized networks and their decompositions. The basic idea is to map an existing kernel, say a 4-order tensor , into a higher -order tensor , so called tensorization. The tensorization guarantees . Consider rank-R tensor decompositions of , the number of parameters needed is further reduced to (in contrast to for ) . We refer to this method of decomposing tensorized kernels as Tensorized Spectrum Preserving Compression’ (t-SPC).

t-SPC is particularly effective at capturing and preserving invariant structures often present in neural networks. Consider a vector of dimension , where the single column consists of a repeated “sub-vectors” of length (). This vector could be mapped into a matrix of dimension , where the the columns of the matrix are repeated. This matrix can now be decomposed with rank reduced to 1 without losing information. We apply this basic idea in t-SPC.

In general, decomposing higher order tensors is challenging, and known methods are not guaranteed to converge to the minimum error decompositions [8]. Hence, we must fine-tune the tensorized decompositions in order to achieve good performance. Fortunately, the optimal performance for the layers that are compressed are already known (from the uncompressed network). We introduce a new training method based on data reconstruction, called sequential training, that minimizes the difference between training output of the uncompressed layers versus compressed. Unlike traditional end-to-end backpropagation, sequential training trains individual compressed “blocks” one at a time, reducing the resources required during training.

#### Contributions

We introduce new methods for tensorized decomposition of higher order kernels. In particular, we present a new architecture for tensorizing network weights/kernels such that CP and Tucker decompositions can be applied on the higher order tensor.

We introduce the sequential training scheme for minimizing the errors accruing from individual layers using data reconstruction optimization. Unlike end-to-end backpropagation, sequential training only fits one individual block of compressed layers into (GPU) memory during each training cycle, reducing memory requirements by a factor of total number of blocks. This reduction is crucial, since the trained component’s parameters (product of output channels, filters, image height and width) can occupy many megabytes of memory. This product is further linearly scaled by the number of images in a batch. Beyond memory occupancy, the computation requirements are also high since all of these parameters must be updated during training.

Sequential training enables a space-time tradeoff allowing large networks to be trained on modest devices. As our results will show, parameters for sequential layers converge quickly in practice, and hence sequential training can reduce training time as well.

We implement our ideas and present an evaluation comparing t-SPC with existing methods. Our code will be available online. Our results on CIFAR-10 and MNIST show that together t-SPC and sequential training can maintain high accuracy even under high compression rate, outperforming end-to-end training and non-tensorized compression. We also show that t-SPC can effectively compress fully-connected layers, and that the performance scales to large datasets such as ImageNet (2012) applied to ResNet-50.

#### Related work

Matrix and tensor decompositions have been used to reduce the number of parameters for convolutional kernels/weights. A straightforward way to exploit the low rank structure is via singular value decomposition, which has been proposed in [10, 4, 23]. The pioneering work in [17] shows that the parameters in the format of high-order tensor can be directly factorized by CP decomposition. Tucker decomposition [12] and Tensor-Train decomposition [5] are also used to decompose the parameters directly. We compare our proposed method against these, and show that by exploiting the invariance structure within filters, we are able to obtain further benefit.

Prior work [20] also shows that parameters in fully connected layers can be compressed by reshaping parameters into a high-order tensor and applying Tensor-Train decomposition. Previous techniques [16, 15] have further reduced the number of parameters by introducing novel tensor operation that we show are equivalent to decomposing the reshaped parameters using CP and Tucker decomposition.

## 2 Tensor Preliminaries and Notations

Let . For a vector , denote the element as . Indices start from 0, and reverse indices start from . Therefore the first entry of a vector is and the last one is . For a matrix , and element as . A multi-dimensional array with dimensions is defined as an -order (or mode) tensor, denoted as . For a tensor , its entry is denoted by ; its mode- fiber, which is a vector along mode-, is denoted by ; its mode-() slice, which is a matrix along mode-(), is denoted by . We define a few generalized operations on arbitrary order tensors in table 1. These operations are combined to construct more complicated operations.

(1) [Compound Operation] Simultaneous multi-operations between two tensors. For example, given and , we defined a compound operation (i.e., ) as performing mode-(0,0) partial outer product, mode-(1,1) tensor convolution product and mode-(2,2) tensor contraction simultaneously.

(2) [Compound Operation] Simultaneous operations between a tensor and multiple tensors. For example, given , , and , we can define a compound operation (i.e., ) as the mode-(2,0) tensor contraction of and , the mode-(0,0) tensor partial outer product of and , the mode-(1,0) tensor convolution of and simultaneously.

## 3 Compression of Neural Networks using Tensor Decompositions

Standard convolutional layer: In modern convolutional neural networks (CNN), the major building block is a convolutional layer. A standard convolutional layer is parameterized by a -order kernel , where and are height and width of the filters (which are typically equal), and are the number of input channels and output channels respectively. The convolutional layer maps a 3-order input tensor (with input feature map of height and width ) to another 3-order output tensor (with output feature map of height and width )

 V=U\leavevmode\nobreak (∗00∘∗11∘×22)\leavevmode\nobreak K (1)

The convolutional layer performs a compound operation of two tensor convolutions and one tensor contraction between the input tensor and the parameters kernel. The number of parameters in a standard convolutional layer is and the number of operations needed to evaluate the output is (in terms of floating point multiplication).

Srinah’s [19] paper empirically showed that bounding the spectral norm of the weights increases the robustness of the neural network and therefore the network generalizes better. Based on this observation, we find low rank kernel approximation, thereby reducing the number of parameters, and implement four types of decompositions, singular value decomposition (SVD), CANDECOMP/PARAFAC (CP) decomposition, Tucker (Tk) decomposition and Tensor-Train (TT) decomposition, of the kernel . We derive the steps to evaluate the output of the compressed network (forward-propagation) sequentially using the factors returned by the decompositions.

#### Kernel Decomposition

Consider a CP decomposition on Kernel , is decomposed into three factor tensors , and as follows

 K=1×02\leavevmode\nobreak (K(1)⊗21K(0)⊗20K(2)) (2)

where is a vector of length . The structure of the convolutional layer is thus transformed into a new structure, and we call it CP-convolutional layer. The number of parameters is reduced from in the original layer to as CP-convolutional layer only requires storing of the three factor tensors , and . We next derive the number of operations required during forward propagation.

#### Forward Propagation

We substitute the expression 2 into 1, and break the procedure of evaluating into three steps. Each step requires , and operations respectively:

where , are two intermediate tensors. Effectively, the standard convolutional layer is transformed into three sub-layers: The first and the third steps are interpreted as convolutional layers and the second step is known as depth-wise convolutional layer [2]. The number of operations needed during forward propagation is .

Due to space limit, we defer the compression procedure and their analysis of convolutional layer using SVD, Tk and TT to Appendix  D, but we summarize the number of parameters and operations for each scheme in Table 2. The compression schemes of tensor decompositions for fully-connected layer are deferred to Appendix E.

## 4 Tensorized Spectrum Preserving Compression of Neural Networks

#### Intuition for Tensorization

Consider a matrix with , where and which is the concatenation of copies of a vector . Obviously, as a rank- matrix can be perfectly represented by the vectors and , resulting in parameters. However, if we construct a tensor by ”reshaping” the matrix into a 3-order tensor, the tensor can be factorized by CP decomposition as , which shows that can be expressed with only parameters. This argument generalizes to kernel tensors in convolutional layers as reshaping the convolution kernels to higher order tensor could further exploit invariances in the kernel, and thus further compress the network.

###### Definition 4.1 (Tensorized Convolutional Layer).

A tensorized convolutional layer with a kernel is defined as

 (3)

where the input is and the output .

Given the tensorized convolutional layer defined in Definition 4.1, we establish its equivalence to the standard convolutional layer.

###### Lemma 4.2.

If and are reshaped versions of and , then is a reshaped version of , where , .

###### Remark.

The tensorized convolutional layer defined in Equation 3 is equivalent to a standard convolutional layer defined in Equation 1.

### 4.1 Compression of Tensorized Convolutional Layer

Now we decompose the kernel using tensor decompositions to compress the tensorized convolutional layer. The intuition behind these decompositions is to reduce the number of parameters while preserving some invariance structure in the kernel . Before tensorization, the invariance structure across the filters might not be picked up by decomposing . However, after tensorization, the invariance structure might be picked up by decomposing .

Kernel Decomposition Consider a CP decomposition on Kernel , is decomposed into factors

 K′=missing1×00(K′(0)⊗00⋯⊗00K′(m)) (4)

where and . A CP decomposition on the kernel transforms the structure of the tensorized convolutional layer in 3 into a new Tensorized CP-convolutional layer.

The number of parameters is reduced from in the standard convolutional layer to in the CP-convolutional layer, and finally to in the tensorized CP-convolutional layer (assuming and ). Now the question is what are the number of operations needed in the forward propagation process.

#### Forward Propagation

The procedure of evaluating requires steps. The first steps require operation, and the last step requires .

 Step1:U′(0)=swapaxes(U′×21K′(0)) (5) Stepl+1:U′(l)=U′(l−1)\leavevmode\nobreak (⊗00∘×31)\leavevmode\nobreak K′(l) (6) Stepm+1:ˆV′=U′(m−1)\leavevmode\nobreak (×00∘∗11∘∗22)\leavevmode\nobreak K′(m) (7)

where are intermediate tensors , and permutes the ordering of modes in order to the ones in the output. Effectively, the tensorized convolutional layer is transformed into sub-layers, where the first one can be interpreted as a convolutional layer, the intermediate ones interpreted as depthwise convolutional layer and the last step is a standard convolutional layer. Here we show that we are able to interact the input tensor with the factors sequentially in order to evaluate the output, so that reconstructing the kernel can be avoided. The number of operations needed during forward propagation is .

Due to space limits, we defer the detailed description of tensorized convolutional compression layer using Tk and TT to Appendix  F. In Table 3, we list all three compression mechanisms and their decomposition form; the number of parameters and operations are summarized in Table 2.

#### Back Propagation for Data Reconstruction Optimization

Tensor decompositions provide weight estimates in the tensorized convolutional layers. However, convergence is not guaranteed in general due to the hardness of general tensor decomposition [8]. Model parameters can further be fine-tuned using standard back propagation and SGD by minimizing the output of the uncompressed convolutional layer verse output of the tensorized CP-convolutional layer. We refer to this scheme as Data Reconstruction Optimization.

Using tensorized CP-convolutional layer as an example, we define a loss function (usually mean squared error after activation function), where is the ouput of tensorized convolutional layer, and is the output of tensorized CP-convolutional layer.

In order to do backpropagation to minimize the loss , we need to derive the partial derivatives with respect to ’s as well as . As we show in Appendix B, all tensor operations are linear, and therefore the derivatives can both easily be derived and the backpropagation implemented using existing libraries.

#### Training the Entire Network

Analogous to [9], our training procedure, called sequential training, sequentially trains layer by layer minimizing squared error between the output of the uncompressed and the output of the compressed .

## 5 Experiments

We present an empirical study of both the compression and accuracy that can be achieved by our methods. By default, we use the CIFAR-10 dataset on the ResNet-34 network [7]; we evaluate fully-connected layers using the MNIST dataset, and scalability of our method using ImageNet (2012) dataset on the ResNet-50 network [7].

We refer to traditional backpropogation-based training of the network as end-to-end (E2E), and to our proposed technique that trains each block individually as sequential training (Seq.). We refer to spectrum-preserving compression using tensor decompositions directly on Kernels as SPC. We refer to SPC compression applied to a specific decomposition, e.g., CP-decomposition as SPC-CP. Analogously, we refer to our proposed tensorized spectrum-preserving compression as t-SPC. Similarly, t-SPC-CP refers to t-SPC applied to CP decompositions.

#### Tensorized Spectrum Preserving Compression with Sequential Training

Our primary contribution in this paper is to introduce a new framework that combines decomposed tensors compressed via tensorized spectrum preserving compression, and then demonstrating that sequential training of these new networks can maintain high accuracy even when the networks themselves are very highly compressed. Table 4 tests this hypothesis, comparing SPC-compressed networks with end-to-end training [17, 12, 5] to t-SPC with sequential training, over the CIFAR-10 dataset. Using CP decomposition, reducing the original network to 10% of its original size and retaining end-to-end training reduces the accuracy to 86.93%. However, with t-SPC and sequential training, with the same number of parameters (10% of the original), testing accuracy increases to 91.28%. We observe similar trends (higher compression and higher accuracy) for Tensor-Train decomposition. The structure of the Tucker decomposition (See Section F) makes it less effective with very high compression, since the “spine” of the network reduces to very low rank, effectively losing necessary information. Increasing the network size to 20% of the original provides reasonable performance on CIFAR-10 for Tucker as well.

#### Sequential Training, tensorized compression, or Both?

Table 4 shows that sequential training combined with Tensorized Spectrum Preserving Compression outperforms end-to-end training with SPC compression. In this section, we address the following question: is one factor (sequential training or tensorized compression) primarily responsible for increased performance, or is the benefit due to synergy between the two?

Table 5 isolates the performance of the different training methods, as they are applied to SPC compressed decompositions. Other than at very high compression ratios (5% column in Table 5), sequential training consistently outperforms end-to-end. Table 7 analogously isolates the performance of t-SPC. Interestingly, if t-SPC is used, the testing accuracy is restored for even very high compression ratios 111Note that Tucker remains an exception with high compression due to the low rank internal structure that we previously discussed.. There confirms the existence of extra invariance in the kernels, which is picked up by tensorization combined with low rank approximation, but not by low rank approximation itself. Thus, our results show that t-SPC and sequential training are symbiotic, and both are necessary to simultaneously obtain high accuracy and compression.

#### Convergence Rate

Compared to end-to-end, an ancillary benefit of sequential training is much faster and leads to more stable convergence. Figure 1 plots training error over number of gradient updates for various methods. (This experiment is for SPC tensor methods, with network parameters compressed to 10% of original.) There are three salient points: first, sequential training has very high training error in the beginning while the “early” blocks of the network are being trained (and the rest of the network is left unchanged to tensor decomposition values). However, as the final block is trained (around gradient updates) in the figure, the training errors drop to nearly minimum immediately. In comparison, end-to-end training requires 50–100% more gradient updates to achieve stable performance. Finally, the result also shows that for each block, sequential training achieves convergence very quickly (and nearly monotonically), which results in the stair-step pattern since extra training of a block does not improve (or appreciably reduce) performance.

#### Performance on Fully-Connected Layers

An extra advantage of t-SPC compression is that it can apply flexibly to fully-connected as well as convolutional layers of a neural network. Table 7 shows the results of applying t-SPC compression to various tensor decompositions on a variant of LeNet-5 network [18]. The convolutional layers of the LeNet-5 network were not compressed, trained or updated in these experiments. The uncompressed network achieves 99.31% accuracy. Table 7 shows the fully-connected layers can be compressed by 500 while losing about 2% accuracy. In fact, reducing the dense layers to 1% of their original size reduce accuracy by less then 1%, demonstrating the extreme efficacy of t-SPC compression when applied to fully-connected neural network layers.

#### Scalability

Finally, we show that our techniques scale to state-of-the art large networks, by evaluating performance on the ImageNet 2012 dataset on a 50-layer ResNet. Table 8 shows the accuracy of t-SPC-TT decomposition with sequential training compared to SPC-TT trained end-to-end, for the ResNet compressed to 10% of its original size. The results are normalized to the accuracy of the original network trained over the same number of epochs. Table 8 shows that sequential training of t-SPC compressed networks is faster than alternatives. This is an important result because it empirically validates our hypotheses that (1) t-SPC decompositions capture the invariance structure of the convolution layers better than regular decompositions, (2) data reconstruction optimization is effective even on the largest networks and datasets, and (3) our proposed methods scale to state-of-the-art neural networks.

## 6 Conclusion

We describe an efficient mechanism for compressing neural networks by tensorizing network layers. We implement tensor decomposition to find the rank-R approximations of the tensorized kernel, potentially preserving invariance structures missed by implementing decompositions on the original kernels. We extend vector/matrix operations to their higher order tensor counterparts, providing systematic notations for tensorization of neural networks and higher order tensor decompositions.

## References

• [1] Anima Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014.
• [2] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
• [3] Andrzej Cichocki, Namgil Lee, Ivan V Oseledets, Anh Huy Phan, Qibin Zhao, and D Mandic. Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1. arXiv preprint arXiv:1609.00893, 2016.
• [4] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
• [5] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
• [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
• [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
• [8] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
• [9] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. arXiv preprint arXiv:1706.04964, 2017.
• [10] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
• [11] Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neural networks. arXiv preprint arXiv:1711.00811, 2017.
• [12] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
• [13] Tamara G Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255, 2001.
• [14] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
• [15] Jean Kossaifi, Aran Khanna, Zachary Lipton, Tommaso Furlanello, and Anima Anandkumar. Tensor contraction layers for parsimonious deep nets. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1940–1946. IEEE, 2017.
• [16] Jean Kossaifi, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
• [17] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
• [18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• [19] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5949–5958, 2017.
• [20] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
• [21] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
• [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
• [23] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1984–1992, 2015.

Appendix: Tensorized Spectrum Preserving Compression for Neural Networks

## Appendix A Notations

Symbols: Lower case letters (e.g. ) are used to denote column vectors, upper case letters (e.g. ) are to denote matrices, and curled letters (e.g. ) for higher-order arrays (tensors). For a tensor , we will refer to the number of indices as order, each individual index as mode and the length at one mode as dimension. Therefore, we will say that is an -order tensor which has dimension at mode-. Various tensor operations are extensively used in this paper: The (partial) outer product is denoted as , tensor convolution is denoted as , and denotes tensor contraction or tensor multiplication. Each of these operators will be equipped with subscript and superscript when used in practice, for example denotes mode- tensor contraction. Furthermore, the symbol is used to construct compound operation. For example, is a compound operator simultaneously performing tensor convolution and tensor contraction between two tensors.

Indexing: (1) Indices start from 0, and reverse indices start from . Therefore the first entry of a vector is and the last one is (2) For array (vector/matrix/tensor), the subscript is used to denote an entry or a subarray within an array, while superscript is to index among a sequence of arrays. For example, denotes the entry located at -th row and column of a matrix , and is the matrix among . (3) The symbol colon ’’ is used to slice an array. For example, denotes the frontal slice of a 3-order tensor ; (4) Big-endian notation is adopted. Specifically, the operation flattens a high-order tensor into a vector such that . (5) The operation is used to permute the ordering of modes of a tensor as needed. For example, given two tensors and , the operation convert the tensor into such that .

## Appendix B Basic tensor operations

In this section, we introduce all necessary tensor operations required to build a tensor network. All operations used in this paper are linear, and simple enough so that their derivatives (with respective to both operants) can be easily derived. Therefore, the backpropagation algorithm can be directed applied to train a network involving tensor operations.

#### Tensor contraction

Given a -order tensor and another -order tensor , which share the same dimension at mode- of and mode- of , i.e. . The mode- contraction of and , denoted as , returns a -order tensor of size , whose entries are computed as

 Ti0,⋯,ik−1,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1=Ik−1∑r=0T(0)i0,⋯,ik−1,r,ik+1,⋯,im−1\leavevmode\nobreak T(1)j0,⋯,jl−1,r,jl+1,⋯,jn−1 (8)

Notice that tensor contraction is a direct generalization of matrix product to higher-order tensor, and the partial derivatives of with respect to and can be easily calculated at the entries level:

 ∂Ti0,⋯,ik−1,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1∂T(0)i0,⋯,ik−1,r,ik+1,⋯,im−1 =T(1)j0,⋯,jl−1,r,jl+1,⋯,jn−1 (9) ∂Ti0,⋯,ik−1,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1∂T(1)j0,⋯,jl−1,r,jl+1,⋯,jn−1 =T(0)i0,⋯,ik−1,r,ik+1,⋯,im−1 (10)

#### Tensor multiplication (Tensor Product)

For the special case that the second operant is a matrix, the operation is also known as tensor multiplication or tensor product, and is defined slightly different from tensor contraction. Given a -order tensor and a matrix , where the dimension at mode- in matches the number of the rows in . The mode- tensor multiplication of and , denoted as , yields a -order tensor of size , whose entries are computed as

 Vi0,⋯,ik−1,j,ik+1,⋯,im−1=Ik−1∑r=0Ui0,⋯,ik−1,r,ik+1,⋯,im−1Mr,j (11)

The derivatives of with respect to and can be derived similarly to tensor contraction.

 ∂Vi0,⋯,ik−1,j,ik+1,⋯,im−1∂Ui0,⋯,ik−1,r,ik+1,⋯,im−1 =Mr,j (12) ∂Vi0,⋯,ik−1,j,ik+1,⋯,im−1∂Mr,j =Ui0,⋯,ik−1,r,ik+1,⋯,im−1 (13)

#### Tensor convolution

Given a -order tensor and another -order tensor . The mode- convolution of and , denoted as , returns a -order tensor of size . The entries of can be computed using the convolutional operation that defined for two vectors:

 Ti0,⋯,ik−1,:,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1=T(0)i0,⋯,ik−1,:,ik+1,⋯,im−1∗T(1)j0,⋯,jl−1,:,jl+1,⋯,jn−1 (14)

Here we deliberately do not give an exact definition for the convolutional operation . In fact, convolution can be defined in different ways depending on the user case, and interestingly the ”convolution” used in convolutional neural network actually performs correlation instead of convolution. The resulting dimension at mode- depends on the chosen type of convolution, for example the most commonly used ”convolution” in neural network will give , which performs non-circular correlation with zero paddings and unit stride. For simplicity, we only derive the partial derivatives of with respect to and in the case that the operation is circular convolution defined for two vectors of the same length.

 ∂Ti0,⋯,ik−1,:,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1∂T(0)i0,⋯,ik−1,:,ik+1,⋯,im−1 =Cir(T(1)j0,⋯,jl−1,:,jl+1,⋯,jn−1)⊤ (15) ∂Ti0,⋯,ik−1,:,ik+1,⋯,im−1,j0,⋯,jl−1,jl+1,⋯,jn−1∂T(1)j0,⋯,jl−1,:,jl+1,⋯,jn−1 =Cir(T(0)i0,⋯,ik−1,:,ik+1,⋯,im−1)⊤ (16)

where computes a circulant matrix for a given vector.

#### Tensor outer product

Given a -order tensor and another -order tensor , the outer product of and , denoted , concatenates all the indices together and returns a -order tensor of size , whose entries are computed as

 Ti0,⋯,im−1,j0,⋯,jn−1=T(0)i0,⋯,im−1\leavevmode\nobreak T(1)j0,⋯,jn−1 (17)

Tensor outer product is a direct generalization for outer product for two vectors . The derivatives of with respect to