Learning Compact Recurrent Neural Networks with BlockTerm Tensor Decomposition
Abstract
Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely BlockTerm tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative lowrank approximations, such as tensortrain RNN (TTRNN), our method, BlockTerm RNN (BTRNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BTRNN outperforms TTRNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BTLSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset.
1 Introduction
Best known for the sequencetosequence learning, the Recurrent Neural Networks (RNNs) belong to a class of neural architectures designed to capture the dynamic temporal behaviors of data. The vanilla fully connected RNN utilizes a feedback loop to memorize previous information, while it is inept to handle long sequences as the gradient exponentially vanishes along the time [14, 2]. Unlike the vanilla RNNs passing information between layers with direct matrixvector multiplications, the Long ShortTerm Memory (LSTM) introduces a number of gates and passes information with elementwise operations [15]. This improvement drastically alleviates the gradient vanishing issue; therefore LSTM and its variants, e.g. Gated Recurrent Unit (GRU) [6], are widely used in various Computer Vision (CV) tasks [3, 23, 40] to model the longterm correlations in sequences.
The current formulation of LSTM, however, suffers from an excess of parameters, making it notoriously difficult to train and susceptible to overfitting. The formulation of LSTM can be described by the following equations:
(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
where denotes the elementwise product, denotes the sigmoid function and is the hyperbolic tangent function. The weight matrices and transform the input and the hidden state , respectively, to cell update and three gates , , and . Please note that given an image feature vector fetch from a Convolutional Neural Network (CNN) network, the shape of will raise to and w.r.t vgg16 [36] and Inception v4 [38]. If the number of hidden states is , the total number of parameters in calculating the four is , which can up to and , respectively. Therefore, the giant matrixvector multiplication, i.e., , leads to the major inefficiency – the current parameterintensive design not only subjects the model difficult to train, but also lead to high computation complexity and memory usage.
In addition, each essentially represents a fully connected operation that transforms the input vector into the hidden state vector. However, extensive research on CNNs has proven that the dense connection is significantly inefficient at extracting the spatially latent local structures and local correlations naturally exhibited in the image [21, 11]. Recent leading CNNs architectures, e.g., DenseNet [16], ResNet [12] and Inception v4 [38], also try to circumvent one huge cumbersome dense layer [39]. But the discussions of improving the dense connections in RNNs are still quite limited [29, 33]. It is imperative to seek a more efficient design to replace .
In this work, we propose to design a sparsely connected tensor representation, i.e., the BlockTerm decomposition (BTD) [8], to replace the redundant and densely connected operation in LSTM ^{6}^{6}6we focus on LSTM in this paper, but the proposed approach also applies for other variants such as GRU.. The BlockTerm decomposition is a lowrank approximation method that decomposes a highorder tensor into a sum of multiple Tucker decomposition models [42]. In detail, we represent the four weight matrices (i.e., ) and the input data into a various order of tensor. In the process of RNNs training, the BTD layer automatically learns interparameter correlations to implicitly prune redundant dense connections rendered by . By plugging the new BTD layer into current RNNs formulations, we present a new BTRNN model with a similar representation power but several orders of fewer parameters. The refined LSTM model with the Blockterm representation is illustrated in Fig. 1.
The major merits of BTRNN are shown as follows:

The lowrank BTD can compress the dense connections in the inputtohidden transformation, while still retaining the current design philosophy of LSTM. By reducing several orders of model parameters, BTLSTM has better convergence rate than the traditional LSTM architecture, significantly enhancing the training speed.

Each dimension in the input data can share weights with all the other dimensions as the existence of core tensors, thus BT representation has the strong connection between different dimensions, enhancing the ability to capture sufficient local correlations. Empirical results show that, compared with the Tensor Train model [32], the BT model has a better representation power with the same amount of model parameters.

The design of multiple Tucker models can significantly reduce the sensitivity to noisy input data and widen network, leading to a more robust RNN model. In contrary to the Tensor Train based tensor approaches [48, 31], the BT model does not suffer from the difficulty of ranks setting, releasing researchers from intolerable work in choosing hyperparameters.
In order to demonstrate the performance of the BTLSTM model, we design three challenging computer vision tasks – Action Recognition in Videos, Image Caption and Image Generation – to quantitatively and qualitatively evaluate the proposed BTLSTM against the baseline LSTM and other lowrank variants such as the Tensor Train LSTM (TTLSTM). Experimental results have demonstrated the promising performance of the BTLSTM model.
2 Related Work
The poor image modeling efficiency of full connections in the perception architecture, i.e., [44], has been widely recognized by the Computer Vision (CV) community. The most prominent example is the great success made by Convolutional Neural Networks (CNNs) for the general image recognition. Instead of using the dense connections in multilayer perceptions, CNNs relies on sparsely connected convolutional kernels to extract the latent regional features in an image. Hence, going sparse on connections is the key to the success of CNNs [9, 17, 30, 13, 37]. Though extensive discussions toward the efficient CNNs design, the discussions of improving the dense connections in RNNs are still quite limited [29, 33].
Compared with aforementioned explicit structure changes, the lowrank method is one orthogonal approach to implicitly prune the dense connections. Lowrank tensor methods have been widely studied in machine learning area [18, 52, 25, 22, 5] and successfully applied to address the redundant dense connection problem in CNNs [31, 48, 1, 41, 19, 27]. Since the key operation in one perception is , Sainath et al. [34] decompose with Singular Value Decomposition (SVD), reducing up to 30% parameters in , but also demonstrates up to 10% accuracy loss [47]. The accuracy loss majorly results from losing the highorder spatial information, as intermediate data after image convolutions are intrinsically in 4D.
In order to capture the high order spatial correlations, recently, tensor methods were introduced into Neural Networks to approximate . For example, Tensor Train (TT) method was employed to alleviate the large computation and reduce the number of parameters [31, 48, 41]. Yu et al. [49] also used a tensor train representation to forecast longterm information. Since this approach targets in long historic states, it increases additional parameters, leading to a difficulty in training. Other tensor decomposition methods also applied in Deep Neural Networks (DNNs) for various purposes [20, 50, 19].
Although TT decomposition has obtained a great success in addressing dense connections problem, there are some limitations which block TT method to achieve better performance: 1) The optimal setting of TTranks is that they are small in the border cores and large in middle cores, e.g., like an olive [51]. However, in most applications, TTranks are set equally, which will hinder TT’s representation ability. 2) TTranks has a strong constraint that the rank in border tensors must set to 1 (), leading to a seriously limited representation ability and flexibility [48, 51].
Instead of difficultly finding the optimal TTranks setting, BTD has these advantages: 1) Tucker decomposition introduces a core tensor to represent the correlations between different dimensions, achieving better weight sharing. 2) ranks in core tensor can be set to equal, avoiding unbalance weight sharing in different dimensions, leading to a robust model toward different permutations of input data. 3) BTD uses a sum of multiple Tucker models to approximate a highorder tensor, breaking a large Tucker decomposition to several smaller models, widening network and increasing representation ability. Meanwhile, multiple Tucker models also lead to a more robust RNN model to noisy input data.
3 Tensorizing Recurrent Neural Networks
The core concept of this work is to approximate with much fewer parameters, while still preserving the memorization mechanism in existing RNN formulations. The technique we use for the approximation is Block Term Decomposition (BTD), which represents as a series of lightweighted small tensor products. In the process of RNN training, the BTD layer automatically learns interparameter correlations to implicitly prune redundant dense connections rendered by . By plugging the new BTD layer into current RNN formulations, we present a new BTRNN model with several orders of magnitude fewer parameters while maintaining the representation power.
This section elaborates the details of the proposed methodology. It starts with exploring the background of tensor representations and BTD, before delving into the transformation of a regular RNN model to the BTRNN; then we present the back propagation procedures for the BTRNN; finally, we analyze the time and memory complexity of the BTRNN compared with the regular one.
3.1 Preliminaries and Background
Tensor Representation.
We use the boldface Euler script letter, e.g., , to denote a tensor. A order tensor represents a dimensional multiway array; thereby a vector and a matrix is a 1order tensor and a 2order tensor, respectively. An element in a order tensor is denoted as .
Tensor Product and Contraction.
Two tensors can perform product on a order if their dimension matches. Let’s denote as the tensortensor product on order [18]. Given two order tensor and , the tensor product on order is:
(7) 
To simplify, we use denotes indices , while denotes . The whole indices can be denoted as . As we can see that each tensor product will be calculated along dimension, which is consistent with matrix product.
Contraction is an extension of tensor product [7]; it conducts tensor products on multiple orders at the same time. For example, if , we can conduct a tensor product according the and order:
(8) 
Block Term Decomposition (BTD)
Block Term decomposition is a combination of CP decomposition [4] and Tucker decomposition [42]. Given a order tensor , BTD decomposes it into block terms; And each term conducts between a core tensor and factor matrices on ’s dimension, where and [8]. The formulation of BTD is as follows:
(9) 
We call the the CPrank, the Tuckerrank and the Coreorder. Fig. 2 demonstrates an example of how 3order tensor being decomposed into block terms.
3.2 BTRNN model
This section demonstrates the core steps of BTRNN model. 1) We transform and into tensor representations, and ; 2) then we decompose into several lowrank core tensors and their corresponding factor tensors using BTD; 3) subsequently, the original product is approximated by the tensor contraction between decomposed weight tensor and input tensor ; 4) finally, we present the gradient calculations amid Back Propagation Through Time (BPTT) [46, 45] to demonstrate the learning procedures of BTRNN model.
Although our discussion mainly focuses on LSTM, the proposed method can easily be generalized into other RNN variants such as Gated Recurrent Unit (GRU) by following above procedures.
Tensorizing and
we tensorize the input vector to a highorder tensor to capture spatial information of the input data, while we tensorize the weight matrix to decomposed weight tensor with BTD.
Formally, given an input vector , we define the notation to denote the tensorization operation. It can be either a stack operation or a reshape operation. We use reshape operation for tensorization as it does not need to duplicate the element of the data. Essentially reshaping is regrouping the data. Fig. 3 outlines how we reshape a vector and a matrix into 3order tensors.
Decomposing with BTD
Given a 2 dimensions weight matrix , we can tensorize it as a dimensions tensor , where and . Following BTD in Eq. (9), we can decomposes into:
(10) 
where denotes the core tensor, denotes the factor tensor, is the CPrank and is the Coreorder. From the mathematical property of BT’s ranks [18], we have (and ), . If (or ), it is difficult for the model to obtain bonus in performance. What’s more, to obtain a robust model, in practice, we set each Tuckerrank to be equal, e.g., , , to avoid unbalanced weight sharing in different dimensions and to alleviate the difficulty in hyperparameters setting.
Computation between and
After substituting the matrixvector product by BT representation and tensorized input vector, we replace the inputtohidden matrixvector product with the following form:
(11) 
where the tensor contraction operation will be computed along all dimensions in and , yielding the same size in the elementwise form as the original one. Fig. 4 demonstrates the substitution intuitively.
Training BTRNN
The gradient of RNN is computed by Back Propagation Through Time (BPTT) [46]. We derive the gradients amid the framework of BPTT for the proposed BTRNN model.
Following the regular LSTM backpropagation procedure, the gradient can be computed by the original BPTT algorithm, where . Using the tensorization operation same to , we can obtain the tensorized gradient . For a more intuitive understanding, we rewrite Eq. (11) in elementwise case:
(12) 
Here, for simplified writing, we use , and to denote the indices , and , respectively. Since the right hand side of Eq. (12) is a scalar, the elementwise gradient for parameters in BTRNN is as follows:
(13)  
(14) 
3.3 HyperParameters and Complexity Analysis
3.3.1 HyperParameters Analysis
Total # Params
# Params w.r.t Coreorder ()
Coreorder is the most significant factor affecting the total amount of parameters as term in Eq. (15). It determines the total dimensions of core tensors, the number of factor tensors, and the total dimensions of input and output tensors. If we set , the model degenerates to the original matrixvector product with the largest number of parameters and the highest complexity. Fig. 5 demonstrates how total amount of parameters vary w.r.t different Coreorder . If the Tuckerrank , the total amount of parameters first decreases with increasing until reaches the minimum, then starts increasing afterwards. This mainly results from the nonlinear characteristic of in Eq. (15).
Hence, a proper choice of is particularly important. Enlarging the parameter is the simplest way to reduce the number of parameters. But due to the second term in Eq. (15), enlarging will also increase the amount of parameters in the core tensors, resulting in the high computational complexity and memory usage. With the Coreorder increasing, each dimensions of the input tensor decreases logarithmically. However, this will result in the loss of important spatial information in an extremely high order BT model. In practice, Coreorder is recommended.
# Params w.r.t Tuckerrank ()
The Tuckerrank controls the complexity of Tucker decomposition. This hyperparameter is conceptually similar to the number of singular values in Singular Value Decomposition (SVD). Eq. (15) and Fig. 5 also suggest the total amount of parameters is sensitive to . Particularly, BTD degenerates to a CP decomposition if we set it as . Since (and ), the choice of is limited in a small value range, releasing researchers from heavily hyperparameters setting.
# Params w.r.t CPrank ()
The CPrank controls the number of block terms. If , BTD degenerates to a Tucker decomposition. As we can see from Table 1 that does not affect the memory usage in forward and backward passes, so if we need a more memory saving model, we can enlarge while decreasing and at the same time.
3.3.2 Computational Complexity Analysis
Complexity in Forward Process
Eq. (10) raises the computation peak, , at the last tensor product , according to lefttoright computational order. However, we can reorder the computations to further reduce the total model complexity . The reordering is:
(16) 
The main difference is each tensor product will be first computed along all dimensions in Eq. (11), while in Eq. (16) along all dimensions. Since BTD is a lowrank decomposition method, e.g., and , the new computation order can significantly reduce the complexity of the last tensor product from to , where , , . And then the total complexity of our model reduces from to . If we decrease Tuckerrank , the computation complexity decreases logarithmically in Eq. (16) while linearly in Eq. (11).
Complexity in Backward Process
To derive the computational complexity in the backward process, we present gradients in the tensor product form. The gradients of factor tensors and core tensors are:
(17)  
(18) 
Since Eq. (17) and Eq. (18) follow the same form of Eq. (11), the backward computational complexity is same as the forward pass . Therefore, the factor tensors demonstrate a total complexity of .
Method  Time  Memory 

RNN forward  
RNN backward  
TTRNN forward  
TTRNN backward  
BTRNN forward  
BTRNN backward 
Complexity Comparisons
We analyze the time complexity and memory usage of RNN, Tensor Train RNN, and BTRNN. The statistics are shown in Table 1. In our observation, both TTRNN and BTRNN hold lower computation complexity and memory usage than the vanilla RNN, since the extra hyperparameters are several orders smaller than or . As we claim that the suggested choice of Coreorder is , the complexity of TTRNN and BTRNN should be comparable.
Since all dimensions share weights through core tensors in BT model while TT only shares weights through two dimensions nearby, BT model can capture sufficient spatial information across multiple dimensions.
4 Experiments
RNN is a versatile and powerful modeling tool widely used in various computer vision tasks. We design three challenging computer vision tasksAction Recognition in Videos, Image Caption and Image Generationto quantitatively and qualitatively evaluate proposed BTLSTM against baseline LSTM and other lowrank variants such as Tensor Train LSTM (TTLSTM). Finally, we design a control experiment to elucidate the effects of different hyperparameters.
4.1 Implementations
Since operations in , , and follow the same computation pattern, we merge them together by concatenating , , and into one giant , and so does . This observation leads to the following simplified LSTM formulations:
(19)  
(20) 
We implemented BTLSTM on the top of simplified LSTM formulation with Keras and TensorFlow. The initialization of baseline LSTM models use the default settings in Keras and TensorFlow, while we use Adam optimizer with the same learning rate (lr) across different tasks.
4.2 A Quantitative Evaluation of BTLSTM on the Task of Action Recognition in Videos
Method  Accuracy  

Orthogonal Approaches  Original [28]  0.712 
Spatialtemporal [26]  0.761  
Visual Attention [35]  0.850  
RNN Approaches  LSTM  0.697 
TTLSTM [48]  0.796  
BTLSTM  0.853 
We use UCF11 YouTube Action dataset [28] for action recognition in videos. The dataset contains 1600 video clips, falling into 11 action categories. Each category contains 25 video groups, within each contains at least 4 clips. All video clips are converted to 29.97fps MPG. We acquire the dataset from here ^{7}^{7}7http://crcv.ucf.edu/data/UCF_YouTube_Action.php. We scale down original frames from to , then we sample 6 random frames in ascending order from each video clip as the input data. For more details on the preprocessing, please refer to [48].
We use a single LSTM cell as the model architecture to evaluate BTLSTM against LSTM and TTLSTM in Fig. 6. Please note there are other orthogonal approaches aiming at improving the model such as visual attention [35] and spatialtemporal [26]. Since our discussion is limited to a single LSTM cell, we can always replace the LSTM cells in those highlevel models with BTLSTM to acquire better accuracies. We set the hyperparameters of BTLSTM and TTLSTM as follows: the factor tensor counts is ; the shape of input tensor is ; and the hidden shape is ; the rank of TTLSTM is , while BTLSTM is set to various Tuckerranks.
Fig. 6 demonstrates the training loss and validation accuracy of BTLSTM against LSTM and TTLSTM under different settings. Table 2 demonstrates the top accuracies of different models. From these experiments, we claim that:
1) times parameter reductions: The vanilla LSTM has 58.9 millons parameters in , while BTLSTM deliveries better accuracies even with several orders of less parameters. The total parameters in BTLSTM follows Eq. (15). At Tuckerrank 1, 2, 4, BTLSTM uses 721, 1470, and 3387 parameters, demonstrating compression ratios of 81693x, 40069x and 17388x, respectively.
2) faster convergence: BTLSTM demonstrates significant convergence improvement over the vanilla LSTM based on training losses and validation accuracies in Fig. 6(a) and Fig. 6(b). In terms of validation accuracies, BTLSTM reaches 60% accuracies at epoch16 while LSTM takes 230 epoces. The data demonstrates 14x convergence speedup. It is widely acknowledged that the model with few parameters is easier to train. Therefore, the convergence speedup majorly results from the drastic parameter reductions. At nearly same parameters, the training loss of BTLSTM4 also decreases faster than TTLSTM4 ( epoches[0, 50] ), substantiating that BT model captures better spatial information than the Tensor Train model.
3) better model efficiency: Though several orders of parameter reductions, BTLSTM demonstrates extra 15.6% accuracies than LSTM. In addition, BTLSTM also demonstrates extra 7.2% accuraies than TTLSTM with comparable parameters. In different Tuckerranks, BTLSTM converges to identical losses; but increasing Tucker ranks also improves the accuracy. This is consistent with the intuition since the high rank models capture additional relevant information. Therefore, we claim BTLSTM has better efficiency.
4.3 Qualitative Evaluations of BTLSTM on Tasks of Image Generation and Image Captioning
We also conduct experiments on Image Generation and Image Captioning to further substantiate the effciency of BTLSTM.
Task 1: Image Generation
Image generation intends to learn latent representation from images, then it tires to generate new image of same style from the learned model. The model for this task is Deep Recurrent Attentive Writer (DRAW) [10]. It uses an encoder RNN network to encode images into latent representations; then an decoder RNN network decodes the latent representations to construct an image. we substitute LSTM in encoder network with our BTLSTM.
In this task, encoder network must capture sufficient local correlations and visual features from raw images to generate high quality of feature vectors. As shown in Fig. 7, both LSTM and BTLSTM based model generate comparable images.
Task 2: Image Captioning
Image Captioning intends to describe the content of an image. We use the model in Neural Image Caption[43] to evaluate the performance of BTLSTM by replacing the LSTM cells.
The training dataset is MSCOCO [24], a largescale dataset for the object detection, segmentation, and captioning. It contains 82783 images for training and 40775 images for testing. Each image is scaled to in RGB channels and subtract the channel means (103.939,116.779,123.68) as the input to a pretrained vgg16 model.
Fig. 8 demonstrates the image captions generated by BTLSTM and LSTM. It is obvious that BTLSTM generates accurate sentences to describe the content of an image than baseline LSTM. Since the input data of BT model is a compact feature vector merged with the embedding images features from VGG16 and language features from a word embedding network, our model demonstrates the qualitative improvement in captioning. The results also demonstrate that BTLSTM captures local correlations missed by traditional LSTM.
4.4 Sensitivity Analysis on HyperParameters
There are 3 key hyperparameters in BTLSTM, which are coreorder , Tuckerrank and CPrank . In order to scrutinize the impacts of these hyperparameters, we design a control experiment illustrate their effects.
We try to sample y from the distribution of , where . Each is generated from a Gaussian distribution . We also add a small noise into to avoid overfitting. is generated by plugging back to . Given and , we randomly initlize , and start training. Eventually, should be similar to since and drawn from the distribution of . Please note that the purpose of this experiment is to evaluate the impact of the BT model on different parameter settings, despite these are many other good methods such as L1 regularization and Lasso regularization, to recover the weight matrix.
Coreorder (): Parameters goes down if grows and . Parameters reduce about 1.3 times from Fig. 9(d) to Fig. 9(f); and increase from 2 to 4. With less parameters, the reconstructed deteriorates quickly. We claim that high Coreorder loses important spatial information, as tensor becomes too small to capture enough latent correlations. This result is consistent with our declaration.
Tuckerrank (): the rank take effectiveness exponentially to the parameters. By comparing Fig. 9(c) and Fig. 9(d), When increases from 1 to 4, BT model has more parameters to capture sufficient information from input data, obtaining a more robust model.
CPrank (): CPrank contributes to the number of parameters linearly, playing an important role when is small. By comparing Fig. 9(c) and Fig. 9(e), we can see that the latter result has less noise in figure, showing that a proper CPrank setting will lead to a more robust model, since we use multiple Tucker models to capture information from input data.
5 Conclusion
We proposed a BlockTerm RNN architecture to address the redundancy problem in RNNs. By using a Block Term tensor decomposition to prune connections in the inputtohidden weight matrix of RNNs, we provide a new RNN model with a less number of parameters and stronger correlation modeling between feature dimensions, leading to easy model training and improved performance. Experiment results on a video action recognition data set show that our BTRNN architecture can not only consume several orders fewer parameters but also improve the model performance over standard traditional LSTM and the TTLSTM. The next works are to 1) explore the sparsity in factor tensors and core tensors of BT model, further reducing the number of model parameters; 2) concatenate hidden states and input data for a period of time, respectively, extracting the temporal features via tensor methods; 3) quantify factor tensors and core tensors to reduce memory usage.
6 Acknowledgment
This paper was in part supported by Grants from the Natural Science Foundation of China (No.61572111), the National High Technology Research and Development Pro gram of China(863 Program) (No. 2015AA015408), a 985 Project of UESTC (No.A1098531023601041) and a Fundamental Research Fund for the Central Universities of China (No. A03017023701).
References
 [1] M. Bai, B. Zhang, and J. Gao. Tensorial recurrent neural networks for longitudinal data analysis. arXiv preprint arXiv:1708.00185, 2017.
 [2] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
 [3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3547–3555, 2015.
 [4] J. D. Carroll and J.J. Chang. Analysis of individual differences in multidimensional scaling via an nway generalization of âeckartyoungâ decomposition. Psychometrika, 35(3):283–319, 1970.
 [5] S. Chen, M. R. Lyu, I. King, and Z. Xu. Exact and stable recovery of pairwise interaction tensors. In Advances in Neural Information Processing Systems, pages 1691–1699, 2013.
 [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [7] A. Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv preprint arXiv:1403.2048, 2014.
 [8] L. De Lathauwer. Decompositions of a higherorder tensor in block termsâpart ii: Definitions and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033–1066, 2008.
 [9] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
 [10] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [14] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, 91, 1991.
 [15] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [17] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [18] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [19] J. Kossaifi, Z. C. Lipton, A. Khanna, T. Furlanello, and A. Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
 [20] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [21] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 [22] G. Li, Z. Xu, L. Wang, J. Ye, I. King, and M. Lyu. Simple and efficient parallelization for probabilistic temporal tensor factorization. arXiv preprint arXiv:1611.03578, 2016.
 [23] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with localglobal long shortterm memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3185–3193, 2016.
 [24] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [25] B. Liu, Z. Xu, and Y. Li. Tensor decomposition via variational autoencoder. arXiv preprint arXiv:1611.00866, 2016.
 [26] D. Liu, M.L. Shyu, and G. Zhao. Spatialtemporal motion information integration for action detection and recognition in nonstatic background. In Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, pages 626–633. IEEE, 2013.
 [27] H. Liu, L. He, H. Bai, and Z. Xu. Efficient structured inference for stochastic recurrent neural networks. 2017.
 [28] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos âin the wildâ. In Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pages 1996–2003. IEEE, 2009.
 [29] Z. Lu, V. Sindhwani, and T. N. Sainath. Learning compact recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5960–5964. IEEE, 2016.
 [30] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. 2016.
 [31] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
 [32] I. V. Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 [33] R. Prabhavalkar, O. Alsharif, A. Bruguier, and L. McGraw. On the compression of recurrent neural networks with an application to lvcsr acoustic modeling for embedded speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5970–5974. IEEE, 2016.
 [34] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659. IEEE, 2013.
 [35] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
 [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [37] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, pages 4278–4284, 2017.
 [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [40] L. Theis and M. Bethge. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pages 1927–1935, 2015.
 [41] A. Tjandra, S. Sakti, and S. Nakamura. Compressing recurrent neural network with tensor train. arXiv preprint arXiv:1705.08052, 2017.
 [42] L. R. Tucker. Some mathematical notes on threemode factor analysis. Psychometrika, 31(3):279–311, 1966.
 [43] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
 [44] L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. Blasx: A high performance level3 blas library for heterogeneous multigpu computing. In Proceedings of the 2016 International Conference on Supercomputing, page 20. ACM, 2016.
 [45] L. Wang, Y. Yang, R. Min, and S. Chakradhar. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks, 2017.
 [46] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 [47] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, pages 2365–2369, 2013.
 [48] Y. Yang, D. Krompass, and V. Tresp. Tensortrain recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786, 2017.
 [49] R. Yu, S. Zheng, A. Anandkumar, and Y. Yue. Longterm forecasting using tensortrain rnns. arXiv preprint arXiv:1711.00073, 2017.
 [50] C. Yunpeng, J. Xiaojie, K. Bingyi, F. Jiashi, and Y. Shuicheng. Sharing residual units through collective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180, 2017.
 [51] Q. Zhao, M. Sugiyama, and A. Cichocki. Learning efficient tensor representations with ring structure networks. arXiv preprint arXiv:1705.08286, 2017.
 [52] S. Zhe, K. Zhang, P. Wang, K.c. Lee, Z. Xu, Y. Qi, and Z. Ghahramani. Distributed flexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems, pages 928–936, 2016.