Wider and Deeper, Cheaper and Faster:
Tensorized LSTMs for Sequence Learning
Abstract
Long ShortTerm Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a crosslayer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.
Wider and Deeper, Cheaper and Faster:
Tensorized LSTMs for Sequence Learning
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
We consider the timeseries prediction task of producing a desired output at each timestep given an observed input sequence , where and are vectors^{1}^{1}1 Vectors are assumed to be in row form throughout this paper.. The Recurrent Neural Network (RNN) (Rumelhart et al., 1986; Elman, 1990) is a powerful model that learns how to use a hidden state vector to encapsulate the relevant features of the entire input history up to timestep . Let be the concatenation of the current input and the previous hidden state :
(1) 
The update of the hidden state is defined as:
(2)  
(3) 
where is the weight, the bias, the hidden activation, and the elementwise tanh function. Finally, the output at timestep is generated by:
(4) 
where and , and can be any differentiable function, depending on the task.
However, this vanilla RNN has difficulties in modeling longrange dependencies due to the vanishing/exploding gradient problem (Bengio et al., 1994). Long ShortTerm Memories (LSTMs) (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Given the success of the LSTM in sequence modeling, it is natural to consider how to increase the complexity of the model and thereby increase the set of tasks for which the LSTM can be profitably applied.
We consider the capacity of a network to consist of two components: the width (the amount of information handled in parallel) and the depth (the number of computation steps) (Bengio, 2009). A naive way to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers (Graves et al., 2013); however, runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers.
In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:

We tensorize RNN hidden state vectors into higherdimensional tensors which allow more flexible parameter sharing and can be widened more efficiently without additional parameters.

Based on (a), we merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).

We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.
2 Method
2.1 Tensorizing Hidden States
It can be seen from (2) that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higherdimensional tensors which can be factorized into lowerrank subtensors that contain significantly fewer elements (Taylor and Hinton, 2009; Sutskever et al., 2011; Denil et al., 2013; Irsoy and Cardie, 2015; Novikov et al., 2015; Wu et al., 2016b; Bertinetto et al., 2016; Garipov et al., 2016; Krause et al., 2017), which is is known as tensor factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs) (LeCun et al., 1989, 1998).
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters can be set independent of the hidden state size, and (ii) separability, i.e., the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain (see Sec. 2.2). We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: (i) flexibility, i.e., one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters, and (ii) efficiency, i.e., with higherdimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (see Sec. 2.3).
For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state to become , where is the tensor size, and the channel size. We locallyconnect the first dimension of in order to share parameters, and fullyconnect the second dimension of to allow global interactions. This is analogous to the CNN which fullyconnects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares to the hidden state of a Stacked RNN (sRNN) (see Fig. 1(a)), then is akin to the number of stacked hidden layers, and the size of each hidden layer. We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higherdimensional tensors.
2.2 Merging Deep Computations
Since an RNN is already deep in its temporal direction, we can deepen an inputtooutput computation by associating the input with a (delayed) future output. In doing this, we need to ensure that the output is separable, i.e., not influenced by any future input (). Thus, we concatenate the projection of to the top of the previous hidden state , then gradually shift the input information down when the temporal computation proceeds, and finally generate from the bottom of , where is the number of delayed timesteps for computations of depth . An example with is shown in Fig. 1(b). This is in fact a skewed sRNN as used in (Appleyard et al., 2016) (also similar to (van den Oord et al., 2016)). However, our method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable, e.g, one can increase the local connections and use feedback (see Fig. 1(c)), which can be beneficial for sRNNs (Chung et al., 2015). In order to share parameters, we update using a convolution with a learnable kernel. In this manner we increase the complexity of the inputtooutput mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).
To describe the resulting tRNN model, let be the concatenated hidden state, and the location at a tensor. The channel vector at location of is defined as:
(5) 
where and . Then, the update of tensor is implemented via a convolution:
(6)  
(7) 
where is the kernel weight of size , with input channels and output channels, is the kernel bias, is the hidden activation, and is the convolution operator (see Appendix A.1 for a more detailed definition). Since the kernel convolves across different hidden layers, we call it the crosslayer convolution. The kernel enables interaction, both bottomup and topdown across layers. Finally, we generate from the channel vector which is located at the bottom of :
(8) 
where and . To guarantee that the receptive field of only covers the current and previous inputs (see Fig. 1(c)), , , and should satisfy the constraint:
(9) 
where is the ceil operation. For the derivation of (9), please see Appendix B.
We call the model defined in (5)(8) the Tensorized RNN (tRNN). The model can be widened by increasing the tensor size , whilst the parameter number remains fixed (thanks to the convolution). Also, unlike the sRNN of runtime complexity , tRNN breaks down the runtime complexity to , which means either increasing the sequence length or the network depth would not significantly increase the runtime.
2.3 Extending to LSTMs
To allow the tRNN to capture longrange temporal dependencies, one can straightforwardly extend it to an LSTM by replacing the tRNN tensor update equations of (6)(7) as follows:
(10)  
(11)  
(12)  
(13) 
where the kernel is of size , with input channels and output channels, are activations for the new content , input gate , forget gate , and output gate , respectively, is the elementwise sigmoid function, and is the memory cell. However, since in (12) the previous memory cell is only gated along the temporal direction (see Fig. 1(d)), longrange dependencies from the input to output might be lost when the tensor size becomes large.
Memory Cell Convolution. To capture longrange dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (see Fig. 1(e)). We also dynamically generate this convolution kernel so that it is both time and locationdependent, allowing for flexible control over longrange dependencies from different directions. This results in our tLSTM tensor update equations:
(14)  
(15)  
(16)  
(17)  
(18)  
(19) 
where, in contrast to (10)(13), the kernel has additional output channels^{2}^{2}2The operator returns the cumulative product of all elements in the input variable. to generate the activation for the dynamic kernel bank , is the vectorized adaptive kernel at the location of , and is the dynamic kernel of size with a single input/output channel, which is reshaped from (see Fig. 2(a) for an illustration). In (17), each channel of the previous memory cell is convolved with whose values vary with , forming a memory cell convolution (see Appendix A.2 for a more detailed definition), which produces a convolved memory cell . Note that in (15) we employ a softmax function to normalize the channel dimension of , which, similar to (Leifert et al., 2016), can stabilize the value of memory cells and help to prevent the vanishing/exploding gradients (see Appendix C for details).
The idea of dynamically generating network weights has been used in many works (Schmidhuber, 1992; Sutskever et al., 2011; Denil et al., 2013; Bertinetto et al., 2016; De Brabandere et al., 2016; Ha et al., 2017), where in (De Brabandere et al., 2016) locationdependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to these works, we focus on broadening the receptive field of tLSTM memory cells. Whilst the flexibility is retained, fewer parameters are required to generate the kernel since the kernel is shared by different memory cell channels.
Channel Normalization. To improve training, we adapt Layer Normalization (LN) (Ba et al., 2016) to our tLSTM. Similar to the observation in (Ba et al., 2016) that LN does not work well in CNNs where channel vectors at different locations have very different statistics, we find that LN is also unsuitable for tLSTM where lower level information is near the input while higher level information is near the output. We therefore normalize the channel vectors at different locations with their own statistics, forming a Channel Normalization (CN), with its operator :
(20) 
where are the original tensor, normalized tensor, gain parameter, and bias parameter, respectively. The th channel of , i.e. , is normalized elementwisely:
(21) 
where are the mean and standard deviation along the channel dimension of , respectively, and is the th channel of . Note that the number of parameters introduced by CN/LN can be neglected as it is very small compared to the number of other parameters in the model.
Using HigherDimensional Tensors. One can observe from (9) that when fixing the kernel size , the tensor size of a 2D tLSTM grows linearly w.r.t. its depth . How can we expand the tensor volume more rapidly so that the network can be widened more efficiently? We can achieve this goal by leveraging higherdimensional tensors. Based on previous definitions for 2D tLSTMs, we replace the 2D tensors with dimensional () tensors, obtaining with the tensor size . Since the hidden states are no longer matrices, we concatenate the projection of to one corner of , and thus (5) is extended as:
(22) 
where is the channel vector at location of the concatenated hidden state . For the tensor update, the convolution kernel and also increase their dimensionality with kernel size . Note that is reshaped from the vector, as illustrated in Fig. 2(b). Correspondingly, we generate the output from the opposite corner of , and therefore (8) is modified as:
(23) 
For convenience, we set and for so that all dimensions of P and K can satisfy (9) with the same depth . In addition, CN still normalizes the channel dimension of tensors.
3 Experiments
We evaluate tLSTM on five challenging sequence learning tasks under different configurations:

sLSTM (baseline): our implementation of sLSTM (Graves, 2013) with parameters shared across all layers.

2D tLSTM–F: removing (–) feedback (F) connections from (b).

3D tLSTM: tensorizing (b) into 3D tLSTM.

3D tLSTM+LN: applying (+) LN (Ba et al., 2016) to (e).

3D tLSTM+CN: applying (+) CN to (e), as defined in (20).
To compare different configurations, we also use to denote the number of layers of a sLSTM, and to denote the hidden size of each sLSTM layer. We set the kernel size to 2 for 2D tLSTM–F and 3 for other tLSTMs, in which case we have according to (9).
For each configuration, we fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. We also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. Next, we compare tLSTM against the stateoftheart methods to evaluate its ability. Finally, we visualize the internal working mechanism of tLSTM. Please see Appendix D for training details.
3.1 Wikipedia Language Modeling
The Hutter Prize Wikipedia dataset (Hutter, 2012) consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the characterlevel, and try to predict the next character of the input sequence.
We fix the parameter number to 10M, corresponding to channel sizes of 1120 for sLSTM and 2D tLSTM–F, 901 for other 2D tLSTMs, and 522 for 3D tLSTMs. All configurations are evaluated with depths . We use Bitspercharacter (BPC) to measure the model performance.
Results are shown in Fig. 3. When , sLSTM and 2D tLSTM–F outperform other models because of a larger . With increasing, the performances of sLSTM and 2D tLSTM–M improve but become saturated when , while tLSTMs with memory cell convolutions improve with increasing and finally outperform both sLSTM and 2D tLSTM–M. When , 2D tLSTM–F is surpassed by 2D tLSTM, which is in turn surpassed by 3D tLSTM. The performance of 3D tLSTM+LN benefits from LN only when . However, 3D tLSTM+CN consistently improves 3D tLSTM with different .
BPC  # Param.  
MILSTM (Wu et al., 2016b)  1.44  17M 
mLSTM (Krause et al., 2017)  1.42  20M 
HyperLSTM+LN (Ha et al., 2017)  1.34  26.5M 
HMLSTM+LN (Chung et al., 2017)  1.32  35M 
Large RHN (Zilly et al., 2017)  1.27  46M 
Large FSLSTM4 (Mujika et al., 2017)  1.245  47M 
2 Large FSLSTM4 (Mujika et al., 2017)  1.198  94M 
3D tLSTM+CN (, )  1.264  50.1M 
Whilst the runtime of sLSTM is almost proportional to , it is nearly constant in each tLSTM configuration and largely independent of .
We compare a larger model, i.e. a 3D tLSTM+CN with and , to the stateoftheart methods on the test set, as reported in Table 1. Our model achieves 1.264 BPC with 50.1M parameters, and is competitive to the best performing methods (Zilly et al., 2017; Mujika et al., 2017) with similar parameter numbers.
3.2 Algorithmic Tasks
(a) Addition:
The task is to sum two 15digit integers.
The network first reads two integers with one digit per timestep, and then predicts the summation.
We follow the processing of (Kalchbrenner et al., 2016), where a symbol ‘
’ is used to delimit the integers as well as pad the input/target sequence. A 3digit integer addition task is of the form:
(b) Memorization: The goal of this task is to memorize a sequence of 20 random symbols. Similar to the addition task, we use 65 different symbols. A 5symbol memorization task is of the form:
We evaluate all configurations with on both tasks, where is 400 for addition and 100 for memorization. The performance is measured by the symbol prediction accuracy.
Fig. 4 show the results. In both tasks, large degrades the performances of sLSTM and 2D tLSTM–M. In contrast, the performance of 2D tLSTM–F steadily improves with increasing, and is further enhanced by using feedback connections, higherdimensional tensors, and CN, while LN helps only when . Note that in both tasks, the correct solution can be found (when test accuracy is achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the addition task, 3D tLSTM+CN with outperforms other configurations and finds the solution with only 298K training samples, while for the memorization task, 3D tLSTM+CN with beats others configurations and achieves perfect memorization after seeing 54K training samples. Also, unlike in sLSTM, the runtime of all tLSTMs is largely unaffected by .
Addition  Memorization  
Acc.  # Samp.  Acc.  # Samp.  
Stacked LSTM (Graves, 2013)  51%  5M  50%  900K 
Grid LSTM (Kalchbrenner et al., 2016)  99%  550K  99%  150K 
3D tLSTM+CN ()  99%  298K  99%  115K 
3D tLSTM+CN ()  99%  317K  99%  54K 
We further compare the best performing configurations to the stateoftheart methods for both tasks (see Table 2). Our models solve both tasks significantly faster (i.e., using fewer training samples) than other models, achieving the new stateoftheart results.
3.3 MNIST Image Classification
The MNIST dataset (LeCun et al., 1998) consists of 50000/10000/10000 handwritten digit images of size for training/validation/test. We have two tasks on this dataset:
(a) Sequential MNIST: The goal is to classify the digit after sequentially reading the pixels in a scanline order (Le et al., 2015). It is therefore a 784 timestep sequence learning task where a single output is produced at the last timestep; the task requires very long range dependencies in the sequence.
(b) Sequential Permuted MNIST: We permute the original image pixels in a fixed random order as in (Arjovsky et al., 2016), resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.
In both tasks, all configurations are evaluated with and . The model performance is measured by the classification accuracy.
Results are shown in Fig. 5. sLSTM and 2D tLSTM–M no longer benefit from the increased depth when . Both increasing the depth and tensorization boost the performance of 2D tLSTM. However, removing feedback connections from 2D tLSTM seems not to affect the performance. On the other hand, CN enhances the 3D tLSTM and when it outperforms LN. 3D tLSTM+CN with achieves the highest performances in both tasks, with a validation accuracy of 99.1% for MNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by , and all tLSTMs become faster than sLSTM when .
MNIST  pMNIST  
iRNN (Le et al., 2015)  97.0  82.0 
LSTM (Arjovsky et al., 2016)  98.2  88.0 
uRNN (Arjovsky et al., 2016)  95.1  91.4 
Fullcapacity uRNN (Wisdom et al., 2016)  96.9  94.1 
sTANH (Zhang et al., 2016)  98.1  94.0 
BNLSTM (Cooijmans et al., 2017)  99.0  95.4 
Dilated GRU (Chang et al., 2017)  99.2  94.6 
Dilated CNN (Oord et al., 2016) in (Chang et al., 2017)  98.3  96.7 
3D tLSTM+CN ()  99.2  94.9 
3D tLSTM+CN ()  99.0  95.7 
We also compare the configurations of the highest test accuracies to the stateoftheart methods (see Table 3). For sequential MNIST, our 3D tLSTM+CN with performs as well as the stateoftheart Dilated GRU model (Chang et al., 2017), with a test accuracy of 99.2%. For the sequential pMNIST, our 3D tLSTM+CN with has a test accuracy of 95.7%, which is close to the stateoftheart of 96.7% produced by the Dilated CNN (Oord et al., 2016) in (Chang et al., 2017).
3.4 Analysis
The experimental results of different model configurations on different tasks suggest that the performance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no additional parameters and little additional runtime. As the network gets wider and deeper, we found that the memory cell convolution mechanism is crucial to maintain improvement in performance. Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia and algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.
It is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the memory cell which gives insight into how information is routed. For each task, the best performing tLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is of size for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values of the channel mean from location (near the input) to (near the output).
Visualization results in Fig. 6 reveal the distinct behaviors of tLSTM when dealing with different tasks: (i) Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa; (ii) addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum; (iii) memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction; (v) sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.
From Fig. 6 we can also observe common phenomena for all tasks: (i) at each timestep, the values at different tensor locations are markedly different, implying that wider (larger) tensors can encode more information, with less effort to compress it; (ii) from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with longrange dependencies carried by memory cells.
4 Related Work
Convolutional LSTMs. Convolutional LSTMs (cLSTMs) are proposed to parallelize the computation of LSTMs when the input at each timestep is structured (see Fig. 7(a)), e.g., a vector array (van den Oord et al., 2016), a vector matrix (Xingjian et al., 2015; RomeraParedes and Torr, 2016; Patraucean et al., 2016; Wu et al., 2016a), or a vector tensor (Stollenga et al., 2015; Chen et al., 2016). Unlike cLSTMs, tLSTM aims to increase the capacity of LSTMs when the input at each timestep is nonstructured, i.e., a single vector, and is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden layers whose structure is independent of the input structure, and integrates information bottomup and topdown; while cLSTM performs the convolution within each hidden layer whose structure is coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep is a single vector; (ii) it can be widened efficiently without additional parameters by increasing the tensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which significantly increases the number of parameters; (iii) it can be deepened with little additional runtime by delaying the output; while cLSTM can be deepened by using more hidden layers, which significantly increases the runtime; (iv) it captures longrange dependencies from multiple directions through the memory cell convolution; while cLSTM struggles to capture longrange dependencies from multiple directions since memory cells are only gated along one direction.
Deep LSTMs. Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Fig. 7(b)(d)). To keep the parameter number small and ease training, Kalchbrenner et al. (2016); Graves (2016); Zilly et al. (2017); Mujika et al. (2017) apply another RNN/LSTM along the depth direction of dLSTMs, which, however, multiplies the runtime. Though there are implementations to accelerate the deep computation (Appleyard et al., 2016; Diamos et al., 2016), they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM performs the deep computation with little additional runtime, and employs a crosslayer convolution to enable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more efficiently by using higherdimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a 2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is fixed.
Other Parallelization Methods. Some methods (Kaiser and Sutskever, 2016; Kaiser and Bengio, 2016; Oord et al., 2016; Bradbury et al., 2017; Lei and Zhang, 2017; Chang et al., 2017) parallelize the temporal computation of the sequence (e.g., use the temporal convolution, as in Fig. 7(e)) during training, in which case full input/target sequences are accessible. However, during the online inference when the input presents sequentially, temporal computations can no longer be parallelized and will be blocked by deep computations of each timestep, making these methods potentially unsuitable for realtime applications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed up not only training but also online inference for many tasks since it performs the deep computation by the temporal computation, which is also humanlike: we convert each signal to an action and meanwhile receive new signals in a nonblocking way. Note that for the online inference of tasks that use the previous output for the current input (e.g., autoregressive sequence generation), tLSTM cannot parallel the deep computation since it requires to delay timesteps to get .
5 Conclusion
We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. We validated our model on a variety of tasks, showing its potential over other popular approaches.
Acknowledgements
This work is supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1.
References
 Appleyard et al. [2016] Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural networks on gpus. arXiv preprint arXiv:1604.01946, 2016.
 Arjovsky et al. [2016] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML, 2016.
 Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bengio et al. [1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE TNN, 5(2):157–166, 1994.
 Bengio [2009] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2009.
 Bertinetto et al. [2016] Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feedforward oneshot learners. In NIPS, 2016.
 Bradbury et al. [2017] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasirecurrent neural networks. In ICLR, 2017.
 Chang et al. [2017] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark HasegawaJohnson, and Thomas Huang. Dilated recurrent neural networks. In NIPS, 2017.
 Chen et al. [2016] Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen. Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation. In NIPS, 2016.
 Chung et al. [2015] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In ICML, 2015.
 Chung et al. [2017] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017.
 Collobert et al. [2011] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlablike environment for machine learning. In NIPS Workshop, 2011.
 Cooijmans et al. [2017] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent batch normalization. In ICLR, 2017.
 De Brabandere et al. [2016] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, 2016.
 Denil et al. [2013] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In NIPS, 2013.
 Diamos et al. [2016] Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights onchip. In ICML, 2016.
 Elman [1990] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 Garipov et al. [2016] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. In NIPS Workshop, 2016.
 Gers et al. [2000] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000.
 Graves et al. [2013] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 Graves [2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Graves [2016] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
 Ha et al. [2017] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hutter [2012] Marcus Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 2012.
 Irsoy and Cardie [2015] Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks. In ICLR, 2015.
 Jozefowicz et al. [2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In ICML, 2015.
 Kaiser and Bengio [2016] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In NIPS, 2016.
 Kaiser and Sutskever [2016] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016.
 Kalchbrenner et al. [2016] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long shortterm memory. In ICLR, 2016.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Krause et al. [2017] Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. In ICLR Workshop, 2017.
 Le et al. [2015] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 LeCun et al. [1989] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lei and Zhang [2017] Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017.
 Leifert et al. [2016] Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimensional recurrent neural networks. JMLR, 17(1):3313–3349, 2016.
 Mujika et al. [2017] Asier Mujika, Florian Meier, and Angelika Steger. Fastslow recurrent neural networks. In NIPS, 2017.
 Novikov et al. [2015] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In NIPS, 2015.
 Oord et al. [2016] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Patraucean et al. [2016] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatiotemporal video autoencoder with differentiable memory. In ICLR Workshop, 2016.
 RomeraParedes and Torr [2016] Bernardino RomeraParedes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In ECCV, 2016.
 Rumelhart et al. [1986] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
 Schmidhuber [1992] Jürgen Schmidhuber. Learning to control fastweight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
 Stollenga et al. [2015] Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multidimensional lstm, with application to fast biomedical volumetric image segmentation. In NIPS, 2015.
 Sutskever et al. [2011] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In ICML, 2011.
 Taylor and Hinton [2009] Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In ICML, 2009.
 van den Oord et al. [2016] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.
 Wisdom et al. [2016] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary recurrent neural networks. In NIPS, 2016.
 Wu et al. [2016a] Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrent convolutional networks for videobased person reidentification: An endtoend approach. arXiv preprint arXiv:1606.01609, 2016.
 Wu et al. [2016b] Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multiplicative integration with recurrent neural networks. In NIPS, 2016.
 Xingjian et al. [2015] SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, and Wangchun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
 Zhang et al. [2016] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and Yoshua Bengio. Architectural complexity measures of recurrent neural networks. In NIPS, 2016.
 Zilly et al. [2017] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017.
Appendix A Mathematical Definition for CrossLayer Convolutions
a.1 Hidden State Convolution
The hidden state convolution in (6) is defined as:
(24) 
where and zero padding is applied to keep the tensor size.
a.2 Memory Cell Convolution
The memory cell convolution in (17) is defined as:
(25) 
To prevent the stored information from being flushed away, is padded with the replication of its boundary values instead of zeros or input projections.
Appendix B Derivation for the Constraint of , , and
Here we derive the constraint of , , and that is defined in (9). The kernel center location is ceiled in case that the kernel size is not odd. Then, the kernel radius can be calculated by:
(26) 
As shown in Fig. 8, to guarantee the receptive field of covers while does not cover , the following constraint should be satisfied:
(27) 
which means:
(28) 
Plugging (26) into (28), we get:
(29) 
Appendix C Memory Cell Convolution Helps to Prevent the Vanishing/Exploding Gradients
Leifert et al. [2016] have proved that the lambda gate, which is very similar to our memory cell convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 1718 in [Leifert et al., 2016]). The differences between our approach and their lambda gate are: (i) we normalize the kernel values though a softmax function, while they normalize the gate values by dividing them with their sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modifications affects the conditions of validity for Theorem 1718 in [Leifert et al., 2016], our memory cell convolution can also help to prevent the vanishing/exploding gradients.
Appendix D Training Details
d.1 Objective Function
The training objective is to minimize the negative loglikelihood (NLL) of the training sequences w.r.t. the parameter (vectorized), i.e.,
(30) 
where is the number of training sequences, the length of the th training sequence, and the likelihood of target conditioned on its prediction . Since all experiment are classification problems, is represented as the onehot encoding of the class label, and the output function is defined as a softmax function, which is used to generate the class distribution . Then, the likelihood can be calculated by .
d.2 Common Settings
In all tasks, the NLL (see (30)) is used as the training objective and is minimized by Adam [Kingma and Ba, 2015] with a learning rate of 0.001. Forget gate biases are set to 4 for image classification tasks and 1 [Jozefowicz et al., 2015] for others. All models are implemented by Torch7 [Collobert et al., 2011] and accelerated by cuDNN on Tesla K80 GPUs.
We only apply CN to the output of the tLSTM hidden state as we have tried different combinations and found this is the most robust way that can always improve the performance for all tasks. With CN, the output of hidden state becomes:
(31) 
d.3 Wikipedia Language Modeling
As in [Chung et al., 2015], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we feed the model with a minibatch of 100 subsequences of length 50. During the forward pass, the hidden values at the last timestep are preserved to initialize the next iteration. We terminate training after 50 epochs.
d.4 Algorithmic Tasks
Following [Kalchbrenner et al., 2016], for both tasks we randomly generate 5M samples for training and 100 samples for test, and set the minibatch size to 15. Training proceeds for at most 1 epoch^{3}^{3}3To simulate the online learning process, we use all training samples only once. and will be terminated if test accuracy is achieved.
d.5 MNIST Image Classification
We set the minibatch size to 50 and use early stopping for training. The training loss is calculated at the last timestep.