# Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning

## Abstract

Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The *capacity* of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the *Tensorized LSTM* in which the hidden states are represented by *tensors* and updated via a *cross-layer convolution*. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.

## 1Introduction

We consider the time-series prediction task of producing a desired output at each timestep given an observed input sequence , where and are vectors^{1}

The update of the hidden state is defined as:

where is the weight, the bias, the hidden activation, and the element-wise tanh function. Finally, the output at timestep is generated by:

where and , and can be any differentiable function, depending on the task.

However, this vanilla RNN has difficulties in modeling long-range dependencies due to the vanishing/exploding gradient problem [?]. Long Short-Term Memories (LSTMs) [?] alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Given the success of the LSTM in sequence modeling, it is natural to consider how to increase the complexity of the model and thereby increase the set of tasks for which the LSTM can be profitably applied.

We consider the *capacity* of a network to consist of two components: the *width* (the amount of information handled in parallel) and the *depth* (the number of computation steps) [?]. A naive way to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (*s*LSTM) stacks multiple LSTM layers [?]; however, runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers.

In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:

We tensorize RNN hidden state vectors into higher-dimensional tensors which allow more flexible parameter sharing and can be widened more efficiently without additional parameters.

Based on (a), we merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a

*Tensorized RNN (*.*t*RNN)We extend the

*t*RNN to an LSTM, namely the*Tensorized LSTM (*, which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.*t*LSTM)

## 2Method

### 2.1Tensorizing Hidden States

It can be seen from (Equation 1) that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements [?], which is is known as tensor factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs) [?].

We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: (i) *scalability*, i.e., the number of shared parameters can be set independent of the hidden state size, and (ii) *separability*, i.e., the information flow can be carefully managed by controlling the *receptive field*, allowing one to shift RNN deep computations to the temporal domain (see Section 2.2). We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: (i) *flexibility*, i.e., one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters, and (ii) *efficiency*, i.e., with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (see Section 2.3).

For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state to become , where is the *tensor size*, and the *channel size*. We locally-connect the first dimension of in order to share parameters, and fully-connect the second dimension of to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares to the hidden state of a Stacked RNN (*s*RNN) (see Figure 1(a)), then is akin to the number of stacked hidden layers, and the size of each hidden layer. We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.

### 2.2Merging Deep Computations

Since an RNN is already *deep* in its temporal direction, we can deepen an input-to-output computation by associating the input with a (delayed) future output. In doing this, we need to ensure that the output is *separable*, i.e., not influenced by any future input (). Thus, we concatenate the projection of to the *top* of the previous hidden state , then gradually shift the input information down when the temporal computation proceeds, and finally generate from the *bottom* of , where is the number of delayed timesteps for computations of *depth* . An example with is shown in Figure 1(b). This is in fact a *skewed* *s*RNN as used in [?] (also similar to [?]). However, our method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable, e.g, one can increase the local connections and use feedback (see Figure 1(c)), which can be beneficial for *s*RNNs [?]. In order to share parameters, we update using a convolution with a learnable kernel. In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).

To describe the resulting *t*RNN model, let be the concatenated hidden state, and the *location* at a tensor. The channel vector at location of is defined as:

where and . Then, the update of tensor is implemented via a convolution:

where is the *kernel weight* of size , with input channels and output channels, is the *kernel bias*, is the hidden activation, and is the convolution operator (see Appendix A.1 for a more detailed definition). Since the kernel convolves across different hidden layers, we call it the *cross-layer convolution*. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate from the channel vector which is located at the *bottom* of :

where and . To guarantee that the *receptive field* of only covers the current and previous inputs (see Figure 1(c)), , , and should satisfy the constraint:

where is the ceil operation. For the derivation of (Equation 5), please see Appendix B.

We call the model defined in (Equation 2)-(Equation 4) the *Tensorized RNN ( tRNN)*. The model can be widened by increasing the tensor size , whilst the parameter number remains fixed (thanks to the convolution). Also, unlike the

*s*RNN of runtime complexity ,

*t*RNN breaks down the runtime complexity to , which means either increasing the sequence length or the network depth would not significantly increase the runtime.

### 2.3Extending to LSTMs

To allow the *t*RNN to capture long-range temporal dependencies, one can straightforwardly extend it to an LSTM by replacing the *t*RNN tensor update equations of (Equation 3)-( ?) as follows:

where the kernel is of size , with input channels and output channels, are activations for the new content , input gate , forget gate , and output gate , respectively, is the element-wise sigmoid function, and is the memory cell. However, since in ( ?) the previous memory cell is only gated along the temporal direction (see Figure 1(d)), long-range dependencies from the input to output might be lost when the tensor size becomes large.

Memory Cell Convolution.

To capture long-range dependencies from multiple directions, we additionally introduce a novel *memory cell convolution*, by which the memory cells can have a larger receptive field (see Figure 1(e)). We also dynamically generate this convolution kernel so that it is both time- and location-dependent, allowing for flexible control over long-range dependencies from different directions. This results in our *t*LSTM tensor update equations:

where, in contrast to (Equation 6)-( ?), the kernel has additional output channels^{2}*each channel* of the previous memory cell is convolved with whose values vary with , forming a *memory cell convolution* (see Appendix A.2 for a more detailed definition), which produces a convolved memory cell . Note that in ( ?) we employ a softmax function to normalize the channel dimension of , which, similar to [?], can stabilize the value of memory cells and help to prevent the vanishing/exploding gradients (see Appendix C for details).

The idea of dynamically generating network weights has been used in many works [?], where in [?] location-dependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to these works, we focus on broadening the receptive field of *t*LSTM memory cells. Whilst the flexibility is retained, fewer parameters are required to generate the kernel since the kernel is shared by different memory cell channels.

Channel Normalization.

To improve training, we adapt Layer Normalization (LN) [?] to our *t*LSTM. Similar to the observation in [?] that LN does not work well in CNNs where channel vectors at different locations have very different statistics, we find that LN is also unsuitable for *t*LSTM where lower level information is near the input while higher level information is near the output. We therefore normalize the channel vectors at different locations with their own statistics, forming a *Channel Normalization (CN)*, with its operator :

where are the original tensor, normalized tensor, *gain* parameter, and *bias* parameter, respectively. The -th channel of , i.e. , is normalized element-wisely:

where are the *mean* and *standard deviation* along the channel dimension of , respectively, and is the -th channel of . Note that the number of parameters introduced by CN/LN can be neglected as it is very small compared to the number of other parameters in the model.

Using Higher-Dimensional Tensors.

One can observe from (Equation 5) that when fixing the kernel size , the tensor size of a 2D *t*LSTM grows linearly w.r.t. its depth . How can we expand the tensor volume more rapidly so that the network can be widened more efficiently? We can achieve this goal by leveraging higher-dimensional tensors. Based on previous definitions for 2D *t*LSTMs, we replace the 2D tensors with -dimensional () tensors, obtaining with the tensor size . Since the hidden states are no longer matrices, we concatenate the projection of to one *corner* of , and thus (Equation 2) is extended as:

where is the channel vector at location of the concatenated hidden state . For the tensor update, the convolution kernel and also increase their dimensionality with kernel size . Note that is reshaped from the vector, as illustrated in Figure 2(b). Correspondingly, we generate the output from the opposite *corner* of , and therefore (Equation 4) is modified as:

For convenience, we set and for so that all dimensions of *P* and *K* can satisfy (Equation 5) with the same depth . In addition, CN still normalizes the channel dimension of tensors.

## 3Experiments

We evaluate *t*LSTM on five challenging sequence learning tasks under different configurations:

*s*LSTM (baseline): our implementation of

*s*LSTM [?] with parameters shared across all layers.2D

*t*LSTM: the standard 2D

*t*LSTM, as defined in (Equation 7)-( ?).2D

*t*LSTM–M: removing (–) memory (M) cell convolutions from (b), as defined in (Equation 6)-( ?).

2D

*t*LSTM–F: removing (–) feedback (F) connections from (b).

3D

*t*LSTM: tensorizing (b) into 3D

*t*LSTM.3D

*t*LSTM+LN: applying (+) LN [?] to (e).

3D

*t*LSTM+CN: applying (+) CN to (e), as defined in (Equation 8).

To compare different configurations, we also use to denote the number of layers of a *s*LSTM, and to denote the hidden size of each *s*LSTM layer. We set the kernel size to 2 for 2D *t*LSTM–F and 3 for other *t*LSTMs, in which case we have according to (Equation 5).

For each configuration, we fix the parameter number and increase the tensor size to see if the performance of *t*LSTM can be boosted without increasing the parameter number. We also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by *a forward and backward pass over one timestep of a single example*. Next, we compare *t*LSTM against the state-of-the-art methods to evaluate its ability. Finally, we visualize the internal working mechanism of *t*LSTM. Please see Appendix D for training details.

### 3.1Wikipedia Language Modeling

The Hutter Prize Wikipedia dataset [?] consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.

We fix the parameter number to 10M, corresponding to channel sizes of 1120 for *s*LSTM and 2D *t*LSTM–F, 901 for other 2D *t*LSTMs, and 522 for 3D *t*LSTMs. All configurations are evaluated with depths . We use Bits-per-character (BPC) to measure the model performance.

Results are shown in Fig. ?. When , *s*LSTM and 2D *t*LSTM–F outperform other models because of a larger . With increasing, the performances of *s*LSTM and 2D *t*LSTM–M improve but become saturated when , while *t*LSTMs with memory cell convolutions improve with increasing and finally outperform both *s*LSTM and 2D *t*LSTM–M. When , 2D *t*LSTM–F is surpassed by 2D *t*LSTM, which is in turn surpassed by 3D *t*LSTM. The performance of 3D *t*LSTM+LN benefits from LN only when . However, 3D *t*LSTM+CN consistently improves 3D *t*LSTM with different .

Whilst the runtime of *s*LSTM is almost proportional to , it is nearly constant in each *t*LSTM configuration and largely independent of .

We compare a larger model, i.e. a 3D *t*LSTM+CN with and , to the state-of-the-art methods on the test set, as reported in Table ?. Our model achieves 1.264 BPC with 50.1M parameters, and is competitive to the best performing methods [?] with similar parameter numbers.

### 3.2Algorithmic Tasks

(a) **Addition**: The task is to sum two 15-digit integers. The network first reads two integers with one digit per timestep, and then predicts the summation. We follow the processing of [?], where a symbol ’`-`

’ is used to delimit the integers as well as pad the input/target sequence. A 3-digit integer addition task is of the form:

(b) **Memorization**: The goal of this task is to memorize a sequence of 20 random symbols. Similar to the addition task, we use 65 different symbols. A 5-symbol memorization task is of the form:

We evaluate all configurations with on both tasks, where is 400 for *addition* and 100 for *memorization*. The performance is measured by the symbol prediction accuracy.

Fig. show the results. In both tasks, large degrades the performances of *s*LSTM and 2D *t*LSTM–M. In contrast, the performance of 2D *t*LSTM–F steadily improves with increasing, and is further enhanced by using feedback connections, higher-dimensional tensors, and CN, while LN helps only when . Note that in both tasks, the correct solution can be found (when test accuracy is achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the addition task, 3D *t*LSTM+CN with outperforms other configurations and finds the solution with only 298K training samples, while for the memorization task, 3D *t*LSTM+CN with beats others configurations and achieves perfect memorization after seeing 54K training samples. Also, unlike in *s*LSTM, the runtime of all *t*LSTMs is largely unaffected by .

We further compare the best performing configurations to the state-of-the-art methods for both tasks (see Table ?). Our models solve both tasks significantly faster (i.e., using fewer training samples) than other models, achieving the new state-of-the-art results.

### 3.3MNIST Image Classification

The MNIST dataset [?] consists of 50000/10000/10000 handwritten digit images of size for training/validation/test. We have two tasks on this dataset:

(a) **Sequential MNIST**: The goal is to classify the digit after sequentially reading the pixels in a scanline order [?]. It is therefore a 784 timestep sequence learning task where a single output is produced at the last timestep; the task requires very long range dependencies in the sequence.

(b) **Sequential Permuted MNIST**: We permute the original image pixels in a fixed random order as in [?], resulting in a permuted MNIST (*p*MNIST) problem that has even longer range dependencies across pixels and is harder.

In both tasks, all configurations are evaluated with and . The model performance is measured by the classification accuracy.

Results are shown in Fig. ?. *s*LSTM and 2D *t*LSTM–M no longer benefit from the increased depth when . Both increasing the depth and tensorization boost the performance of 2D *t*LSTM. However, removing feedback connections from 2D *t*LSTM seems not to affect the performance. On the other hand, CN enhances the 3D *t*LSTM and when it outperforms LN. 3D *t*LSTM+CN with achieves the highest performances in both tasks, with a validation accuracy of 99.1% for MNIST and 95.6% for *p*MNIST. The runtime of *t*LSTMs is negligibly affected by , and all *t*LSTMs become faster than *s*LSTM when .

We also compare the configurations of the highest test accuracies to the state-of-the-art methods (see Table ?). For sequential MNIST, our 3D *t*LSTM+CN with performs as well as the state-of-the-art Dilated GRU model [?], with a test accuracy of 99.2%. For the sequential *p*MNIST, our 3D *t*LSTM+CN with has a test accuracy of 95.7%, which is close to the state-of-the-art of 96.7% produced by the Dilated CNN [?] in [?].

### 3.4Analysis

The experimental results of different model configurations on different tasks suggest that the performance of *t*LSTMs can be improved by increasing the tensor size and network depth, requiring no additional parameters and little additional runtime. As the network gets wider and deeper, we found that the memory cell convolution mechanism is crucial to maintain improvement in performance. Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia and algorithmic tasks). Moreover, *t*LSTM can be further strengthened via tensorization or CN.

It is also intriguing to examine the internal working mechanism of *t*LSTM. Thus, we visualize the memory cell which gives insight into how information is routed. For each task, the best performing *t*LSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is of size for 3D *t*LSTMs) of the memory cell at each timestep, and visualize the diagonal values of the channel mean from location (near the input) to (near the output).

Visualization results in Fig. reveal the distinct behaviors of *t*LSTM when dealing with different tasks: (i) Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa; (ii) addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum; (iii) memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction; (v) sequential *p*MNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.

From Fig. we can also observe common phenomena for all tasks: (i) at each timestep, the values at different tensor locations are markedly different, implying that wider (larger) tensors can encode more information, with less effort to compress it; (ii) from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.

## 4Related Work

Convolutional LSTMs.

Convolutional LSTMs (*c*LSTMs) are proposed to parallelize the computation of LSTMs when the input at each timestep is *structured* (see Figure 4(a)), e.g., a vector array [?], a vector matrix [?], or a vector tensor [?]. Unlike *c*LSTMs, *t*LSTM aims to increase the capacity of LSTMs when the input at each timestep is *non-structured*, i.e., a single vector, and is advantageous over *c*LSTMs in that: (i) it performs the convolution across different hidden layers whose structure is independent of the input structure, and integrates information bottom-up and top-down; while *c*LSTM performs the convolution within each hidden layer whose structure is coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep is a single vector; (ii) it can be widened efficiently without additional parameters by increasing the tensor size; while *c*LSTM can be widened by increasing the kernel size or kernel channel, which significantly increases the number of parameters; (iii) it can be deepened with little additional runtime by delaying the output; while *c*LSTM can be deepened by using more hidden layers, which significantly increases the runtime; (iv) it captures long-range dependencies from multiple directions through the memory cell convolution; while *c*LSTM struggles to capture long-range dependencies from multiple directions since memory cells are only gated along one direction.

Deep LSTMs.

Deep LSTMs (*d*LSTMs) extend *s*LSTMs by making them deeper (see Figure 4(b)-(d)). To keep the parameter number small and ease training, [?] apply another RNN/LSTM along the *depth* direction of *d*LSTMs, which, however, multiplies the runtime. Though there are implementations to accelerate the deep computation [?], they generally aim at simple architectures such *s*LSTMs. Compared with *d*LSTMs, *t*LSTM performs the deep computation with little additional runtime, and employs a cross-layer convolution to enable the feedback mechanism. Moreover, the capacity of *t*LSTM can be increased more efficiently by using higher-dimensional tensors, whereas in *d*LSTM all hidden layers as a whole only equal to a 2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is fixed.

Other Parallelization Methods.

Some methods [?] parallelize the temporal computation of the sequence (e.g., use the temporal convolution, as in Figure 4(e)) during training, in which case full input/target sequences are accessible. However, during the online inference when the input presents sequentially, temporal computations can no longer be parallelized and will be blocked by deep computations of each timestep, making these methods potentially unsuitable for real-time applications that demand a high sampling/output frequency. Unlike these methods, *t*LSTM can speed up not only training but also online inference for many tasks since it performs the deep computation by the temporal computation, which is also human-like: we convert each signal to an action and *meanwhile* receive new signals in a non-blocking way. Note that for the online inference of tasks that use the previous output for the current input (e.g., autoregressive sequence generation), *t*LSTM cannot parallel the deep computation since it requires to delay timesteps to get .

## 5Conclusion

We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. We validated our model on a variety of tasks, showing its potential over other popular approaches.

## Acknowledgements

This work is supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1.

## AMathematical Definition for Cross-Layer Convolutions

### a.1Hidden State Convolution

The hidden state convolution in (Equation 3) is defined as:

where and zero padding is applied to keep the tensor size.

### a.2Memory Cell Convolution

The memory cell convolution in ( ?) is defined as:

To prevent the stored information from being flushed away, is padded with the replication of its boundary values instead of zeros or input projections.

## BDerivation for the Constraint of , , and

Here we derive the constraint of , , and that is defined in (Equation 5). The kernel center location is ceiled in case that the kernel size is not odd. Then, the kernel radius can be calculated by:

As shown in Figure 5, to guarantee the receptive field of covers while does not cover , the following constraint should be satisfied:

which means:

Plugging (Equation 14) into (Equation 16), we get:

## CMemory Cell Convolution Helps to Prevent the Vanishing/Exploding Gradients

[?] have proved that the *lambda gate*, which is very similar to our memory cell convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 17-18 in [?]). The differences between our approach and their *lambda gate* are: (i) we normalize the kernel values though a softmax function, while they normalize the gate values by dividing them with their sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modifications affects the conditions of validity for Theorem 17-18 in [?], our memory cell convolution can also help to prevent the vanishing/exploding gradients.

## DTraining Details

### d.1Objective Function

The training objective is to minimize the negative log-likelihood (NLL) of the training sequences w.r.t. the parameter (vectorized), i.e.,

where is the number of training sequences, the length of the -th training sequence, and the likelihood of target conditioned on its prediction . Since all experiment are classification problems, is represented as the one-hot encoding of the class label, and the output function is defined as a softmax function, which is used to generate the class distribution . Then, the likelihood can be calculated by .

### d.2Common Settings

In all tasks, the NLL (see (Equation 18)) is used as the training objective and is minimized by Adam [?] with a learning rate of 0.001. Forget gate biases are set to 4 for image classification tasks and 1 [?] for others. All models are implemented by Torch7 [?] and accelerated by cuDNN on Tesla K80 GPUs.

We only apply CN to the output of the *t*LSTM hidden state as we have tried different combinations and found this is the most robust way that can always improve the performance for all tasks. With CN, the output of hidden state becomes:

### d.3Wikipedia Language Modeling

As in [?], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we feed the model with a mini-batch of 100 subsequences of length 50. During the forward pass, the hidden values at the last timestep are preserved to initialize the next iteration. We terminate training after 50 epochs.

### d.4Algorithmic Tasks

Following [?], for both tasks we randomly generate 5M samples for training and 100 samples for test, and set the mini-batch size to 15. Training proceeds for at most 1 epoch^{3}

### d.5MNIST Image Classification

We set the mini-batch size to 50 and use early stopping for training. The training loss is calculated at the last timestep.

### Footnotes

- Vectors are assumed to be in row form throughout this paper.
- The operator returns the cumulative product of all elements in the input variable.
- To simulate the online learning process, we use all training samples only once.