Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.
We consider the time-series prediction task of producing a desired output at each timestep given an observed input sequence , where and are vectors
The update of the hidden state is defined as:
where is the weight, the bias, the hidden activation, and the element-wise tanh function. Finally, the output at timestep is generated by:
where and , and can be any differentiable function, depending on the task.
However, this vanilla RNN has difficulties in modeling long-range dependencies due to the vanishing/exploding gradient problem [?]. Long Short-Term Memories (LSTMs) [?] alleviate these problems by employing memory cells to preserve information for longer, and adopting gating mechanisms to modulate the information flow. Given the success of the LSTM in sequence modeling, it is natural to consider how to increase the complexity of the model and thereby increase the set of tasks for which the LSTM can be profitably applied.
We consider the capacity of a network to consist of two components: the width (the amount of information handled in parallel) and the depth (the number of computation steps) [?]. A naive way to widen the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers [?]; however, runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers.
In this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:
We tensorize RNN hidden state vectors into higher-dimensional tensors which allow more flexible parameter sharing and can be widened more efficiently without additional parameters.
Based on (a), we merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).
We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.
2.1Tensorizing Hidden States
It can be seen from (Equation 1) that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements [?], which is is known as tensor factorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs) [?].
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters can be set independent of the hidden state size, and (ii) separability, i.e., the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain (see Section 2.2). We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: (i) flexibility, i.e., one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters, and (ii) efficiency, i.e., with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (see Section 2.3).
For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state to become , where is the tensor size, and the channel size. We locally-connect the first dimension of in order to share parameters, and fully-connect the second dimension of to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares to the hidden state of a Stacked RNN (sRNN) (see Figure 1(a)), then is akin to the number of stacked hidden layers, and the size of each hidden layer. We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.
2.2Merging Deep Computations
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input with a (delayed) future output. In doing this, we need to ensure that the output is separable, i.e., not influenced by any future input (). Thus, we concatenate the projection of to the top of the previous hidden state , then gradually shift the input information down when the temporal computation proceeds, and finally generate from the bottom of , where is the number of delayed timesteps for computations of depth . An example with is shown in Figure 1(b). This is in fact a skewed sRNN as used in [?] (also similar to [?]). However, our method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable, e.g, one can increase the local connections and use feedback (see Figure 1(c)), which can be beneficial for sRNNs [?]. In order to share parameters, we update using a convolution with a learnable kernel. In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).
To describe the resulting tRNN model, let be the concatenated hidden state, and the location at a tensor. The channel vector at location of is defined as:
where and . Then, the update of tensor is implemented via a convolution:
where is the kernel weight of size , with input channels and output channels, is the kernel bias, is the hidden activation, and is the convolution operator (see Appendix A.1 for a more detailed definition). Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate from the channel vector which is located at the bottom of :
where and . To guarantee that the receptive field of only covers the current and previous inputs (see Figure 1(c)), , , and should satisfy the constraint:
We call the model defined in (Equation 2)-(Equation 4) the Tensorized RNN (tRNN). The model can be widened by increasing the tensor size , whilst the parameter number remains fixed (thanks to the convolution). Also, unlike the sRNN of runtime complexity , tRNN breaks down the runtime complexity to , which means either increasing the sequence length or the network depth would not significantly increase the runtime.
2.3Extending to LSTMs
To allow the tRNN to capture long-range temporal dependencies, one can straightforwardly extend it to an LSTM by replacing the tRNN tensor update equations of (Equation 3)-( ?) as follows:
where the kernel is of size , with input channels and output channels, are activations for the new content , input gate , forget gate , and output gate , respectively, is the element-wise sigmoid function, and is the memory cell. However, since in ( ?) the previous memory cell is only gated along the temporal direction (see Figure 1(d)), long-range dependencies from the input to output might be lost when the tensor size becomes large.
Memory Cell Convolution.
To capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (see Figure 1(e)). We also dynamically generate this convolution kernel so that it is both time- and location-dependent, allowing for flexible control over long-range dependencies from different directions. This results in our tLSTM tensor update equations:
where, in contrast to (Equation 6)-( ?), the kernel has additional output channels
The idea of dynamically generating network weights has been used in many works [?], where in [?] location-dependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to these works, we focus on broadening the receptive field of tLSTM memory cells. Whilst the flexibility is retained, fewer parameters are required to generate the kernel since the kernel is shared by different memory cell channels.
To improve training, we adapt Layer Normalization (LN) [?] to our tLSTM. Similar to the observation in [?] that LN does not work well in CNNs where channel vectors at different locations have very different statistics, we find that LN is also unsuitable for tLSTM where lower level information is near the input while higher level information is near the output. We therefore normalize the channel vectors at different locations with their own statistics, forming a Channel Normalization (CN), with its operator :
where are the original tensor, normalized tensor, gain parameter, and bias parameter, respectively. The -th channel of , i.e. , is normalized element-wisely:
where are the mean and standard deviation along the channel dimension of , respectively, and is the -th channel of . Note that the number of parameters introduced by CN/LN can be neglected as it is very small compared to the number of other parameters in the model.
Using Higher-Dimensional Tensors.
One can observe from (Equation 5) that when fixing the kernel size , the tensor size of a 2D tLSTM grows linearly w.r.t. its depth . How can we expand the tensor volume more rapidly so that the network can be widened more efficiently? We can achieve this goal by leveraging higher-dimensional tensors. Based on previous definitions for 2D tLSTMs, we replace the 2D tensors with -dimensional () tensors, obtaining with the tensor size . Since the hidden states are no longer matrices, we concatenate the projection of to one corner of , and thus (Equation 2) is extended as:
where is the channel vector at location of the concatenated hidden state . For the tensor update, the convolution kernel and also increase their dimensionality with kernel size . Note that is reshaped from the vector, as illustrated in Figure 2(b). Correspondingly, we generate the output from the opposite corner of , and therefore (Equation 4) is modified as:
For convenience, we set and for so that all dimensions of P and K can satisfy (Equation 5) with the same depth . In addition, CN still normalizes the channel dimension of tensors.
We evaluate tLSTM on five challenging sequence learning tasks under different configurations:
: our implementation of sLSTM [?] with parameters shared across all layers.
: the standard 2D tLSTM, as defined in (Equation 7)-( ?).
: removing (–) memory (M) cell convolutions from (b), as defined in (Equation 6)-( ?).
: removing (–) feedback (F) connections from (b).
: tensorizing (b) into 3D tLSTM.
: applying (+) LN [?] to (e).
: applying (+) CN to (e), as defined in (Equation 8).
To compare different configurations, we also use to denote the number of layers of a sLSTM, and to denote the hidden size of each sLSTM layer. We set the kernel size to 2 for 2D tLSTM–F and 3 for other tLSTMs, in which case we have according to (Equation 5).
For each configuration, we fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. We also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. Next, we compare tLSTM against the state-of-the-art methods to evaluate its ability. Finally, we visualize the internal working mechanism of tLSTM. Please see Appendix D for training details.
3.1Wikipedia Language Modeling
The Hutter Prize Wikipedia dataset [?] consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.
We fix the parameter number to 10M, corresponding to channel sizes of 1120 for sLSTM and 2D tLSTM–F, 901 for other 2D tLSTMs, and 522 for 3D tLSTMs. All configurations are evaluated with depths . We use Bits-per-character (BPC) to measure the model performance.
Results are shown in Fig. ?. When , sLSTM and 2D tLSTM–F outperform other models because of a larger . With increasing, the performances of sLSTM and 2D tLSTM–M improve but become saturated when , while tLSTMs with memory cell convolutions improve with increasing and finally outperform both sLSTM and 2D tLSTM–M. When , 2D tLSTM–F is surpassed by 2D tLSTM, which is in turn surpassed by 3D tLSTM. The performance of 3D tLSTM+LN benefits from LN only when . However, 3D tLSTM+CN consistently improves 3D tLSTM with different .
Whilst the runtime of sLSTM is almost proportional to , it is nearly constant in each tLSTM configuration and largely independent of .
We compare a larger model, i.e. a 3D tLSTM+CN with and , to the state-of-the-art methods on the test set, as reported in Table ?. Our model achieves 1.264 BPC with 50.1M parameters, and is competitive to the best performing methods [?] with similar parameter numbers.
(a) Addition: The task is to sum two 15-digit integers. The network first reads two integers with one digit per timestep, and then predicts the summation. We follow the processing of [?], where a symbol ’
-’ is used to delimit the integers as well as pad the input/target sequence. A 3-digit integer addition task is of the form:
(b) Memorization: The goal of this task is to memorize a sequence of 20 random symbols. Similar to the addition task, we use 65 different symbols. A 5-symbol memorization task is of the form:
We evaluate all configurations with on both tasks, where is 400 for addition and 100 for memorization. The performance is measured by the symbol prediction accuracy.
Fig. show the results. In both tasks, large degrades the performances of sLSTM and 2D tLSTM–M. In contrast, the performance of 2D tLSTM–F steadily improves with increasing, and is further enhanced by using feedback connections, higher-dimensional tensors, and CN, while LN helps only when . Note that in both tasks, the correct solution can be found (when test accuracy is achieved) due to the repetitive nature of the task. In our experiment, we also observe that for the addition task, 3D tLSTM+CN with outperforms other configurations and finds the solution with only 298K training samples, while for the memorization task, 3D tLSTM+CN with beats others configurations and achieves perfect memorization after seeing 54K training samples. Also, unlike in sLSTM, the runtime of all tLSTMs is largely unaffected by .
We further compare the best performing configurations to the state-of-the-art methods for both tasks (see Table ?). Our models solve both tasks significantly faster (i.e., using fewer training samples) than other models, achieving the new state-of-the-art results.
3.3MNIST Image Classification
The MNIST dataset [?] consists of 50000/10000/10000 handwritten digit images of size for training/validation/test. We have two tasks on this dataset:
(a) Sequential MNIST: The goal is to classify the digit after sequentially reading the pixels in a scanline order [?]. It is therefore a 784 timestep sequence learning task where a single output is produced at the last timestep; the task requires very long range dependencies in the sequence.
(b) Sequential Permuted MNIST: We permute the original image pixels in a fixed random order as in [?], resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.
In both tasks, all configurations are evaluated with and . The model performance is measured by the classification accuracy.
Results are shown in Fig. ?. sLSTM and 2D tLSTM–M no longer benefit from the increased depth when . Both increasing the depth and tensorization boost the performance of 2D tLSTM. However, removing feedback connections from 2D tLSTM seems not to affect the performance. On the other hand, CN enhances the 3D tLSTM and when it outperforms LN. 3D tLSTM+CN with achieves the highest performances in both tasks, with a validation accuracy of 99.1% for MNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by , and all tLSTMs become faster than sLSTM when .
We also compare the configurations of the highest test accuracies to the state-of-the-art methods (see Table ?). For sequential MNIST, our 3D tLSTM+CN with performs as well as the state-of-the-art Dilated GRU model [?], with a test accuracy of 99.2%. For the sequential pMNIST, our 3D tLSTM+CN with has a test accuracy of 95.7%, which is close to the state-of-the-art of 96.7% produced by the Dilated CNN [?] in [?].
The experimental results of different model configurations on different tasks suggest that the performance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no additional parameters and little additional runtime. As the network gets wider and deeper, we found that the memory cell convolution mechanism is crucial to maintain improvement in performance. Also, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia and algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.
It is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the memory cell which gives insight into how information is routed. For each task, the best performing tLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is of size for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values of the channel mean from location (near the input) to (near the output).
Visualization results in Fig. reveal the distinct behaviors of tLSTM when dealing with different tasks: (i) Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa; (ii) addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum; (iii) memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction; (v) sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.
From Fig. we can also observe common phenomena for all tasks: (i) at each timestep, the values at different tensor locations are markedly different, implying that wider (larger) tensors can encode more information, with less effort to compress it; (ii) from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.
Convolutional LSTMs (cLSTMs) are proposed to parallelize the computation of LSTMs when the input at each timestep is structured (see Figure 4(a)), e.g., a vector array [?], a vector matrix [?], or a vector tensor [?]. Unlike cLSTMs, tLSTM aims to increase the capacity of LSTMs when the input at each timestep is non-structured, i.e., a single vector, and is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden layers whose structure is independent of the input structure, and integrates information bottom-up and top-down; while cLSTM performs the convolution within each hidden layer whose structure is coupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep is a single vector; (ii) it can be widened efficiently without additional parameters by increasing the tensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which significantly increases the number of parameters; (iii) it can be deepened with little additional runtime by delaying the output; while cLSTM can be deepened by using more hidden layers, which significantly increases the runtime; (iv) it captures long-range dependencies from multiple directions through the memory cell convolution; while cLSTM struggles to capture long-range dependencies from multiple directions since memory cells are only gated along one direction.
Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Figure 4(b)-(d)). To keep the parameter number small and ease training, [?] apply another RNN/LSTM along the depth direction of dLSTMs, which, however, multiplies the runtime. Though there are implementations to accelerate the deep computation [?], they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM performs the deep computation with little additional runtime, and employs a cross-layer convolution to enable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more efficiently by using higher-dimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a 2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is fixed.
Other Parallelization Methods.
Some methods [?] parallelize the temporal computation of the sequence (e.g., use the temporal convolution, as in Figure 4(e)) during training, in which case full input/target sequences are accessible. However, during the online inference when the input presents sequentially, temporal computations can no longer be parallelized and will be blocked by deep computations of each timestep, making these methods potentially unsuitable for real-time applications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed up not only training but also online inference for many tasks since it performs the deep computation by the temporal computation, which is also human-like: we convert each signal to an action and meanwhile receive new signals in a non-blocking way. Note that for the online inference of tasks that use the previous output for the current input (e.g., autoregressive sequence generation), tLSTM cannot parallel the deep computation since it requires to delay timesteps to get .
We introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. We validated our model on a variety of tasks, showing its potential over other popular approaches.
This work is supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1.
AMathematical Definition for Cross-Layer Convolutions
a.1Hidden State Convolution
The hidden state convolution in (Equation 3) is defined as:
where and zero padding is applied to keep the tensor size.
a.2Memory Cell Convolution
The memory cell convolution in ( ?) is defined as:
To prevent the stored information from being flushed away, is padded with the replication of its boundary values instead of zeros or input projections.
BDerivation for the Constraint of , , and
Here we derive the constraint of , , and that is defined in (Equation 5). The kernel center location is ceiled in case that the kernel size is not odd. Then, the kernel radius can be calculated by:
As shown in Figure 5, to guarantee the receptive field of covers while does not cover , the following constraint should be satisfied:
CMemory Cell Convolution Helps to Prevent the Vanishing/Exploding Gradients
[?] have proved that the lambda gate, which is very similar to our memory cell convolution kernel, can help to prevent the vanishing/exploding gradients (see Theorem 17-18 in [?]). The differences between our approach and their lambda gate are: (i) we normalize the kernel values though a softmax function, while they normalize the gate values by dividing them with their sum, and (ii) we share the kernel for all channels, while they do not. However, as neither modifications affects the conditions of validity for Theorem 17-18 in [?], our memory cell convolution can also help to prevent the vanishing/exploding gradients.
The training objective is to minimize the negative log-likelihood (NLL) of the training sequences w.r.t. the parameter (vectorized), i.e.,
where is the number of training sequences, the length of the -th training sequence, and the likelihood of target conditioned on its prediction . Since all experiment are classification problems, is represented as the one-hot encoding of the class label, and the output function is defined as a softmax function, which is used to generate the class distribution . Then, the likelihood can be calculated by .
In all tasks, the NLL (see (Equation 18)) is used as the training objective and is minimized by Adam [?] with a learning rate of 0.001. Forget gate biases are set to 4 for image classification tasks and 1 [?] for others. All models are implemented by Torch7 [?] and accelerated by cuDNN on Tesla K80 GPUs.
We only apply CN to the output of the tLSTM hidden state as we have tried different combinations and found this is the most robust way that can always improve the performance for all tasks. With CN, the output of hidden state becomes:
d.3Wikipedia Language Modeling
As in [?], we split the dataset into 90M/5M/5M for training/validation/test. In each iteration, we feed the model with a mini-batch of 100 subsequences of length 50. During the forward pass, the hidden values at the last timestep are preserved to initialize the next iteration. We terminate training after 50 epochs.
Following [?], for both tasks we randomly generate 5M samples for training and 100 samples for test, and set the mini-batch size to 15. Training proceeds for at most 1 epoch
d.5MNIST Image Classification
We set the mini-batch size to 50 and use early stopping for training. The training loss is calculated at the last timestep.
- Vectors are assumed to be in row form throughout this paper.
- The operator returns the cumulative product of all elements in the input variable.
- To simulate the online learning process, we use all training samples only once.