Longterm Forecasting using
TensorTrain RNNs
Abstract
We present TensorTrain RNN (TTRNN), a novel family of neural sequence architectures for multivariate forecasting in environments with nonlinear dynamics. Longterm forecasting in such systems is highly challenging, since there exist longterm temporal dependencies, higherorder correlations and sensitivity to error propagation. Our proposed tensor recurrent architecture addresses these issues by learning the nonlinear dynamics directly using higher order moments and highorder state transition functions. Furthermore, we decompose the higherorder structure using the tensortrain (TT) decomposition to reduce the number of parameters while preserving the model performance. We theoretically establish the approximation properties of TensorTrain RNNs for general sequence inputs, and such guarantees are not available for usual RNNs. We also demonstrate significant longterm prediction improvements over general RNN and LSTM architectures on a range of simulated environments with nonlinear dynamics, as well on realworld climate and traffic data.
Longterm Forecasting using
TensorTrain RNNs
Rose Yu ^{1}^{1}1Equal contribution. Stephan Zheng^{1}^{1}1Equal contribution. Anima Anandkumar Yisong Yue 

Department of Computation and Mathematical Sciences, Caltech 
{rose,stephan,anima,yyue}@caltech.edu 
1 Introduction
One of the central questions in science is forecasting: given the past history, how well can we predict the future? In many domains with complex multivariate correlation structures and nonlinear dynamics, forecasting is highly challenging since the system has longterm temporal dependencies and higherorder dynamics. Examples of such systems abound in science and engineering, from biological neural network activity, fluid turbulence, to climate and traffic systems (see Figure 1). Since current forecasting systems are unable to faithfully represent the higherorder dynamics, they have limited ability for accurate longterm forecasting.
Therefore, a key challenge is accurately modeling nonlinear dynamics and obtaining stable longterm predictions, given a dataset of realizations of the dynamics. Here, the forecasting problem can be stated as follows: how can we efficiently learn a model that, given only few initial states, can reliably predict a sequence of future states over a long horizon of timesteps?
Common approaches to forecasting involve linear time series models such as autoregressive moving average (ARMA), state space models such as hidden Markov model (HMM), and deep neural networks. We refer readers to a survey on time series forecasting by (Box et al., 2015) and the references therein. A recurrent neural network (RNN), as well as its memorybased extensions such as the LSTM, is a class of models that have achieved good performance on sequence prediction tasks from demand forecasting (Flunkert et al., 2017) to speech recognition (Soltau et al., 2016) and video analysis (LeCun et al., 2015). Although these methods can be effective for shortterm, smooth dynamics, neither analytic nor datadriven learning methods tend to generalize well to capturing longterm nonlinear dynamics and predicting them over longer time horizons.
To address this issue, we propose a novel family of tensortrain recurrent neural networks that can learn stable longterm forecasting. These models have two key features: they 1) explicitly model the higherorder dynamics, by using a longer history of previous hidden states and highorder state interactions with multiplicative memory units; and 2) they are scalable by using tensor trains, a structured lowrank tensor decomposition that greatly reduces the number of model parameters, while mostly preserving the correlation structure of the fullrank model.
In this work, we analyze TensorTrain RNNs theoretically, and also experimentally validate them over a wide range of forecasting domains. Our contributions can be summarized as follows:

We describe how TTRNNs encode higherorder nonMarkovian dynamics and highorder state interactions. To address the memory issue, we propose a tensortrain (TT) decomposition that makes learning tractable and fast.

We provide theoretical guarantees for the representation power of TTRNNs for nonlinear dynamics, and obtain the connection between the target dynamics and TTRNN approximation. In contrast, no such theoretical results are known for standard recurrent networks.

We validate TTRNNs on simulated data and two realworld environments with nonlinear dynamics (climate and traffic). Here, we show that TTRNNs can forecast more accurately for significantly longer time horizons compared to standard RNNs and LSTMs.
2 Forecasting using TensorTrain RNNs
Forecasting Nonlinear Dynamics
Our goal is to learn an efficient model for sequential multivariate forecasting in environments with nonlinear dynamics. Such systems are governed by dynamics that describe how a system state evolves using a set of nonlinear differential equations:
(1) 
where can be an arbitrary (smooth) function of the state and its derivatives. Continuous time dynamics are usually described by differential equations while difference equations are employed for discrete time. In continuous time, a classic example is the firstorder Lorenz attractor, whose realizations showcase the “butterflyeffect”, a characteristic set of doublespiral orbits. In discretetime, a nontrivial example is the 1dimensional Genz dynamics, whose difference equation is:
(2) 
where denotes the system state at time and are the parameters. Due to the nonlinear nature of the dynamics, such systems exhibit higherorder correlations, longterm dependencies and sensitivity to error propagation, and thus form a challenging setting for learning. Given a sequence of initial states , the forecasting problem aims to learn a model
(3) 
that outputs a sequence of future states . Hence, accurately approximating the dynamics is critical to learning a good forecasting model and accurately predicting for long time horizons.
Firstorder Markov Models
In deep learning, common approaches for modeling dynamics usually employ firstorder hiddenstate models, such as recurrent neural networks (RNNs). An RNN with a single RNN cell recursively computes the output from a hidden state using:
(4) 
where is the state transition function, is the output function and are model parameters. An RNN therefore learns a model for the Markov process , of order 1 (only the previous timestep is considered). A common choice is to model (4) as a nonlinear activation function applied to a linear map of and as:
(5) 
where is the activation function (e.g. sigmoid, ) for the state transition, and are transition weight matrices and are biases. RNNs have many different variations, including LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Chung et al., 2014). For instance, LSTM cells use a memorystate, which mitigate the “exploding gradient” problem and allow RNNs to propagate information over longer time horizons. Although RNNs are very expressive, they compute only using the previous state and input . Such models do not explicitly model higherorder dynamics and only implicitly model longterm dependencies between all historical states , which limits their forecasting effectiveness in environments with nonlinear dynamics.
2.1 Tensorized Recurrent Neural Networks
To effectively learn nonlinear dynamics, we propose TensorTrain RNNs, or TTRNNs, a class of higherorder models that can be viewed as a higherorder generalization of RNNs. We developed TTRNNs with two goals in mind: explicitly modeling 1) order Markov processes with steps of temporal memory and 2) polynomial interactions between the hidden states and .
First, we consider longer “history”: we keep length historic states: :
where is an activation function. In principle, early work (Giles et al., 1989) has shown that with a large enough hidden state size, such recurrent structures are capable of approximating any dynamics.
Second, to learn the nonlinear dynamics efficiently, we also use higherorder moments to approximate the state transition function. We construct a higherorder transition tensor by modeling a degree polynomial interaction between hidden states. Hence, the TTRNN with standard RNN cell is defined by:
(6) 
where is a dimensional tensor, the index the hidden states and is the polynomial degree. Here, we defined the lag hidden state as:
We included the bias unit 1 to model all possible polynomial expansions up to order in a compact form. The TTRNN with LSTM cell, or “TLSTM”, is defined analogously as:
(7) 
where denotes the Hadamard product. Note that the bias units are again included. TTRNN serves as a module for sequencetosequence (Seq2Seq) framework (Sutskever et al., 2014), which consists of an encoderdecoder pair (see Figure 3). We use tensortrain recurrent cells both the encoder and decoder. The encoder receives the initial states and the decoder predicts . For each timestep , the decoder uses its previous prediction as an input.
2.2 Tensortrain Networks
Unfortunately, due to the “curse of dimensionality”, the number of parameters in with hidden size grows exponentially as , which makes the highorder model prohibitively large to train. To overcome this difficulty, we utilize tensor networks to approximate the weight tensor . Such networks encode a structural decomposition of tensors into lowdimensional components and have been shown to provide the most general approximation to smooth tensors (Orús, 2014). The most commonly used tensor networks are linear tensor networks (LTN), also known as tensortrains in numerical analysis or matrixproduct states in quantum physics (Oseledets, 2011).
A tensor train model decomposes a dimensional tensor into a network of sparsely connected lowdimensional tensors as:
as depicted in Figure (3). When the are called the tensortrain rank. With tensortrain, we can reduce the number of parameters of TTRNN from to , with as the upper bound on the tensortrain rank. Thus, a major benefit of tensortrain is that they do not suffer from the curse of dimensionality, which is in sharp contrast to many classical tensor decompositions, such as the Tucker decomposition.
3 Approximation results for TTRNN
A significant benefit of using tensortrains is that we can theoretically characterize the representation power of tensortrain neural networks for approximating highdimensional functions. We do so by analyzing a class of functions that satisfies some regularity condition. For such functions, tensortrain decompositions preserve weak differentiability and yield a compact representation. We combine this property with neural network estimation theory to bound the approximation error for TTRNN with one hidden layer in terms of: 1) the regularity of the target function , 2) the dimension of the input space, 3) the tensor train rank and 4) the order of the tensor.
In the context of TTRNN, the target function , with , describes the state transitions of the system dynamics, as in (6). Let us assume that is a Sobolev function: , defined on the input space , where each is a set of vectors. The space is defined as the functions that have bounded derivatives up to some order and are integrable:
(8) 
where is the th weak derivative of and .^{1}^{1}1A weak derivative generalizes the derivative concept for (non)differentiable functions and is implicitly defined as: e.g. is a weak derivative of if for all smooth with : . Any Sobolev function admits a Schmidt decomposition: , where are the eigenvalues and are the associated eigenfunctions. Hence, we can decompose the target function as:
(9) 
where are basis functions , satisfying . We can truncate (13) to a low dimensional subspace (), and obtain the functional tensortrain (FTT) approximation of the target function :
(10) 
In practice, TTRNN implements a polynomial expansion of the state as in (6), using powers to approximate , where is the degree of the polynomial. We can then bound the approximation error using TTRNN, viewed as a onelayer hidden neural network:
Theorem 3.1.
Let the state transition function be a Hölder continuous function defined on a input domain , with bounded derivatives up to order and finite Fourier magnitude distribution . Then a single layer Tensor Train RNN can approximate with an estimation error of using with hidden units:
where , is the size of the state space, is the tensortrain rank and is the degree of highorder polynomials i.e., the order of tensor.
For the full proof, see the Appendix. From this theorem we see: 1) if the target becomes smoother, it is easier to approximate and 2) polynomial interactions are more efficient than linear ones: if the polynomial order increases, we require fewer hidden units . This result applies to the full family of TTRNNs, including those using vanilla RNN or LSTM as the recurrent cell, as long as we are given a state transitions (e.g. the state transition function learned by the encoder).
4 Experiments
4.1 Datasets
We validated the accuracy and efficiency of TTRNN on one synthetic and two realworld datasets, as described below; Detailed preprocessing and data statistics are deferred to the Appendix.
Genz dynamics
The Genz “product peak” (see Figure 4 \subreffig:genz) is one of the Genz functions (Genz, 1984), which are often used as a basis for highdimensional function approximation. In particular, (Bigoni et al., 2016) used them to analyze tensortrain decompositions. We generated samples of length using (2) with and random initial points.
Traffic
The traffic data (see Figure 4 \subreffig:traffic) of Los Angeles County highway network is collected from California department of transportation http://pems.dot.ca.gov/. The prediction task is to predict the speed readings for locations across LA, aggregated every minutes. After upsampling and processing the data for missing values, we obtained sequences of length .
Climate
The climate data (see Figure 4 \subreffig:climate) is collected from the U.S. Historical Climatology Network (USHCN) (http://cdiac.ornl.gov/ftp/ushcn_daily/). The prediction task is to predict the daily maximum temperature for stations. The data spans approximately years. After preprocessing, we obtained sequences of length .
4.2 Longterm Forecasting Evaluation
Experimental Setup
To validate that TTRNNs effectively perform longterm forecasting task in (3), we experiment with a seq2seq architecture with TTRNN using LSTM as recurrent cells (TLSTM). For all experiments, we used an initial sequence of length as input and varied the forecasting horizon . We trained all models using stochastic gradient descent on the length sequence regression loss where are the ground truth and model prediction respectively. For more details on training and hyperparameters, see the Appendix.
We compared TTRNN against 2 set of natural baselines: 1storder RNN (vanilla RNN, LSTM), and matrix RNNs (vanilla MRNN, MLSTM), which use matrix products of multiple hidden states without factorization (Soltani & Jiang, 2016)). We observed that TTRNN with RNN cells outperforms vanilla RNN and MRNN, but using LSTM cells performs best in all experiments. We also evaluated the classic ARIMA time series model and observed that it performs worse than LSTM.
Longterm Accuracy
For traffic, we forecast up to hours ahead with hours as initial inputs. For climate, we forecast up to days ahead given days of initial observations. For Genz dynamics, we forecast for steps given initial steps. All results are averages over runs.
We now present the longterm forecasting accuracy of TLSTM in nonlinear systems. Figure 5 shows the test prediction error (in RMSE) for varying forecasting horizons for different datasets. We can see that TLSTM notably outperforms all baselines on all datasets in this setting. In particular, TLSTM is more robust to longterm error propagation. We observe two salient benefits of using TTRNNs over the unfactorized models. First, MRNN and MLSTM can suffer from overfitting as the number of weights increases. Second, on traffic, unfactorized models also show considerable instability in their longterm predictions. These results suggest that tensortrain neural networks learn more stable representations that generalize better for longterm horizons.
Visualization of Predictions
To get intuition for the learned models, we visualize the best performing TLSTM and baselines in Figure 6 for the Genz function “cornerpeak” and the statetransition function. We can see that TLSTM can almost perfectly recover the original function, while LSTM and MLSTM only correctly predict the mean. These baselines cannot capture the dynamics fully, often predicting an incorrect range and phase for the dynamics.
In Figure 7 we show predictions for the real world traffic and climate dataset. We can see that the TLSTM corresponds significantly better with ground truth in longterm forecasting. As the ground truth time series is highly chaotic and noisy, LSTM often deviates from the general trend. While both MLSTM and TLSTM can correctly learn the trend, TLSTM captures more detailed curvatures due to the inherent highorder structure.
Speed Performance Tradeoff
We now investigate potential tradeoffs between accuracy and computation. Figure 8 displays the validation loss with respect to the number of steps, for the best performing models on longterm forecasting. We see that TTRNNs converge significantly faster than other models, and achieve lower validationloss. This suggests that TTRNN has a more efficient representation of the nonlinear dynamics, and can learn much faster as a result.
Hyperparameter Analysis
The TLSTM model is equipped with a set of hyperparameters, such as tensortrain rank and the number of lags. We perform a random grid search over these hyperparameters and showcase the results in Table 1. In the top row, we report the prediction RMSE for the largest forecasting horizon w.r.t tensor ranks for all the datasets with lag . When the rank is too low, the model does not have enough capacity to capture nonlinear dynamics. when the rank is too high, the model starts to overfit. In the bottom row, we report the effect of changing lags (degree of orders in Markovian dynamics). For each setting, the best is determined by crossvalidation. For different forecasting horizon, the best lag value also varies.
Chaotic Nonlinear Dynamics
We have also evaluated TTRNN on longterm forecasting for chaotic dynamics, such as the Lorenz dynamics (see Figure 9\subreffig:lorenz_data). Such dynamics are highly sensitive to input perturbations: two close points can move exponentially far apart under the dynamics. This makes longterm forecasting highly challenging, as small errors can lead to catastrophic longterm errors. Figure 9 shows that TTRNN can predict up to steps into the future, but diverges quickly beyond that. We have found no stateoftheart prediction model is stable in this setting.
5 Related Work
Classic work in time series forecasting has studied autoregressive models, such as the ARMA or ARIMA model (Box et al., 2015), which model a process linearly, and so do not capture nonlinear dynamics. Our method contrasts with this by explicitly modeling higherorder dependencies. Using neural networks to model time series has a long history. More recently, they have been applied to room temperature prediction, weather forecasting, traffic prediction and other domains. We refer to (Schmidhuber, 2015) for a detailed overview of the relevant literature.
From a modeling perspective, (Giles et al., 1989) considers a highorder RNN to simulate a deterministic finite state machine and recognize regular grammars. This work considers a second order mapping from inputs and hidden states to the next state. However, this model only considers the most recent state and is limited to twoway interactions. (Sutskever et al., 2011) proposes multiplicative RNN that allow each hidden state to specify a different factorized hiddentohidden weight matrix. A similar approach also appears in (Soltani & Jiang, 2016), but without the factorization. Our method can be seen as an efficient generalization of these works. Moreover, hierarchical RNNs have been used to model sequential data at multiple resolutions, e.g. to learn both shortterm and longterm human behavior (Zheng et al., 2016).
Tensor methods have tight connections with neural networks. For example, (Cohen et al., 2016) shows convolutional neural networks have equivalence to hierarchical tensor factorizations. (Novikov et al., 2015; Yang et al., 2017) employs tensortrain to compress large neural networks and reduce the number of weights. (Stoudenmire & Schwab, 2016) propose to parameterizes the supervised learning models with matrixproduct states for image classification. This work however, to the best of our knowledge, is the first work to consider tensor networks in RNNS for sequential prediction tasks for learning in environments with nonlinear dynamics.
6 Conclusion and Discussion
In this work, we considered forecasting under nonlinear dynamics.We propose a novel class of RNNs – TTRNN. We provide approximation guarantees for TTRNN and characterize its representation power. We demonstrate the benefits of TTRNN to forecast accurately for significantly longer time horizon in both synthetic and realworld multivariate time series data.
As we observed, chaotic dynamics still present a significant challenge to any sequential prediction model. Hence, it would be interesting to study how to learn robust models for chaotic dynamics. In other sequential prediction settings, such as natural language processing, there does not (or is not known to) exist a succinct analytical description of the datagenerating process. It would be interesting to further investigate the effectiveness of TTRNNs in such domains as well.
References
 Barron (1993) Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
 Bigoni et al. (2016) Daniele Bigoni, Allan P EngsigKarup, and Youssef M Marzouk. Spectral tensortrain decomposition. SIAM Journal on Scientific Computing, 38(4):A2405–A2439, 2016.
 Box et al. (2015) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: a tensor analysis. In 29th Annual Conference on Learning Theory, pp. 698–728, 2016.
 Flunkert et al. (2017) Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 Genz (1984) Alan Genz. Testing multidimensional integration routines. In Proc. Of International Conference on Tools, Methods and Languages for Scientific and Engineering Computation, pp. 81–94, New York, NY, USA, 1984. Elsevier NorthHolland, Inc. ISBN 0444875700. URL http://dl.acm.org/citation.cfm?id=2837.2842.
 Giles et al. (1989) C Lee Giles, GuoZheng Sun, HsingHen Chen, YeeChun Lee, and Dong Chen. Higher order recurrent networks and grammatical inference. In NIPS, pp. 380–387, 1989.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Orús (2014) Román Orús. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349:117–158, 2014.
 Oseledets (2011) Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 Schmidhuber (2015) Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
 Soltani & Jiang (2016) Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064, 2016.
 Soltau et al. (2016) Hagen Soltau, Hank Liao, and Hasim Sak. Neural speech recognizer: Acoustictoword lstm model for large vocabulary speech recognition. arXiv preprint arXiv:1610.09975, 2016.
 Stoudenmire & Schwab (2016) Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pp. 4799–4807, 2016.
 Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 1017–1024, 2011.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Yang et al. (2017) Yinchong Yang, Denis Krompass, and Volker Tresp. Tensortrain recurrent neural networks for video classification. In International Conference on Machine Learning, pp. 3891–3900, 2017.
 Zheng et al. (2016) Stephan Zheng, Yisong Yue, and Patrick Lucey. Generating longterm trajectories using deep hierarchical networks. In Advances in Neural Information Processing Systems, pp. 1543–1551, 2016.
Appendix A Appendix
a.1 Theoretical Analysis
We provide theoretical guarantees for the proposed TTRNN model by analyzing a class of functions that satisfy some regularity condition. For such functions, tensortrain decompositions preserve weak differentiability and yield a compact representation. We combine this property with neural network estimation theory to bound the approximation error for TTRNN with one hidden layer, in terms of: 1) the regularity of the target function , 2) the dimension of the input space, and 3) the tensor train rank.
In the context of TTRNN, the target function with , is the system dynamics that describes state transitions, as in (6). Let us assume that is a Sobolev function: , defined on the input space , where each is a set of vectors. The space is defined as the set of functions that have bounded derivatives up to some order and are integrable:
(11) 
where is the th weak derivative of and .^{2}^{2}2A weak derivative generalizes the derivative concept for (non)differentiable functions and is implicitly defined as: e.g. is a weak derivative of if for all smooth with : .
Any Sobolev function admits a Schmidt decomposition: , where are the eigenvalues and are the associated eigenfunctions. Hence, we can decompose the target function as:
(12) 
where are basis functions , satisfying . We can truncate Eqn 13 to a low dimensional subspace (), and obtain the functional tensortrain (FTT) approximation of the target function :
(13) 
.
FTT approximation in Eqn 13 projects the target function to a subspace with finite basis. And the approximation error can be bounded using the following Lemma:
Lemma A.1 (FTT Approximation Bigoni et al. (2016)).
Let be a Hölder continuous function, defined on a bounded domain with exponent , the FTT approximation error can be upper bounded as
(14) 
for and
(15) 
for
Lemma A.1 relates the approximation error to the dimension , tensortrain rank ,and the regularity of the target function . In practice, TTRNN implements a polynomial expansion of the input states , using powers to approximate , where is the degree of the polynomial. We can further use the classic spectral approximation theory to connect the TTRNN structure with the degree of the polynomial, i.e., the order of the tensor. Let . Given a function and its polynomial expansion , the approximation error is therefore bounded by:
Lemma A.2 (Polynomial Approximation).
Let for . Let be the approximating polynomial with degree , Then
Here is the seminorm of the space . is the coefficient of the spectral expansion. By definition, is equipped with a norm and a seminorm . For notation simplicity, we muted the subscript and used for .
So far, we have obtained the tensortrain approximation error with the regularity of the target function . Next we will connect the tensortrain approximation and the estimation error of neural networks with one layer hidden units. Given a neural network with one hidden layer and sigmoid activation function, following Lemma describes the classic result of describes the error between a target function and the single hiddenlayer neural network that approximates it best:
Lemma A.3 (NN Approximation Barron (1993)).
Given a function with finite Fourier magnitude distribution , there exists a neural network of hidden units , such that
(16) 
where with Fourier representation .
We can now generalize Barron’s approximation lemma A.3 to TTRNN. The target function we are approximating is the state transition function . We can express the function using FTT, followed by the polynomial expansion of the states concatenation . The approximation error of TTRNN, viewed as one layer hidden
Where is the order of tensor and is the tensortrain rank. As the rank of the tensortrain and the polynomial order increase, the required size of the hidden units become smaller, up to a constant that depends on the regularity of the underlying dynamics .
a.2 Training and Hyperparameter Search
We trained all models using the RMSprop optimizer and employed a learning rate decay of schedule. We performed an exhaustive search over the hyperparameters for validation. Table 2 reports the hyperparameter search range used in this work.
Hyperparameter search range  

learning rate  hidden state size  
tensortrain rank  number of lags  
number of orders  number of layers 
For all datasets, we used a trainvalidationtest split and train for a maximum of steps. We compute the moving average of the validation loss and use it as an early stopping criteria. We also did not employ scheduled sampling, as we found training became highly unstable under a range of annealing schedules.
a.3 Dataset Details
Genz
Genz functions are often used as basis for evaluating highdimensional function approximation. In particular, they have been used to analyze tensortrain decompositions (Bigoni et al., 2016). There are in total different Genz functions. (1) , (2) , (3) , (4) (5) (6) . For each function, we generated a dataset with samples using (2) with and and random initial points draw from a range of .
Traffic
We use the traffic data of Los Angeles County highway network collected from California department of transportation http://pems.dot.ca.gov/. The dataset consists of month speed readings aggregated every minutes . Due to large number of missing values () in the raw data, we impute the missing values using the average values of nonmissing entries from other sensors at the same time. In total, after processing, the dataset covers timeseries. We treat each sequence as daily traffic of time stamps. We upsample the dataset every minutes, which results in a dataset of sequences of daily measurements. We select sensors as a joint forecasting tasks.
Climate
We use the daily maximum temperature data from the U.S. Historical Climatology Network (USHCN) daily (http://cdiac.ornl.gov/ftp/ushcn_daily/) contains daily measurements for climate variables for approximately years. The records were collected across more than locations and span over days. We analyze the area in California which contains stations. We removed the first years of day, most of which has no observations. We treat the temperature reading per year as one sequence and impute the missing observations using other nonmissing entries from other stations across years. We augment the datasets by rotating the sequence every days, which results in a data set of sequences.
We also perform a DickeyâFuller test in order to test the null hypothesis of whether a unit root is present in an autoregressive model. The test statistics of the traffic and climate data is shown in Table 3, which demonstrate the nonstationarity of the time series.
Traffic  Climate  
Test Statistic  0.00003  0  3e7  0 
pvalue  0.96  0.96  1.12 e13  2.52 e7 
Number Lags Used  2  7  0  1 
Critical Value (1%)  3.49  3.51  3.63  2.7 
Critical Value (5%)  2.89  2.90  2.91  3.70 
Critical Value (10%)  2.58  2.59  2.60  2.63 
a.4 Prediction Visualizations
Genz functions are basis functions for multidimensional Figure 10 visualizes different Genz functions, realizations of dynamics and predictions from TLSTM and baselines. We can see for “oscillatory”, “product peak” and “Gaussian ”, TLSTM can better capture the complex dynamics, leading to more accurate predictions.
a.5 More Chaotic Dynamics Results
Chaotic dynamics such as Lorenz attractor is notoriously different to lean in nonlinear dynamics. In such systems, the dynamics are highly sensitive to perturbations in the input state: two close points can move exponentially far apart under the dynamics. We also evaluated tensortrain neural networks on longterm forecasting for Lorenz attractor and report the results as follows:
Lorenz
The Lorenz attractor system describes a twodimensional flow of fluids (see Figure 9):
This system has chaotic solutions (for certain parameter values) that revolve around the socalled Lorenz attractor. We simulated trajectories with the discretized time interval length . We sample from each trajectory every units in Euclidean distance. The dynamics is generated using , . The initial condition of each trajectory is sampled uniformly random from the interval of .
Figure 11 shows steps ahead predictions for all models. HORNN is the full tensor TTRNN using vanilla RNN unit without the tensortrain decomposition. We can see all the tensor models perform better than vanilla RNN or MRNN. TTRNN shows slight improvement at the beginning state.