ForecastNet: A TimeVariant Deep FeedForward Neural Network Architecture for MultiStepAhead TimeSeries Forecasting
Abstract
Recurrent and convolutional neural networks are the most common architectures used for time series forecasting in deep learning literature. These networks use parameter sharing by repeating a set of fixed architectures with fixed parameters over time or space. The result is that the overall architecture is timeinvariant (shiftinvariant in the spatial domain) or stationary. We argue that timeinvariance can reduce the capacity to perform multistepahead forecasting, where modelling the dynamics at a range of scales and resolutions is required. We propose ForecastNet which uses a deep feedforward architecture to provide a timevariant model. An additional novelty of ForecastNet is interleaved outputs, which we show assist in mitigating vanishing gradients. ForecastNet is demonstrated to outperform statistical and deep learning benchmark models on several datasets.
1 Introduction
Multistepahead forecasting involves the prediction of some variable several timesteps into the future, given past and present data. Over the set of timesteps, various timeseries components such as complex trends, seasonality, and noise may be observed at a range of scales or resolutions. Increasing the number of steps ahead that are forecast increases the range of scales that are required to be modelled. An accurate forecasting method is required to model all these components over the complete range of scales.
There is significant interest in the recurrent neural network (RNN) and sequencetosequence models for forecasting Kuznetsov & Mariet (2018). Several studies have shown success with variants of these models Zhu & Laptev (2017); Laptev et al. (2017); Wen et al. (2017); Maddix et al. (2018); Salinas et al. (2019); Xing et al. (2019); Hewamalage et al. (2019); Nguyen et al. (2019); Du et al. (2020). The recurrence in the RNN produces a set of cells, each with fixed architecture, that are replicated over time. This replication results in a timeinvariant model. Similarly, parameter sharing and convolution in the convolutional neural network (CNN) result in a shiftinvariant model in the spatial domain; which is equivalent to timeinvariance in the time domain. Our conjecture is that timeinvariance can reduce the capacity for the model to learn various scales of dynamics across multiple steps in time; especially in multistepahead forecasts.
To address this, we propose ForecastNet, which is a deep feedforward architecture with interleaved outputs. Between interleaved outputs are network structures that differ in architecture and parameters over time. The result is a model that varies across time. We demonstrate four variations of ForecastNet to highlight its flexibility. These four variations are compared with stateoftheart benchmark models. We show that ForecastNet is accurate and robust in terms of performance variation.
The contributions of this study are: (1) ForecastNet: a model for multistepahead forecasting; (2) we address the timeinvariance problem which (to our knowledge) has not been considered in deep learning timeseries forecasting literature before; and (3) provide a comparison of several stateoftheart models on seasonal timeseries datasets.
2 Motivations and Related Work
2.1 Recurrent Neural Networks
The RNN comprises a set of cell structures with parameters that are replicated over time. These replications (cells and parameters) remain constant over time. The replication has its benefits (such as parameter sharing) but it can reduce capacity to model the complex dependencies over time. For example, a model can adapt to longterm dynamics more easily if its parameters are able to change over time.
Another challenge with RNNs is that learning long sequences can be difficult due to complex dependencies over time and vanishing gradients (Chang et al., 2017). Vanishing gradients in the RNN have been addressed in the LSTM (Hochreiter & Schmidhuber, 1997; Gers et al., 1999) and the Gated Recurrent Unit (GRU) (Cho et al., 2014b, a), by introducing gatelike structures. These LSTM and GRU cell structures are however recurrent and are constant over time.
ForecastNet comprises a set of parameters and an architecture that changes over the sequence of inputs and forecast outputs. The result is that ForecastNet is not a timeinvariant model. Furthermore, ForecastNet mitigates vanishing gradient problems by using shortcutconnections and by interleaving outputs between hidden cells. The shortcutconnection approach was introduced in ResNet (He et al., 2015a) and Highway Network (Srivastava et al., 2015). It has been shown to be highly effective in addressing vanishing and shattered gradient problems (Balduzzi et al., 2017). Additionally, shortcutconnections allow for a much deeper network (He et al., 2015a).
2.2 Sequence Models
Stateoftheart deep sequence models include multiple RNNs linked in various configurations (Zhang et al., 2019). A prominent configuration is the sequencetosequence (encoderdecoder) model (Sutskever et al., 2014). The sequencetosequence model sequentially links two RNNs (an encoder and a decoder) through a fixed size vector, such as the last encoder cell state. This can be limiting as it forms a potential bottleneck between the encoder and decoder. Furthermore, earlier inputs have to pass through several layers to reach the decoder.
The attention model (Bahdanau et al., 2015) addresses the sequencetosequence model problem by adding a network structure from all the encoder cells to each decoder cell. This structure, called an attention mechanism, ascribes relevance to the particular encoder cell. The attention mechanism is not necessarily timeinvariant, so it may help reduce the timeinvariance properties of the overall model. However, the time series dynamics are primarily modelled with the RNN encoder and decoder, which are time invariant.
Unlike the sequencetosequence and attention models, ForecastNet does not have a separate encoder and decoder. Challenges with linking these entities are thus removed.
2.3 Convolutional Neural Networks
The convolutional neural network (CNN) has shown promising results in modelling sequential data (Binkowski et al., 2018; Mehrkanoon, 2019; Sen et al., 2019). The CNN uses convolution and parameter sharing to achieve shiftinvariance (or translationinvariance) LeCun et al. (1995). This is a key feature in image processing, however, in timeseries applications, it translates to timeinvariance.
WaveNet Oord et al. (2016) is a seminal CNN model which uses multiple layers of dilated causal convolutions for raw audio synthesis. This model has been generalised for broader sequence modelling problems and this generalisation is referred to as the Temporal Convolutional Network (TCN) Bai et al. (2018). We select this model as a benchmark for comparing ForecastNet. Furthermore, we demonstrate how ForecastNet is able to accommodate convolutional layers in hidden cells.
2.4 The Transformer and SelfAttention
A model that has successfully departed from the RNN and CNN architectures is the transformer model Vaswani et al. (2017). This model comprises a sequence of encoders and decoders. The encoders include a selfattention mechanism and a positionwise feedforward neural network. The decoder has the same architecture as the encoder, but with an additional attention mechanism over the encoding.
Though the transformer has been highly successful in natural language processing, it has limitations in time series analysis. The first limitation is that it does not assume any sequential structure of the inputs Vaswani et al. (2017). Positional encoding in the form of a sinusoid is injected into the inputs to provide information on the sequence order. Temporal structure is key in time series modelling and is what timeseries models are usually designed to model. Li et al. (2019) propose including causal convolution to model local context. Convolutions are however timeinvariant.
A second limitation of the transformer is that (to promote parallelisation) a large portion of the processing operates over the dimension of the input embedding rather than over time. For example, the multiple attention heads and the positionwise feedforward neural networks operate over the embedding dimension. The transformer is thus not designed to operate on low dimensional time series signals such as univariate timeseries as considered in this study. Wu et al. (2020) use a feedforward neural network to increase the dimensionality of the input space. We however did not find such an approach effective on the datasets used in this study.
ForecastNet addresses both transformer model limitations. It is specifically designed to model the temporal structure of the inputs and it is not limited to multivariate timeseries.
2.5 Uncertainty Representation
Realistically, forecasting methods should provide some form of uncertainty or confidence interval around estimates Makridakis et al. (2018b). Neural networks do not naturally provide this capability. Several approaches to incorporate uncertainty have however been proposed. These include: empirical approaches (such as bootstrapped residuals Hyndman & Athanasopoulos (2018), or Monte Carlo dropout Zhu & Laptev (2017); ensembles (such as Lakshminarayanan et al. (2017)); variational inferencebased models (such as Bayer & Osendorfer (2014); Krishnan et al. (2015); Chung et al. (2015); Fraccaro et al. (2016); Doerr et al. (2018); Rangapuram et al. (2018)); or predictive distributions (such as Salinas et al. (2019)). The predictive distribution approach (based on mixture density networks (Bishop, 1994; Graves, 2013)) is effective as the distribution parameters are learned directly through gradient descent. ForecastNet thus adopts this mixture density approach.
3 ForecastNet Architecture
As illustrated in Figure 1, ForecastNet is a feedforward neural network comprising a set of n_{I} inputs, a set of n_{O} outputs, and a set of sequentially connected hidden cells (a term borrowed from RNN literature). A detailed diagram of a simple form of ForecastNet is presented in Figure 2.
3.1 Inputs
3.2 Hidden Cells
A hidden cell represents some form of feed forward neural network such as a multilayered perceptron (MLP), a CNN, or selfattention. Each hidden cell can be heterogeneous in terms of architecture. As a feedforward network, even if the architecture of each hidden cell is identical (as used in this study), each cell is provided with its own unique set of parameters. This is in contrast to RNNs where the RNN cell architecture and parameters are duplicated at each sequence step. ForecastNet is thus sequential, but it is not recurrent.
The hidden cells are intended to model the timeseries dynamics. Links between hidden cells model local dynamics and cells together model longerterm dynamics.
3.3 Outputs
Each output in ForecastNet provides a forecast onestep into the future. The deeper the network, the more outputs there are. ForecastNet thus naturally scales in complexity with increased forecast reach.
Using the idea of mixture density networks (Salinas et al., 2019; Bishop, 1994), each output models a probability distribution. In this study the normal distribution is used as illustrated in Figure 2. The mean and standard deviation of the normal distribution at output layer l are given by
\displaystyle\mu^{[l]}=W_{\mu}^{[l]T}\mathbf{a}^{[l1]}+\mathbf{b}^{[l]}_{\mu}  (1)  
\displaystyle\sigma^{[l]}=\log(1+\exp(W_{\sigma}^{[l]T}\mathbf{a}^{[l1]}+% \mathbf{b}^{[l]}_{\sigma})),  (2) 
where \mathbf{a}^{[l1]} are the outputs of the previous hidden cell, W_{\mu}^{[l]T} and \mathbf{b}^{[l]}_{\mu} are the weights and biases of the mean’s layer, and W_{\sigma}^{[l]T} and \mathbf{b}^{[l]}_{\sigma} are the weights and biases of the standard deviation’s layer.
The forecast associated with layer l is produced by sampling from \mathcal{N}(\mu^{[l]},\sigma^{[l]}). During forecasting, the sampled forecast is fed to the next layer. During training, the forecast is fully observable through the training data. The network is trained with gradient descent to optimise the normal distribution loglikelihood function (Salinas et al., 2019).
The mixture density output can be replaced with a linear output and the squared error loss function. No uncertainty will be available in this form, however the model be required to optimise over two parameters. This form is demonstrated in this study as one of the variations of ForecastNet.
4 ForecastNet Properties
4.1 TimeVariance
A timeinvariant system is defined as a system for which a time shift of the input sequence causes a corresponding shift in the output sequence Oppenheim & Schafer (2009).
Theorem 1.
ForecastNet with inputs \mathbf{x}_{t}, hidden states \mathbf{h}_{t}, and outputs \mathbf{y}_{t} given by
\displaystyle\mathbf{y}_{t}=f_{t}(\mathbf{h}_{t})  (3)  
\displaystyle\mathbf{h}_{t}=g_{t}(\mathbf{x}_{t},\mathbf{h}_{t1},\mathbf{y}_{% t1})  (4) 
is not time invariant.
Proof.
Given two inputs, \mathbf{x}_{t} and \mathbf{x^{\prime}}_{t}, two outputs \mathbf{y}_{t} and \mathbf{y^{\prime}}_{t}, and two hidden states \mathbf{h}_{t} and \mathbf{h^{\prime}}_{t}. Let \mathbf{x^{\prime}}_{t} be \mathbf{x}_{t}, shifted by t_{0} such that \mathbf{x^{\prime}}_{t}=\mathbf{x}_{tt_{0}}. Similarly, let \mathbf{h^{\prime}}_{t}=\mathbf{h}_{tt_{0}}. Timeinvariance requires that \mathbf{y}_{tt_{0}}=\mathbf{y^{\prime}}_{t}. From (3) and (4), \mathbf{y}_{tt_{0}} is given by
\displaystyle\mathbf{y}_{tt_{0}}=f_{tt_{0}}(g_{tt_{0}}(\mathbf{x}_{tt_{0}}% ,\mathbf{h}_{tt_{0}1},\mathbf{y}_{tt_{0}1})) 
and \mathbf{y^{\prime}}_{t} is given by
\displaystyle\mathbf{y^{\prime}}_{t}  \displaystyle=f_{t}(g_{t}(\mathbf{x^{\prime}}_{t},\mathbf{h^{\prime}}_{t1},% \mathbf{y^{\prime}}_{t1}))  
\displaystyle=f_{t}(g_{t}(\mathbf{x}_{tt_{0}},\mathbf{h}_{tt_{0}1},\mathbf{% y^{\prime}}_{t1})). 
Thus, \mathbf{y}_{tt_{0}}\neq\mathbf{y^{\prime}}_{t} and ForecastNet is not timeinvariant. ∎
ForecastNet is not timeinvariant as its parameters (and optionally architecture) vary in time (over the sequence of inputs and outputs). This is compared with the RNN that has fixed parameters which are reused each time step, resulting in a timeinvariant model.
Theorem 2.
A RNN with inputs \mathbf{x}_{t}, hidden states \mathbf{h}_{t}, and outputs \mathbf{y}_{t} given by
\displaystyle\mathbf{y}_{t}=f(\mathbf{h}_{t})  (5)  
\displaystyle\mathbf{h}_{t}=g(\mathbf{x}_{t},\mathbf{h}_{t1})  (6) 
is timeinvariant when \mathbf{h}_{t} is initialised to some initial value \mathbf{h}_{0} immediately before the first input is received.
Proof.
Given two inputs, \mathbf{x}_{t} and \mathbf{x^{\prime}}_{t}, two outputs \mathbf{y}_{t} and \mathbf{y^{\prime}}_{t}, and two hidden states \mathbf{h}_{t} and \mathbf{h^{\prime}}_{t}. Let \mathbf{x^{\prime}}_{t} be \mathbf{x}_{t}, shifted by t_{0} such that \mathbf{x^{\prime}}_{t}=\mathbf{x}_{tt_{0}}. Similarly, let \mathbf{h^{\prime}}_{t}=\mathbf{h}_{tt_{0}}. Timeinvariance requires that \mathbf{y}_{tt_{0}}=\mathbf{y^{\prime}}_{t}. From (5) and (6), \mathbf{y}_{tt_{0}} is given by
\displaystyle\mathbf{y}_{tt_{0}}=f(g(\mathbf{x}_{tt_{0}},\mathbf{h}_{tt_{0}% 1})) 
and \mathbf{y^{\prime}}_{t} is given by
\displaystyle\mathbf{y^{\prime}}_{t}  \displaystyle=f(g(\mathbf{x^{\prime}}_{t},\mathbf{h^{\prime}}_{t1}))  
\displaystyle=f(g(\mathbf{x}_{tt_{0}},\mathbf{h}_{tt_{0}1})). 
Thus \mathbf{y}_{tt_{0}}=\mathbf{y^{\prime}}_{t} and the RNN is timeinvariant. ∎
Note that a requirement for timeinvariance in the RNN is that \mathbf{h}_{t} is initialised to \mathbf{h}_{0} before the first input is received. If \mathbf{h}_{t} were initialised several timesteps before the first input was received, a set of zeros (padding) would have to be provided as inputs until the first \mathbf{x}_{t} were received. This could result in \mathbf{h^{\prime}}_{t}\neq\mathbf{h}_{tt_{0}}.
4.2 Interleaved Outputs
The vanishing/exploding gradient problem has been referred to as “deep learning’s fundamental problem” (Schmidhuber, 2015). The problem stems from the repeated application of the chain rule of calculus in computing the gradient (Caterini & Chang, 2018). The chain rule produces a chain of factors. Especially long chains associated with deep networks can either vanish to zero or diverge (explode). By interleaving outputs between hidden layers in ForecastNet, the chain is broken into a sum of terms. This sum of terms is more stable than a product of factors. The intuition is that interleaved outputs provide localised information to inner hidden layers of the network during training. This decreases the effective depth of the network.
To show this, consider an Llayered ForecastNet with a single hidden neuron and a single output neuron. A summation results when the output of hidden layer is split between the next hidden layer and the next output. The repeated application of the chain rule results in the following expression
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\sum_{k=0}^{\frac{L1l}{2}}\frac{\partial\mathcal{L}}{\partial% \mathbf{z}^{[l+2k+1]}}\frac{\partial\mathbf{z}^{[l+2k+1]}}{\partial\mathbf{a}^% {[l+2k]}}\Psi_{k}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}  (7) 
where
\displaystyle\Psi_{k}=\begin{cases}1&k=0\\ \displaystyle\prod_{j=1}^{k}\frac{\partial\mathbf{z}^{[l+2j]}}{\partial\mathbf% {a}^{[l+2(j1)]}}&k>0\end{cases}  (8) 
Proof.
Consider an Llayered ForecastNet with a single hidden neuron in the hidden cell and a single linear output neuron as illustrated in Figure 3 (the inputs are not shown as they do not contribute to the error backpropagated through hidden cells). For some layer l in this network, W^{[l]} is the weight matrix, \bar{b}^{[l]} is the bias vector, \mathbf{a}^{[l]} is the output vector, and \mathbf{z}^{[l]}=W^{[l]T}\mathbf{a}^{[l1]}+\bar{b}^{[l]}. Using the chain rule of calculus, the derivative of the loss function \mathcal{L} with respect to the weights W^{[l]} at layer l is given by
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[l]}}\frac{% \partial\mathbf{a}^{[l]}}{\partial W^{[l]}} 
Hidden layer l links to layers l+1 and l+2. Thus, the derivative with respect to \mathbf{a}^{[l]} is expanded as follows
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}% \frac{\partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}+\frac{\partial% \mathcal{L}}{\partial\mathbf{z}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\right)\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l% ]}}  
\displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+2]}% }\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial% \mathbf{a}^{[l]}}{\partial W^{[l]}} 
Layer l+2 links to layers l+3 and l+4 and \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+2]}}} can be expanded in a similar manner to the above. This expansion process is continued until the final output layer L is reached, which produces (C). ∎
Similarly, consider a chain of neurons in a standard Llayered feedforward neural network with loss function \mathcal{L}. Repeated application of the chain rule of calculus results in the following product
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}=  \displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[L]}}\frac{% \partial\mathbf{a}^{[L]}}{\partial\mathbf{z}^{[L]}}\frac{\partial\mathbf{z}^{[% L]}}{\partial\mathbf{a}^{[L1]}}\frac{\partial\mathbf{a}^{[L1]}}{\partial% \mathbf{z}^{[L1]}}\cdots  
\displaystyle\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l+1]}}% \frac{\partial\mathbf{a}^{[l+1]}}{\partial\mathbf{z}^{[l+1]}}\frac{\partial% \mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial\mathbf{z}^{[l]}}\frac{\partial\mathbf{z}^{[l]}}{\partial W^{[l]}}  (9) 
This equation can be expressed in the form \lambda^{L} (Caterini & Chang, 2018; Goodfellow et al., 2016). If \lambda<1, the term vanishes towards 0 as L grows. If \lambda>1, the term explodes as L grows.
Equation (4.2) contains a product of factors whereas (C) contains a sum of terms. A sum of terms with values less than one does not tend to zero (vanish) as a product of factors with values less than one would.
The interleaved outputs mitigate, but do not eliminate the vanishing gradient problem. The term \Psi_{k} is a product of derivatives formed by the chain rule of calculus. The number of factors in this product grows proportional to k. For a distant layer where k is large, \Psi_{k} will have many factors. The result is that gradients backpropagated from distant layers are still susceptible to vanishing gradient problems. However, for nearby outputs where k is small, \Psi_{k} will have few factors and so gradients from nearby outputs are less likely to experience vanishing gradient problems. Thus, nearby outputs can provide guidance to local parameters during training, resulting in improved convergence as the effective depth of the network is reduced.
5 Material and Methods
5.1 Datasets
A set of models are compared on a synthetic dataset and nine realworld datasets sourced from various domains. These include weather, environmental, energy, aquaculture, and meteorological domains. Datasets with seasonal components and complex trends are handpicked to ensure that they provide a sufficiently challenging problem. Properties such as varying seasonal amplitude, varying seasonal shape, and noise were sought. For example, the shape of the seasonal cycle is nonstationary and changes over time in most datasets. Additionally, properties such as seasonality assist in demonstrating the ability for models to learn longterm dependencies in the data.
The synthetic dataset is used to provide a baseline. The data is generated according to x_{t}=2\sin\left(2\pi ft\right)+\nicefrac{{1}}{{3}}\sin\left(\nicefrac{{2\pi ft% }}{{5}}\right), where f is the frequency and t denotes time. The low frequency sinusoid emulates a longterm timevarying trend, whereas the high frequency sinusoid emulates seasonality. Models are expected to perform well on this dataset because it contains no noise. The properties of all the datasets are summarised in Table 1. In figures and tables the datasets are referred to by their abbreviations.
Dataset  Abbreviation  Resolution  Period  Length  Minimum  Maximum  Mean  Std. Dev. 

Synthetic  Synth.    20  4320  2.33  2.33  0.00  1.43 
England temperature Hyndman (2014)  Weath.  Monthly  12  3261  0.10  18.80  9.27  4.75 
River flow Hyndman (2014)  River  Monthly  12  1492  3290  66500  23157.60  13087.40 
Electricity AEMO (2019)  Elect.  Hourly  24  19224  5514  14580  8709.79  1360.29 
Traffic Volume UCI (2019)  Traff.  Hourly  24  8776  125  7217  3269.26  2021.57 
Lake levels Hyndman (2014)  Lake  Monthly  12  648  10  20  15.08  2.00 
Dissolved Oxygen Dabrowski et al. (2018)  DO  Hourly  24  2422  5.66  7.94  6.50  0.53 
pH Dabrowski et al. (2018)  pH  Hourly  24  2422  8.07  11.15  8.56  0.21 
Pond temperature Dabrowski et al. (2018)  Temp.  Hourly  24  2422  24.38  31.97  27.74  1.85 
Ozone Hyndman (2014)  Ozone  Monthly  12  516  266  430  338.00  38.30 
5.2 Models
Four deeplearning based benchmark models are compared to four variations of ForecastNet. These models include deepAR (Salinas et al., 2019), the TCN (Bai et al., 2018), the sequencetosequence (encoderdecoder) model (Sutskever et al., 2014), and the attention model (Bahdanau et al., 2015). For completeness, a single layer MLP, a freeform seasonal Dynamic Linear Model (DLM) (West & Harrison, 1997), and a seasonal autoregressive moving average (SARIMA) model are included in the comparison. The DLM (a state space model) and the SARIMA model are wellknown statistical models that are used for timeseries forecasting (Hyndman & Athanasopoulos, 2018).
Four variations of ForecastNet are tested^{1}^{1}1Note that we tested the use of selfattention mechanisms in the hidden cells. We were unable to achieve any significant performance improvement compared with FN.:
 FN:

Densely connected hidden layers in each hidden cell and a Gaussian mixture density output layer.
 cFN:

CNNs in the hidden cell and a Gaussian mixture density output layer.
 FN2:

This is identical to FN, but with a linear output layer instead of a mixture density output layer.
 cFN2:

This is identical to cFN, but with a linear output layer instead of a mixture density output layer.
The set of models are tested on a datasets that have a seasonal component with period denoted by \tau. The number of inputs in all models is set to 2\tau and the number of outputs (forecaststeps) is set to \tau. Thus, the models are trained to forecast one seasonal cycle ahead in time, given the two previous cycles of data.
To avoid possible bias between the models, they are configured to be as similar as possible. This is achieved by limiting the number of neurons in the models to similar values. Configuration details are provided in Table 2. In figures and tables, the sequencetosequence model is denoted by ‘seq2seq’ and the attention model is denoted by ‘Att’.
Model  Configuration 

FN  Each hidden cell comprises two densely connected hidden layers, each with 24 ReLU neurons. The rectified linear unit (ReLU) with He initialisation (He et al., 2015b) is used as the activation function. 
FN2  Identical to FN2 but with a linear output layer instead of a mixture density layer. 
cFN  Each hidden cell comprises a convolutional layer with 24 filters, each with a kernel size of 2, followed by a average pooling layer with a pool size of 2 and stride of 1. The convolutional and pooling layers are duplicated and followed by a dense layer with 24 ReLU neurons. 
cFN2  Identical to cFN2 but with a linear output layer instead of a mixture density layer. 
DeepAR  The sequencetosequence architecture with single layered LSTMs are used in the encoder and decoder. The mixture density output of the network is a Gaussian (normal) distribution. 
TCN  The TCN contains a convolutional layer with 32 filters, each with a kernel size of 2 for the Synthetic, Weather, Elect., and River datasets. For the remaining datasets, the TCN contains a convolutional layer with 64 filters, each with a kernel size of 3. The output contains a dense layer with \tau linear units. 
Attention  Encoder has a bidirectional LSTM and the decoder has a single layered LSTM. The LSTM cells are configured with 24 ReLU units. 
Seq2Seq  Encoder and decoder use a single layered LSTM. The LSTM cells are configured with 24 RelU units. 
MLP  Feed forward MLP with a single hidden layer comprising 4\tau ReLU hidden neurons. A set of 2\tau inputs are provided and a set of \tau linear outputs are used. 
DLM  The DLM used is the freeform seasonal model with a zero order trend component (West & Harrison, 1997). Model fitting is performed using a modified Kalman filter (Wang, 2016) 
SARIMA  Standard form SARIMA(p,d,q)(P,D,Q)s with: Synthetic: SARIMA(1,1,1)(1,1,0)20, Weather: SARIMA(2,0,3)(0,1,0)12, Elect.: SARIMA(3,1,4)(0,1,0)24, River: SARIMA(2,0,4)(0,1,0)12, Traff.: SARIMA(3,1,1)(0,1,0)24, Lake: SARIMA(2,0,8)(0,1,0)12, DO: SARIMA(2,0,6)(0,1,0)24, pH: SARIMA(4,1,3)(0,1,0)24, Temp.: SARIMA(2,1,4)(0,1,0)24, and Ozone: SARIMA(3,0,4)(0,1,0)12, 
5.3 Training and Testing
The datasets are scaled to the range of [0,1] for training. Each dataset is split into a training and a test set, where the last 10% of the data are used for the test set. The training and test sets both comprise a long sequence of values. These sequences are converted into a set of samples that the models can process. A sample is extracted using a sliding window of length 3\tau. The first 2\tau samples in this window form the input sequence to the model. The last \tau samples form the forecast target values. The sliding window is slid across the dataset sequence to produce a set of samples. The set of training samples are shuffled prior to training.
The ADAM algorithm (Kingma & Ba, 2014) is used to minimise the mean squared error in all machine learning models. The learning rate is searched over the range 10^{i},i\in[2,\dots,6]. Early stopping is used to address overfitting and defines the number of epochs. The Mean Absolute Scaled Error (MASE) (Hyndman & Koehler, 2006) is used to evaluate performance of the models. For completeness, results with the Symmetric Mean Absolute Percentage Error (SMAPE) Hyndman & Athanasopoulos (2018) are additionally provided.
6 Results and Discussion
6.1 TimeInvariance Test
Before comparing models on the datasets, the difference between timeinvariant and timevariant models is demonstrated using a variation of the synthetic dataset given by
\displaystyle x_{t}=\frac{1}{2}\sin\left(\frac{2\pi f}{6}t\right)\left(\frac{3% }{5}\sin\left(2\pi ft\right)+\frac{1}{5}\sin\left(\frac{2\pi f}{5}t\right)\right) 
The second and third sinusoids emulate a seasonal cycle timevarying trend as previously presented. The first sinusoid is included to perform amplitude modulation.
The amplitude modulation repeats every 6 cycles of the seasonal period. A model is only presented with two seasonal cycles as inputs. The model thus never observes a complete pattern, which is the full cycle of the amplitude modulation. A timevariant model’s parameters are able to change over time, which enables the model to adapt to the variations in amplitude. A timeinvariant model is less agile and is expected to struggle with large amplitude variations.
The seq2seq, attention, FN2, and cFN2 models are trained on the dataset. Results of forecasts from inputs starting at time indices 0, 50, 150 and 200 are presented in Figure 4. As expected, the timeinvariant seq2seq model struggles maintaining accurate forecasts when there are large variations in the amplitude. The attention model performs better due to the timevariant attention mechanism. However, the timevariant ForecastNet models adapt well to the large variations in the signal amplitude. We argue that this is due to the timevariant properties of ForecastNet.
6.2 Model Comparison Error Results
FN  cFN  FN2  cFN2  deepAR  Seq2Seq  Attention  TCN  MLP  DLM  SARIMA  
Synth.  0.00 (1.5)  0.01 (1.7)  0.00 (1.3)  0.00 (1.4)  0.03 (2.3)  0.01 (1.7)  0.04 (2.8)  0.05 (3.0)  0.01 (1.6)  0.64 (23.0)  0.29 (11.7) 
Weath.  0.46 (17.1)  0.40 (15.3)  0.46 (16.7)  0.31 (12.3)  0.46 (17.3)  0.43 (16.0)  0.37 (14.3)  0.47 (17.3)  0.47 (17.1)  0.52 (18.9)  0.61 (22.0) 
Elect.  1.12 (12.5)  1.04 (11.9)  0.89 (10.9)  0.54 (6.7)  1.77 (18.2)  1.00 (11.7)  1.39 (14.7)  1.09 (12.9)  1.34 (15.4)  1.73 (19.1)  1.26 (15.1) 
River  0.71 (24.1)  0.66 (23.6)  0.66 (25.2)  0.39 (15.4)  0.86 (28.7)  0.57 (21.9)  0.53 (19.4)  0.85 (30.5)  0.85 (30.3)  0.77 (26.8)  0.87 (29.5) 
Traff.  2.23 (46.0)  1.95 (40.3)  1.44 (32.5)  0.82 (21.2)  2.01 (40.3)  1.78 (36.6)  1.94 (39.3)  2.20 (44.4)  2.36 (47.8)  2.40 (43.5)  2.32 (42.9) 
Lake  1.42 (10.1)  1.61 (12.0)  1.58 (12.3)  1.69 (13.0)  1.61 (12.3)  1.73 (13.4)  1.56 (11.2)  2.03 (13.8)  1.57 (12.1)  1.95 (15.5)  1.69 (12.1) 
DO  0.54 (6.1)  0.62 (7.2)  0.62 (6.9)  0.54 (6.2)  0.71 (8.2)  0.73 (8.6)  2.11 (26.5)  0.78 (9.1)  0.77 (8.6)  0.77 (8.7)  0.64 (7.0) 
pH  1.41 (13.6)  1.23 (12.0)  1.26 (12.4)  1.01 (9.6)  2.64 (25.1)  1.70 (16.6)  1.42 (14.4)  1.35 (13.1)  1.90 (18.3)  1.29 (12.7)  1.68 (16.5) 
Temp.  1.90 (9.0)  1.90 (10.0)  1.66 (8.5)  2.00 (10.7)  1.95 (9.8)  2.13 (10.6)  2.23 (11.4)  3.18 (14.2)  2.10 (9.7)  3.67 (18.4)  1.64 (7.0) 
Ozone  0.72 (33.9)  0.78 (36.2)  0.69 (33.9)  0.79 (36.0)  1.03 (44.7)  1.50 (61.8)  0.58 (28.6)  0.69 (33.5)  0.66 (32.5)  0.76 (30.6)  0.89 (36.5) 
Borda Count  74  76  90  90  43  57  65  42  51  31  41 
The average MASE and SMAPE over all forecasts on each dataset’s test set is provided in Table 3. ForecastNet produces the best results on 8 of the 10 datasets. The cFN2 variation of ForecastNet achieves the best results on 4 of these 8 datasets. This result is reinforced with Borda counts provided in the last row. A Borda count ranks a set of M models with integers (1,\dots,M) such that the model with the lowest MASE is assigned a value of M (a higher vote) and the model with the highest MASE is assigned a value of 1 (a lower vote). Borda counts thus provide a more aggregated evaluation. FN2 and cFN2 are voted as the best models with the highest Borda counts. These are followed by cFN, FN and the attention models respectively.
The attention model produced the lowest error for the ozone dataset. The attention model is a relatively complex model and its attention mechanism assists in modelling longterm dependencies. FN2 provides strong competition to the attention model over the remaining datasets. This is despite it having an arguably a simpler architecture with no gating structures to reduce vanishing gradients. We argue that a key reason why ForecastNet performs so well is that it is not a timeinvariant model.
Increasing the model complexity generally resulted in improved forecast performance in this study. For example, cFN generally outperforms FN. However, simpler models do not fail on the datasets. For example, the MLP provided comparably accurate forecasts despite its simplicity.
As expected, the DLM and SARIMA statistical models performed well despite being linear models (Makridakis et al., 2018a). For example, the SARIMA model achieved the lowest error on the pond temperature dataset. This suggests that the dynamics of this dataset are more linear. However, with the nonlinear trends, amplitudes, and cyclic shapes in the other datasets, the DLM and SARIMA models did not perform as well the nonlinear neural networkbased models. It has however been shown that such linear statistical models can outperform complex machine learning models when the sample size is small (Cerqueira et al., 2019).
Of the deep neural networkbased models, the TCN performed the worst on several datasets. Bai et al. (2018) suggest that the model is in a simplified form and improved results may be possible by using a more advanced architecture. Furthermore, the TCN is designed to perform dilated convolutions over many samples. Of the datasets used in this study, the maximum number of input samples was 48 for datasets with a period of 24 hours. This may be too few to demonstrate the effectiveness of the TCN.
Several MASE results are above unity. In multistepahead forecasting, this does not necessarily imply that a forecasting model produces results worse than the Naive model. In the MASE calculation, Naive is produces onestepahead forecasts with insample data provided for each previous step. The forecasting model is computed outofsample. That is, Naive has access to the groundtruth data in the MASE calculation, whereas the forecasting model does not.
6.3 Model Comparison BoxWhisker Plots
Boxwhisker plots of the results over all the forecasts in each dataset’s test set are provided in Figure 5. ForecastNet consistently produced small boxes with low median values. The small boxes indicate that there is little variation in the accuracy over the set of forecasts. This indicates some form of robustness in the ForecastNet model. The low median values indicate a high level of accuracy.
There was significant variation over the different models in the synthetic dataset boxwhisker plots. For this dataset, the densely connected networks such as FN, FN2 and MLP have small boxes. DeepAR had a large box which indicates higher variation in the forecasts. This indicates that DeepAR is less robust for this dataset.
The models generally perform well on the weather dataset. This may be due to a more consistent seasonal amplitude in this dataset compared with the other realworld datasets. The lake and pH datasets have varying trends, amplitudes, and seasonal shape resulting in higher errors than other datasets. ForecastNet and the attention model seem to model these variations better given their lower errors.
In datasets such as electricity, traffic, and pH, ForecastNet produced low errors with small boxes indicating reliable and accurate forecasts. Especially in the electricity and traffic datasets, it is evident that increased model complexity and removing the mixture density output results in lower errors and a more robust model. The mixture density outputs can reduce accuracy because the learning algorithm seeks to simultaneously minimise two variables in the normal distribution’s log likelihood function. This is opposed to minimising a single variable in mean squared error loss function used for a linear output layer.
7 Summary and Conclusion
In this study, ForecastNet is proposed as novel deep neural architecture for multistepahead time series forecasting. Its architecture breaks from convention of structuring a model around the RNN or CNN. The result is a model that is timevariant compared with the RNN and CNN, which are timeinvariant.
We provide comparison over seven stateoftheart deep learning and statistical models for forecasting seasonal time series data. The comparison is performed on 10 seasonal timeseries datasets selected from various domains. We demonstrate that ForecastNet is both accurate and robust on all datasets. It outperforms other models in terms of MASE and SMAPE on 8 of the 10 datasets and is ranked as the best performing model overall with Borda counts.
In future work, shortcutconnections within and between hidden cells could be investigated. Furthermore, more work into integrating selfattention into the model will provide benefits relating to model interpretability. Lastly, by avoiding parameter sharing to achieve a timevariant model, ForecastNet can require more memory. An investigation into using memory reduction techniques (such as quantization) could be explored.
References
 AEMO (2019) AEMO. Electricity demand data. https://www.aemo.com.au/Electricity/NationalElectricityMarketNEM/Datadashboard, January 2019. Last accessed: January 2019.
 Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1409.0473.
 Bai et al. (2018) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
 Balduzzi et al. (2017) Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.D., and McWilliams, B. The shattered gradients problem: If resnets are the answer, then what is the question? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 342–350, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/balduzzi17b.html.
 Bayer & Osendorfer (2014) Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 Binkowski et al. (2018) Binkowski, M., Marti, G., and Donnat, P. Autoregressive convolutional neural networks for asynchronous time series. In International Conference on Machine Learning, pp. 579–588, 2018.
 Bishop (1994) Bishop, C. M. Mixture density networks. Technical Report NCRG94004, Aston University, 1994.
 Caterini & Chang (2018) Caterini, A. L. and Chang, D. E. Deep Neural Networks in a Mathematical Framework. Springer International Publishing, Cham, 2018. ISBN 9783319753041. doi: 10.1007/9783319753041. URL https://doi.org/10.1007/9783319753041.
 Cerqueira et al. (2019) Cerqueira, V., Torgo, L., and Soares, C. Machine learning vs statistical methods for time series forecasting: Size matters. arXiv preprint arXiv:1909.13316, 2019.
 Chang et al. (2017) Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock, M., HasegawaJohnson, M. A., and Huang, T. S. Dilated recurrent neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 77–87. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6613dilatedrecurrentneuralnetworks.pdf.
 Cho et al. (2014a) Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014a.
 Cho et al. (2014b) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014b.
 Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2980–2988. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5653arecurrentlatentvariablemodelforsequentialdata.pdf.
 Dabrowski et al. (2018) Dabrowski, J. J., Rahman, A., George, A., Arnold, S., and McCulloch, J. State space models for forecasting water quality variables: An application in aquaculture prawn farming. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 177–185, New York, NY, USA, 2018. ACM. ISBN 9781450355520. doi: 10.1145/3219819.3219841. URL http://doi.acm.org/10.1145/3219819.3219841.
 Doerr et al. (2018) Doerr, A., Daniel, C., Schiegg, M., NguyenTuong, D., Schaal, S., Toussaint, M., and Trimpe, S. Probabilistic recurrent statespace models. arXiv preprint arXiv:1801.10395, 2018.
 Du et al. (2020) Du, S., Li, T., Yang, Y., and Horng, S.J. Multivariate time series forecasting via attentionbased encoderâdecoder framework. Neurocomputing, 2020. ISSN 09252312. doi: https://doi.org/10.1016/j.neucom.2019.12.118. URL http://www.sciencedirect.com/science/article/pii/S0925231220300606.
 Fraccaro et al. (2016) Fraccaro, M., Sø nderby, S. r. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2199–2207. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6039sequentialneuralmodelswithstochasticlayers.pdf.
 Gers et al. (1999) Gers, F., Schmidhuber, J., and Cummins, F. Learning to forget: continual prediction with lstm. IET Conference Proceedings, pp. 850–855(5), January 1999. URL http://digitallibrary.theiet.org/content/conferences/10.1049/cp_19991218.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
 Graves (2013) Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 He et al. (2015a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015a.
 He et al. (2015b) He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, Dec 2015b. doi: 10.1109/ICCV.2015.123.
 Hewamalage et al. (2019) Hewamalage, H., Bergmeir, C., and Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. arXiv preprint arXiv:1909.00590, 2019.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
 Hyndman (2014) Hyndman, R. Time series data library. https://robjhyndman.com/tsdl/, February 2014. Last accessed: July 2018.
 Hyndman & Athanasopoulos (2018) Hyndman, R. and Athanasopoulos, G. Forecasting: principles and practice. OTexts, Melbourne, Australia, 2nd edition, 2018. URL https://otexts.com/fpp2. Accessed on 27 April, 2019.
 Hyndman & Koehler (2006) Hyndman, R. J. and Koehler, A. B. Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4):679 – 688, 2006. ISSN 01692070. doi: https://doi.org/10.1016/j.ijforecast.2006.03.001. URL http://www.sciencedirect.com/science/article/pii/S0169207006000239.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krishnan et al. (2015) Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Kuznetsov & Mariet (2018) Kuznetsov, V. and Mariet, Z. Foundations of sequencetosequence modeling for time series. arXiv preprint arXiv:1805.03714, 2018.
 Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6402–6413. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7219simpleandscalablepredictiveuncertaintyestimationusingdeepensembles.pdf.
 Laptev et al. (2017) Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Timeseries extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, number 34, pp. 1–5, 2017.
 LeCun et al. (1995) LeCun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 Li et al. (2019) Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.X., and Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems, pp. 5244–5254, 2019.
 Maddix et al. (2018) Maddix, D. C., Wang, Y., and Smola, A. Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1812.00098, 2018.
 Makridakis et al. (2018a) Makridakis, S., Spiliotis, E., and Assimakopoulos, V. The m4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802 – 808, 2018a. ISSN 01692070. doi: https://doi.org/10.1016/j.ijforecast.2018.06.001. URL http://www.sciencedirect.com/science/article/pii/S0169207018300785.
 Makridakis et al. (2018b) Makridakis, S., Spiliotis, E., and Assimakopoulos, V. Statistical and machine learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3):1–26, 03 2018b. doi: 10.1371/journal.pone.0194889. URL https://doi.org/10.1371/journal.pone.0194889.
 Mehrkanoon (2019) Mehrkanoon, S. Deep shared representation learning for weather elements forecasting. KnowledgeBased Systems, 179:120 – 128, 2019. ISSN 09507051. doi: https://doi.org/10.1016/j.knosys.2019.05.009. URL http://www.sciencedirect.com/science/article/pii/S0950705119302151.
 Nguyen et al. (2019) Nguyen, L., Yang, Z., Li, J., Pan, Z., Cao, G., and Jin, F. Forecasting people’s needs in hurricane events from social network. IEEE Transactions on Big Data, pp. 1–1, 2019. doi: 10.1109/TBDATA.2019.2941887.
 Oord et al. (2016) Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Oppenheim & Schafer (2009) Oppenheim, A. V. and Schafer, R. W. DiscreteTime Signal Processing. Pearson education signal processing series. Pearson, 3rd edition, 2009. ISBN 9780131988422.
 Rangapuram et al. (2018) Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7785–7794. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8004deepstatespacemodelsfortimeseriesforecasting.pdf.
 Salinas et al. (2019) Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 2019. ISSN 01692070. doi: https://doi.org/10.1016/j.ijforecast.2019.07.001. URL http://www.sciencedirect.com/science/article/pii/S0169207019301888.
 Schmidhuber (2015) Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks, 61:85 – 117, 2015. ISSN 08936080. doi: https://doi.org/10.1016/j.neunet.2014.09.003. URL http://www.sciencedirect.com/science/article/pii/S0893608014002135.
 Sen et al. (2019) Sen, R., Yu, H.F., and Dhillon, I. Think globally, act locally: A deep neural network approach to highdimensional time series forecasting. arXiv preprint arXiv:1905.03806, 2019.
 Srivastava et al. (2015) Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2377–2385. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5850trainingverydeepnetworks.pdf.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5346sequencetosequencelearningwithneuralnetworks.pdf.
 UCI (2019) UCI. Uc irvine machine learning repository. https://archive.ics.uci.edu/ml/index.php, 2019. Last accessed: November 2019.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181attentionisallyouneed.pdf.
 Wang (2016) Wang, X. Pydlm, 2016. URL https://pydlm.github.io/index.html.
 Wen et al. (2017) Wen, R., Torkkola, K., Narayanaswamy, B., and Madeka, D. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 West & Harrison (1997) West, M. and Harrison, J. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Springer New York, 1997. ISBN 9780387227771. doi: 10.1007/0387227776. URL https://doi.org/10.1007/0387227776.
 Wu et al. (2020) Wu, N., Green, B., Ben, X., and O’Banion, S. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020.
 Xing et al. (2019) Xing, F. Z., Cambria, E., and Zhang, Y. Sentimentaware volatility forecasting. KnowledgeBased Systems, 176:68 – 76, 2019. ISSN 09507051. doi: https://doi.org/10.1016/j.knosys.2019.03.029. URL http://www.sciencedirect.com/science/article/pii/S0950705119301546.
 Zhang et al. (2019) Zhang, Y., Thorburn, P., Xiang, W., and Fitch, P. SSIM a deep learning approach for recovering missing time series sensor data. IEEE Internet of Things Journal, 2019. ISSN 23274662. doi: 10.1109/JIOT.2019.2909038.
 Zhu & Laptev (2017) Zhu, L. and Laptev, N. Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110, Nov 2017. doi: 10.1109/ICDMW.2017.19.
Appendix A Additional Results
A.1 Forecast Plots
To illustrate forecasting ability, a plot of the forecasts for each model on the river flow dataset is presented in Figure 6. The uncertainty (standard deviation) is plotted for the FN, cFN, deepAR, DLM, and SARIMA models.
As illustrated in Figure 6, the river flow data has a varying amplitude and an irregular seasonal curve. The first peak has a lower amplitude than the second and third peak. Furthermore, the shape of each the seasonal cycle is unique. An accurate model is one that is able to adapt to these changes by learning the underlying dynamics which generate these changes.
Despite the variations over seasonal cycles, cFN2 provided highly accurate forecasts, where fine intricacies in the data were forecast. In this example, the attention model also produces an accurate forecast, however cFN2 outperforms the attention model on this dataset. The DeepAR and the sequencetosequence models provided more smooth forecasts. DeepAR has the advantage of providing uncertainty with the forecast.
The DLM and SARIMA models forecast a curve that is similar in form to the curve from the previous cycle. This is expected as both of these models consider one cycle of data in the past to compute the forecast. Furthermore, these statistical models assume a linear trend, whereas the actual trend is nonlinear and appears more stochastic in nature. The nonlinear models are more capable of modelling the nonlinear dynamics.
A.2 Computational Complexity
FN  cFN  FN2  cFN2  deepAR  Seq2Seq  Attention  TCN  MLP  

Synth.  12.62  36.84  5.21  32.48  39.38  36.68  69.60  5.04  1.00 
Weath.  6.55  18.88  3.38  18.22  23.39  22.14  36.93  2.28  1.00 
Elect.  15.07  77.90  6.09  74.79  48.26  45.60  101.55  13.96  1.00 
River  7.05  18.63  3.67  17.52  22.61  21.18  35.83  2.85  1.00 
Traff.  14.06  60.68  5.56  56.94  44.38  41.54  86.20  15.86  1.00 
Lake  7.52  18.41  3.68  16.42  23.65  22.52  37.64  2.62  1.00 
DO  17.41  46.64  6.61  39.28  47.29  44.70  88.71  8.96  1.00 
pH  17.44  45.24  6.69  42.48  46.21  44.79  89.40  9.08  1.00 
Temp.  16.97  48.62  6.57  41.41  47.27  44.39  88.32  8.94  1.00 
Ozone  7.89  17.45  3.91  15.68  23.50  21.93  36.53  2.87  1.00 
Median  13.34  41.04  5.38  35.88  41.88  39.11  77.90  6.99  1.00 
To demonstrate the computational complexity of the models, their training times were considered. Each model was trained on each dataset for 10 epochs and these epoch times were logged. The results are presented in Table 4. To provide some independence from the platform on which the models were trained, the results are presented as an average epoch time of the particular model divided by the average epoch time of the MLP. The results are thus represented as a multiple of the epoch duration of the MLP.
The MLP had the simplest architecture and thus provided the shortest epoch times. Second to the MLP are the FN2 and the TCN. These models required 5 to 7 times the amount of time to train an epoch compared to the MLP. Note that FN2 is ranked as a top model and also has the second lowest median computation time, resulting in a highly attractive model.
Using a density mixture output on ForecastNet significantly increased the epoch time. This is evident when the duration of FN is compared to FN2 and the duration of cFN is compared to cFN2. However, even with increased computational complexity, the median duration of cFN and cFN2 was less than the sequencetosequence, attention, and deepAR models. The attention model in particular had a high epoch duration due to its complex architecture. The results provide empirical evidence that ForecastNet provided reduced computational complexity compared with the benchmark models.
A.3 Vanishing Gradients
To demonstrate that ForecastNet is able to mitigate vanishing gradients, it was compared with a deep MLP on the synthetic dataset. The deep MLP was selected for this purpose as it has no guard against vanishing gradients. The MLP was configured with 40 inputs, 20 hidden layers, and 20 outputs. Similarly, ForecastNet was configured with 20 linear outputs and a single hidden layer in each cell. This results in a total of 20 hidden layers, 20 outputs, and 40 inputs as for the deep MLP. Thus, the primary difference between the models was that ForecastNet uses interleaved outputs, whereas the deep MLP’s outputs are all located at the output layer. Both models used the sigmoid activation function with Xavier normal initialisation in hidden layers. The models were trained over 10 epochs with a learning rate of 10^{4}.
The absolute mean value of the weights for the first and last hidden layers are plotted in the top graph of Figure 7. The training losses are plotted in the bottom graph of Figure 7. The gradient in the first layer of the MLP remained close to zero over all 10 epochs. This indicates a vanishing gradient problem. Furthermore, as indicated in the loss plot, the model did not converge to an optimal solution. In comparison, the gradients of both layers in ForecastNet were nonzero. This indicates that ForecastNet had mitigated vanishing gradients. Furthermore, ForecastNet converged to a more optimal solution compared to the deep MLP model.
Appendix B Additional Model Details
Appendix C Detailed Derivation of Interleaved Output Chain Rule in Equation (C)
The outputs in ForecastNet are interleaved between hidden cells. Consider an Llayered ForecastNet with a single hidden neuron in the hidden cell and a single linear output neuron as illustrated in Figure 10. The inputs are not shown as they do not contribute to the backpropagated error across hidden cells. For some layer l in this network, W^{[l]} is the weight matrix, \bar{b}^{[l]} is the bias vector, \mathbf{a}^{[l]} is the output vector, and \mathbf{z}^{[l]}=W^{[l]T}\mathbf{a}^{[l1]}+\bar{b}^{[l]}. Using the chain rule of calculus, the derivative of the loss function \mathcal{L} with respect to the weights W^{[l]} at layer l is given by
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[l]}}\frac{% \partial\mathbf{a}^{[l]}}{\partial W^{[l]}} 
Layer l links to layers l+1 and l+2. Thus, the derivative with respect to \mathbf{a}^{[l]} is expanded as follows
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}% \frac{\partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}+\frac{\partial% \mathcal{L}}{\partial\mathbf{z}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\right)\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l% ]}}  
\displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+2]}% }\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial% \mathbf{a}^{[l]}}{\partial W^{[l]}} 
The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{z}^{[l+1]}}} is computed with respect to a target value as layer l+1 is an output layer. Layer l+2 is a hidden layer. The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+2]}}} is thus expanded as follows.
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}~{}+\left(\frac{\partial\mathcal{L}}{\partial\mathbf{% z}^{[l+3]}}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}+\frac% {\partial\mathcal{L}}{\partial\mathbf{z}^{[l+4]}}\frac{\partial\mathbf{z}^{[l+% 4]}}{\partial\mathbf{a}^{[l+2]}}\right)  
\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}\right) 
Which is expanded into the sum
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l% +3]}}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{% \partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l% +4]}}\frac{\partial\mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{% \partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}} 
Similarly, layer l+3 is an output layer and layer l+4 is a hidden layer. The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+4]}}} is expanded as follows
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+3]% }}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial% \mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}+\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^% {[l+5]}}\frac{\partial\mathbf{z}^{[l+5]}}{\partial\mathbf{a}^{[l+4]}}+\frac{% \partial\mathcal{L}}{\partial\mathbf{z}^{[l+6]}}\frac{\partial\mathbf{z}^{[l+6% ]}}{\partial\mathbf{a}^{[l+4]}}\right)  
\displaystyle~{}~{}~{}~{}\times\left(\frac{\partial\mathbf{z}^{[l+4]}}{% \partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a% }^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}\right) 
Which is expanded into the sum
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+3]% }}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial% \mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+5]% }}\frac{\partial\mathbf{z}^{[l+5]}}{\partial\mathbf{a}^{[l+4]}}\frac{\partial% \mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]% }}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}  
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+6]% }}\frac{\partial\mathbf{z}^{[l+6]}}{\partial\mathbf{a}^{[l+4]}}\frac{\partial% \mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]% }}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}} 
This expansion process is continued until the final output layer L is reached. The final result is
\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}  \displaystyle=\sum_{k=0}^{\frac{L1l}{2}}\frac{\partial\mathcal{L}}{\partial% \mathbf{z}^{[l+2k+1]}}\frac{\partial\mathbf{z}^{[l+2k+1]}}{\partial\mathbf{a}^% {[l+2k]}}\Psi_{k}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}} 
where
\displaystyle\Psi_{k}=\begin{cases}1&k=0\\ \displaystyle\prod_{j=1}^{k}\frac{\partial\mathbf{z}^{[l+2j]}}{\partial\mathbf% {a}^{[l+2(j1)]}}&k>0\end{cases} 