ForecastNet: A Time-Variant Deep Feed-Forward Neural Network Architecture for Multi-Step-Ahead Time-Series Forecasting

ForecastNet: A Time-Variant Deep Feed-Forward Neural Network Architecture for Multi-Step-Ahead Time-Series Forecasting


Recurrent and convolutional neural networks are the most common architectures used for time series forecasting in deep learning literature. These networks use parameter sharing by repeating a set of fixed architectures with fixed parameters over time or space. The result is that the overall architecture is time-invariant (shift-invariant in the spatial domain) or stationary. We argue that time-invariance can reduce the capacity to perform multi-step-ahead forecasting, where modelling the dynamics at a range of scales and resolutions is required. We propose ForecastNet which uses a deep feed-forward architecture to provide a time-variant model. An additional novelty of ForecastNet is interleaved outputs, which we show assist in mitigating vanishing gradients. ForecastNet is demonstrated to outperform statistical and deep learning benchmark models on several datasets.


1 Introduction

Multi-step-ahead forecasting involves the prediction of some variable several time-steps into the future, given past and present data. Over the set of time-steps, various time-series components such as complex trends, seasonality, and noise may be observed at a range of scales or resolutions. Increasing the number of steps ahead that are forecast increases the range of scales that are required to be modelled. An accurate forecasting method is required to model all these components over the complete range of scales.

There is significant interest in the recurrent neural network (RNN) and sequence-to-sequence models for forecasting Kuznetsov & Mariet (2018). Several studies have shown success with variants of these models Zhu & Laptev (2017); Laptev et al. (2017); Wen et al. (2017); Maddix et al. (2018); Salinas et al. (2019); Xing et al. (2019); Hewamalage et al. (2019); Nguyen et al. (2019); Du et al. (2020). The recurrence in the RNN produces a set of cells, each with fixed architecture, that are replicated over time. This replication results in a time-invariant model. Similarly, parameter sharing and convolution in the convolutional neural network (CNN) result in a shift-invariant model in the spatial domain; which is equivalent to time-invariance in the time domain. Our conjecture is that time-invariance can reduce the capacity for the model to learn various scales of dynamics across multiple steps in time; especially in multi-step-ahead forecasts.

To address this, we propose ForecastNet, which is a deep feed-forward architecture with interleaved outputs. Between interleaved outputs are network structures that differ in architecture and parameters over time. The result is a model that varies across time. We demonstrate four variations of ForecastNet to highlight its flexibility. These four variations are compared with state-of-the-art benchmark models. We show that ForecastNet is accurate and robust in terms of performance variation.

The contributions of this study are: (1) ForecastNet: a model for multi-step-ahead forecasting; (2) we address the time-invariance problem which (to our knowledge) has not been considered in deep learning time-series forecasting literature before; and (3) provide a comparison of several state-of-the-art models on seasonal time-series datasets.

2 Motivations and Related Work

2.1 Recurrent Neural Networks

The RNN comprises a set of cell structures with parameters that are replicated over time. These replications (cells and parameters) remain constant over time. The replication has its benefits (such as parameter sharing) but it can reduce capacity to model the complex dependencies over time. For example, a model can adapt to long-term dynamics more easily if its parameters are able to change over time.

Another challenge with RNNs is that learning long sequences can be difficult due to complex dependencies over time and vanishing gradients (Chang et al., 2017). Vanishing gradients in the RNN have been addressed in the LSTM (Hochreiter & Schmidhuber, 1997; Gers et al., 1999) and the Gated Recurrent Unit (GRU) (Cho et al., 2014b, a), by introducing gate-like structures. These LSTM and GRU cell structures are however recurrent and are constant over time.

ForecastNet comprises a set of parameters and an architecture that changes over the sequence of inputs and forecast outputs. The result is that ForecastNet is not a time-invariant model. Furthermore, ForecastNet mitigates vanishing gradient problems by using shortcut-connections and by interleaving outputs between hidden cells. The shortcut-connection approach was introduced in ResNet (He et al., 2015a) and Highway Network (Srivastava et al., 2015). It has been shown to be highly effective in addressing vanishing and shattered gradient problems (Balduzzi et al., 2017). Additionally, shortcut-connections allow for a much deeper network (He et al., 2015a).

2.2 Sequence Models

State-of-the-art deep sequence models include multiple RNNs linked in various configurations (Zhang et al., 2019). A prominent configuration is the sequence-to-sequence (encoder-decoder) model (Sutskever et al., 2014). The sequence-to-sequence model sequentially links two RNNs (an encoder and a decoder) through a fixed size vector, such as the last encoder cell state. This can be limiting as it forms a potential bottleneck between the encoder and decoder. Furthermore, earlier inputs have to pass through several layers to reach the decoder.

The attention model (Bahdanau et al., 2015) addresses the sequence-to-sequence model problem by adding a network structure from all the encoder cells to each decoder cell. This structure, called an attention mechanism, ascribes relevance to the particular encoder cell. The attention mechanism is not necessarily time-invariant, so it may help reduce the time-invariance properties of the overall model. However, the time series dynamics are primarily modelled with the RNN encoder and decoder, which are time invariant.

Unlike the sequence-to-sequence and attention models, ForecastNet does not have a separate encoder and decoder. Challenges with linking these entities are thus removed.

2.3 Convolutional Neural Networks

The convolutional neural network (CNN) has shown promising results in modelling sequential data (Binkowski et al., 2018; Mehrkanoon, 2019; Sen et al., 2019). The CNN uses convolution and parameter sharing to achieve shift-invariance (or translation-invariance) LeCun et al. (1995). This is a key feature in image processing, however, in time-series applications, it translates to time-invariance.

WaveNet Oord et al. (2016) is a seminal CNN model which uses multiple layers of dilated causal convolutions for raw audio synthesis. This model has been generalised for broader sequence modelling problems and this generalisation is referred to as the Temporal Convolutional Network (TCN) Bai et al. (2018). We select this model as a benchmark for comparing ForecastNet. Furthermore, we demonstrate how ForecastNet is able to accommodate convolutional layers in hidden cells.

2.4 The Transformer and Self-Attention

A model that has successfully departed from the RNN and CNN architectures is the transformer model Vaswani et al. (2017). This model comprises a sequence of encoders and decoders. The encoders include a self-attention mechanism and a position-wise feed-forward neural network. The decoder has the same architecture as the encoder, but with an additional attention mechanism over the encoding.

Though the transformer has been highly successful in natural language processing, it has limitations in time series analysis. The first limitation is that it does not assume any sequential structure of the inputs Vaswani et al. (2017). Positional encoding in the form of a sinusoid is injected into the inputs to provide information on the sequence order. Temporal structure is key in time series modelling and is what time-series models are usually designed to model. Li et al. (2019) propose including causal convolution to model local context. Convolutions are however time-invariant.

A second limitation of the transformer is that (to promote parallelisation) a large portion of the processing operates over the dimension of the input embedding rather than over time. For example, the multiple attention heads and the position-wise feed-forward neural networks operate over the embedding dimension. The transformer is thus not designed to operate on low dimensional time series signals such as univariate time-series as considered in this study. Wu et al. (2020) use a feed-forward neural network to increase the dimensionality of the input space. We however did not find such an approach effective on the datasets used in this study.

ForecastNet addresses both transformer model limitations. It is specifically designed to model the temporal structure of the inputs and it is not limited to multivariate time-series.

2.5 Uncertainty Representation

Realistically, forecasting methods should provide some form of uncertainty or confidence interval around estimates Makridakis et al. (2018b). Neural networks do not naturally provide this capability. Several approaches to incorporate uncertainty have however been proposed. These include: empirical approaches (such as bootstrapped residuals Hyndman & Athanasopoulos (2018), or Monte Carlo dropout Zhu & Laptev (2017); ensembles (such as Lakshminarayanan et al. (2017)); variational inference-based models (such as Bayer & Osendorfer (2014); Krishnan et al. (2015); Chung et al. (2015); Fraccaro et al. (2016); Doerr et al. (2018); Rangapuram et al. (2018)); or predictive distributions (such as Salinas et al. (2019)). The predictive distribution approach (based on mixture density networks (Bishop, 1994; Graves, 2013)) is effective as the distribution parameters are learned directly through gradient descent. ForecastNet thus adopts this mixture density approach.

3 ForecastNet Architecture

As illustrated in Figure 1, ForecastNet is a feed-forward neural network comprising a set of n_{I} inputs, a set of n_{O} outputs, and a set of sequentially connected hidden cells (a term borrowed from RNN literature). A detailed diagram of a simple form of ForecastNet is presented in Figure 2.












Figure 1: General ForecastNet structure to provide a forecast \hat{x}_{t+1:t+4} given \mathbf{x}=x_{t-n_{I}+1:t} as inputs (circles). A hidden cell (squares) comprises some form of feed forward neural network structure. Each hidden cell and output is illustrated with a different shade to indicate heterogeneity over the sequence.



















Figure 2: An example of a simple form of ForecastNet with a single densely connected hidden layer in the hidden cell (more complex structures can be chosen for the hidden cell). Outputs take the form of a normal or Gaussian distribution. Note that though the architecture at each sequence step is identical, the weights will differ (i.e. they are not recurrent).

3.1 Inputs

ForecastNet’s inputs \mathbf{x}=x_{t-n_{I}+1:t} are a set of lagged values of the dependent variable. The dependent variable can be univariate or multivariate. The set of inputs are presented to every hidden cell in the network as illustrated in Figure 1 and Figure 2.

3.2 Hidden Cells

A hidden cell represents some form of feed forward neural network such as a multi-layered perceptron (MLP), a CNN, or self-attention. Each hidden cell can be heterogeneous in terms of architecture. As a feed-forward network, even if the architecture of each hidden cell is identical (as used in this study), each cell is provided with its own unique set of parameters. This is in contrast to RNNs where the RNN cell architecture and parameters are duplicated at each sequence step. ForecastNet is thus sequential, but it is not recurrent.

The hidden cells are intended to model the time-series dynamics. Links between hidden cells model local dynamics and cells together model longer-term dynamics.

3.3 Outputs

Each output in ForecastNet provides a forecast one-step into the future. The deeper the network, the more outputs there are. ForecastNet thus naturally scales in complexity with increased forecast reach.

Using the idea of mixture density networks (Salinas et al., 2019; Bishop, 1994), each output models a probability distribution. In this study the normal distribution is used as illustrated in Figure 2. The mean and standard deviation of the normal distribution at output layer l are given by

\displaystyle\mu^{[l]}=W_{\mu}^{[l]T}\mathbf{a}^{[l-1]}+\mathbf{b}^{[l]}_{\mu} (1)
\displaystyle\sigma^{[l]}=\log(1+\exp(W_{\sigma}^{[l]T}\mathbf{a}^{[l-1]}+% \mathbf{b}^{[l]}_{\sigma})), (2)

where \mathbf{a}^{[l-1]} are the outputs of the previous hidden cell, W_{\mu}^{[l]T} and \mathbf{b}^{[l]}_{\mu} are the weights and biases of the mean’s layer, and W_{\sigma}^{[l]T} and \mathbf{b}^{[l]}_{\sigma} are the weights and biases of the standard deviation’s layer.

The forecast associated with layer l is produced by sampling from \mathcal{N}(\mu^{[l]},\sigma^{[l]}). During forecasting, the sampled forecast is fed to the next layer. During training, the forecast is fully observable through the training data. The network is trained with gradient descent to optimise the normal distribution log-likelihood function (Salinas et al., 2019).

The mixture density output can be replaced with a linear output and the squared error loss function. No uncertainty will be available in this form, however the model be required to optimise over two parameters. This form is demonstrated in this study as one of the variations of ForecastNet.

4 ForecastNet Properties

4.1 Time-Variance

A time-invariant system is defined as a system for which a time shift of the input sequence causes a corresponding shift in the output sequence Oppenheim & Schafer (2009).

Theorem 1.

ForecastNet with inputs \mathbf{x}_{t}, hidden states \mathbf{h}_{t}, and outputs \mathbf{y}_{t} given by

\displaystyle\mathbf{y}_{t}=f_{t}(\mathbf{h}_{t}) (3)
\displaystyle\mathbf{h}_{t}=g_{t}(\mathbf{x}_{t},\mathbf{h}_{t-1},\mathbf{y}_{% t-1}) (4)

is not time invariant.


Given two inputs, \mathbf{x}_{t} and \mathbf{x^{\prime}}_{t}, two outputs \mathbf{y}_{t} and \mathbf{y^{\prime}}_{t}, and two hidden states \mathbf{h}_{t} and \mathbf{h^{\prime}}_{t}. Let \mathbf{x^{\prime}}_{t} be \mathbf{x}_{t}, shifted by t_{0} such that \mathbf{x^{\prime}}_{t}=\mathbf{x}_{t-t_{0}}. Similarly, let \mathbf{h^{\prime}}_{t}=\mathbf{h}_{t-t_{0}}. Time-invariance requires that \mathbf{y}_{t-t_{0}}=\mathbf{y^{\prime}}_{t}. From (3) and (4), \mathbf{y}_{t-t_{0}} is given by

\displaystyle\mathbf{y}_{t-t_{0}}=f_{t-t_{0}}(g_{t-t_{0}}(\mathbf{x}_{t-t_{0}}% ,\mathbf{h}_{t-t_{0}-1},\mathbf{y}_{t-t_{0}-1}))

and \mathbf{y^{\prime}}_{t} is given by

\displaystyle\mathbf{y^{\prime}}_{t} \displaystyle=f_{t}(g_{t}(\mathbf{x^{\prime}}_{t},\mathbf{h^{\prime}}_{t-1},% \mathbf{y^{\prime}}_{t-1}))
\displaystyle=f_{t}(g_{t}(\mathbf{x}_{t-t_{0}},\mathbf{h}_{t-t_{0}-1},\mathbf{% y^{\prime}}_{t-1})).

Thus, \mathbf{y}_{t-t_{0}}\neq\mathbf{y^{\prime}}_{t} and ForecastNet is not time-invariant. ∎

ForecastNet is not time-invariant as its parameters (and optionally architecture) vary in time (over the sequence of inputs and outputs). This is compared with the RNN that has fixed parameters which are reused each time step, resulting in a time-invariant model.

Theorem 2.

A RNN with inputs \mathbf{x}_{t}, hidden states \mathbf{h}_{t}, and outputs \mathbf{y}_{t} given by

\displaystyle\mathbf{y}_{t}=f(\mathbf{h}_{t}) (5)
\displaystyle\mathbf{h}_{t}=g(\mathbf{x}_{t},\mathbf{h}_{t-1}) (6)

is time-invariant when \mathbf{h}_{t} is initialised to some initial value \mathbf{h}_{0} immediately before the first input is received.


Given two inputs, \mathbf{x}_{t} and \mathbf{x^{\prime}}_{t}, two outputs \mathbf{y}_{t} and \mathbf{y^{\prime}}_{t}, and two hidden states \mathbf{h}_{t} and \mathbf{h^{\prime}}_{t}. Let \mathbf{x^{\prime}}_{t} be \mathbf{x}_{t}, shifted by t_{0} such that \mathbf{x^{\prime}}_{t}=\mathbf{x}_{t-t_{0}}. Similarly, let \mathbf{h^{\prime}}_{t}=\mathbf{h}_{t-t_{0}}. Time-invariance requires that \mathbf{y}_{t-t_{0}}=\mathbf{y^{\prime}}_{t}. From (5) and (6), \mathbf{y}_{t-t_{0}} is given by

\displaystyle\mathbf{y}_{t-t_{0}}=f(g(\mathbf{x}_{t-t_{0}},\mathbf{h}_{t-t_{0}% -1}))

and \mathbf{y^{\prime}}_{t} is given by

\displaystyle\mathbf{y^{\prime}}_{t} \displaystyle=f(g(\mathbf{x^{\prime}}_{t},\mathbf{h^{\prime}}_{t-1}))

Thus \mathbf{y}_{t-t_{0}}=\mathbf{y^{\prime}}_{t} and the RNN is time-invariant. ∎

Note that a requirement for time-invariance in the RNN is that \mathbf{h}_{t} is initialised to \mathbf{h}_{0} before the first input is received. If \mathbf{h}_{t} were initialised several time-steps before the first input was received, a set of zeros (padding) would have to be provided as inputs until the first \mathbf{x}_{t} were received. This could result in \mathbf{h^{\prime}}_{t}\neq\mathbf{h}_{t-t_{0}}.

4.2 Interleaved Outputs

The vanishing/exploding gradient problem has been referred to as “deep learning’s fundamental problem” (Schmidhuber, 2015). The problem stems from the repeated application of the chain rule of calculus in computing the gradient (Caterini & Chang, 2018). The chain rule produces a chain of factors. Especially long chains associated with deep networks can either vanish to zero or diverge (explode). By interleaving outputs between hidden layers in ForecastNet, the chain is broken into a sum of terms. This sum of terms is more stable than a product of factors. The intuition is that interleaved outputs provide localised information to inner hidden layers of the network during training. This decreases the effective depth of the network.

To show this, consider an L-layered ForecastNet with a single hidden neuron and a single output neuron. A summation results when the output of hidden layer is split between the next hidden layer and the next output. The repeated application of the chain rule results in the following expression

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\sum_{k=0}^{\frac{L-1-l}{2}}\frac{\partial\mathcal{L}}{\partial% \mathbf{z}^{[l+2k+1]}}\frac{\partial\mathbf{z}^{[l+2k+1]}}{\partial\mathbf{a}^% {[l+2k]}}\Psi_{k}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}} (7)


\displaystyle\Psi_{k}=\begin{cases}1&k=0\\ \displaystyle\prod_{j=1}^{k}\frac{\partial\mathbf{z}^{[l+2j]}}{\partial\mathbf% {a}^{[l+2(j-1)]}}&k>0\end{cases} (8)










Figure 3: Figure for the derivation of equation (C). The top row of nodes are hidden neurons. The bottom row of nodes are output neurons. Inputs are not shown.

Consider an L-layered ForecastNet with a single hidden neuron in the hidden cell and a single linear output neuron as illustrated in Figure 3 (the inputs are not shown as they do not contribute to the error backpropagated through hidden cells). For some layer l in this network, W^{[l]} is the weight matrix, \bar{b}^{[l]} is the bias vector, \mathbf{a}^{[l]} is the output vector, and \mathbf{z}^{[l]}=W^{[l]T}\mathbf{a}^{[l-1]}+\bar{b}^{[l]}. Using the chain rule of calculus, the derivative of the loss function \mathcal{L} with respect to the weights W^{[l]} at layer l is given by

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[l]}}\frac{% \partial\mathbf{a}^{[l]}}{\partial W^{[l]}}

Hidden layer l links to layers l+1 and l+2. Thus, the derivative with respect to \mathbf{a}^{[l]} is expanded as follows

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}% \frac{\partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}+\frac{\partial% \mathcal{L}}{\partial\mathbf{z}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\right)\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l% ]}}
\displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+2]}% }\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial% \mathbf{a}^{[l]}}{\partial W^{[l]}}

Layer l+2 links to layers l+3 and l+4 and \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+2]}}} can be expanded in a similar manner to the above. This expansion process is continued until the final output layer L is reached, which produces (C). ∎

Similarly, consider a chain of neurons in a standard L-layered feed-forward neural network with loss function \mathcal{L}. Repeated application of the chain rule of calculus results in the following product

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}}= \displaystyle\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[L]}}\frac{% \partial\mathbf{a}^{[L]}}{\partial\mathbf{z}^{[L]}}\frac{\partial\mathbf{z}^{[% L]}}{\partial\mathbf{a}^{[L-1]}}\frac{\partial\mathbf{a}^{[L-1]}}{\partial% \mathbf{z}^{[L-1]}}\cdots
\displaystyle\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l+1]}}% \frac{\partial\mathbf{a}^{[l+1]}}{\partial\mathbf{z}^{[l+1]}}\frac{\partial% \mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial\mathbf{z}^{[l]}}\frac{\partial\mathbf{z}^{[l]}}{\partial W^{[l]}} (9)

This equation can be expressed in the form \lambda^{L} (Caterini & Chang, 2018; Goodfellow et al., 2016). If |\lambda|<1, the term vanishes towards 0 as L grows. If |\lambda|>1, the term explodes as L grows.

Equation (4.2) contains a product of factors whereas (C) contains a sum of terms. A sum of terms with values less than one does not tend to zero (vanish) as a product of factors with values less than one would.

The interleaved outputs mitigate, but do not eliminate the vanishing gradient problem. The term \Psi_{k} is a product of derivatives formed by the chain rule of calculus. The number of factors in this product grows proportional to k. For a distant layer where k is large, \Psi_{k} will have many factors. The result is that gradients back-propagated from distant layers are still susceptible to vanishing gradient problems. However, for nearby outputs where k is small, \Psi_{k} will have few factors and so gradients from nearby outputs are less likely to experience vanishing gradient problems. Thus, nearby outputs can provide guidance to local parameters during training, resulting in improved convergence as the effective depth of the network is reduced.

5 Material and Methods

5.1 Datasets

A set of models are compared on a synthetic dataset and nine real-world datasets sourced from various domains. These include weather, environmental, energy, aquaculture, and meteorological domains. Datasets with seasonal components and complex trends are hand-picked to ensure that they provide a sufficiently challenging problem. Properties such as varying seasonal amplitude, varying seasonal shape, and noise were sought. For example, the shape of the seasonal cycle is non-stationary and changes over time in most datasets. Additionally, properties such as seasonality assist in demonstrating the ability for models to learn long-term dependencies in the data.

The synthetic dataset is used to provide a baseline. The data is generated according to x_{t}=2\sin\left(2\pi ft\right)+\nicefrac{{1}}{{3}}\sin\left(\nicefrac{{2\pi ft% }}{{5}}\right), where f is the frequency and t denotes time. The low frequency sinusoid emulates a long-term time-varying trend, whereas the high frequency sinusoid emulates seasonality. Models are expected to perform well on this dataset because it contains no noise. The properties of all the datasets are summarised in Table 1. In figures and tables the datasets are referred to by their abbreviations.

Table 1: Dataset properties.
Dataset Abbreviation Resolution Period Length Minimum Maximum Mean Std. Dev.
Synthetic Synth. - 20 4320 -2.33 2.33 -0.00 1.43
England temperature Hyndman (2014) Weath. Monthly 12 3261 0.10 18.80 9.27 4.75
River flow Hyndman (2014) River Monthly 12 1492 3290 66500 23157.60 13087.40
Electricity AEMO (2019) Elect. Hourly 24 19224 5514 14580 8709.79 1360.29
Traffic Volume UCI (2019) Traff. Hourly 24 8776 125 7217 3269.26 2021.57
Lake levels Hyndman (2014) Lake Monthly 12 648 10 20 15.08 2.00
Dissolved Oxygen Dabrowski et al. (2018) DO Hourly 24 2422 5.66 7.94 6.50 0.53
pH Dabrowski et al. (2018) pH Hourly 24 2422 8.07 11.15 8.56 0.21
Pond temperature Dabrowski et al. (2018) Temp. Hourly 24 2422 24.38 31.97 27.74 1.85
Ozone Hyndman (2014) Ozone Monthly 12 516 266 430 338.00 38.30

5.2 Models

Four deep-learning based benchmark models are compared to four variations of ForecastNet. These models include deepAR (Salinas et al., 2019), the TCN (Bai et al., 2018), the sequence-to-sequence (encoder-decoder) model (Sutskever et al., 2014), and the attention model (Bahdanau et al., 2015). For completeness, a single layer MLP, a free-form seasonal Dynamic Linear Model (DLM) (West & Harrison, 1997), and a seasonal autoregressive moving average (SARIMA) model are included in the comparison. The DLM (a state space model) and the SARIMA model are well-known statistical models that are used for time-series forecasting (Hyndman & Athanasopoulos, 2018).

Four variations of ForecastNet are tested111Note that we tested the use of self-attention mechanisms in the hidden cells. We were unable to achieve any significant performance improvement compared with FN.:


Densely connected hidden layers in each hidden cell and a Gaussian mixture density output layer.


CNNs in the hidden cell and a Gaussian mixture density output layer.


This is identical to FN, but with a linear output layer instead of a mixture density output layer.


This is identical to cFN, but with a linear output layer instead of a mixture density output layer.

The set of models are tested on a datasets that have a seasonal component with period denoted by \tau. The number of inputs in all models is set to 2\tau and the number of outputs (forecast-steps) is set to \tau. Thus, the models are trained to forecast one seasonal cycle ahead in time, given the two previous cycles of data.

To avoid possible bias between the models, they are configured to be as similar as possible. This is achieved by limiting the number of neurons in the models to similar values. Configuration details are provided in Table 2. In figures and tables, the sequence-to-sequence model is denoted by ‘seq2seq’ and the attention model is denoted by ‘Att’.

Table 2: Model configuration. \tau is the seasonal period of the dataset and ReLU is the rectified linear unit.
Model Configuration
FN Each hidden cell comprises two densely connected hidden layers, each with 24 ReLU neurons. The rectified linear unit (ReLU) with He initialisation (He et al., 2015b) is used as the activation function.
FN2 Identical to FN2 but with a linear output layer instead of a mixture density layer.
cFN Each hidden cell comprises a convolutional layer with 24 filters, each with a kernel size of 2, followed by a average pooling layer with a pool size of 2 and stride of 1. The convolutional and pooling layers are duplicated and followed by a dense layer with 24 ReLU neurons.
cFN2 Identical to cFN2 but with a linear output layer instead of a mixture density layer.
DeepAR The sequence-to-sequence architecture with single layered LSTMs are used in the encoder and decoder. The mixture density output of the network is a Gaussian (normal) distribution.
TCN The TCN contains a convolutional layer with 32 filters, each with a kernel size of 2 for the Synthetic, Weather, Elect., and River datasets. For the remaining datasets, the TCN contains a convolutional layer with 64 filters, each with a kernel size of 3. The output contains a dense layer with \tau linear units.
Attention Encoder has a bidirectional LSTM and the decoder has a single layered LSTM. The LSTM cells are configured with 24 ReLU units.
Seq2Seq Encoder and decoder use a single layered LSTM. The LSTM cells are configured with 24 RelU units.
MLP Feed forward MLP with a single hidden layer comprising 4\tau ReLU hidden neurons. A set of 2\tau inputs are provided and a set of \tau linear outputs are used.
DLM The DLM used is the free-form seasonal model with a zero order trend component (West & Harrison, 1997). Model fitting is performed using a modified Kalman filter (Wang, 2016)
SARIMA Standard form SARIMA(p,d,q)(P,D,Q)s with: Synthetic: SARIMA(1,1,1)(1,1,0)20, Weather: SARIMA(2,0,3)(0,1,0)12, Elect.: SARIMA(3,1,4)(0,1,0)24, River: SARIMA(2,0,4)(0,1,0)12, Traff.: SARIMA(3,1,1)(0,1,0)24, Lake: SARIMA(2,0,8)(0,1,0)12, DO: SARIMA(2,0,6)(0,1,0)24, pH: SARIMA(4,1,3)(0,1,0)24, Temp.: SARIMA(2,1,4)(0,1,0)24, and Ozone: SARIMA(3,0,4)(0,1,0)12,

5.3 Training and Testing

The datasets are scaled to the range of [0,1] for training. Each dataset is split into a training and a test set, where the last 10% of the data are used for the test set. The training and test sets both comprise a long sequence of values. These sequences are converted into a set of samples that the models can process. A sample is extracted using a sliding window of length 3\tau. The first 2\tau samples in this window form the input sequence to the model. The last \tau samples form the forecast target values. The sliding window is slid across the dataset sequence to produce a set of samples. The set of training samples are shuffled prior to training.

The ADAM algorithm (Kingma & Ba, 2014) is used to minimise the mean squared error in all machine learning models. The learning rate is searched over the range 10^{-i},i\in[2,\dots,6]. Early stopping is used to address overfitting and defines the number of epochs. The Mean Absolute Scaled Error (MASE) (Hyndman & Koehler, 2006) is used to evaluate performance of the models. For completeness, results with the Symmetric Mean Absolute Percentage Error (SMAPE) Hyndman & Athanasopoulos (2018) are additionally provided.

6 Results and Discussion

6.1 Time-Invariance Test

Figure 4: Synthetic dataset forecasts for for the seq2seq, Att., FN2, and cFN2 models at starting time indices 0, 50, 150 and 200.

Before comparing models on the datasets, the difference between time-invariant and time-variant models is demonstrated using a variation of the synthetic dataset given by

\displaystyle x_{t}=\frac{1}{2}\sin\left(\frac{2\pi f}{6}t\right)\left(\frac{3% }{5}\sin\left(2\pi ft\right)+\frac{1}{5}\sin\left(\frac{2\pi f}{5}t\right)\right)

The second and third sinusoids emulate a seasonal cycle time-varying trend as previously presented. The first sinusoid is included to perform amplitude modulation.

The amplitude modulation repeats every 6 cycles of the seasonal period. A model is only presented with two seasonal cycles as inputs. The model thus never observes a complete pattern, which is the full cycle of the amplitude modulation. A time-variant model’s parameters are able to change over time, which enables the model to adapt to the variations in amplitude. A time-invariant model is less agile and is expected to struggle with large amplitude variations.

The seq2seq, attention, FN2, and cFN2 models are trained on the dataset. Results of forecasts from inputs starting at time indices 0, 50, 150 and 200 are presented in Figure 4. As expected, the time-invariant seq2seq model struggles maintaining accurate forecasts when there are large variations in the amplitude. The attention model performs better due to the time-variant attention mechanism. However, the time-variant ForecastNet models adapt well to the large variations in the signal amplitude. We argue that this is due to the time-variant properties of ForecastNet.

6.2 Model Comparison Error Results

Table 3: Average MASE and SMAPE (in brackets) of the models results over the test datasets. The last row indicates the sum of Borda counts of the models over the datasets (a higher value indicates more points in the voting score). Boldface numbers highlight top results.
FN cFN FN2 cFN2 deepAR Seq2Seq Attention TCN MLP DLM SARIMA
Synth. 0.00 (1.5) 0.01 (1.7) 0.00 (1.3) 0.00 (1.4) 0.03 (2.3) 0.01 (1.7) 0.04 (2.8) 0.05 (3.0) 0.01 (1.6) 0.64 (23.0) 0.29 (11.7)
Weath. 0.46 (17.1) 0.40 (15.3) 0.46 (16.7) 0.31 (12.3) 0.46 (17.3) 0.43 (16.0) 0.37 (14.3) 0.47 (17.3) 0.47 (17.1) 0.52 (18.9) 0.61 (22.0)
Elect. 1.12 (12.5) 1.04 (11.9) 0.89 (10.9) 0.54 (6.7) 1.77 (18.2) 1.00 (11.7) 1.39 (14.7) 1.09 (12.9) 1.34 (15.4) 1.73 (19.1) 1.26 (15.1)
River 0.71 (24.1) 0.66 (23.6) 0.66 (25.2) 0.39 (15.4) 0.86 (28.7) 0.57 (21.9) 0.53 (19.4) 0.85 (30.5) 0.85 (30.3) 0.77 (26.8) 0.87 (29.5)
Traff. 2.23 (46.0) 1.95 (40.3) 1.44 (32.5) 0.82 (21.2) 2.01 (40.3) 1.78 (36.6) 1.94 (39.3) 2.20 (44.4) 2.36 (47.8) 2.40 (43.5) 2.32 (42.9)
Lake 1.42 (10.1) 1.61 (12.0) 1.58 (12.3) 1.69 (13.0) 1.61 (12.3) 1.73 (13.4) 1.56 (11.2) 2.03 (13.8) 1.57 (12.1) 1.95 (15.5) 1.69 (12.1)
DO 0.54 (6.1) 0.62 (7.2) 0.62 (6.9) 0.54 (6.2) 0.71 (8.2) 0.73 (8.6) 2.11 (26.5) 0.78 (9.1) 0.77 (8.6) 0.77 (8.7) 0.64 (7.0)
pH 1.41 (13.6) 1.23 (12.0) 1.26 (12.4) 1.01 (9.6) 2.64 (25.1) 1.70 (16.6) 1.42 (14.4) 1.35 (13.1) 1.90 (18.3) 1.29 (12.7) 1.68 (16.5)
Temp. 1.90 (9.0) 1.90 (10.0) 1.66 (8.5) 2.00 (10.7) 1.95 (9.8) 2.13 (10.6) 2.23 (11.4) 3.18 (14.2) 2.10 (9.7) 3.67 (18.4) 1.64 (7.0)
Ozone 0.72 (33.9) 0.78 (36.2) 0.69 (33.9) 0.79 (36.0) 1.03 (44.7) 1.50 (61.8) 0.58 (28.6) 0.69 (33.5) 0.66 (32.5) 0.76 (30.6) 0.89 (36.5)
Borda Count 74 76 90 90 43 57 65 42 51 31 41

The average MASE and SMAPE over all forecasts on each dataset’s test set is provided in Table 3. ForecastNet produces the best results on 8 of the 10 datasets. The cFN2 variation of ForecastNet achieves the best results on 4 of these 8 datasets. This result is reinforced with Borda counts provided in the last row. A Borda count ranks a set of M models with integers (1,\dots,M) such that the model with the lowest MASE is assigned a value of M (a higher vote) and the model with the highest MASE is assigned a value of 1 (a lower vote). Borda counts thus provide a more aggregated evaluation. FN2 and cFN2 are voted as the best models with the highest Borda counts. These are followed by cFN, FN and the attention models respectively.

The attention model produced the lowest error for the ozone dataset. The attention model is a relatively complex model and its attention mechanism assists in modelling long-term dependencies. FN2 provides strong competition to the attention model over the remaining datasets. This is despite it having an arguably a simpler architecture with no gating structures to reduce vanishing gradients. We argue that a key reason why ForecastNet performs so well is that it is not a time-invariant model.

Increasing the model complexity generally resulted in improved forecast performance in this study. For example, cFN generally outperforms FN. However, simpler models do not fail on the datasets. For example, the MLP provided comparably accurate forecasts despite its simplicity.

As expected, the DLM and SARIMA statistical models performed well despite being linear models (Makridakis et al., 2018a). For example, the SARIMA model achieved the lowest error on the pond temperature dataset. This suggests that the dynamics of this dataset are more linear. However, with the non-linear trends, amplitudes, and cyclic shapes in the other datasets, the DLM and SARIMA models did not perform as well the non-linear neural network-based models. It has however been shown that such linear statistical models can outperform complex machine learning models when the sample size is small (Cerqueira et al., 2019).

Of the deep neural network-based models, the TCN performed the worst on several datasets. Bai et al. (2018) suggest that the model is in a simplified form and improved results may be possible by using a more advanced architecture. Furthermore, the TCN is designed to perform dilated convolutions over many samples. Of the datasets used in this study, the maximum number of input samples was 48 for datasets with a period of 24 hours. This may be too few to demonstrate the effectiveness of the TCN.

Several MASE results are above unity. In multi-step-ahead forecasting, this does not necessarily imply that a forecasting model produces results worse than the Naive model. In the MASE calculation, Naive is produces one-step-ahead forecasts with in-sample data provided for each previous step. The forecasting model is computed out-of-sample. That is, Naive has access to the ground-truth data in the MASE calculation, whereas the forecasting model does not.

6.3 Model Comparison Box-Whisker Plots

Figure 5: Box plot of the MASE over the set of forecasts produced for each training dataset. The DLM and SARIMA boxes are outside of the plot range for the synthetic dataset.

Box-whisker plots of the results over all the forecasts in each dataset’s test set are provided in Figure 5. ForecastNet consistently produced small boxes with low median values. The small boxes indicate that there is little variation in the accuracy over the set of forecasts. This indicates some form of robustness in the ForecastNet model. The low median values indicate a high level of accuracy.

There was significant variation over the different models in the synthetic dataset box-whisker plots. For this dataset, the densely connected networks such as FN, FN2 and MLP have small boxes. DeepAR had a large box which indicates higher variation in the forecasts. This indicates that DeepAR is less robust for this dataset.

The models generally perform well on the weather dataset. This may be due to a more consistent seasonal amplitude in this dataset compared with the other real-world datasets. The lake and pH datasets have varying trends, amplitudes, and seasonal shape resulting in higher errors than other datasets. ForecastNet and the attention model seem to model these variations better given their lower errors.

In datasets such as electricity, traffic, and pH, ForecastNet produced low errors with small boxes indicating reliable and accurate forecasts. Especially in the electricity and traffic datasets, it is evident that increased model complexity and removing the mixture density output results in lower errors and a more robust model. The mixture density outputs can reduce accuracy because the learning algorithm seeks to simultaneously minimise two variables in the normal distribution’s log likelihood function. This is opposed to minimising a single variable in mean squared error loss function used for a linear output layer.

7 Summary and Conclusion

In this study, ForecastNet is proposed as novel deep neural architecture for multi-step-ahead time series forecasting. Its architecture breaks from convention of structuring a model around the RNN or CNN. The result is a model that is time-variant compared with the RNN and CNN, which are time-invariant.

We provide comparison over seven state-of-the-art deep learning and statistical models for forecasting seasonal time series data. The comparison is performed on 10 seasonal time-series datasets selected from various domains. We demonstrate that ForecastNet is both accurate and robust on all datasets. It outperforms other models in terms of MASE and SMAPE on 8 of the 10 datasets and is ranked as the best performing model overall with Borda counts.

In future work, shortcut-connections within and between hidden cells could be investigated. Furthermore, more work into integrating self-attention into the model will provide benefits relating to model interpretability. Lastly, by avoiding parameter sharing to achieve a time-variant model, ForecastNet can require more memory. An investigation into using memory reduction techniques (such as quantization) could be explored.


  • AEMO (2019) AEMO. Electricity demand data., January 2019. Last accessed: January 2019.
  • Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations, 2015. URL
  • Bai et al. (2018) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
  • Balduzzi et al. (2017) Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.-D., and McWilliams, B. The shattered gradients problem: If resnets are the answer, then what is the question? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 342–350, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • Bayer & Osendorfer (2014) Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
  • Binkowski et al. (2018) Binkowski, M., Marti, G., and Donnat, P. Autoregressive convolutional neural networks for asynchronous time series. In International Conference on Machine Learning, pp. 579–588, 2018.
  • Bishop (1994) Bishop, C. M. Mixture density networks. Technical Report NCRG94004, Aston University, 1994.
  • Caterini & Chang (2018) Caterini, A. L. and Chang, D. E. Deep Neural Networks in a Mathematical Framework. Springer International Publishing, Cham, 2018. ISBN 978-3-319-75304-1. doi: 10.1007/978-3-319-75304-1. URL
  • Cerqueira et al. (2019) Cerqueira, V., Torgo, L., and Soares, C. Machine learning vs statistical methods for time series forecasting: Size matters. arXiv preprint arXiv:1909.13316, 2019.
  • Chang et al. (2017) Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock, M., Hasegawa-Johnson, M. A., and Huang, T. S. Dilated recurrent neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 77–87. Curran Associates, Inc., 2017. URL
  • Cho et al. (2014a) Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014a.
  • Cho et al. (2014b) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014b.
  • Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2980–2988. Curran Associates, Inc., 2015. URL
  • Dabrowski et al. (2018) Dabrowski, J. J., Rahman, A., George, A., Arnold, S., and McCulloch, J. State space models for forecasting water quality variables: An application in aquaculture prawn farming. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &#38; Data Mining, KDD ’18, pp. 177–185, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5552-0. doi: 10.1145/3219819.3219841. URL
  • Doerr et al. (2018) Doerr, A., Daniel, C., Schiegg, M., Nguyen-Tuong, D., Schaal, S., Toussaint, M., and Trimpe, S. Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395, 2018.
  • Du et al. (2020) Du, S., Li, T., Yang, Y., and Horng, S.-J. Multivariate time series forecasting via attention-based encoder–decoder framework. Neurocomputing, 2020. ISSN 0925-2312. doi: URL
  • Fraccaro et al. (2016) Fraccaro, M., Sø nderby, S. r. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 2199–2207. Curran Associates, Inc., 2016. URL
  • Gers et al. (1999) Gers, F., Schmidhuber, J., and Cummins, F. Learning to forget: continual prediction with lstm. IET Conference Proceedings, pp. 850–855(5), January 1999. URL
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
  • Graves (2013) Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  • He et al. (2015a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015a.
  • He et al. (2015b) He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, Dec 2015b. doi: 10.1109/ICCV.2015.123.
  • Hewamalage et al. (2019) Hewamalage, H., Bergmeir, C., and Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. arXiv preprint arXiv:1909.00590, 2019.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735. URL
  • Hyndman (2014) Hyndman, R. Time series data library., February 2014. Last accessed: July 2018.
  • Hyndman & Athanasopoulos (2018) Hyndman, R. and Athanasopoulos, G. Forecasting: principles and practice. OTexts, Melbourne, Australia, 2nd edition, 2018. URL Accessed on 27 April, 2019.
  • Hyndman & Koehler (2006) Hyndman, R. J. and Koehler, A. B. Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4):679 – 688, 2006. ISSN 0169-2070. doi: URL
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krishnan et al. (2015) Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
  • Kuznetsov & Mariet (2018) Kuznetsov, V. and Mariet, Z. Foundations of sequence-to-sequence modeling for time series. arXiv preprint arXiv:1805.03714, 2018.
  • Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6402–6413. Curran Associates, Inc., 2017. URL
  • Laptev et al. (2017) Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, number 34, pp. 1–5, 2017.
  • LeCun et al. (1995) LeCun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • Li et al. (2019) Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., and Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems, pp. 5244–5254, 2019.
  • Maddix et al. (2018) Maddix, D. C., Wang, Y., and Smola, A. Deep factors with gaussian processes for forecasting. arXiv preprint arXiv:1812.00098, 2018.
  • Makridakis et al. (2018a) Makridakis, S., Spiliotis, E., and Assimakopoulos, V. The m4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802 – 808, 2018a. ISSN 0169-2070. doi: URL
  • Makridakis et al. (2018b) Makridakis, S., Spiliotis, E., and Assimakopoulos, V. Statistical and machine learning forecasting methods: Concerns and ways forward. PLOS ONE, 13(3):1–26, 03 2018b. doi: 10.1371/journal.pone.0194889. URL
  • Mehrkanoon (2019) Mehrkanoon, S. Deep shared representation learning for weather elements forecasting. Knowledge-Based Systems, 179:120 – 128, 2019. ISSN 0950-7051. doi: URL
  • Nguyen et al. (2019) Nguyen, L., Yang, Z., Li, J., Pan, Z., Cao, G., and Jin, F. Forecasting people’s needs in hurricane events from social network. IEEE Transactions on Big Data, pp. 1–1, 2019. doi: 10.1109/TBDATA.2019.2941887.
  • Oord et al. (2016) Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • Oppenheim & Schafer (2009) Oppenheim, A. V. and Schafer, R. W. Discrete-Time Signal Processing. Pearson education signal processing series. Pearson, 3rd edition, 2009. ISBN 9780131988422.
  • Rangapuram et al. (2018) Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7785–7794. Curran Associates, Inc., 2018. URL
  • Salinas et al. (2019) Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 2019. ISSN 0169-2070. doi: URL
  • Schmidhuber (2015) Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks, 61:85 – 117, 2015. ISSN 0893-6080. doi: URL
  • Sen et al. (2019) Sen, R., Yu, H.-F., and Dhillon, I. Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting. arXiv preprint arXiv:1905.03806, 2019.
  • Srivastava et al. (2015) Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2377–2385. Curran Associates, Inc., 2015. URL
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc., 2014. URL
  • UCI (2019) UCI. Uc irvine machine learning repository., 2019. Last accessed: November 2019.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL
  • Wang (2016) Wang, X. Pydlm, 2016. URL
  • Wen et al. (2017) Wen, R., Torkkola, K., Narayanaswamy, B., and Madeka, D. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
  • West & Harrison (1997) West, M. and Harrison, J. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Springer New York, 1997. ISBN 978-0-387-22777-1. doi: 10.1007/0-387-22777-6. URL
  • Wu et al. (2020) Wu, N., Green, B., Ben, X., and O’Banion, S. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020.
  • Xing et al. (2019) Xing, F. Z., Cambria, E., and Zhang, Y. Sentiment-aware volatility forecasting. Knowledge-Based Systems, 176:68 – 76, 2019. ISSN 0950-7051. doi: URL
  • Zhang et al. (2019) Zhang, Y., Thorburn, P., Xiang, W., and Fitch, P. SSIM -a deep learning approach for recovering missing time series sensor data. IEEE Internet of Things Journal, 2019. ISSN 2327-4662. doi: 10.1109/JIOT.2019.2909038.
  • Zhu & Laptev (2017) Zhu, L. and Laptev, N. Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110, Nov 2017. doi: 10.1109/ICDMW.2017.19.

Appendix A Additional Results

A.1 Forecast Plots

Figure 6: Plots of the forecasts for each model on the river flow dataset. The grey curve plots the sensor data and the red curve plots the forecasts. The standard deviation for the relevant models is plotted as the red shaded region.

To illustrate forecasting ability, a plot of the forecasts for each model on the river flow dataset is presented in Figure 6. The uncertainty (standard deviation) is plotted for the FN, cFN, deepAR, DLM, and SARIMA models.

As illustrated in Figure 6, the river flow data has a varying amplitude and an irregular seasonal curve. The first peak has a lower amplitude than the second and third peak. Furthermore, the shape of each the seasonal cycle is unique. An accurate model is one that is able to adapt to these changes by learning the underlying dynamics which generate these changes.

Despite the variations over seasonal cycles, cFN2 provided highly accurate forecasts, where fine intricacies in the data were forecast. In this example, the attention model also produces an accurate forecast, however cFN2 outperforms the attention model on this dataset. The DeepAR and the sequence-to-sequence models provided more smooth forecasts. DeepAR has the advantage of providing uncertainty with the forecast.

The DLM and SARIMA models forecast a curve that is similar in form to the curve from the previous cycle. This is expected as both of these models consider one cycle of data in the past to compute the forecast. Furthermore, these statistical models assume a linear trend, whereas the actual trend is nonlinear and appears more stochastic in nature. The nonlinear models are more capable of modelling the nonlinear dynamics.

A.2 Computational Complexity

Table 4: Average epoch duration over 10 epochs. The times are represented as a multiple of the epoch duration of the MLP for each respective dataset.
FN cFN FN2 cFN2 deepAR Seq2Seq Attention TCN MLP
Synth. 12.62 36.84 5.21 32.48 39.38 36.68 69.60 5.04 1.00
Weath. 6.55 18.88 3.38 18.22 23.39 22.14 36.93 2.28 1.00
Elect. 15.07 77.90 6.09 74.79 48.26 45.60 101.55 13.96 1.00
River 7.05 18.63 3.67 17.52 22.61 21.18 35.83 2.85 1.00
Traff. 14.06 60.68 5.56 56.94 44.38 41.54 86.20 15.86 1.00
Lake 7.52 18.41 3.68 16.42 23.65 22.52 37.64 2.62 1.00
DO 17.41 46.64 6.61 39.28 47.29 44.70 88.71 8.96 1.00
pH 17.44 45.24 6.69 42.48 46.21 44.79 89.40 9.08 1.00
Temp. 16.97 48.62 6.57 41.41 47.27 44.39 88.32 8.94 1.00
Ozone 7.89 17.45 3.91 15.68 23.50 21.93 36.53 2.87 1.00
Median 13.34 41.04 5.38 35.88 41.88 39.11 77.90 6.99 1.00

To demonstrate the computational complexity of the models, their training times were considered. Each model was trained on each dataset for 10 epochs and these epoch times were logged. The results are presented in Table 4. To provide some independence from the platform on which the models were trained, the results are presented as an average epoch time of the particular model divided by the average epoch time of the MLP. The results are thus represented as a multiple of the epoch duration of the MLP.

The MLP had the simplest architecture and thus provided the shortest epoch times. Second to the MLP are the FN2 and the TCN. These models required 5 to 7 times the amount of time to train an epoch compared to the MLP. Note that FN2 is ranked as a top model and also has the second lowest median computation time, resulting in a highly attractive model.

Using a density mixture output on ForecastNet significantly increased the epoch time. This is evident when the duration of FN is compared to FN2 and the duration of cFN is compared to cFN2. However, even with increased computational complexity, the median duration of cFN and cFN2 was less than the sequence-to-sequence, attention, and deepAR models. The attention model in particular had a high epoch duration due to its complex architecture. The results provide empirical evidence that ForecastNet provided reduced computational complexity compared with the benchmark models.

A.3 Vanishing Gradients

To demonstrate that ForecastNet is able to mitigate vanishing gradients, it was compared with a deep MLP on the synthetic dataset. The deep MLP was selected for this purpose as it has no guard against vanishing gradients. The MLP was configured with 40 inputs, 20 hidden layers, and 20 outputs. Similarly, ForecastNet was configured with 20 linear outputs and a single hidden layer in each cell. This results in a total of 20 hidden layers, 20 outputs, and 40 inputs as for the deep MLP. Thus, the primary difference between the models was that ForecastNet uses interleaved outputs, whereas the deep MLP’s outputs are all located at the output layer. Both models used the sigmoid activation function with Xavier normal initialisation in hidden layers. The models were trained over 10 epochs with a learning rate of 10^{-4}.

The absolute mean value of the weights for the first and last hidden layers are plotted in the top graph of Figure 7. The training losses are plotted in the bottom graph of Figure 7. The gradient in the first layer of the MLP remained close to zero over all 10 epochs. This indicates a vanishing gradient problem. Furthermore, as indicated in the loss plot, the model did not converge to an optimal solution. In comparison, the gradients of both layers in ForecastNet were non-zero. This indicates that ForecastNet had mitigated vanishing gradients. Furthermore, ForecastNet converged to a more optimal solution compared to the deep MLP model.

Figure 7: Gradient and loss plots for ForecastNet and a deep MLP. Layer 1 of the MLP experiences vanishing gradients. The gradient plot is cut-off at 0.6\times 10^{-4}

Appendix B Additional Model Details

A more detailed diagram of FN and cFN are provided in Figure 8 and Figure 9 respectively.






Figure 8: ForecastNet with a densely connected hidden cell structure as used in FN and FN2. The number of hidden neurons in each layer is denoted by h.



(f=24, k=2)Pool(p=2, s=1)Conv.(f=24, k=2)Pool(p=2, s=1)Dense(h=24)



Figure 9: ForecastNet with a CNN hidden cell structure as used in cFN and cFN2. The number of filters, kernel size, padding, strides, and number of hidden neurons are denoted by f, k, p, s, and h respectively.

Appendix C Detailed Derivation of Interleaved Output Chain Rule in Equation (C)












Figure 10: Figure for the derivation of equation (C). The top row of nodes are hidden neurons. The bottom row of nodes are output neurons.

The outputs in ForecastNet are interleaved between hidden cells. Consider an L-layered ForecastNet with a single hidden neuron in the hidden cell and a single linear output neuron as illustrated in Figure 10. The inputs are not shown as they do not contribute to the backpropagated error across hidden cells. For some layer l in this network, W^{[l]} is the weight matrix, \bar{b}^{[l]} is the bias vector, \mathbf{a}^{[l]} is the output vector, and \mathbf{z}^{[l]}=W^{[l]T}\mathbf{a}^{[l-1]}+\bar{b}^{[l]}. Using the chain rule of calculus, the derivative of the loss function \mathcal{L} with respect to the weights W^{[l]} at layer l is given by

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[l]}}\frac{% \partial\mathbf{a}^{[l]}}{\partial W^{[l]}}

Layer l links to layers l+1 and l+2. Thus, the derivative with respect to \mathbf{a}^{[l]} is expanded as follows

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}% \frac{\partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}+\frac{\partial% \mathcal{L}}{\partial\mathbf{z}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\right)\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l% ]}}
\displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+2]}% }\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial% \mathbf{a}^{[l]}}{\partial W^{[l]}}

The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{z}^{[l+1]}}} is computed with respect to a target value as layer l+1 is an output layer. Layer l+2 is a hidden layer. The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+2]}}} is thus expanded as follows.

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}~{}+\left(\frac{\partial\mathcal{L}}{\partial\mathbf{% z}^{[l+3]}}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}+\frac% {\partial\mathcal{L}}{\partial\mathbf{z}^{[l+4]}}\frac{\partial\mathbf{z}^{[l+% 4]}}{\partial\mathbf{a}^{[l+2]}}\right)
\displaystyle~{}~{}~{}~{}~{}\times\left(\frac{\partial\mathbf{z}^{[l+2]}}{% \partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}\right)

Which is expanded into the sum

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l% +3]}}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{% \partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l% +4]}}\frac{\partial\mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{% \partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}

Similarly, layer l+3 is an output layer and layer l+4 is a hidden layer. The term \nicefrac{{\partial\mathcal{L}}}{{\partial\mathbf{a}^{[l+4]}}} is expanded as follows

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+3]% }}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial% \mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial W^{[l]}}
\displaystyle~{}~{}~{}~{}+\left(\frac{\partial\mathcal{L}}{\partial\mathbf{z}^% {[l+5]}}\frac{\partial\mathbf{z}^{[l+5]}}{\partial\mathbf{a}^{[l+4]}}+\frac{% \partial\mathcal{L}}{\partial\mathbf{z}^{[l+6]}}\frac{\partial\mathbf{z}^{[l+6% ]}}{\partial\mathbf{a}^{[l+4]}}\right)
\displaystyle~{}~{}~{}~{}\times\left(\frac{\partial\mathbf{z}^{[l+4]}}{% \partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]}}{\partial\mathbf{a% }^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}\right)

Which is expanded into the sum

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+1]}}\frac{% \partial\mathbf{z}^{[l+1]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^% {[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+3]% }}\frac{\partial\mathbf{z}^{[l+3]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial% \mathbf{z}^{[l+2]}}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{% \partial W^{[l]}}
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+5]% }}\frac{\partial\mathbf{z}^{[l+5]}}{\partial\mathbf{a}^{[l+4]}}\frac{\partial% \mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]% }}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}
\displaystyle~{}~{}~{}~{}+\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[l+6]% }}\frac{\partial\mathbf{z}^{[l+6]}}{\partial\mathbf{a}^{[l+4]}}\frac{\partial% \mathbf{z}^{[l+4]}}{\partial\mathbf{a}^{[l+2]}}\frac{\partial\mathbf{z}^{[l+2]% }}{\partial\mathbf{a}^{[l]}}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}

This expansion process is continued until the final output layer L is reached. The final result is

\displaystyle\frac{\partial\mathcal{L}}{\partial W^{[l]}} \displaystyle=\sum_{k=0}^{\frac{L-1-l}{2}}\frac{\partial\mathcal{L}}{\partial% \mathbf{z}^{[l+2k+1]}}\frac{\partial\mathbf{z}^{[l+2k+1]}}{\partial\mathbf{a}^% {[l+2k]}}\Psi_{k}\frac{\partial\mathbf{a}^{[l]}}{\partial W^{[l]}}


\displaystyle\Psi_{k}=\begin{cases}1&k=0\\ \displaystyle\prod_{j=1}^{k}\frac{\partial\mathbf{z}^{[l+2j]}}{\partial\mathbf% {a}^{[l+2(j-1)]}}&k>0\end{cases}
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description