LSTM-MSNet: Leveraging Forecasts on Sets of Related Time Series with Multiple Seasonal Patterns
Generating forecasts for time series with multiple seasonal cycles is an important use-case for many industries nowadays. Accounting for the multi-seasonal patterns becomes necessary to generate more accurate and meaningful forecasts in these contexts. In this paper, we propose Long Short-Term Memory Multi-Seasonal Net (LSTM-MSNet), a decomposition-based, unified prediction framework to forecast time series with multiple seasonal patterns. The current state of the art in this space are typically univariate methods, in which the model parameters of each time series are estimated independently. Consequently, these models are unable to include key patterns and structures that may be shared by a collection of time series. In contrast, LSTM-MSNet is a globally trained Long Short-Term Memory network (LSTM), where a single prediction model is built across all the available time series to exploit the cross-series knowledge in a group of related time series. Furthermore, our methodology combines a series of state-of-the-art multi-seasonal decomposition techniques to supplement the LSTM learning procedure. In our experiments, we are able to show that on datasets from disparate data sources, like e.g. the popular M4 forecasting competition, a decomposition step is beneficial, whereas in the common real-world situation of homogeneous series from a single application, exogenous seasonal variables or no seasonal preprocessing at all are better choices. All options are readily included in the framework and allow us to achieve competitive results for both cases, outperforming many state-of-the-art multi-seasonal forecasting methods.
Time series forecasting has become a key-enabler of modern day business planning by landscaping the short-term, medium-term and long-term goals in an organisation. As such, generating accurate and reliable forecasts is becoming a perpetual endeavour for many organisations, leading to significant savings and cost reductions. The complex nature of the properties present in a time series, such as seasonality, trend, and level, may bring numerous challenges to produce accurate forecasts. In terms of seasonality, a time series may exhibit complex behaviour such as multiple seasonal patterns, non-integer seasonality, calendar effects, etc.
As sensors and data storage capabilities advance, time series with higher sampling rates (sub-hourly, hourly, daily) are becoming more common in many industries, e.g. in the utility demand industry (electricity and water usage). Fig. 1 illustrates an example of half-hourly energy consumption of an Australian household that exhibits both daily (period = 48) and weekly (period = 336) seasonal patterns. A longer version of this time series may even exhibit a yearly seasonality (period = 17532), representing seasonal effects such as summer and winter. Here, the variations among energy consumption patterns within a day, i.e., morning and evening, workday and weekday cause daily and weekly seasonal patterns in the time series. While the daily and weekly seasonality can be used to determine the short-term energy demand in households, the yearly seasonality may explain the seasonal effects towards the energy consumption, which can be beneficial in long-term energy planning. Therefore, accurate modelling of such multiple seasonal cycles is necessary to estimate the demand on various time horizons. Particularly in the energy industry, accurate short-term and long-term load forecasting may lead to better demand planning and efficient resource management. In addition to the utility demand industry, the demand variations in the transportation, tourist, and healthcare industries can also be largely influenced by multiple seasonal cycles.
The current methods to handle multiple seasonal patterns are mostly statistical forecasting techniques [2, 3] that are univariate. Thus, they treat each time series as an independent sequence of observations, and forecast it in isolation. The univariate time series forecasting is not able to exploit any cross series information available in a set of time series that may be correlated and share a large amount of common features. This is a common characteristic observed in the realm of “Big Data,” where often large collections of related time series are available. Examples for these are sales demand of related product assortments in retail, server performance measures in computer centres, household smart meter data, etc. This can be applied to the time series shown in Fig. 2, in which these energy consumption patterns of various households can be similar and may share key properties in common. As a result, efforts to build global models across multiple related time series is becoming increasingly popular, and these methods have achieved state-of-the-art performance in recent studies [4, 5, 6, 7, 8, 9]. The recent success is mainly around Recurrent Neural Networks (RNN) and Long Short-Term Memory Networks (LSTM) that are naturally suited in modelling sequence data.
Although several unified models have been proposed to learn better under these circumstances, how to handle multiple-seasonal patterns in a set of time series has not yet been thoroughly studied. Moreover, the competitiveness of such global models highly rely on the characteristics of the time series. To this end, in this paper, we propose LSTM-MSNet, a novel forecasting framework using LSTMs that effectively accounts for the multiple seasonal periods present in a time series. Following the recent success, our model borrows the strength across a set of related time series to improve the forecast accuracy. This enables our model to untap the common seasonality structures and behaviours available in a collection of time series. As a part of the LSTM-MSNet architecture, we introduce a host of decomposition techniques to supplement the LSTM learning procedure, following the recommendations of [10, 11, 12, 7]. Nevertheless, competitiveness of such global models can be affected by the homogeneous characteristics present in the collection of time series . Therefore, LSTM-MSNet introduces two training paradigms to accommodate both homogeneous and inhomogeneous groups of time series. Our model is evaluated using several time series databases, including a competition dataset and real-world datasets, which contain multiple seasonal patterns, exhibiting different levels of seasonal homogeneity.
Ii Background and related work
The traditional approaches to model time series with seasonal cycles are mostly state-of-the-art univariate statistical forecasting methods such as exponential smoothing methods  and autoregressive integrated moving-average (ARIMA) models . The basic forms of these algorithms are only suited in modelling a single seasonality, and unable to account for multiple seasonal patterns.
Nonetheless, over the past decade, numerous studies have been conducted to extend the traditional statistical forecasting models to accommodate multiple seasonal patterns [14, 15, 16, 17, 18]. An early study developed by Harvey et al.  introduces a model to suit time series with two seasonal periods. Later, Taylor  adapts the simple Holt-Winters method to capture seasonalities by introducing multiple seasonal components to the linear version of the model. Gould et al.  propose an innovation state space approach to model multiple seasonalities, in which various forms of seasonal patterns can be incorporated, i.e., additive seasonality and multiplicative seasonality. Later, Taylor and Snyder  overcome the limitations of Gould et al.  by introducing a parsimonious version of the seasonal exponential smoothing approach. However, the majority of these techniques suffer from over-parameterisation, optimisation problems, and are also unable to model complex seasonal patterns in a time series. For more detailed discussions of these weaknesses, we refer to De Livera et al. . A more flexible and parsimonious version of an innovation state space modelling framework was developed by De Livera et al. , aiming to address various challenges associated with seasonal time series, such as modelling multiple seasonal periods, non-integer seasonality, and calendar effects. Today, these proposed seasonal models, i.e., BATS and TBATS, are considered state-of-the-art statistical techniques to model time series with multiple seasonal patterns.
Time series decomposition is another popular strategy to handle time series with complex seasonal patterns [19, 2]. Here, the time series is decomposed into a trend, seasonal, and residual component. Each component is modelled separately, so that the model complexity is less than forecasting the original time series as a whole. For example, this approach is applied by Nowicka-Zagrajek and Weron , where those authors initially decompose time series using a moving average technique. The seasonally adjusted time series is then modelled separately using an ARMA process. Moreover, Lee and Ko  use a lifting scheme, a different decomposition technique, to separate the original time series at different load frequency levels. Afterwards, individual ARIMA models are built to forecast each decomposed sub series separately.
In parallel to these developments, neural networks (NNs) have been advocated as a strong alternative to traditional statistical forecasting methods in forecasting seasonal time series. The favourable properties towards forecasting, such as universal function approximation [20, 21], in theory position NNs as a competitive machine learning approach to model underlying seasonality in a time series. Though early studies postulate the suitability of NNs in modelling seasonal patterns [22, 23], more recent studies advise that deseasonalising the time series prior to modelling is useful to achieve better forecasting accuracy from NNs [10, 12, 11, 24, 25]. Here, deseasonalisation refers to the process of removing the seasonal component from a time series. More specifically, Nelson et al.  and Ben Taieb et al.  empirically show the accuracy gains by including a deseasonalisation process with NNs. Furthermore, Zhang and Qi  highlight that NNs are unable to model trend or seasonality directly, thus detrending or deseasonalisation is necessary to produce accurate forecasts with NNs. Meanwhile, Dudek  develops a local learning based approach to deal with multiple seasonal cycles. Though this obviates the need of time series decomposition, the local learning procedure that matches similar seasonal patterns in a time series tends to weaken the global generalisability of the model.
More recently, deep neural networks have drawn significant attention among forecasting practitioners. In particular, RNNs and convolutional neural networks (CNN) have exhibited promising results, outperforming many state-of-the-art statistical forecasting methods [4, 5, 6, 7, 8, 27]. Nevertheless, in spite of the substantial literature available on deep learning in time series forecasting, only few attempts have been undertaken to explicitly handle multiple seasonal patterns in a time series [8, 28]. Lai et al. in  introduce a combination of CNN and RNN architectures to model short and long term dependencies in a time series, and employ a skipped connection architecture to model different seasonal periods. Bianchi et al.  implement a seasonal differencing strategy to select the most significant seasonal pattern, i.e., single seasonality present in a time series to forecast the short-term energy load. Moreover, the winning submission of the recently concluded M4 forecasting competition , Exponential Smoothing-Recurrent Neural Network (ES-RNN), uses a hybrid approach to forecast the hourly time series category with two seasonalities. However, the original implementation of ES-RNN restricts the number of seasonalities to two, and also due to the limitations of the underlying models (Holt-Winters) that operate in this approach, the ES-RNN is not suitable to handle long term seasonalities in a time series (e.g., yearly seasonality in an hourly time series) .
Iii LSTM-MSNet Framework
In this section, we first formally define the problem of forecasting with multiple seasonal patterns, and then discuss the components of the proposed LSTM-MSNet architecture shown in Fig. 3 in detail.
Iii-a Problem Statement
Let be the th time series from time series in our database. The past observations of the time series are given by , where represents the length of the time series . We introduce the seasonality periods of time series as , where is the highest seasonal period present in the time series . The primary objective of this study is to develop a global prediction model , which uses previous observations of all the time series, i.e., to forecast number of future data points , i.e., }, while accounting for all the available seasonal periods present in the time series. Here, is the intended forecasting horizon of time series . The model can be defined as follows:
Here, are the model parameters of our LSTM-MSNet prediction model.
Fig. 3 illustrates a schematic overview of the LSTM-MSNet forecasting framework. LSTM-MSNet is a forecasting framework designed to forecast time series with multiple seasonal patterns. The architecture of LSTM-MSNet is a fusion of statistical decomposition techniques and recurrent neural networks. The LSTM-MSNet has three layers, namely: 1) the pre-processing layer, which consists of a normalisation and variance stabilising phase, and a seasonal decomposition phase, 2) the recurrent layer, which consists of an LSTM based stacking architecture to train the network, and 3) a post-processing layer to denormalise and reseasonalise the time series to derive the final forecasts. The proposed framework can be used with any RNN variant such as LSTMs, Gated Recurrent Units (GRUs), and others. In this paper, we select LSTMs, a promising RNN variant, as our primary network training module. In the following sections, we discuss each layer of the LSTM-MSNet in detail.
Iii-B Normalisation and Variance Stabilisation Layer
The proposed LSTM-MSNet is a global model that is built across a group of time series. Therefore, performing a data normalisation strategy becomes necessary as in a collection of time series, each time series may contain observations with different value ranges. Hence, we use the mean-scale transformation strategy, which uses the mean of a time series as the scaling factor. This scaling strategy can be defined as follows:
Here, represents the normalised observation, and represents the number of observations of time series .
After normalising the time series, we stabilise the variance in the group of time series by transforming each time series to a logarithmic scale. Apart from the variance stabilisation, the log transformation also enables the conversion of the seasonality form in a given time series to an additive form. This is a necessary requirement for additive time series decomposition techniques employed in our decomposition layer. The transformation can be defined in the following way:
Here, denotes a time series, and is the corresponding log transformed time series .
Iii-C Seasonal Decomposition
As highlighted in Section II, when modelling seasonal time series with NNs, many studies suggest applying a prior seasonal adjustment, i.e., deseasonalisation to the time series [10, 12, 11]. The main intention of this approach is to minimise the complexity of the original time series, and thereby reducing the subsequent effort of the NN’s learning process. In line with these recommendations, LSTM-MSNet initially uses a deseasonalisation strategy to detach the multi-seasonal components from a time series. To accommodate this, we use a series of statistical decomposition techniques that support separating multi-seasonal patterns in a time series. We also configure these methods to extract various forms of seasonality, i.e., deterministic and stochastic seasonality to assess their sensitivity towards the forecast accuracy. Next, we briefly describe the different types of decomposition techniques used in our study. An overview of the methods is given in Table I.
Iii-C1 Multiple STL Decomposition (MSTL)
MSTL extends the original version of Seasonal-Trend Decomposition (STL) , to allow for decomposition of a time series with multiple seasonal cycles. The STL method additively decomposes a time series into trend, seasonal, and remainder components. In other words, the original series can be reconstructed by summing the decomposed parts of the time series. The additive decomposition can be formulated as follows:
Here, represents the observation at time , and , , refers to the seasonal, trend, and the remainder components of the observation, respectively.
In MSTL, the STL procedure is used iteratively to estimate the multiple seasonal components in a time series. So, the original version of Equation 4 can be extended to reflect the decomposition of MSTL as follows:
Here, denotes the number of distinct seasonal patterns decomposed by the MSTL. In our study, we use the R  implementation of the MSTL algorithm,
mstl, from the
forecast package [33, 34]. MSTL also supports controlling the smoothness of the change of seasonal components extracted from the time series, i.e., configuring the
s.window parameter. For example, by adjusting the
s.window parameter to “periodic”, the MSTL decomposition limits the change in the seasonal components to zero. This enables us to separate the deterministic seasonality from a time series. In Table I, we give the two
s.window parameter values used in our experiments.
Iii-C2 Seasonal-Trend decomposition by Regression (STR)
STR is a regression based decomposition technique introduced by Dokumentov et al. . The division is additive, hence the decomposition accords with Equation 5. In contrast to STL, STR is capable of incorporating multiple external regressors to the decomposition procedure, while allowing to account for external factors that may influence the seasonal patterns in a time series. However, to make our comparisons unbiased, we use STR in the default mode, without including any exogenous regressors. In R, the STR algorithm is available through the
AutoSTR function from the
stR package .
Iii-C3 Trigonometric, Box-Cox, ARMA, Trend, Seasonal (TBATS)
As highlighted in Section II, the TBATS model was developed to handle complex seasonal patterns present in a time series . This method is currently established as a state-of-the-art technique to forecast time series with multiple seasonal cycles. Particularly, the inclusion of trigonometric expression terms has enabled TBATS to identify sophisticated seasonal terms in a time series (for details see Livera et al. )).
Prophet is an automated forecasting framework developed by Taylor and Letham . The main aim of this framework is to address the challenges involved in forecasting at Facebook, the employer of those authors at that time. The challenges include the task of forecasting time series with multiple seasonal cycles. The underlying model of Prophet uses an additive decomposition layer similar to Equation 5. However, this division introduces an additional term to model holidays as seasonal covariates. After including the holiday terms, Equation 5 can be rewritten as follows:
Here, denotes the holiday covariates in the model that represent the effects of holidays. Likewise in TBATS, we use Prophet in the Decomposition layer to obtain the multiple seasonal components present in a time series. We achieve this by applying the Prophet algorithm available through the
prophet package in R .
Iii-C5 Fourier Transformation
Fourier terms are a flexible approach to model periodic effects in a time series . For example, let be an observation of time series at time . The seasonal terms relevant to can be approximated by Fourier terms as follows:
Here, refers to the th seasonal periodicity in the time series. Thereby, we can define an amount of seasonal periodicities available in a time series. The parameter in Equation 7 is the number of , pairs used for the transformation process. This essentially controls the momentum of the seasonality, where a higher allows to represent a seasonal pattern that changes more quickly, compared to a lower . In our case, for each seasonal periodicity in the time series, a separate must be introduced. We generate these Fourier terms using the
fourier function available in the
forecast package. In our experiments, we use a parameter grid, which ranges from to , to determine the optimal values in Fourier terms. Moreover, we consider (with least number of ) as a special use case in Fourier terms and report separately as a variant of LSTM-MSNet.
The overall summary of the aforementioned methods is shown in Table I. Here, the Package column provides a reference to the software implementation used in our experiments. The table furthermore indicates the type of seasonalities extracted by each method.
Iii-D Recurrent Layer
The second layer, the Recurrent Layer, is the primary prediction module of LSTM-MSNet, equipped with LSTMs. RNNs, and in particular LSTMs, have been embraced by many fields that involve sequence modelling tasks, such as Natural Language Processing , speech recognition , image generation , and more recently have received a great amount of attention in time series research [43, 5, 6, 7, 25].
In the LSTM, the gating mechanism together with the self-contained memory cell enables the network to capture non-linear long-term temporal dependencies in a sequence. We configure the input and forget gates of the LSTM network to include the previous state of the memory cell (). This configuration is also known as “LSTM with peephole connections”, in which the hidden state () at time can be computed from the following equations:
Here, , , , and represent the weight matrices of input gate, forget gate, output gate, and memory cell gates respectively, while is the input at time . Also, , , , and denote the corresponding input weight matrices, and , , are the respective peephole weight matrices. The biases of the gates are represented by , , , and . refers to the candidate cell state, which is used to update the state of the original memory cell (see Equation 11). Moreover, is the element-wise multiplication operation, while represents the sigmoid activation function. We use a hyperbolic tangent function, i.e., tanh as our hidden update activation function in Equation 10.
Iii-D1 Moving Window Transformation
As a preprocessing step, we transform the past observations of time series () into multiple pairs of input and output frames using a Moving Window (MW) strategy. Later, these frames are used as the primary training source of LSTM-MSNet.
In summary, the MW strategy converts a time series of length into records, where each record has an amount of observations. Here, refers to the length of the output window, and is the length of the input window. These frames are generated according to the Multi-Input Multi-Output (MIMO) principle used in multi-step forecasting, which directly predicts all the future observations up to the intended forecasting horizon . Training the NNs in this way has the advantage of avoiding the potential error accumulation at each forecasting step [12, 6, 44]. Therefore, we choose the size of the output window equivalent to the length of the intended forecast horizon . Fig. 4 illustrates an example of applying the MW approach to the hourly time series T48 of the M4 dataset.
To train the LSTM-MSNet, we use many observations from time series and reserve the last output window for network validation.
Iii-D2 Training Paradigms
In this study, we propose to use the output of the decomposition layer in two different ways. These paradigms can be distinguished by the time series components used in the MW process, and later in the LSTM-MSNet training procedure. In the following, we provide a short overview of these two paradigms.
Deseasonalised Approach (DS)
This approach uses seasonally adjusted time series as MW patches to train the LSTM-MSNet. Since the seasonal components are not included in DS for the training procedure, a reseasonalisation technique is later introduced in the Post-processing layer of LSTM-MSNet to ascertain the corresponding multiple seasonal components of the time series.
Seasonal Exogenous Approach (SE)
This second approach uses the output of the pre-processing layer, together with the seasonal components extracted from the multi-seasonal decomposition as external variables. Here, in addition to the normalised time series (without the deseasonalisation phase), the seasonal components relevant to the last observation of the input window are used as exogenous variables in each input window. As the original components of the time series are used in the training phase of SE, the LSTM-MSNet is expected to forecast all the components of a time series, including the relevant multi-seasonal patterns. Therefore, a reseasonalisation stage is not required by SE.
In summary, DS supplements the LSTM-MSNet by excluding the seasonal factors in the LSTM-MSNet training procedure. This essentially minimises the overall training complexity of the LSTM-MSNet. In contrast, SE supplements LSTM-MSNet in the form of exogenous variables that assist modelling the seasonal trajectories of a time series.
Iii-D3 LSTM Learning Scheme
As highlighted earlier, we use the past observations of time series , in the form of input and output windows to train the LSTM-MSNet. In our work, we follow the LSTM design guidelines recommended by Hewamalage et al. . Fig. 5 illustrates the primary LSTM learning architecture of LSTM-MSNet. This consists of four components, namely: Training input window layer, LSTM stacking layer, Dense layer and Training output window layer. Here, represents the input window at time step . Also, the projected cell output of the LSTM at time step is represented by . Here represents the size of the output window, which is identical to the forecasting horizon . Moreover, the hidden and the cell states of the LSTM are denoted by and . We use an affine neural layer (a fully connected layer; ), excluding the bias component to map each LSTM cell output to the dimension of the output window .
Before feeding these windows to the network for training, each input and output window is subjected to a local normalisation process to avoid possible network saturation effects caused by the bounds of the network activation functions . In the DS approach, we use the trend component of the last value of the input window as a local normalisation factor, whereas in SE we use the mean value of each input window as the normalisation factor. Afterwards, these factors are subtracted from each data point in the corresponding input and output window. The above LSTM learning scheme is implemented using TensorFlow .
Iii-D4 Loss Function
We use the L1-norm, as the primary learning objective function, which essentially minimises the absolute differences between the target values and the estimated values. This has the advantage of being more robust to anomalies in the time series. The L1-loss is given by:
Here, refers to the actual observations of values in the output window at time step . The cell output of the LSTM at time step is defined by . Also, is the set of time steps used for training. We include an L2-regularisation term to minimise possible overfitting of the network. In Equation 14, is the regularisation parameter, refers to the network weights and is the number of weights in the network.
Iii-E Post-processing Layer
The reseasonalisation and renormalisation is the main component of the post processing layer in LSTM-MSNet. Here, in the reseasonalisation stage, the relevant seasonal components of the time series are added to the forecasts generated by the LSTM. This is computed by repeating the last seasonal components of the time series to the intended forecast horizon. As outlined in Section III-D2, SE does not require this phase. Next, in the renormalisation phase, the generated forecasts are back-transformed to their original scale by adding back the corresponding local normalisation factor, and taking the exponent of the values. The final forecasts are obtained by multiplying this vector by the scaling factor used for the normalisation process.
In this section, we evaluate the proposed variants of the LSTM-MSNet framework on three time series datasets. First, we describe the datasets, error metrics, hyper-parameter selection method, and benchmarks used in our experimental setup. Then, we provide a detailed analysis of the results obtained.
We use three benchmark time series datasets which present multiple seasonal cycles. Following is a brief overview of these datasets:
M4-Hourly Dataset : Hourly dataset from the M4 forecasting competition.
AusGrid-Energy Dataset : Half-hourly dataset, representing energy consumption of 300 households in Australia. We select general consumption (GC letter code in the dataset) as the primary measure of energy consumption in households. Firstly, we extract a subset of 3 months of half-hourly data (2012 July - 2012 October). Then, to evaluate LSTM-MSNet on multiple seasonal patterns, we aggregate the original half-hourly time series to hourly time series and extend the extraction to 2 years (2010 July - 2012 July), considering three seasonalities: daily, weekly, and yearly.
Table II summarises statistics of the datasets used in our experiments. Here, denotes the number of time series, denotes the length of each time series, denotes the sampling rate of the time series, represents the different seasonal cycles present in the time series, and is the relevant forecast horizon. Moreover, we choose the size of the input window equivalent to , following the heuristic proposed in [7, 25].
|M4-Hourly Dataset||414||700||hourly||(24, 168)||48|
|AusGrid-Energy Dataset (3 Months)||300||4704||half-hourly||(48, 336)||96|
|AusGrid-Energy Dataset (2 Years)||300||17600||hourly||(24, 168, 8766)||24|
We plot the seasonal components of our benchmark datasets to investigate the diversity of their seasonal distributions. For simplicity, we use MSTL as the primary decomposition technique to extract the multiple seasonal components from a time series. From Fig. 6, it is evident that there exists a high variation of seasonality among the time series in the M4 dataset. This can be attributed to a less homogeneous nature of the time series and different start/end calendar dates. On the other hand, it is clear that the distribution of the seasonal components is similar among the time series in the AusGrid-Energy datasets, and shows less variation compared to the M4 dataset, as the time series are homogeneous in the sense that they are all related to household energy consumption, and follow identical calendar dates. This seasonal diversity present in our benchmark datasets enables us to assess the robustness of the LSTM-MSNet framework under different seasonality conditions, i.e., inhomogeneous and homogeneous seasonalities.
Iv-B Error Metrics
To assess the accuracy of LSTM-MSNet against the benchmarks, we use two evaluation metrics commonly found in the forecasting literature, namely the symmetric Mean Absolute Percentage Error (sMAPE) and the Mean Absolute Scaled Error (MASE) . The sMAPE and MASE are defined as follows:
Here, represents the observation at time , and is the generated forecast. Also, denotes the number of data points in the test set and is the number of observations in the training set of a time series. We define , as the frequency of the highest available seasonality in the given time series (e.g., S = 168, S = 336, and S = 8766 for the M4-Hourly and AusGrid-Energy datasets, respectively). Furthermore, when calculating the sMAPE values for the energy datasets, to avoid problems for zero forecasts and actual observations, we add a constant term of to the denominator of Equation 15. To provide a broader overview of the error distributions, we compute the mean and median of these primary error measures. This includes, mean of the sMAPEs (Mean sMAPE), median of the sMAPEs (Median sMAPE), mean of the MASEs (Mean MASE), and median MASEs (Median MASE).
Iv-C Statistical tests of the results
We use the non-parametric Friedman rank-sum test to assess the statistically significant differences within the compared forecasting methods. Also, Hochberg’s post-hoc procedure is used to further examine these differences 111More information can be found on the thematic web site of SCI2S about Statistical Inference in Computational Intelligence and Data Mining http://sci2s.ugr.es/sicidm. The sMAPE error measure is used to perform the statistical testing, with a significance level of = 0.05.
Iv-D Hyper-parameter Tuning and Optimisation
The recurrent layer in LSTM-MSNet has various hyper-parameters. This includes LSTM cell dimension, number of epochs, hidden-layers, mini-batch-size and model regularisation terms. In the optimisation framework, we use COntinuous COin Betting  as our primary learning algorithm to train the network, which does not require the tuning of learning rate. In this way, we minimise the overall amount of hyper-parameters to be tuned in the network.
Furthermore, we automate the hyper-parameter selection process by employing a Bayesian global optimisation methodology that autonomously discovers the optimal set of parameters of an unknown function. Compared to other parameter optimisation selection techniques, such as Random Search and Grid Search, the Bayesian optimisation strategy is considered as a more systematic and/or more efficient approach. This is because, when determining the current pool of potential hyper-parameters for validation, it takes into account the previously visited hyper-parameter values in a non-trivial way . In our experiments, we use the python implementation of the sequential model-based algorithm configuration (SMAC)  that implements a version of the Bayesian hyper-parameter optimisation process. Table III summarises the bounds of the hyper-parameter values used throughout the LSTM-MSNet learning process, represented by the respective minimum and maximum values.
|Model Parameter||Minimum value||Maximum value|
Iv-E Benchmarks and LSTM-MSNet variants
We compare our developments against a collection of current state-of-the-art techniques in forecasting with multiple seasonal cycles. This includes Tbats , Prophet , and FFORMA . We also use two variants of Dynamic-Harmonic-Regression  as the benchmarks. In our experiments, Dynamic-Harmonic-Regression (T) represents the variant that uses the
tslm function from the
forecast package, whereas the Dynamic-Harmonic-Regression (A) variant uses the
auto.arima function from the
Based on the training paradigms defined in Section III-D2, we introduce variants of the LSTM-MSNet methodology for our comparative evaluation. These methods are summarised in Table IV, represented by the corresponding training paradigm and decomposition technique. Moreover, as the baseline, we use LSTM-MSNet, excluding the seasonal decomposition phase. In other words, we use original observations of the time series, without using DS or SE, to train the LSTM-MSNet. This is referred to as LSTM-Baseline in our experiments.
|LSTM Variant||Training Paradigm||Decomposition Technique|
|LSTM-Fourier-SE ()||SE||Fourier Transformation|
Iv-F Computational Performance
We also report the computational costs in execution time of our proposed LSTM-MSNet variants and the benchmark models on the aggregated hourly energy dataset. The experiments are run on an Intel(R) i7 processor (3.2 GHz), with 2 threads per core, 6 cores per socket, and 64GB of main memory.
Table V summarises the evaluation results of all the LSTM-MSNet variants and benchmarks for the 414 hourly series of the M4 competition dataset, ordered by the first column, which is the Mean sMAPE. For each column, the results of the best performing method(s) are marked in boldface. According to Table V, the proposed LSTM-MSTL-DS variant obtains the best Mean SMAPE, while FFORMA achieves the best Median SMAPE. We see that regarding the mean MASE, Dynamic-Harmonic-Regression with the auto.arima variant performs better than the rest of the benchmarks, whereas FFORMA outperforms the proposed LSTM variants, in terms of the median MASE. Also, on average, the LSTM-MSNet variants with the DS training paradigm achieve better accuracies compared to those of SE. Furthermore, the proposed LSTM-MSTL-DS variant consistently outperforms many state-of-the-art methods, such as TBATS, Prophet, and Dynamic-Harmonic-Regression variants in terms of Mean sMAPE. It is also noteworthy to mention that all the proposed LSTM-MSNet variants outperform our baseline model, LSTM-Baseline, in terms of mean sMAPE and mean MASE.
|Method||Mean sMAPE||Median sMAPE||Mean MASE||Median MASE|
Table VI shows the results of the statistical testing evaluation. Adjusted -values calculated from the Friedman test with Hochberg’s post-hoc procedure are presented. A horizontal line is used to separate the methods that perform significantly worse than the best performing method. The overall result of the Friedman rank sum test is a -value of , which is highly significant. The FFORMA method performs best and is used as the control method. Also, according to Table VI, we see that the FFORMA achieves significantly better results than the other methods.
Table VII shows the evaluation summary for the 300 half-hourly series of the AusGrid-Energy dataset. The NA values in the table represent models that could not complete the execution within a time frame of 6 days. It can be seen that the proposed LSTM-MSTL-SE variant outperforms all the benchmarks in terms of the Mean sMAPE and Median sMAPE. Meanwhile, the proposed LSTM-Prophet-DS variant achieves the best accuracy with respect to Mean MASE and Median MASE. Here, in the majority of the cases, the LSTM variants that use SE as the training paradigm obtain better forecasts, which is contrary to our previous findings from the M4 competition dataset. Also, several LSTM-MSNet variants with the DS training paradigm display poor performance compared to the LSTM-Baseline. Most importantly, we observe that the proposed LSTM variants LSTM-MSTL-SE and LSTM-Prophet-DS consistently surpass the current state of the art in all performance metrics.
|Method||Mean sMAPE||Median sMAPE||Mean MASE||Median MASE|
The Friedman rank sum test gives an overall -value of . Therefore, the differences among the benchmarks are highly significant. According to Table VIII, we see that the LSTM-MSTL-SE performs best and is used as the control method. Moreover, the LSTM-MSNet variants, LSTM-TBATS-SE, LSTM-Prophet-DS do not perform significantly worse than the control method.
Table IX provides the results for the evaluations on the aggregated hourly time series of the AusGrid-Energy dataset. We see that the proposed LSTM-Fourier-SE () variant achieves the best results on each performance metric, and outperforms the rest of the benchmarks. Also, among the proposed variants, we observe that the LSTM-MSNet variants with the SE training paradigm outperform their counterparts, the LSTM-MSNet variants with the DS training paradigm. Furthermore, the LSTM-Baseline method performs better than the LSTM-MSNet variants with DS training paradigm. Nevertheless, consistent with our previous findings from Table VII, the proposed LSTM-MSNet variants outperform the current state-of-the-art techniques; FFORMA, TBATS, and Prophet.
|Method||Mean sMAPE||Median sMAPE||Mean MASE||Median MASE|
The overall result of the Friedman rank sum test is a -value of , which means the results are highly significant. According to Table X, the LSTM-Fourier-SE () performs best and is used as the control method. We see that the LSTM-MSNet variants, LSTM-MSTL-SE, LSTM-TBATS-SE, LSTM-Prophet-SE, LSTM-Fourier-SE and FFORMA do not perform significantly worse than the control method.
Also, with respect to computational cost of the proposed variants and benchmarks, according to Table XI, except for LSTM-TBATS-DS and LSTM-TBATS-SE, which use TBATS as the decomposition technique, we see that the proposed LSTM-MSNet variants have a lower execution time compared to TBATS and FFORMA. The Dynamic-Harmonic-Regression (T) variant and Prophet are more computationally efficient than the LSTM-MSNet variants. However, we notice that both these methods do not display competitive results compared to the LSTM-MSNet variants, according to Tables V, VII, and IX. In contrary, LSTM-MSNet promises to deliver better performance than TBATS and FFORMA, with respect to both accuracy and computation time.
|Method||Pre-processing||Model-Training & Post-processing||Total-time|
It can be seen that the LSTM-Baseline results for the M4 and AusGrid-Energy datasets are contradictory. Even though the LSTM-Baseline model cannot outperform both DS and SE learning paradigms of LSTM-MSNet in the M4 dataset, it obtains better results than the DS learning paradigm in both AusGrid-Energy datasets. These results can be interpreted by the seasonal characteristics present in these datasets (Fig. 6). As discussed in Section IV-A, the seasonal components of the M4 dataset are less homogeneous. So, in this scenario, learning the seasonality directly from the original time series becomes difficult for LSTM-MSNet. Hence, removing the seasonal components prior to training (DS learning paradigm) has positively contributed towards the LSTM-MSNet results in the M4 dataset. This also explains the poor results of the LSTM-Baseline variant, which attempts to learn seasonality directly from the time series, without any assistance of DS and SE. We also note that lengths of the series are likely to have an effect here, in the sense that for longer series, we assume that the patterns can be learned easier, whereas for shorter series, deseasonalisation is beneficial. However, we do not analyse this in our current study. Furthermore, we observe the highly competitive results of FFORMA in the M4 dataset. FFORMA was the second-best performing method in the overall M4 competition, so it can be arguably seen as optimised for this particular dataset. Also, due to its nature as being an ensemble of simpler univariate techniques, it is inherently more suitable for this situation of a dataset of inhomogeneous series, whereas on the AusGrid-Energy datasets, it is less performant. There, due to the high presence of homogeneous seasonality, exclusive learning of the seasonality becomes viable for LSTM-MSNet. As a result, LSTM-Baseline achieves better results compared to LSTM-MSNet with the DS learning paradigm. However, in this scenario, we observe that the variants of LSTM-MSNet with the SE learning paradigm, achieve better results than the LSTM-Baseline model. This suggests that the extracted seasonal components have supplemented the LSTM-MSNet training procedure, in the form of external variables that assist to determine the trajectory of the multiple seasonal cycles.
In terms of decomposition techniques, MSTL, STR, and Prophet are the best performing methods in the M4 dataset. Whereas, on the AusGrid-Energy datasets, we see that the LSTM-MSNet variants with MSTL and Prophet decomposition techniques give better results. Also, the STR based LSTM-MSNet variants are unstable on the AusGrid-Energy datasets, which can be attributed to the longer lengths of the time series. Furthermore, according to Table XI, the MSTL and Prophet methods are computationally efficient decomposition techniques compared to TBATS.
These results indicate that our proposed LSTM-MSNet forecasting framework can be easily adapted, depending on the seasonal characteristics in a group of time series. The LSTM-MSNet with the DS learning paradigm is more suitable for situations where the origins of the time series are unknown and time series are inhomogeneous. Whereas its counter part, SE is better if the time series are homogeneous and share similar shapes of the seasonal components. Another important finding from this study is that the RNNs are competitive in situations where groups of time series are highly homogeneous. As discussed in Section I, this is the case in many real-world applications. However, in situations like in the M4 dataset, where the groups of time series exhibit highly heterogeneous characteristics, better competitiveness of the RNNs can be achieved by applying additional preprocessing steps such as deseasonalisation (DS learning paradigm) that supplements the RNN training procedure. Apart from these observations, we also see that for longer time series, the majority of the LSTM-MSNet variants are computationally more efficient than the state-of-the-art univariate forecasting techniques, such as TBATS and FFORMA.
In this paper, we have presented the LSTM-MSNet methodology, a novel, three-layered forecasting framework that is capable of forecasting a group of related time series with multiple seasonal cycles. Our methodology is based on time series decomposition and LSTM recurrent neural networks, to overcome the limitations of the current univariate state-of-the-art models by training a unified model that exploits key structures, behaviours, and patterns common within a group of time series.
We have utilised a series of decomposition techniques from the literature to extract the various forms of seasonal components in time series. Moreover, we have also discussed two training paradigms, the Deseasonalised approach and the Seasonal Exogenous approach, highlighting how these decomposition techniques can be used to supplement the LSTM learning procedure. We have identified that the choice of these learning paradigms can be determined by the characteristics of the time series, where a deseasonalised approach is more suitable for situations where the series have different seasonal patterns, and the start/end dates of the series are different. Whereas its counter part, the seasonal exogenous approach, is better if the time series are homogeneous and share similar shapes of the seasonal components. In general, MSTL, and Prophet are stable decomposition techniques for both shorter and longer time series, and achieve competitive results with a lower computational cost. We have evaluated the proposed forecasting framework using a competition dataset, and two energy consumption time series datasets that contain multiple seasonal patterns. Also, with respect to accuracy and computational time, we observe that our framework can be a competitive approach among the current state-of-the-art in his research space. Furthermore, somewhat contrary to widespread beliefs, the globally trained LSTM-MSNet can be computationally more efficient than many univariate forecasting methods.
As a possible future work, a hybrid version of this approach can be introduced to handle seasonalities in longer time series. Here, the deseasonalised approach can be used to model shorter seasonalities, whereas the seasonal exogenous approach can be applied to address the longer seasonalities.
This research was supported by the Australian Research Council under grant DE190100045, Facebook Statistics for Improving Insights and Decisions research award, Monash Institute of Medical Engineering seed funding, and MASSIVE - High performance computing facility, Australia.
-  AusGrid, “Innovation and research - ausgrid,” https://www.ausgrid.com.au/Industry/Innovation-and-research/, 2019, accessed: 2019-5-16.
-  C.-M. Lee and C.-N. Ko, “Short-term load forecasting using lifting scheme and ARIMA models,” Expert Syst. Appl., vol. 38, no. 5, pp. 5902–5911, May 2011.
-  G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015.
-  A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional time series forecasting with convolutional neural networks,” Mar. 2017. [Online]. Available: http://arxiv.org/abs/1703.04691
-  D. Salinas, V. Flunkert, and J. Gasthaus, “DeepAR: Probabilistic forecasting with autoregressive recurrent networks,” Apr. 2017. [Online]. Available: http://arxiv.org/abs/1704.04110
-  R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka, “A Multi-Horizon quantile recurrent forecaster,” Nov. 2017. [Online]. Available: http://arxiv.org/abs/1711.11053
-  K. Bandara, C. Bergmeir, and S. Smyl, “Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach,” Expert Syst. Appl., vol. 140, Feb. 2020.
-  G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long- and Short-Term temporal patterns with deep neural networks,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, ser. SIGIR ’18. New York, NY, USA: ACM, 2018, pp. 95–104.
-  K. Bandara, P. Shi, C. Bergmeir, H. Hewamalage, Q. Tran, and B. Seaman, “Sales demand forecast in e-commerce using a long Short-Term memory neural network methodology,” Jan. 2019. [Online]. Available: https://arxiv.org/abs/1901.04028
-  M. Nelson, T. Hill, W. Remus, and M. O’Connor, “Time series forecasting using neural networks: should the data be deseasonalized first?” J. Forecast., vol. 18, no. 5, pp. 359–367, 1999.
-  G. P. Zhang and M. Qi, “Neural network forecasting for seasonal and trend time series,” Eur. J. Oper. Res., vol. 160, no. 2, pp. 501–514, 2005.
-  S. Ben Taieb, G. Bontempi, A. F. Atiya, and A. Sorjamaa, “A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition,” Expert Syst. Appl., vol. 39, no. 8, pp. 7067–7083, Jun. 2012.
-  R. Hyndman, A. B. Koehler, J. Keith Ord, and R. D. Snyder, Forecasting with Exponential Smoothing: The State Space Approach. Springer Science & Business Media, 2008.
-  A. Harvey, S. J. Koopman, and M. Riani, “The modeling and seasonal adjustment of weekly observations,” J. Bus. Econ. Stat., vol. 15, no. 3, pp. 354–368, 1997.
-  J. W. Taylor, “Short-term electricity demand forecasting using double seasonal exponential smoothing,” J. Oper. Res. Soc., vol. 54, no. 8, pp. 799–805, Aug. 2003.
-  P. G. Gould, A. B. Koehler, J. K. Ord, R. Snyder, R. J. Hyndman, and F. Vahid-Araghi, “Forecasting time series with multiple seasonal patterns,” Eur. J. Oper. Res., vol. 191, no. 1, pp. 207–222, Nov. 2008.
-  J. W. Taylor and R. Snyder, “Forecasting intraday time series with multiple seasonal cycles using parsimonious seasonal exponential smoothing,” Omega, vol. 40, no. 6, pp. 748–757, Dec. 2012.
-  A. M. De Livera, R. J. Hyndman, and R. Snyder, “Forecasting time series with complex seasonal patterns using exponential smoothing,” J. Am. Stat. Assoc., vol. 106, no. 496, pp. 1513–1527, Dec. 2011.
-  J. Nowicka-Zagrajek and R. Weron, “Modeling electricity loads in california: ARMA models with hyperbolic noise,” Signal Processing, vol. 82, no. 12, pp. 1903–1915, Dec. 2002.
-  G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Math. Control Signals Systems, vol. 2, no. 4, pp. 303–314, Dec. 1989.
-  K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Netw., vol. 4, no. 2, pp. 251–257, Jan. 1991.
-  Zaiyong Tang, C. de Almeida, and P. A. Fishwick, “Time series forecasting using neural networks vs. box- jenkins methodology,” Simulation, vol. 57, no. 5, pp. 303–310, 1991.
-  M. Marseguerra, S. Minoggio, A. Rossi, and E. Zio, “Neural networks prediction and fault diagnosis applied to stationary and non stationary ARMA modeled time series,” Prog. Nuclear Energy, vol. 27, no. 1, pp. 25–36, 1992.
-  W. Yan, “Toward automatic time-series forecasting using neural networks,” IEEE Trans Neural Netw Learn Syst, vol. 23, no. 7, pp. 1028–1039, 2012.
-  H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural networks for time series forecasting: Current status and future directions,” arXiv [cs.LG], 2019. [Online]. Available: https://arxiv.org/abs/1909.00590
-  G. Dudek, “Forecasting time series with multiple seasonal cycles using neural networks with local learning,” in Artificial Intelligence and Soft Computing. Springer Berlin Heidelberg, 2013, pp. 52–63.
-  M. Han and M. Xu, “Laplacian echo state network for multivariate time series prediction,” IEEE Trans Neural Netw Learn Syst, vol. 29, no. 1, pp. 238–244, Jan. 2018.
-  F. M. Bianchi, E. Maiorino, M. C. Kampffmeyer, A. Rizzi, and R. Jenssen, “An overview and comparative analysis of recurrent neural networks for short term load forecasting,” May 2017. [Online]. Available: http://arxiv.org/abs/1705.04378
-  S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The M4 competition: Results, findings, conclusion and way forward,” Int. J. Forecast., vol. 34, no. 4, pp. 802–808, 2018.
-  S. Smyl, “A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting,” Int. J. Forecast., Jul. 2019.
-  R. B. Cleveland, W. S. Cleveland, and I. Terpenning, “STL: A seasonal-trend decomposition procedure based on loess,” J. Off. Stat., vol. 6, no. 1, p. 3, 1990.
-  R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2013. [Online]. Available: http://www.R-project.org/
-  R. J. Hyndman, G. Athanasopoulos, S. Razbash, D. Schmidt, Z. Zhou, Y. Khan, C. Bergmeir, and E. Wang, “forecast: Forecasting functions for time series and linear models,” R package version, vol. 6, no. 6, p. 7, 2015.
-  Y. Khandakar and R. J. Hyndman, “Automatic time series forecasting: the forecast package for R,” J. Stat. Softw., vol. 27, no. 03, 2008.
-  A. Dokumentov, R. J. Hyndman, and Others, “STR: A seasonal-trend decomposition procedure based on regression,” Monash University, Department of Econometrics and Business Statistics, Tech. Rep., 2015.
-  A. Dokumentov and R. J. Hyndman, “str: STR decomposition,” 2018.
-  S. J. Taylor and B. Letham, “Forecasting at scale,” PeerJ Preprints, Tech. Rep. e3190v2, Sep. 2017.
-  S. Taylor and B. Letham, “prophet: Automatic forecasting procedure,” 2018.
-  A. C. Harvey and N. Shephard, “10 structural time series models,” in Handbook of Statistics. Elsevier, Jan. 1993, vol. 11, pp. 261–302.
-  T. Mikolov, M. Karafiát, L. Burget, J. Cernock\‘y, and S. Khudanpur, “Recurrent neural network based language model,” in Interspeech, vol. 2. fit.vutbr.cz, 2010, p. 3.
-  A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. ieeexplore.ieee.org, 2013, pp. 6645–6649.
-  K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” Feb. 2015.
-  H.-G. Zimmermann, C. Tietz, and R. Grothmann, “Forecasting with recurrent neural networks: 12 tricks,” in Neural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2012, pp. 687–707.
-  S. Ben Taieb and A. F. Atiya, “A bias and variance analysis for Multistep-Ahead time series forecasting,” IEEE Trans Neural Netw Learn Syst, vol. 27, no. 1, pp. 62–76, Jan. 2016.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-Scale machine learning on heterogeneous distributed systems,” Mar. 2016.
-  R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,” Int. J. Forecast., 2006.
-  S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power,” Inf. Sci., vol. 180, no. 10, pp. 2044–2064, May 2010.
-  F. Orabona and T. Tommasi, “Training deep networks without learning rates through coin betting,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17, USA, 2017, pp. 2157–2167.
-  J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 2951–2959.
-  F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in Proceedings of the 5th International Conference on Learning and Intelligent Optimization, 2011, pp. 507–523.
-  P. Montero-Manso, G. Athanasopoulos, R. J. Hyndman, and T. S. Talagala, “FFORMA: Feature-based forecast model averaging,” Tech. Rep. 19/18, 2018.