Making Good on LSTMs Unfulfilled Promise
Abstract
LSTMs promise much to financial timeseries analysis, temporal and crosssectional inference, but we find they do not deliver in a realworld financial management task. We examine an alternative called Continual Learning (CL), a memoryaugmented approach, which can provide transparent explanations; which memory did what and when. This work has implications for many financial applications including to credit, timevarying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, timeseries CL approaches outperform LSTM and a simple sliding window learner (feedforward neural net (FFNN)). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how realworld, timeseries noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex real world problem; emerging market equities investment decision making. CLA provides a testbed as it can be based on different types of timeseries learner, allowing testing of LSTM and sliding window (FFNN) learners side by side. CLA is also used to test several distance approaches used in a memory recallgate: euclidean distance (ED), dynamic time warping (DTW), autoencoder (AE) and a novel hybrid approach, warpAE. We find CLA outperforms simple LSTM and FFNN learners and CLA based on a sliding window (CLAFFNN) outperforms a LSTM (CLALSTM) implementation. While for memoryaddressing, ED underperforms DTW and AE but warpAE shows the best overall performance in a realworld financial task.
Keywords: Continual learning, timeseries, LSTM, similarity, DTW, autoencoder
CLABib.bib
1 Introduction
Both LSTMs Hochreiter_1997 and a wide range of timeseries approaches suffer from a common problem; how to deal with long versus short term dependenciesKoutn14; Mozer:1991:IMT:2986916.2986950; Gers_Schidhuber_LSTMs_Timeseries_2001. This problem is a manifestation of (CF), one of the major impediments to the development of artificial general intelligence (AGI), where the ability of a learner to generalise to older tasks is corrupted by learning newer tasks French1999CatastrophicFI; MCCLOSKEY1989109. To address this problem, Continual Learning (CL) has been developed. Although very few timeseries CL approaches exist, some have the advantage of having interpretable memory addressing Philps_2018, in contrast to typical LSTMs. The advantages of better intrepretability would be significant for realworld financial problems but there are a number of open questions for timeseries CL. Firstly, should a financial timeseries CL approach be based on recurrent (e.g. LSTM) or a sliding window architecture (e.g. feedforward neural net (FFNN))? What impact does timeseries noise have on the memory functions of a CL approach? How would these choices translate to performance in a realworld financial problem when compared with LSTMs and a simple sliding window approach (FFNN)?
This study empirically examines these questions using a flexible and explainable CL approach called (CLA) using the complex real world problem of driving stock selection investment decisions in emerging market equities.
2 Design Choices for Timeseries CL
Machine learning based timeseries approaches can be broadly separated into:

[font=]
 Sliding window

: Dividing a timeseries into a series of discrete modelling steps. E.g. FFNN.
 Sequential

: Attempting to model a timeseries process. E.g. LSTM.
A sliding window allows the choice of a wide range of learners, such as OLS regression Fama93commonrisk or feed forward neural networks (FFNN) applied in a step forwards fashion. More specialist models have been developed, such as time delayed neural nets (TDNN) Waibel_TDNN_1989 but these are still constrained by choices relating to the timedelays to use. The major shortcoming of slidingwindow approaches is that, as timesteps on, all information that moves out of the sliding window is forgotten.
Sequential approaches, while still requiring a sliding window for longer series, otherwise address the arbitrary window size concern but have to model potentially complex crosssectional and short and long term temporal relationships sequentially. While there are many types of sequential learner Thomas_Sequential_2002, LSTMs are probably the most popular. While they are able to solve timeseries problems that sliding window approaches cannot, sliding window approaches have outperformed LSTMs on seemingly more simple timeseries problems Gers_Schidhuber_LSTMs_Timeseries_2001. Interpretability of LSTMs is also challenged Guo2019_.
In either case, FFNNs and LSTMs applied to a timeevolving dataset, are exposed to CF Schak_2019; occurring when a learner is applied in a fashion. For example, a FFNN that is first trained to accurately approximate a model in a timeperiod , and is then trained in timeperiod , may see a deterioration in accuracy when applied to timeperiod again. CL has been proposed to address CF, using implicit or explicit memory for the purpose. In many cases plays a part in driving this liu_2015; Fei:2016:LCB:2939672.2939835; Shu_2018 but in the realworld of complex, noisy timeseries, similarity can be more subjective. Temporal dependencies, changing modalities and more Schlimmer1986, all complicate gauging timevarying similarity and generally add computational expense. The impact of these effects on a learner is sometimes called concept drift Schlimmer1986; Widmer1996.
This piece addresses whether a sequential or sliding window approach is best in a noisy realworld financial task and examines timeseries similarity in a CL context. In the next section we introduce relevant research, then describe the CLA approach and finally conduct simulation tests with analysis and conclusions. We test a sliding window approach (FFNN) and then a sequential learner (LSTM) finding that the FFNN applied as a sliding window is the best performer. Secondly, we test different timeseries similarity approaches, which are used to drive CLA’s memoryrecall gate. We find simple euclidean distance (ED) underperforms noise invariant similarity approaches; dynamic time warping (DTW) and autoencoders (AE). We find the best performing similarity approach is a novel hybrid; warpAE.
3 Related work
While CF remains an open question, many techniques have been developed to address it including gated neural networks Hochreiter_1997, explicit memory structures Weston, prototypical addressing Snell_2017, weight adaptation Hinton_Distilling_2015; Sprechmann_2018, task rehearsal Silver_2002 and encoder based lifelong learning Triki2017 to name a few. As researchers have addressed the initial challenges of CL, other problems have emerged, such as the overhead of external memory structures Rae_2016_sparsereads, problems with weight saturation Kirkpatrick_2017, transfer learning LopezPaz_2017 and the drawbacks of outright complexity DBLP:journals/corr/ZarembaS15. While most CL approaches aim to learn sequentially only a fraction of CL approaches have been focused on timeseries Kadous_TS_2002:; Graves:2006:CTC:1143844.1143891; Lipton_TS_Modeling; Thomas_TS_2017. It is also unclear how effective these approaches would be for statebased CL in long term, noisy, nonstationary timeseries, particularly those commonly found in finance. As most CL approaches are applied to usually well defined, generally labelled and typically stylized tasks LopezPaz_2017, this is a motivating point for timeseries CL.
3.1 Remembering
Regime switching models KimNelson1999 and change point detection Pettitt1979 provide a simplified answer to identifying changing states in timeseries with the major disadvantage that change points between regimes (or states) are notoriously difficult to identify out of sample fabozzi2010quantitative and existing econometric approaches are limited by long term, parametric assumptions in their attempts Engle_1999; Zhang_2010; Siegmund_2013. There is also no guarantee that a change point represents a significant change in the accuracy of an applied model, a more useful perspective for learning different states. Residual change aims to observe change in the absolute error of a learner, aiming to capture as much information as possible regarding changes in the relation between independent and dependent variables.
Different forms of residual change have been developed Brown_1975; Jandhyala_1986; Jandhyala_1989; MacNeilt_1985; Bai_1991; Gama_COnceptDriftAdapt_2014. However, most approaches assume a single or known number of change points in a series and are less applicable to a priori change points or multivariate series Yu_2007. Some drift detection approaches address these issues Gama_COnceptDriftAdapt_2014 but tend to be applied to simple, generally instance based memory Widmer1996; Maloof2000; Klinkenberg_2004; Gomes_2010 exhibiting Koychev00gradualforgetting, rather than taskoriented, CL memory used to address CF. With the advent of timeseries CL, residual change may have another interesting application.
3.2 Recalling
CL approaches that use external memory structures require an appropriate memory addressing mechanism (a way of storing and recalling a memory). Memory addressing is generally based on a similarity measure such as cosine similarity Graves_14; graves2016hybrid; Park_2017 kernel weighting Vinyals_2016, use of linear models Snell_2017 or instancebased similarities, many using Knearest neighbours Kaiser_2017; Sprechmann_2018. More recently, autoencoders (AE) have been used to gauge similarity in the context of multitask learning (MTL) Aljundi17 and for memory consolidation Triki2017. However, these methods, as they have been applied, are not obviously well suited to assessing similarity in noisy multivariate timeseries. In contrast, researchers have extensively researched noise invariant timeseries distance measures Cha2007, generally for timeseries classification (TSC). While simple euclidean distance (ED) offers a rudimentary approach for comparing timeseries it has a high sensitivity to the timing of datapoints, something that has been addressed by dynamic time warping (DTW) Sakoe1978. However, DTW requires normalized data and is computationally expensive, although some mitigating measures have been developed Zhang:2017:DTW:3062405.3062585. A relatively small subset of data mining research has used deep learning based approaches, such as convolutional neural nets (CNN) Zheng_CNN_2014. While results have been encouraging, interpretability is still an open question. Another interesting possibility is to use AEs to cope with timeseries noise by varying manifold dimensionality and by using simple activation functions to introduce sparsity (ie ReLU).
4 The Testbed: Continual Learning Augmentation
Continual learning augmentation (CLA) memory augments a conventional learner for timeseries regression. The aim is to allow well understood learners to be used in a CL framework in an interpretable way. CLA’s memory functions are applied as a sliding window stepping forward through time, over input data of one or more timeseries. The approach is initialized with an empty memory structure, and a chosen base learner, , parameterized by . This base learner can be a sequential approach or a sliding window approach and can be applied to a multivariate input series, , with variables over timesteps. The chosen base learner produces a forecast value in each period as time steps forward. A remember gate, , appends a new memory, , to , on a remember cue defined by the change in the base learner’s absolute error at time point . A recall gate, , balances a mixture of base and memory forecasts to result in the final outcome, . Figure 1 shows the functional steps of remembering and recalling learnermemories.
4.1 Memory management
Repeating patterns are required in subsequences of the input data to provide memory cues to remember and recall different past states. Learner parameters trained in a given past state, , can then be applied if that state approximately reoccurs in the future. When CLA forms a memory, it is stored as a column in an explicit memory structure, similar to Ciresan_2012, which changes in size over time as new memories are remembered and old ones forgotten. Each memory column consists of a copy of a past base learner parameterization, , and a representation,, of the training data used to learn those parameters. As the sliding window steps into a new time period, CLA recalls one or more learnermemories by comparing the latest input data () with a representation of the training data stored in each memory column (). Memories with training data that are more similar to the current input series will have a higher weight applied to their output () and therefore make a greater contribution to the final CLA output ().
4.2 RememberGate
Remembering is triggered by changes in the absolute error series, , of the base learner as the approach steps forward through time:
(1) 
CLA interrogates the base learner for changes in outofsample error, , which are assumed to be associated with changes in state. The remembergate, , both learns to define and trigger a change which stores a pairing of the parameterization of the base learner, , and a contextual reference, . Figure 1 shows how a change is detected by , which then results in a new memory column being appended to :
(2) 
Immediately after the remember event has occurred, a new base learner is trained on the current input, overwriting .
Theoretically, for a fair model of a state, would be approximately with a zero valued mean. Therefore the current base mode would cease to be a fair representation of the current state when exceeds a certain confidence interval, in turn implying a change in state. represents a critical level for , indicating a change point has occurred in state. Memories are only stored when the observed absolute error series,, spikes above the critical level, :
is a hyperparameter, optimized at every timestep, to result in a level of sensitivity to remembering that forms an external memory, , resulting in the lowest empirical forecasting error for the CLA approach over the study term up until time :
(3) 
Where is the CLA approach expressed as a function of the input series and , yielding (the absolute error of the base learner at time ). is a 20 point, equidistant set between the minimum and the maximum values of , representing fivepercent intervals in the empirical distribution of .
4.3 RecallGate
The recall of memories takes place in the recallgate , which calculates a mixture of the predictions from the current base learner and from learnermemories.
(4) 
The mixture coefficients are derived by comparing the similarity of the current time varying context with the contextual references stored with each individual memory. Memories that are more similar to the current context have a greater weight in CLA’s final outcome.
4.4 Recall: Testing Measures of Similarity
Several approaches for calculating contextual similarity are tested separately, using the CLA approach. Each is used to define , either by simply storing past training examples or by using a process of contextual learning; essentially learning a representation of base learner training data.
ED and then DTW are applied first. Both approaches require to be raw training examples which are required to be stored in each respective memory column, making both approaches relatively resource hungry. Secondly, AE distance is used through a process of contextual learning. Rather than needing to store many training examples in a memory column, only the AE parameters are needed to form a reconstruction of the training data with the disadvantage that an AE must be trained in every timestep. Thirdly we introduce a DTW filtered AE distance, which is intended to phase adjust the AE distance calculation, we call this warpAE. Again, an AE needs to be trained at every timestep but DTW processing expense is reduced as it is only run on AE reconstructions. We describe each approach in turn.
ED and DTW are applied only to a subset of randomly sampled instances from and , sampling over rows, each of which represent different securities in the dataset:
(5) 
(6) 
Where is the dissimilarity, is the number of samples to take and are random integers between 1 and .
AE distance is used in a similar fashion to Aljundi et al 2017, Aljundi17, using ReLU activations to avoid overfit. However CLA’s use of AEs is different. AEs are used for contextual learning for memory management, to cope with noisy, real world, multivariate timeseries. The use of ReLU units aims to allow generalisation over the noise of otherwise similar timeseries subsequences. Additionally. the similarities returned from CLA’s AE implementation are also used to balance memory weightings:
(7) 
is the reconstruction loss of the current input, , calculated as a euclidean distance. and are the encoder and decoder functions respectively. warpAE is designed to gain the AE’s benefits of lower memory usage than DTW while benefiting from the phase invariant loss of DTW:
(8) 
These (dis)similarities are used to determine memories to recall from and also how to weight the contribution of each memory to CLA’s final outcome, .
These different similarity functions were each tested in CLA’s memory recallgate in turn, gaining new insights about the effectiveness of each similarity approach in a CL system, when applied to a complex multivariate timeseries problem.
4.5 Balancing
The base learner and all recalled memories are weighted by similarity to produce CLA’s final outcome, using the recallgate, :
(9) 
Where is the number of memories in the memory structure . Previous research indicated this was the most powerful approach over selecting the single memory Philps_2018. (Notably, both these balancing approaches significantly outperform equal weighing of all memories, indicating CLA is gaining significantly more than a simple ensemble effect).
5 Investment Simulation Setup
CLA is used as a test bed for different learners and similarity approaches in a regression task to forecast future expected returns of individual equity securities. This is used to drive equities investment simulations, a real world task using noisy timeseries. The data set consisted of stock level characteristics at each timestep. Tests were conducted to show the relative performance of a sliding window base learner, FFNN, and a sequential base learner, LSTM. Different similarity approaches were also used to drive the memory recallgate; ED, DTW, AE and warpAE.
Base learners were batch trained over all stocks at each timestep, forecasting US$ total returns 12months ahead for each stock. For the sliding window learner a year long, fixed length sliding window of four quarters was used for training and for the sequential learner all historic data up to the current time, was used for training. A stock level forecast in the top (bottom) decile of the stocks in a timeperiod was interpreted as a buy (sell) signal.
Although CLA is designed to use nontraditional driver variables, stock level characteristics are commonly expressed using factor loadings. These were estimated, insample at each timestep by regressing style factor excess returns against each stock level US$ excess return stream: , where is the excess return of stock in period , is the excess return of the Emerging Market Equities Index, is the relative return of the Emerging Market Value Equities Index.
Stock level factor loadings populated a matrix, , which comprised the input data. Each row represented a stock appearing in the index at time (up to 5,500 stocks) and each column related to a coefficient calculated on a specific time lag. resulted from winsorizing the raw input to eliminate outliers.
Long/short model portfolios were constructed (ie rebalanced) every six months over the study term, using equal weighted long positions (buys) and shorts (sells). The simulation encompassed 5,500 equities in total, covering 26 countries across emerging markets, corresponding to an Emerging Market Equities Index between 20062017. To account for the DTW sampling approach used and differences in random initialisation of neural components, several simulations were carried out per test.
6 Performance and Interpretation
7 Simulation Results
CLA results showed a significant augmentation benefit for both base learners (see 2 b). While tests of similarity approaches favoured noise invariant approaches over simple ED.
Sliding window learner tests, CLAFFNN, outperformed all the equivalent sequential learner tests, CLALSTM, in terms of total return (TR) while Sharpe ratios (see 2 a) were superior also (although none were significant at the 5% level). However, augmentation benefit, gauged by relative return (RR) and information ratio (Info Ratio), was superior for CLALSTMs (2 b), with most augmentation tests for both learners statistically significant at the 5% level. In these tests, although CLALSTM saw a better augmentation benefit (RR) CLAFFNN saw the strongest outright performance (TR), followed by unaugmented FFNN (given by TRRR), then CLALSTM. By far the weakest outright performer (TR) was unaugmented LSTM (given by TRRR).
Tests of different similarity approaches, used in the recallgate, saw ED underperform DTW in TR terms and also in terms of augmentation benefit. This was true for both learners tested. This would imply that the invariance to phase, DTW provides, is an important consideration in a real world context. AE distance tests showed higher TRs than DTW and demonstrated statistically significant augmentation benefits at the 5% level for both learners, indicating that AE distance is an appropriate approach to use in this context. warpAE generated the highest RR and information ratios of all similarity tests, implying that adding a DTW filter to AE distance was the most interesting similarity approach tested.
7.1 Interpretable Memory
CLA produces outcomes that can be explained and attributed to its memories. 3 shows an example of one of the simulation runs, CLAFFNN with AE similarity, and shows how certain memories were applied at certain time points to result in specific outcomes. In this case at least three memories are remembered (lower chart, black lines) and recalled at different future times. In this case a learner memory remembered in January 2007, a period of turbulence in financial markets, adds the most value. It proves more appropriate than the base learner in the period of the 2008 financial crisis and its aftermath involving concerted fiscal stimulus (Sept 2008Dec 2010). It was again recalled in 2013 and then in 2016, both also periods where fiscal stimulus dominated market returns (in Europe and China respectively).
8 Conclusion
We have empirically demonstrated that when applied to a real world financial task involving noisy timeseries, a CL augmented sliding window learner (CLAFFNN) is superior to LSTM and superior to a CL augmented LSTM learner (CLALSTM). Testing of different similarity approaches, applied to a recallgate, showed poor performance of simple euclidean distance (ED) when compared to dynamic time warping (DTW). This strongly implies that the timing of datapoints is crucial in this task and likely in other real world problems involving noisy timeseries. Simulation tests also showed that AE distance is a good alternative to DTW. These results imply that AE dimensionality reduction and generalisation (using ReLU units in this case) are almost equivalent to DTW driven memory recall. warpAE was proposed to benefit from both AE’s dimensionality reduction and DTW’s phase invariance, an approach that produced the strongest investment performance and augmentation benefit of the similarity approaches tested. We also show that timeseries CL not only outperforms an LSTM base learner but can provide a transparent explanation for which memory did what and when. In summary, the most successful CL choices were found to be a sliding window CLAFFNN learner combined with a recall gate using warpAE similarity. These tests also affirm Continual Learning Augmentation (CLA) as a realworld timeseries CL approach, with the flexibility to augment different types of learners.
8.1 Future work
We have tested our approach on many financial datasets but this approach could, in principle (and by design) be used on many other financial timeseries problems. This might include applications to credit scoring, analysis of time/statevarying fairness in decision making and more.
9 Bibliography
\printbibliography[heading=none]