Making Good on LSTMs Unfulfilled Promise
Abstract
LSTMs promise much to financial timeseries analysis, temporal and crosssectional inference, but we find they do not deliver in a realworld financial management task. We examine an alternative called Continual Learning (CL), a memoryaugmented approach, which can provide transparent explanations; which memory did what and when. This work has implications for many financial applications including to credit, timevarying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, timeseries CL approaches outperform LSTM and a simple sliding window learner (feedforward neural net (FFNN)). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how realworld, timeseries noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex realworld problem; emerging market equities investment decision making. CLA provides a testbed as it can be based on different types of timeseries learner, allowing testing of LSTM and sliding window (FFNN) learners side by side. CLA is also used to test several distance approaches used in a memory recallgate: euclidean distance (ED), dynamic time warping (DTW), autoencoder (AE) and a novel hybrid approach, warpAE. We find CLA outperforms simple LSTM and FFNN learners and CLA based on a sliding window (CLAFFNN) outperforms a LSTM (CLALSTM) implementation. While for memoryaddressing, ED underperforms DTW and AE but warpAE shows the best overall performance in a realworld financial task.
Keywords: Continual learning, timeseries, LSTM, similarity, DTW, autoencoder
CLABib.bib
1 Introduction
Both LSTMs Hochreiter_1997 and a wide range of timeseries approaches suffer from a common problem; how to deal with long versus short term dependenciesKoutn14; Mozer:1991:IMT:2986916.2986950; Gers_Schidhuber_LSTMs_Timeseries_2001. This problem is a manifestation of (CF), one of the major impediments to the development of artificial general intelligence (AGI), where the ability of a learner to generalise to older tasks is corrupted by learning newer tasks French1999CatastrophicFI; MCCLOSKEY1989109. To address this problem, Continual Learning (CL) has been developed. Although very few timeseries CL approaches exist, some have the advantage of having interpretable memory addressing Philps_2018, in contrast to LSTMs. The advantages of better intrepretability would be significant for realworld financial problems but there are a number of open questions for timeseries CL. Firstly, should a financial timeseries CL approach be based on recurrent (eg LSTM) or a sliding window architecture (eg feedforward neural net (FFNN))? What impact does timeseries noise have on the memory functions of a CL approach? How would these choices translate to performance in a realworld financial problem when compared with LSTMs and a simple sliding window approach (FFNN)? This study empirically examines these questions in the context of the complex realworld problem of driving stock selection investment decisions in emerging market equities.
The rest of this paper is organised as follows. Section 2 introduces common design choices for timeseries CL. Section 3 reviews related work while section 4 then discusses Continual Learning Augmentation remembergates, recallgates and memory balancing. Section 5 describes the experimental setup, results and interpretability of the complex, realworld tests conducted while section 6 concludes this paper.
2 Background
Machine learning based timeseries approaches can be broadly separated into:

[font=]
 Sliding window

: Dividing a timeseries into a series of discrete modelling steps. Eg FFNN.
 Sequential

: Attempting to model a timeseries process. Eg LSTM.
A sliding window allows the choice of a wide range of learners, such as OLS regression Fama93commonrisk or feed forward neural networks (FFNN) applied in a step forwards fashion. Specialist models have also been developed, such as time delayed neural nets (TDNN) Waibel_TDNN_1989 but these are still constrained by choices relating to the timedelays to use. The major shortcoming of slidingwindow approaches is that, as timesteps on, all information that moves out of the sliding window is forgotten.
Sequential approaches, while still requiring a sliding window for longer series, attempt to avoid the fairly arbitrary, window sizing problem. However, to do this these approaches have to model greater degrees of complexity Zhang_RNNComplexity_Skip_16, in a sequential fashion: crosssectional, temporal and short versus long dependencies. While there are many types of sequential learner Thomas_Sequential_2002, LSTMs are probably the most popular. While they are able to solve timeseries problems that sliding window approaches cannot, sliding window approaches have outperformed LSTMs on seemingly more simple timeseries problems Gers_Schidhuber_LSTMs_Timeseries_2001. Interpretability of LSTMs is also challenged Guo2019_. In either case, FFNNs and LSTMs applied to a timeevolving dataset, are exposed to CF Schak_2019; occurring when a learner is applied in a fashion. For example, a FFNN that is first trained to accurately approximate a model in a timeperiod , and is then trained in timeperiod , may see a deterioration in accuracy when applied to timeperiod again. CL has been proposed to address CF, using implicit or explicit memory for the purpose. In many cases plays a part in driving this liu_2015; Fei:2016:LCB:2939672.2939835; Shu_2018 but in the realworld of complex, noisy timeseries, similarity can be more subjective. Temporal dependencies, changing modalities and more Schlimmer1986, all complicate gauging timevarying similarity and generally add computational expense. The impact of these effects on a learner is sometimes called concept drift Schlimmer1986; Widmer1996.
This paper addresses whether a sequential or sliding window approach is best in a noisy realworld financial task and examines timeseries similarity in a CL context. We test a sliding window approach (FFNN) and then a sequential learner (LSTM) finding that the FFNN applied as a sliding window is the best performer. Secondly, we test different timeseries similarity approaches, which are used to drive CLA’s memoryrecall gate. We find simple euclidean distance (ED) underperforms noise invariant similarity approaches; dynamic time warping (DTW) and autoencoders (AE). We find the best performing similarity approach is a novel hybrid; warpAE.
3 Related Work
While CF remains an open question, many techniques have been developed to address it including gated neural networks Hochreiter_1997, explicit memory structures Weston, prototypical addressing Snell_2017, weight adaptation Hinton_Distilling_2015; Sprechmann_2018, task rehearsal Silver_2002 and encoder based lifelong learning Triki2017 to name a few. As researchers have addressed the initial challenges of CL, other problems have emerged, such as the overhead of external memory structures Rae_2016_sparsereads, problems with weight saturation Kirkpatrick_2017, transfer learning LopezPaz_2017 and the drawbacks of outright complexity DBLP:journals/corr/ZarembaS15. While most CL approaches aim to learn sequentially, only a fraction of CL approaches have been focused on timeseries Kadous_TS_2002:; Graves:2006:CTC:1143844.1143891; Lipton_TS_Modeling; Thomas_TS_2017. It is also unclear how effective these approaches would be when applied to openworld, statebased, temporal learning in long term, noisy, nonstationary timeseries, particularly those commonly found in finance. As most CL approaches are applied to usually well defined, generally labelled and typically stylized tasks LopezPaz_2017, this is a key motivator for a timeseries specific CL.
3.1 Remembering
Regime switching models KimNelson1999 and change point detection Pettitt1979 provide a simplified answer to identifying changing states in timeseries with the major disadvantage that change points between regimes (or states) are notoriously difficult to identify out of sample fabozzi2010quantitative and existing econometric approaches are limited by long term, parametric assumptions in their attempts Engle_1999; Zhang_2010; Siegmund_2013. There is also no guarantee that a change point represents a significant change in the accuracy of an applied model, a more useful perspective for learning different states. Residual change aims to observe change in the absolute error of a learner, aiming to capture as much information as possible regarding changes in the relation between independent and dependent variables.
Different forms of residual change have been developed Brown_1975; Jandhyala_1986; Jandhyala_1989; MacNeilt_1985; Bai_1991; Gama_COnceptDriftAdapt_2014. However, most approaches assume a single or known number of change points in a series and are less applicable to a priori change points or multivariate series Yu_2007. Some drift adaptation approaches address these issues Gama_COnceptDriftAdapt_2014 but tend to be applied to simple, generally instance based memory Widmer1996; Maloof2000; Klinkenberg_2004; Gomes_2010 tending to suffer from Koychev00gradualforgetting. This contrasts with explicit, taskoriented, memory structures of the sort used by CL to address CF. With the advent of timeseries CL, residual change may have another interesting application.
3.2 Recalling
CL approaches that use external memory structures require an appropriate addressing mechanism (a way of storing and recalling a memory). Memory addressing is generally based on a similarity measure such as cosine similarity Graves_14; graves2016hybrid; Park_2017 kernel weighting Vinyals_2016, use of linear models Snell_2017 or instancebased similarities, many using Knearest neighbours Kaiser_2017; Sprechmann_2018. More recently, autoencoders (AE) have been used to gauge similarity in the context of multitask learning (MTL) Aljundi17 and for memory consolidation Triki2017. However, these methods, as they have been applied, are not obviously well suited to assessing similarity in noisy multivariate timeseries. In contrast, researchers have extensively researched noise invariant timeseries distance measures Cha2007, generally for timeseries classification (TSC). While simple euclidean distance (ED) offers a rudimentary approach for comparing timeseries it has a high sensitivity to the timing of datapoints, something that has been addressed by dynamic time warping (DTW) Sakoe1978. However, DTW requires normalized data and is computationally expensive, although some mitigating measures have been developed Zhang:2017:DTW:3062405.3062585. A relatively small subset of datamining research has used deep learning based approaches, such as convolutional neural nets (CNN) Zheng_CNN_2014. While results have been encouraging, interpretability is still an open question. Another interesting possibility is to use AEs to cope with timeseries noise by varying manifold dimensionality and by using simple activation functions to introduce sparsity (ie ReLU).
4 Continual Learning Augmentation
Continual learning augmentation (CLA) memory augments a conventional learner for timeseries regression. The aim is to allow well understood learners to be used in a CL framework in an interpretable way. CLA’s memory functions are applied as a sliding window stepping forward through time, over input data of one or more timeseries. The approach is initialized with an empty memory structure, , and a chosen base learner, , parameterized by . This base learner can be a sequential approach or a sliding window approach applied to a multivariate input series, , with variables over timesteps. The chosen base learner produces a forecast value in each period as time steps forward. A remembergate, , appends a new memory, , to , on a remember cue defined by the change in the base learner’s absolute error at time point . A recallgate, , balances a mixture of base and memory forecasts to result in the final outcome, . Figure 1 shows the functional steps of remembering, recalling and balancing learnermemories.
4.1 Memory management
Repeating patterns are required in subsequences of the input data to provide memory cues to remember and recall different past states. Learner parameters, , trained in a given past state can then be applied if that state approximately reoccurs in the future. When CLA forms a memory, it is stored as a column in an explicit memory structure, similar to Ciresan_2012, which changes in size over time as new memories are remembered and old ones forgotten. Each memory column consists of a copy of a past base learner parameterization, , and a representation, , of the training data used to learn those parameters. As the sliding window steps into a new time period, CLA recalls one or more learnermemories by comparing the latest input data, , with a representation of the training data stored in each memory column, . Memories with training data that are more similar to the current input series will have a higher weight applied to their output, , and therefore make a greater contribution to the final CLA output, .
4.2 RememberGate
Remembering is triggered by changes in the absolute error series, , of the base learner as the approach steps forward through time:
(1) 
CLA interrogates the base learner for changes in outofsample error, , which are assumed to be associated with changes in state. The remembergate, , both learns to define and trigger a change which stores a pairing of the parameterization of the base learner, , and a contextual reference, . Figure 1 shows how a change is detected by , which then results in a new memory column being appended to :
(2) 
Immediately after the remember event has occurred, a new base learner is trained on the current input, overwriting .
Theoretically, for a fair model of a state, would be approximately with a zero valued mean. Therefore the current base mode would cease to be a fair representation of the current state when exceeds a certain confidence interval, in turn implying a change in state. represents a critical level for , indicating a change point has occurred in state. This can be interpreted as a cue to remember the past state, when the observed absolute error series,, spikes above this critical level.
is essentially a hyperparameter of the CLA approach, estimated by . This can be optimized at every timestep to result in a level of sensitivity to remembering that forms an external memory, , resulting in the lowest empirical forecasting error for the CLA approach over the study term up until time :
(3) 
Where is the CLA approach expressed as a function of the input series, , and , yielding (the absolute error of the base learner at time ). is an equidistant set, between the minimum and the maximum values of up until time . This represents a discretization of the empirical distribution of from which to empirically solve for . In our testing was initialised to a value of 20, representing fivepercent buckets in the empirical range of .
4.3 RecallGate
The recall of memories takes place in the recallgate, , which calculates , a mixture of the predictions from the current base learner and from learnermemories:
(4) 
The mixture coefficients are derived by comparing the similarity of the current time varying context with the contextual references, , stored with each individual memory. Memories that are more similar to the current context have a greater weight in CLA’s final outcome.
4.4 Recallgate Similarity Choices
Several approaches for calculating contextual similarity are tested separately, using the CLA approach. Each is used to define , either by simply storing past training examples or by using a process of contextual learning; essentially learning a representation of base learner training data.
ED and then DTW are applied first. Both approaches require to be raw training examples which are required to be stored in each respective memory column, making both approaches relatively resource hungry. Secondly, AE distance is used through a process of contextual learning. Rather than needing to store many training examples in a memory column, only the AE parameters are needed to form a reconstruction of the training data with the disadvantage that an AE must be trained in every timestep. Thirdly we introduce a DTW filtered AE distance, which is intended to phase adjust the AE reconstruction, we call this warpAE. Again, an AE needs to be trained at every timestep but DTW processing expense is reduced as it is only run on AE reconstructions. We describe each approach in turn.
ED and DTW are applied only to a subset of randomly sampled instances from and , sampling over rows (ie crosssectional data), each of which represent a different security at a given point in time:
(5) 
(6) 
Where is the dissimilarity, is the number of samples to take and are random integers between 1 and .
AE distance is used in a similar fashion to Aljundi et al 2017, Aljundi17, using ReLU activations to avoid overfit. However CLA’s use of AEs is different. AEs are used for contextual learning for memory management, to cope with noisy, realworld, multivariate timeseries. The use of ReLU units aims to allow generalisation over the noise of otherwise similar timeseries subsequences. Additionally. the similarities returned from CLA’s AE implementation are also used to balance memory weightings:
(7) 
is the reconstruction loss of the current input, , calculated as a euclidean distance. and are the encoder and decoder functions respectively.
warpAE is designed to gain the AE’s benefits of lower memory usage than DTW while benefiting from the phase invariant loss of DTW:
(8) 
Each of these (dis)similarities can be used to determine the memories to recall from and also how to weight the contribution of each memory to CLA’s final outcome, . Each was tested in turn, in CLA’s memory recallgate, aiming to gain new insights about the effectiveness of each similarity approach when used in a CL system and applied to a complex multivariate timeseries problem.
4.5 Balancing
The base learner and all recalled memories are weighted by similarity to produce CLA’s final outcome, using the recallgate, :
(9) 
Where is the number of memories in the CLA memory structure, . Previous research indicated this was the most powerful approach over selecting the single memory Philps_2018. (Notably, both these balancing approaches significantly outperform equal weighing of all memories, indicating CLA is gaining significantly more than a simple ensemble effect).
5 Experimental Results
5.1 Setup
CLA is used as a test bed for different learners and similarity approaches in a regression task to forecast future expected returns of individual equity securities. This is used to drive equities investment simulations, a realworld task using noisy timeseries. The data set consisted of stock level characteristics at each timestep, for many stocks over many timesteps. Tests were conducted to show the relative performance of a sliding window base learner, FFNN, and a sequential base learner, LSTM. Different similarity approaches were also used to drive the memory recallgate; ED, DTW, AE and warpAE.
Base learners were batch trained over all stocks at each timestep, forecasting US$ total returns 12months ahead for each stock. For the sliding window learner a year long, fixed length sliding window of four quarters was used for training and for the sequential learner all historic data up to the current time, was used for training. A stock level forecast in the top (bottom) decile of the stocks in a timeperiod was interpreted as a buy (sell) signal.
Although CLA is designed to use nontraditional driver variables, stock level characteristics are commonly expressed using factor loadings. These were estimated, insample at each timestep by regressing style factor excess returns against each stock level US$ excess return stream: , where is the excess return of stock in period , is the excess return of the Emerging Market Equities Index, is the relative return of the Emerging Market Value Equities Index.
Stock level factor loadings populated a matrix, , which comprised the input data. Each row represented a stock appearing in the index at time (up to 5,500 stocks) and each column related to a coefficient calculated on a specific time lag. resulted from a fifth and ninetyfifth percentile winsorizing of the raw input to eliminate outliers.
Long/short model portfolios were constructed (ie rebalanced) every six months over the study term, using equal weighted long positions (buys) and shorts (sells). The simulation encompassed 5,500 equities in total, covering 26 countries across emerging markets, corresponding to an Emerging Market Equities Index between 20062017. To account for the sampling approach used for ED and DTW similarities and differences in random initialisation of neural components, several simulations were carried out per test.
5.2 Simulation Results
CLA results showed a significant augmentation benefit for both base learners (see 2 b). While tests of similarity approaches favoured noise invariant approaches over simple ED.
Sliding window learner tests, CLAFFNN, outperformed all the equivalent sequential learner tests, CLALSTM, in terms of total return (TR) while Sharpe ratios (see 2 a) were superior also (although no positive Sharpe ratio was significant at the 5% level). However, augmentation benefit, gauged by relative return (RR) and information ratio (Info Ratio), was superior for CLALSTMs (2 b), with most augmentation tests for both learners statistically significant at the 5% level. In these tests, although CLALSTM saw a better augmentation benefit (RR) CLAFFNN saw the strongest outright performance (TR), followed by unaugmented FFNN (given by TRRR), then CLALSTM. By far the weakest outright performer (TR) was unaugmented LSTM (given by TRRR).
Tests of different similarity approaches, used in the recallgate, saw ED underperform DTW in TR terms and also in terms of augmentation benefit. This was true for both learners tested. This would imply that the invariance to phase, DTW provides, is an important consideration in a realworld context. AE distance tests showed higher TRs than DTW and demonstrated statistically significant augmentation benefits at the 5% level for both learners, indicating that AE distance is an appropriate approach to use in this context. warpAE generated the highest RR and information ratios of all similarity tests, implying that adding a DTW filter to AE distance was the most interesting similarity approach tested.
5.3 Interpretable Memory
CLA produces outcomes that can be attributed to specific, recalled memories. Figure 3 shows an example of one of the simulation runs, CLAFFNN with AE similarity. The upper panel shows the hypothetical investment returns of the unaugmented base learner (line chart) compared to the returns of the CLA augmented base learner (area). The lower panel shows an expandingmemorytriangle, expanding by one row at each timestep forwards, indicating the possibility of the system adding a new memory at each step. Note the black horizontal lines which show specifically when and how certain memories were recalled and applied to result in specific return outcomes (in the upper panel). In this example, at least three memories are remembered and recalled at later times. Qualitatively, we can see that a memory remembered in January 2007, a period of turbulence in financial markets, adds materially to the strategy return in later periods, when recalled. It proves more appropriate than the base learner in the period of the 2008 financial crisis and its aftermath involving concerted fiscal stimulus (Sept 2008Dec 2010). It was again recalled in 2013 and then in 2016, both also periods where fiscal stimulus dominated market returns (in Europe and China respectively). The sparsity of the memory structure is also notable, with only three principle memories recalled/used in the entire period. As a rule of thumb, fewer memories will be remembered/recalled for noisier datasets. This occurs through either a lower learned sensitivity to remembering, from a higher learned value of , or through less recalling, from less discernible contexts.
6 Conclusion
We have empirically demonstrated that when applied to a realworld financial task involving noisy timeseries, a CL augmented sliding window learner (CLAFFNN) is superior to LSTM and superior to a CL augmented LSTM learner (CLALSTM). Testing of different similarity approaches, applied to a recallgate, showed poor performance of simple euclidean distance (ED) when compared to dynamic time warping (DTW). This strongly implies that the timing of datapoints is crucial in this task and likely in other realworld problems involving noisy timeseries. Simulation tests also showed that AE distance is a good alternative to DTW. These results imply that AE dimensionality reduction and generalisation (using ReLU units in this case) are almost equivalent to DTW driven memory recall. warpAE was proposed to benefit from both AE’s dimensionality reduction and DTW’s phase invariance, an approach that produced the strongest investment performance and augmentation benefit of the similarity approaches tested. We also show that timeseries CL not only outperforms an LSTM base learner but can provide a transparent explanation for which memory did what and when. In summary, the most successful CL choices were found to be a sliding window CLAFFNN learner combined with a recall gate using warpAE similarity. These tests also affirm Continual Learning Augmentation (CLA) as a realworld timeseries CL approach, with the flexibility to augment different types of learners.
7 Bibliography
\printbibliography[heading=none]