Making Good on LSTMs Unfulfilled Promise
LSTMs promise much to financial time-series analysis, temporal and cross-sectional inference, but we find they do not deliver in a real-world financial management task. We examine an alternative called Continual Learning (CL), a memory-augmented approach, which can provide transparent explanations; which memory did what and when. This work has implications for many financial applications including to credit, time-varying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, time-series CL approaches outperform LSTM and a simple sliding window learner (feed-forward neural net (FFNN)). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how real-world, time-series noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex real-world problem; emerging market equities investment decision making. CLA provides a test-bed as it can be based on different types of time-series learner, allowing testing of LSTM and sliding window (FFNN) learners side by side. CLA is also used to test several distance approaches used in a memory recall-gate: euclidean distance (ED), dynamic time warping (DTW), autoencoder (AE) and a novel hybrid approach, warp-AE. We find CLA out-performs simple LSTM and FFNN learners and CLA based on a sliding window (CLA-FFNN) out-performs a LSTM (CLA-LSTM) implementation. While for memory-addressing, ED under-performs DTW and AE but warp-AE shows the best overall performance in a real-world financial task.
Keywords: Continual learning, time-series, LSTM, similarity, DTW, autoencoder
Both LSTMs Hochreiter_1997 and a wide range of time-series approaches suffer from a common problem; how to deal with long versus short term dependenciesKoutn14; Mozer:1991:IMT:2986916.2986950; Gers_Schidhuber_LSTMs_Timeseries_2001. This problem is a manifestation of (CF), one of the major impediments to the development of artificial general intelligence (AGI), where the ability of a learner to generalise to older tasks is corrupted by learning newer tasks French1999CatastrophicFI; MCCLOSKEY1989109. To address this problem, Continual Learning (CL) has been developed. Although very few time-series CL approaches exist, some have the advantage of having interpretable memory addressing Philps_2018, in contrast to LSTMs. The advantages of better intrepretability would be significant for real-world financial problems but there are a number of open questions for time-series CL. Firstly, should a financial time-series CL approach be based on recurrent (eg LSTM) or a sliding window architecture (eg feed-forward neural net (FFNN))? What impact does time-series noise have on the memory functions of a CL approach? How would these choices translate to performance in a real-world financial problem when compared with LSTMs and a simple sliding window approach (FFNN)? This study empirically examines these questions in the context of the complex real-world problem of driving stock selection investment decisions in emerging market equities.
The rest of this paper is organised as follows. Section 2 introduces common design choices for time-series CL. Section 3 reviews related work while section 4 then discusses Continual Learning Augmentation remember-gates, recall-gates and memory balancing. Section 5 describes the experimental setup, results and interpretability of the complex, real-world tests conducted while section 6 concludes this paper.
Machine learning based time-series approaches can be broadly separated into:
- Sliding window
: Dividing a time-series into a series of discrete modelling steps. Eg FFNN.
: Attempting to model a time-series process. Eg LSTM.
A sliding window allows the choice of a wide range of learners, such as OLS regression Fama93commonrisk or feed forward neural networks (FFNN) applied in a step forwards fashion. Specialist models have also been developed, such as time delayed neural nets (TDNN) Waibel_TDNN_1989 but these are still constrained by choices relating to the time-delays to use. The major shortcoming of sliding-window approaches is that, as time-steps on, all information that moves out of the sliding window is forgotten.
Sequential approaches, while still requiring a sliding window for longer series, attempt to avoid the fairly arbitrary, window sizing problem. However, to do this these approaches have to model greater degrees of complexity Zhang_RNNComplexity_Skip_16, in a sequential fashion: cross-sectional, temporal and short versus long dependencies. While there are many types of sequential learner Thomas_Sequential_2002, LSTMs are probably the most popular. While they are able to solve time-series problems that sliding window approaches cannot, sliding window approaches have outperformed LSTMs on seemingly more simple time-series problems Gers_Schidhuber_LSTMs_Timeseries_2001. Interpretability of LSTMs is also challenged Guo2019_. In either case, FFNNs and LSTMs applied to a time-evolving data-set, are exposed to CF Schak_2019; occurring when a learner is applied in a fashion. For example, a FFNN that is first trained to accurately approximate a model in a time-period , and is then trained in time-period , may see a deterioration in accuracy when applied to time-period again. CL has been proposed to address CF, using implicit or explicit memory for the purpose. In many cases plays a part in driving this liu_2015; Fei:2016:LCB:2939672.2939835; Shu_2018 but in the real-world of complex, noisy time-series, similarity can be more subjective. Temporal dependencies, changing modalities and more Schlimmer1986, all complicate gauging time-varying similarity and generally add computational expense. The impact of these effects on a learner is sometimes called concept drift Schlimmer1986; Widmer1996.
This paper addresses whether a sequential or sliding window approach is best in a noisy real-world financial task and examines time-series similarity in a CL context. We test a sliding window approach (FFNN) and then a sequential learner (LSTM) finding that the FFNN applied as a sliding window is the best performer. Secondly, we test different time-series similarity approaches, which are used to drive CLA’s memory-recall gate. We find simple euclidean distance (ED) under-performs noise invariant similarity approaches; dynamic time warping (DTW) and autoencoders (AE). We find the best performing similarity approach is a novel hybrid; warp-AE.
3 Related Work
While CF remains an open question, many techniques have been developed to address it including gated neural networks Hochreiter_1997, explicit memory structures Weston, prototypical addressing Snell_2017, weight adaptation Hinton_Distilling_2015; Sprechmann_2018, task rehearsal Silver_2002 and encoder based lifelong learning Triki2017 to name a few. As researchers have addressed the initial challenges of CL, other problems have emerged, such as the overhead of external memory structures Rae_2016_sparsereads, problems with weight saturation Kirkpatrick_2017, transfer learning Lopez-Paz_2017 and the drawbacks of outright complexity DBLP:journals/corr/ZarembaS15. While most CL approaches aim to learn sequentially, only a fraction of CL approaches have been focused on time-series Kadous_TS_2002:; Graves:2006:CTC:1143844.1143891; Lipton_TS_Modeling; Thomas_TS_2017. It is also unclear how effective these approaches would be when applied to open-world, state-based, temporal learning in long term, noisy, non-stationary time-series, particularly those commonly found in finance. As most CL approaches are applied to usually well defined, generally labelled and typically stylized tasks Lopez-Paz_2017, this is a key motivator for a time-series specific CL.
Regime switching models KimNelson1999 and change point detection Pettitt1979 provide a simplified answer to identifying changing states in time-series with the major disadvantage that change points between regimes (or states) are notoriously difficult to identify out of sample fabozzi2010quantitative and existing econometric approaches are limited by long term, parametric assumptions in their attempts Engle_1999; Zhang_2010; Siegmund_2013. There is also no guarantee that a change point represents a significant change in the accuracy of an applied model, a more useful perspective for learning different states. Residual change aims to observe change in the absolute error of a learner, aiming to capture as much information as possible regarding changes in the relation between independent and dependent variables.
Different forms of residual change have been developed Brown_1975; Jandhyala_1986; Jandhyala_1989; MacNeilt_1985; Bai_1991; Gama_COnceptDriftAdapt_2014. However, most approaches assume a single or known number of change points in a series and are less applicable to a priori change points or multivariate series Yu_2007. Some drift adaptation approaches address these issues Gama_COnceptDriftAdapt_2014 but tend to be applied to simple, generally instance based memory Widmer1996; Maloof2000; Klinkenberg_2004; Gomes_2010 tending to suffer from Koychev00gradualforgetting. This contrasts with explicit, task-oriented, memory structures of the sort used by CL to address CF. With the advent of time-series CL, residual change may have another interesting application.
CL approaches that use external memory structures require an appropriate addressing mechanism (a way of storing and recalling a memory). Memory addressing is generally based on a similarity measure such as cosine similarity Graves_14; graves2016hybrid; Park_2017 kernel weighting Vinyals_2016, use of linear models Snell_2017 or instance-based similarities, many using K-nearest neighbours Kaiser_2017; Sprechmann_2018. More recently, autoencoders (AE) have been used to gauge similarity in the context of multitask learning (MTL) Aljundi17 and for memory consolidation Triki2017. However, these methods, as they have been applied, are not obviously well suited to assessing similarity in noisy multivariate time-series. In contrast, researchers have extensively researched noise invariant time-series distance measures Cha2007, generally for time-series classification (TSC). While simple euclidean distance (ED) offers a rudimentary approach for comparing time-series it has a high sensitivity to the timing of data-points, something that has been addressed by dynamic time warping (DTW) Sakoe1978. However, DTW requires normalized data and is computationally expensive, although some mitigating measures have been developed Zhang:2017:DTW:3062405.3062585. A relatively small subset of data-mining research has used deep learning based approaches, such as convolutional neural nets (CNN) Zheng_CNN_2014. While results have been encouraging, interpretability is still an open question. Another interesting possibility is to use AEs to cope with time-series noise by varying manifold dimensionality and by using simple activation functions to introduce sparsity (ie ReLU).
4 Continual Learning Augmentation
Continual learning augmentation (CLA) memory augments a conventional learner for time-series regression. The aim is to allow well understood learners to be used in a CL framework in an interpretable way. CLA’s memory functions are applied as a sliding window stepping forward through time, over input data of one or more time-series. The approach is initialized with an empty memory structure, , and a chosen base learner, , parameterized by . This base learner can be a sequential approach or a sliding window approach applied to a multivariate input series, , with variables over time-steps. The chosen base learner produces a forecast value in each period as time steps forward. A remember-gate, , appends a new memory, , to , on a remember cue defined by the change in the base learner’s absolute error at time point . A recall-gate, , balances a mixture of base and memory forecasts to result in the final outcome, . Figure 1 shows the functional steps of remembering, recalling and balancing learner-memories.
4.1 Memory management
Repeating patterns are required in sub-sequences of the input data to provide memory cues to remember and recall different past states. Learner parameters, , trained in a given past state can then be applied if that state approximately reoccurs in the future. When CLA forms a memory, it is stored as a column in an explicit memory structure, similar to Ciresan_2012, which changes in size over time as new memories are remembered and old ones forgotten. Each memory column consists of a copy of a past base learner parameterization, , and a representation, , of the training data used to learn those parameters. As the sliding window steps into a new time period, CLA recalls one or more learner-memories by comparing the latest input data, , with a representation of the training data stored in each memory column, . Memories with training data that are more similar to the current input series will have a higher weight applied to their output, , and therefore make a greater contribution to the final CLA output, .
Remembering is triggered by changes in the absolute error series, , of the base learner as the approach steps forward through time:
CLA interrogates the base learner for changes in out-of-sample error, , which are assumed to be associated with changes in state. The remember-gate, , both learns to define and trigger a change which stores a pairing of the parameterization of the base learner, , and a contextual reference, . Figure 1 shows how a change is detected by , which then results in a new memory column being appended to :
Immediately after the remember event has occurred, a new base learner is trained on the current input, overwriting .
Theoretically, for a fair model of a state, would be approximately with a zero valued mean. Therefore the current base mode would cease to be a fair representation of the current state when exceeds a certain confidence interval, in turn implying a change in state. represents a critical level for , indicating a change point has occurred in state. This can be interpreted as a cue to remember the past state, when the observed absolute error series,, spikes above this critical level.
is essentially a hyperparameter of the CLA approach, estimated by . This can be optimized at every time-step to result in a level of sensitivity to remembering that forms an external memory, , resulting in the lowest empirical forecasting error for the CLA approach over the study term up until time :
Where is the CLA approach expressed as a function of the input series, , and , yielding (the absolute error of the base learner at time ). is an equidistant set, between the minimum and the maximum values of up until time . This represents a discretization of the empirical distribution of from which to empirically solve for . In our testing was initialised to a value of 20, representing five-percent buckets in the empirical range of .
The recall of memories takes place in the recall-gate, , which calculates , a mixture of the predictions from the current base learner and from learner-memories:
The mixture coefficients are derived by comparing the similarity of the current time varying context with the contextual references, , stored with each individual memory. Memories that are more similar to the current context have a greater weight in CLA’s final outcome.
4.4 Recall-gate Similarity Choices
Several approaches for calculating contextual similarity are tested separately, using the CLA approach. Each is used to define , either by simply storing past training examples or by using a process of contextual learning; essentially learning a representation of base learner training data.
ED and then DTW are applied first. Both approaches require to be raw training examples which are required to be stored in each respective memory column, making both approaches relatively resource hungry. Secondly, AE distance is used through a process of contextual learning. Rather than needing to store many training examples in a memory column, only the AE parameters are needed to form a reconstruction of the training data with the disadvantage that an AE must be trained in every time-step. Thirdly we introduce a DTW filtered AE distance, which is intended to phase adjust the AE reconstruction, we call this warp-AE. Again, an AE needs to be trained at every time-step but DTW processing expense is reduced as it is only run on AE reconstructions. We describe each approach in turn.
ED and DTW are applied only to a subset of randomly sampled instances from and , sampling over rows (ie cross-sectional data), each of which represent a different security at a given point in time:
Where is the dissimilarity, is the number of samples to take and are random integers between 1 and .
AE distance is used in a similar fashion to Aljundi et al 2017, Aljundi17, using ReLU activations to avoid over-fit. However CLA’s use of AEs is different. AEs are used for contextual learning for memory management, to cope with noisy, real-world, multivariate time-series. The use of ReLU units aims to allow generalisation over the noise of otherwise similar time-series sub-sequences. Additionally. the similarities returned from CLA’s AE implementation are also used to balance memory weightings:
is the reconstruction loss of the current input, , calculated as a euclidean distance. and are the encoder and decoder functions respectively.
warp-AE is designed to gain the AE’s benefits of lower memory usage than DTW while benefiting from the phase invariant loss of DTW:
Each of these (dis)similarities can be used to determine the memories to recall from and also how to weight the contribution of each memory to CLA’s final outcome, . Each was tested in turn, in CLA’s memory recall-gate, aiming to gain new insights about the effectiveness of each similarity approach when used in a CL system and applied to a complex multivariate time-series problem.
The base learner and all recalled memories are weighted by similarity to produce CLA’s final outcome, using the recall-gate, :
Where is the number of memories in the CLA memory structure, . Previous research indicated this was the most powerful approach over selecting the single memory Philps_2018. (Notably, both these balancing approaches significantly outperform equal weighing of all memories, indicating CLA is gaining significantly more than a simple ensemble effect).
5 Experimental Results
CLA is used as a test bed for different learners and similarity approaches in a regression task to forecast future expected returns of individual equity securities. This is used to drive equities investment simulations, a real-world task using noisy time-series. The data set consisted of stock level characteristics at each time-step, for many stocks over many time-steps. Tests were conducted to show the relative performance of a sliding window base learner, FFNN, and a sequential base learner, LSTM. Different similarity approaches were also used to drive the memory recall-gate; ED, DTW, AE and warp-AE.
Base learners were batch trained over all stocks at each time-step, forecasting US$ total returns 12months ahead for each stock. For the sliding window learner a year long, fixed length sliding window of four quarters was used for training and for the sequential learner all historic data up to the current time, was used for training. A stock level forecast in the top (bottom) decile of the stocks in a time-period was interpreted as a buy (sell) signal.
Although CLA is designed to use non-traditional driver variables, stock level characteristics are commonly expressed using factor loadings. These were estimated, in-sample at each time-step by regressing style factor excess returns against each stock level US$ excess return stream: , where is the excess return of stock in period , is the excess return of the Emerging Market Equities Index, is the relative return of the Emerging Market Value Equities Index.
Stock level factor loadings populated a matrix, , which comprised the input data. Each row represented a stock appearing in the index at time (up to 5,500 stocks) and each column related to a coefficient calculated on a specific time lag. resulted from a fifth and ninety-fifth percentile winsorizing of the raw input to eliminate outliers.
Long/short model portfolios were constructed (ie rebalanced) every six months over the study term, using equal weighted long positions (buys) and shorts (sells). The simulation encompassed 5,500 equities in total, covering 26 countries across emerging markets, corresponding to an Emerging Market Equities Index between 2006-2017. To account for the sampling approach used for ED and DTW similarities and differences in random initialisation of neural components, several simulations were carried out per test.
5.2 Simulation Results
CLA results showed a significant augmentation benefit for both base learners (see 2 b). While tests of similarity approaches favoured noise invariant approaches over simple ED.
Sliding window learner tests, CLA-FFNN, outperformed all the equivalent sequential learner tests, CLA-LSTM, in terms of total return (TR) while Sharpe ratios (see 2 a) were superior also (although no positive Sharpe ratio was significant at the 5% level). However, augmentation benefit, gauged by relative return (RR) and information ratio (Info Ratio), was superior for CLA-LSTMs (2 b), with most augmentation tests for both learners statistically significant at the 5% level. In these tests, although CLA-LSTM saw a better augmentation benefit (RR) CLA-FFNN saw the strongest outright performance (TR), followed by unaugmented FFNN (given by TR-RR), then CLA-LSTM. By far the weakest outright performer (TR) was unaugmented LSTM (given by TR-RR).
Tests of different similarity approaches, used in the recall-gate, saw ED under-perform DTW in TR terms and also in terms of augmentation benefit. This was true for both learners tested. This would imply that the invariance to phase, DTW provides, is an important consideration in a real-world context. AE distance tests showed higher TRs than DTW and demonstrated statistically significant augmentation benefits at the 5% level for both learners, indicating that AE distance is an appropriate approach to use in this context. warp-AE generated the highest RR and information ratios of all similarity tests, implying that adding a DTW filter to AE distance was the most interesting similarity approach tested.
5.3 Interpretable Memory
CLA produces outcomes that can be attributed to specific, recalled memories. Figure 3 shows an example of one of the simulation runs, CLA-FFNN with AE similarity. The upper panel shows the hypothetical investment returns of the unaugmented base learner (line chart) compared to the returns of the CLA augmented base learner (area). The lower panel shows an expanding-memory-triangle, expanding by one row at each time-step forwards, indicating the possibility of the system adding a new memory at each step. Note the black horizontal lines which show specifically when and how certain memories were recalled and applied to result in specific return outcomes (in the upper panel). In this example, at least three memories are remembered and recalled at later times. Qualitatively, we can see that a memory remembered in January 2007, a period of turbulence in financial markets, adds materially to the strategy return in later periods, when recalled. It proves more appropriate than the base learner in the period of the 2008 financial crisis and its aftermath involving concerted fiscal stimulus (Sept 2008-Dec 2010). It was again recalled in 2013 and then in 2016, both also periods where fiscal stimulus dominated market returns (in Europe and China respectively). The sparsity of the memory structure is also notable, with only three principle memories recalled/used in the entire period. As a rule of thumb, fewer memories will be remembered/recalled for noisier data-sets. This occurs through either a lower learned sensitivity to remembering, from a higher learned value of , or through less recalling, from less discernible contexts.
We have empirically demonstrated that when applied to a real-world financial task involving noisy time-series, a CL augmented sliding window learner (CLA-FFNN) is superior to LSTM and superior to a CL augmented LSTM learner (CLA-LSTM). Testing of different similarity approaches, applied to a recall-gate, showed poor performance of simple euclidean distance (ED) when compared to dynamic time warping (DTW). This strongly implies that the timing of data-points is crucial in this task and likely in other real-world problems involving noisy time-series. Simulation tests also showed that AE distance is a good alternative to DTW. These results imply that AE dimensionality reduction and generalisation (using ReLU units in this case) are almost equivalent to DTW driven memory recall. warp-AE was proposed to benefit from both AE’s dimensionality reduction and DTW’s phase invariance, an approach that produced the strongest investment performance and augmentation benefit of the similarity approaches tested. We also show that time-series CL not only outperforms an LSTM base learner but can provide a transparent explanation for which memory did what and when. In summary, the most successful CL choices were found to be a sliding window CLA-FFNN learner combined with a recall gate using warp-AE similarity. These tests also affirm Continual Learning Augmentation (CLA) as a real-world time-series CL approach, with the flexibility to augment different types of learners.