Making Good on LSTMs Unfulfilled Promise

# Making Good on LSTMs Unfulfilled Promise

Daniel Philps
Rothko Investment Strategies
Department of Computer Science
City, University of London
daniel.philps@city.ac.uk &Tillman Weyde
Department of Computer Science
City, University of London
t.e.weyde@city.ac.uk &Artur d’Avila Garcez
Department of Computer Science
City, University of London
a.garcez@city.ac.uk
###### Abstract

LSTMs promise much to financial time-series analysis, temporal and cross-sectional inference, but we find they do not deliver in a real-world financial management task. We examine an alternative called Continual Learning (CL), a memory-augmented approach, which can provide transparent explanations; which memory did what and when. This work has implications for many financial applications including to credit, time-varying fairness in decision making and more. We make three important new observations. Firstly, as well as being more explainable, time-series CL approaches outperform LSTM and a simple sliding window learner (feed-forward neural net (FFNN)). Secondly, we show that CL based on a sliding window learner (FFNN) is more effective than CL based on a sequential learner (LSTM). Thirdly, we examine how real-world, time-series noise impacts several similarity approaches used in CL memory addressing. We provide these insights using an approach called Continual Learning Augmentation (CLA) tested on a complex real world problem; emerging market equities investment decision making. CLA provides a test-bed as it can be based on different types of time-series learner, allowing testing of LSTM and sliding window (FFNN) learners side by side. CLA is also used to test several distance approaches used in a memory recall-gate: euclidean distance (ED), dynamic time warping (DTW), auto-encoder (AE) and a novel hybrid approach, warp-AE. We find CLA out-performs simple LSTM and FFNN learners and CLA based on a sliding window (CLA-FFNN) out-performs a LSTM (CLA-LSTM) implementation. While for memory-addressing, ED under-performs DTW and AE but warp-AE shows the best overall performance in a real-world financial task.

Keywords: Continual learning, time-series, LSTM, similarity, DTW, auto-encoder

CLABib.bib

## 1 Introduction

Both LSTMs Hochreiter_1997 and a wide range of time-series approaches suffer from a common problem; how to deal with long versus short term dependenciesKoutn14; Mozer:1991:IMT:2986916.2986950; Gers_Schidhuber_LSTMs_Timeseries_2001. This problem is a manifestation of (CF), one of the major impediments to the development of artificial general intelligence (AGI), where the ability of a learner to generalise to older tasks is corrupted by learning newer tasks French1999CatastrophicFI; MCCLOSKEY1989109. To address this problem, Continual Learning (CL) has been developed. Although very few time-series CL approaches exist, some have the advantage of having interpretable memory addressing Philps_2018, in contrast to typical LSTMs. The advantages of better intrepretability would be significant for real-world financial problems but there are a number of open questions for time-series CL. Firstly, should a financial time-series CL approach be based on recurrent (e.g. LSTM) or a sliding window architecture (e.g. feed-forward neural net (FFNN))? What impact does time-series noise have on the memory functions of a CL approach? How would these choices translate to performance in a real-world financial problem when compared with LSTMs and a simple sliding window approach (FFNN)?

This study empirically examines these questions using a flexible and explainable CL approach called (CLA) using the complex real world problem of driving stock selection investment decisions in emerging market equities.

## 2 Design Choices for Time-series CL

Machine learning based time-series approaches can be broadly separated into:

[font=]

Sliding window

: Dividing a time-series into a series of discrete modelling steps. E.g. FFNN.

Sequential

: Attempting to model a time-series process. E.g. LSTM.

A sliding window allows the choice of a wide range of learners, such as OLS regression Fama93commonrisk or feed forward neural networks (FFNN) applied in a step forwards fashion. More specialist models have been developed, such as time delayed neural nets (TDNN) Waibel_TDNN_1989 but these are still constrained by choices relating to the time-delays to use. The major shortcoming of sliding-window approaches is that, as time-steps on, all information that moves out of the sliding window is forgotten.

Sequential approaches, while still requiring a sliding window for longer series, otherwise address the arbitrary window size concern but have to model potentially complex cross-sectional and short and long term temporal relationships sequentially. While there are many types of sequential learner Thomas_Sequential_2002, LSTMs are probably the most popular. While they are able to solve time-series problems that sliding window approaches cannot, sliding window approaches have outperformed LSTMs on seemingly more simple time-series problems Gers_Schidhuber_LSTMs_Timeseries_2001. Interpretability of LSTMs is also challenged Guo2019_.

In either case, FFNNs and LSTMs applied to a time-evolving data-set, are exposed to CF Schak_2019; occurring when a learner is applied in a fashion. For example, a FFNN that is first trained to accurately approximate a model in a time-period , and is then trained in time-period , may see a deterioration in accuracy when applied to time-period again. CL has been proposed to address CF, using implicit or explicit memory for the purpose. In many cases plays a part in driving this liu_2015; Fei:2016:LCB:2939672.2939835; Shu_2018 but in the real-world of complex, noisy time-series, similarity can be more subjective. Temporal dependencies, changing modalities and more Schlimmer1986, all complicate gauging time-varying similarity and generally add computational expense. The impact of these effects on a learner is sometimes called concept drift Schlimmer1986; Widmer1996.

This piece addresses whether a sequential or sliding window approach is best in a noisy real-world financial task and examines time-series similarity in a CL context. In the next section we introduce relevant research, then describe the CLA approach and finally conduct simulation tests with analysis and conclusions. We test a sliding window approach (FFNN) and then a sequential learner (LSTM) finding that the FFNN applied as a sliding window is the best performer. Secondly, we test different time-series similarity approaches, which are used to drive CLA’s memory-recall gate. We find simple euclidean distance (ED) under-performs noise invariant similarity approaches; dynamic time warping (DTW) and auto-encoders (AE). We find the best performing similarity approach is a novel hybrid; warp-AE.

## 3 Related work

While CF remains an open question, many techniques have been developed to address it including gated neural networks Hochreiter_1997, explicit memory structures Weston, prototypical addressing Snell_2017, weight adaptation Hinton_Distilling_2015; Sprechmann_2018, task rehearsal Silver_2002 and encoder based lifelong learning Triki2017 to name a few. As researchers have addressed the initial challenges of CL, other problems have emerged, such as the overhead of external memory structures Rae_2016_sparsereads, problems with weight saturation Kirkpatrick_2017, transfer learning Lopez-Paz_2017 and the drawbacks of outright complexity DBLP:journals/corr/ZarembaS15. While most CL approaches aim to learn sequentially only a fraction of CL approaches have been focused on time-series Kadous_TS_2002:; Graves:2006:CTC:1143844.1143891; Lipton_TS_Modeling; Thomas_TS_2017. It is also unclear how effective these approaches would be for state-based CL in long term, noisy, non-stationary time-series, particularly those commonly found in finance. As most CL approaches are applied to usually well defined, generally labelled and typically stylized tasks Lopez-Paz_2017, this is a motivating point for time-series CL.

### 3.1 Remembering

Regime switching models KimNelson1999 and change point detection Pettitt1979 provide a simplified answer to identifying changing states in time-series with the major disadvantage that change points between regimes (or states) are notoriously difficult to identify out of sample fabozzi2010quantitative and existing econometric approaches are limited by long term, parametric assumptions in their attempts Engle_1999; Zhang_2010; Siegmund_2013. There is also no guarantee that a change point represents a significant change in the accuracy of an applied model, a more useful perspective for learning different states. Residual change aims to observe change in the absolute error of a learner, aiming to capture as much information as possible regarding changes in the relation between independent and dependent variables.

Different forms of residual change have been developed Brown_1975; Jandhyala_1986; Jandhyala_1989; MacNeilt_1985; Bai_1991; Gama_COnceptDriftAdapt_2014. However, most approaches assume a single or known number of change points in a series and are less applicable to a priori change points or multivariate series Yu_2007. Some drift detection approaches address these issues Gama_COnceptDriftAdapt_2014 but tend to be applied to simple, generally instance based memory Widmer1996; Maloof2000; Klinkenberg_2004; Gomes_2010 exhibiting Koychev00gradualforgetting, rather than task-oriented, CL memory used to address CF. With the advent of time-series CL, residual change may have another interesting application.

### 3.2 Recalling

CL approaches that use external memory structures require an appropriate memory addressing mechanism (a way of storing and recalling a memory). Memory addressing is generally based on a similarity measure such as cosine similarity Graves_14; graves2016hybrid; Park_2017 kernel weighting Vinyals_2016, use of linear models Snell_2017 or instance-based similarities, many using K-nearest neighbours Kaiser_2017; Sprechmann_2018. More recently, autoencoders (AE) have been used to gauge similarity in the context of multitask learning (MTL) Aljundi17 and for memory consolidation Triki2017. However, these methods, as they have been applied, are not obviously well suited to assessing similarity in noisy multivariate time-series. In contrast, researchers have extensively researched noise invariant time-series distance measures Cha2007, generally for time-series classification (TSC). While simple euclidean distance (ED) offers a rudimentary approach for comparing time-series it has a high sensitivity to the timing of data-points, something that has been addressed by dynamic time warping (DTW) Sakoe1978. However, DTW requires normalized data and is computationally expensive, although some mitigating measures have been developed Zhang:2017:DTW:3062405.3062585. A relatively small subset of data mining research has used deep learning based approaches, such as convolutional neural nets (CNN) Zheng_CNN_2014. While results have been encouraging, interpretability is still an open question. Another interesting possibility is to use AEs to cope with time-series noise by varying manifold dimensionality and by using simple activation functions to introduce sparsity (ie ReLU).

## 4 The Test-bed: Continual Learning Augmentation

Continual learning augmentation (CLA) memory augments a conventional learner for time-series regression. The aim is to allow well understood learners to be used in a CL framework in an interpretable way. CLA’s memory functions are applied as a sliding window stepping forward through time, over input data of one or more time-series. The approach is initialized with an empty memory structure, and a chosen base learner, , parameterized by . This base learner can be a sequential approach or a sliding window approach and can be applied to a multivariate input series, , with variables over time-steps. The chosen base learner produces a forecast value in each period as time steps forward. A remember gate, , appends a new memory, , to , on a remember cue defined by the change in the base learner’s absolute error at time point . A recall gate, , balances a mixture of base and memory forecasts to result in the final outcome, . Figure 1 shows the functional steps of remembering and recalling learner-memories.

### 4.1 Memory management

Repeating patterns are required in sub-sequences of the input data to provide memory cues to remember and recall different past states. Learner parameters trained in a given past state, , can then be applied if that state approximately reoccurs in the future. When CLA forms a memory, it is stored as a column in an explicit memory structure, similar to Ciresan_2012, which changes in size over time as new memories are remembered and old ones forgotten. Each memory column consists of a copy of a past base learner parameterization, , and a representation,, of the training data used to learn those parameters. As the sliding window steps into a new time period, CLA recalls one or more learner-memories by comparing the latest input data () with a representation of the training data stored in each memory column (). Memories with training data that are more similar to the current input series will have a higher weight applied to their output () and therefore make a greater contribution to the final CLA output ().

### 4.2 Remember-Gate

Remembering is triggered by changes in the absolute error series, , of the base learner as the approach steps forward through time:

 ϵB={|(^yt−yt|,|(^yn−yn|} (1)

CLA interrogates the base learner for changes in out-of-sample error, , which are assumed to be associated with changes in state. The remember-gate, , both learns to define and trigger a change which stores a pairing of the parameterization of the base learner, , and a contextual reference, . Figure 1 shows how a change is detected by , which then results in a new memory column being appended to :

 M={(˜X1,θ1),…,(˜XM,θM)} (2)

Immediately after the remember event has occurred, a new base learner is trained on the current input, overwriting .

Theoretically, for a fair model of a state, would be approximately with a zero valued mean. Therefore the current base mode would cease to be a fair representation of the current state when exceeds a certain confidence interval, in turn implying a change in state. represents a critical level for , indicating a change point has occurred in state. Memories are only stored when the observed absolute error series,, spikes above the critical level, :

is a hyperparameter, optimized at every time-step, to result in a level of sensitivity to remembering that forms an external memory, , resulting in the lowest empirical forecasting error for the CLA approach over the study term up until time :

 JCrit=argminjCrit∈jgridf(Xt,jCrit) (3)

Where is the CLA approach expressed as a function of the input series and , yielding (the absolute error of the base learner at time ). is a 20 point, equidistant set between the minimum and the maximum values of , representing five-percent intervals in the empirical distribution of .

### 4.3 Recall-Gate

The recall of memories takes place in the recall-gate , which calculates a mixture of the predictions from the current base learner and from learner-memories.

 ^y(t+1)=g(Xt,Mt) (4)

The mixture coefficients are derived by comparing the similarity of the current time varying context with the contextual references stored with each individual memory. Memories that are more similar to the current context have a greater weight in CLA’s final outcome.

### 4.4 Recall: Testing Measures of Similarity

Several approaches for calculating contextual similarity are tested separately, using the CLA approach. Each is used to define , either by simply storing past training examples or by using a process of contextual learning; essentially learning a representation of base learner training data.

ED and then DTW are applied first. Both approaches require to be raw training examples which are required to be stored in each respective memory column, making both approaches relatively resource hungry. Secondly, AE distance is used through a process of contextual learning. Rather than needing to store many training examples in a memory column, only the AE parameters are needed to form a reconstruction of the training data with the disadvantage that an AE must be trained in every time-step. Thirdly we introduce a DTW filtered AE distance, which is intended to phase adjust the AE distance calculation, we call this warp-AE. Again, an AE needs to be trained at every time-step but DTW processing expense is reduced as it is only run on AE reconstructions. We describe each approach in turn.

ED and DTW are applied only to a subset of randomly sampled instances from and , sampling over rows, each of which represent different securities in the data-set:

 ^DED(˜Xm,Xt)=1/NN∑i=0ED(˜Xm,r1(D),Xt,r2(D)) (5)
 ^DDTW(˜Xm,Xt)=1/NN∑i=0DTW(˜Xm,r1(D),Xt,r2(D)) (6)

Where is the dissimilarity, is the number of samples to take and are random integers between 1 and .

AE distance is used in a similar fashion to Aljundi et al 2017, Aljundi17, using ReLU activations to avoid over-fit. However CLA’s use of AEs is different. AEs are used for contextual learning for memory management, to cope with noisy, real world, multivariate time-series. The use of ReLU units aims to allow generalisation over the noise of otherwise similar time-series sub-sequences. Additionally. the similarities returned from CLA’s AE implementation are also used to balance memory weightings:

 ^DAE(˜Xm,Xt)=1/NN∑i=0ED(Xt,a(h(Xt))) (7)

is the reconstruction loss of the current input, , calculated as a euclidean distance. and are the encoder and decoder functions respectively. warp-AE is designed to gain the AE’s benefits of lower memory usage than DTW while benefiting from the phase invariant loss of DTW:

 ^DwAE(˜Xm,Xt)=1/NN∑i=0DTW(Xt,a(h(Xt))) (8)

These (dis)similarities are used to determine memories to recall from and also how to weight the contribution of each memory to CLA’s final outcome, .

These different similarity functions were each tested in CLA’s memory recall-gate in turn, gaining new insights about the effectiveness of each similarity approach in a CL system, when applied to a complex multivariate time-series problem.

### 4.5 Balancing

The base learner and all recalled memories are weighted by similarity to produce CLA’s final outcome, using the recall-gate, :

 ^yt+1=M∑m=1ϕ(Xt,θm)⋅[1−^D(˜Xm,Xt)∑Mm=1^D(˜Xm,Xt)] (9)

Where is the number of memories in the memory structure . Previous research indicated this was the most powerful approach over selecting the single memory Philps_2018. (Notably, both these balancing approaches significantly outperform equal weighing of all memories, indicating CLA is gaining significantly more than a simple ensemble effect).

## 5 Investment Simulation Setup

CLA is used as a test bed for different learners and similarity approaches in a regression task to forecast future expected returns of individual equity securities. This is used to drive equities investment simulations, a real world task using noisy time-series. The data set consisted of stock level characteristics at each time-step. Tests were conducted to show the relative performance of a sliding window base learner, FFNN, and a sequential base learner, LSTM. Different similarity approaches were also used to drive the memory recall-gate; ED, DTW, AE and warp-AE.

Base learners were batch trained over all stocks at each time-step, forecasting US$total returns 12months ahead for each stock. For the sliding window learner a year long, fixed length sliding window of four quarters was used for training and for the sequential learner all historic data up to the current time, was used for training. A stock level forecast in the top (bottom) decile of the stocks in a time-period was interpreted as a buy (sell) signal. Although CLA is designed to use non-traditional driver variables, stock level characteristics are commonly expressed using factor loadings. These were estimated, in-sample at each time-step by regressing style factor excess returns against each stock level US$ excess return stream: , where is the excess return of stock in period , is the excess return of the Emerging Market Equities Index, is the relative return of the Emerging Market Value Equities Index.

Stock level factor loadings populated a matrix, , which comprised the input data. Each row represented a stock appearing in the index at time (up to 5,500 stocks) and each column related to a coefficient calculated on a specific time lag. resulted from winsorizing the raw input to eliminate outliers.

Long/short model portfolios were constructed (ie rebalanced) every six months over the study term, using equal weighted long positions (buys) and shorts (sells). The simulation encompassed 5,500 equities in total, covering 26 countries across emerging markets, corresponding to an Emerging Market Equities Index between 2006-2017. To account for the DTW sampling approach used and differences in random initialisation of neural components, several simulations were carried out per test.

## 7 Simulation Results

CLA results showed a significant augmentation benefit for both base learners (see 2 b). While tests of similarity approaches favoured noise invariant approaches over simple ED.

Sliding window learner tests, CLA-FFNN, outperformed all the equivalent sequential learner tests, CLA-LSTM, in terms of total return (TR) while Sharpe ratios (see 2 a) were superior also (although none were significant at the 5% level). However, augmentation benefit, gauged by relative return (RR) and information ratio (Info Ratio), was superior for CLA-LSTMs (2 b), with most augmentation tests for both learners statistically significant at the 5% level. In these tests, although CLA-LSTM saw a better augmentation benefit (RR) CLA-FFNN saw the strongest outright performance (TR), followed by unaugmented FFNN (given by TR-RR), then CLA-LSTM. By far the weakest outright performer (TR) was unaugmented LSTM (given by TR-RR).

Tests of different similarity approaches, used in the recall-gate, saw ED under-perform DTW in TR terms and also in terms of augmentation benefit. This was true for both learners tested. This would imply that the invariance to phase, DTW provides, is an important consideration in a real world context. AE distance tests showed higher TRs than DTW and demonstrated statistically significant augmentation benefits at the 5% level for both learners, indicating that AE distance is an appropriate approach to use in this context. warp-AE generated the highest RR and information ratios of all similarity tests, implying that adding a DTW filter to AE distance was the most interesting similarity approach tested.

### 7.1 Interpretable Memory

CLA produces outcomes that can be explained and attributed to its memories. 3 shows an example of one of the simulation runs, CLA-FFNN with AE similarity, and shows how certain memories were applied at certain time points to result in specific outcomes. In this case at least three memories are remembered (lower chart, black lines) and recalled at different future times. In this case a learner memory remembered in January 2007, a period of turbulence in financial markets, adds the most value. It proves more appropriate than the base learner in the period of the 2008 financial crisis and its aftermath involving concerted fiscal stimulus (Sept 2008-Dec 2010). It was again recalled in 2013 and then in 2016, both also periods where fiscal stimulus dominated market returns (in Europe and China respectively).

## 8 Conclusion

We have empirically demonstrated that when applied to a real world financial task involving noisy time-series, a CL augmented sliding window learner (CLA-FFNN) is superior to LSTM and superior to a CL augmented LSTM learner (CLA-LSTM). Testing of different similarity approaches, applied to a recall-gate, showed poor performance of simple euclidean distance (ED) when compared to dynamic time warping (DTW). This strongly implies that the timing of data-points is crucial in this task and likely in other real world problems involving noisy time-series. Simulation tests also showed that AE distance is a good alternative to DTW. These results imply that AE dimensionality reduction and generalisation (using ReLU units in this case) are almost equivalent to DTW driven memory recall. warp-AE was proposed to benefit from both AE’s dimensionality reduction and DTW’s phase invariance, an approach that produced the strongest investment performance and augmentation benefit of the similarity approaches tested. We also show that time-series CL not only outperforms an LSTM base learner but can provide a transparent explanation for which memory did what and when. In summary, the most successful CL choices were found to be a sliding window CLA-FFNN learner combined with a recall gate using warp-AE similarity. These tests also affirm Continual Learning Augmentation (CLA) as a real-world time-series CL approach, with the flexibility to augment different types of learners.

### 8.1 Future work

We have tested our approach on many financial data-sets but this approach could, in principle (and by design) be used on many other financial time-series problems. This might include applications to credit scoring, analysis of time/state-varying fairness in decision making and more.

## 9 Bibliography

\printbibliography

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters