An empirical study of neural networks for trend detection in time series

An empirical study of neural networks for trend detection in time series

Abstract

Detecting structure in noisy time series is a difficult task. One intuitive feature, which is of particular interest in financial applications, is the notion of trend. From theoretical hints and using simulated time series, we empirically investigate the efficiency of standard recurrent neural networks (RNNs) to detect trends. We show the overall superiority and versatility of certain standard RNNs structures over various other estimators. These RNNs could be used as basic blocks to build more complex time series trend estimators.

1 Introduction

When looking at any dataset, human brain is wired to detect patterns [8]. Time series are no exception and quite naturally we see “trends” when shown a plot of share prices. Trends seem a relevant feature of any forecasting mechanism for time series. In this article, we focus on univariate time series having a conspicuous trend component as commonly found in financial data. Trending time series are not unique to finance and our work extends to other domains. The main contributions of this article are:

  • Framing the problem into a classification problem emphasizing the usefulness of simulated data

  • Building a general trend estimator for a wide range of dynamics

  • Showing in a simple case why RNNs are good trend estimators

  • Showing empirically the superiority of RNNs over standard estimators

  • Deriving theoretical maximum likelihood estimators for the considered dynamics

We first describe our general framework establishing trend detection as a sequence to sequence classification problem. We then define the time series dynamics used in our simulations. Next, we explore the use of recurrent neural networks to detect trends. Thereupon, we empirically compare performance of standard RNNs structures. We then build a general purpose trend estimator called RNN baseline. We benchmark its performance against other estimators like convolutional networks. Finally, we compare its performance against estimators based on parameter estimation (MLE) of the modelled dynamic. Mathematical topics and detailed results have been left aside in the appendix.

2 Framework and data set

In this section we define our framework, which basically tries to address the question: what setup should one consider to find a “good” general purpose estimator of trend in time series ?

2.1 The thought process

Trends can be interpreted as the slopes of a smooth function around which the time series oscillates. The simplest, and probably the closest to human intuition, would be to use piecewise linear functions as in described in [9]. The issue with these filtering approaches is that they tend to be good ex post but slow to detect changes of trends. This is a real problem when the whole time series is not known in advance.
We take a slightly different approach. If the future value of the time series is expected to be higher [respectively lower, equal] than the current one, then the time series is said to be trending up [respectively trending down, not trending]. At each time step, we assign a unique trend value noted , the time-series is:

  • trending downward at if

  • not trending at if

  • trending upward at if

We can directly translate this intuition into mathematical terms. Consider a process adapted to a filtration , under some technical conditions, the Doob-Meyer theorem applies and can be decomposed in an unique way as

where is a predictable increasing [respectively decreasing, zero] process if is a sub-martingale [respectively super-martingale, martingale] starting at 0 and is a martingale. Obviously, we can map our intuitive definition to more precise concepts.
is: trending downward is decreasing not trending is null trending upward is increasing
The monotonicity of the process will be our definition of the trend of and thus a classification task with three labels for downward, flat and upward trend. Considering an Itô process

where is a Wiener process. We can track the changing monotonicity of via the sign of which will be our practical definition of trend.
The challenge at hand is to build an estimator of the sign of , which will be our classification label. In the following, we will consider various time series dynamics where we control the sign of . This gives us a framework to analyse the performance of various estimators, while controlling for the statistical properties of the dataset.
The classification task relies on the labelling of the training set. When using historical data, labelling is not easy to do: the definition of trend is subjective and usually depends on the choice of a time window or of a performance criterion. On the contrary, when using simulated data, labelling of the training set is easy. A general-purpose estimator of trend in a simulated environment is a useful building block for handling more complex real-life cases where no trend labels are available. It gives us a robust starting point on which we can build on1.

2.2 Time series dynamics

Our idea is to generate as many realistic datasets as possible, and to train trend estimators on those datasets. If we train our estimator on a dataset rich enough to capture all the possible scenarios, we can hope to have an estimator robust to real-life conditions. In the following, we consider three different types of dynamics, hopefully rich and diverse enough to match a lot of the real-life behaviour:

  • a noisy piecewise linear process

  • a piecewise Ornstein-Uhlenbeck process [16]

  • a Markovian switching process [5]

The first two are piecewise meaning that we divide time into intervals on which the time series follows the chosen dynamic. A simple continuity constraint is applied to “glue” together these different periods.
In the rest of the section we define:

  • a time interval

  • for piecewise processes, a number of intervals of possibly different lengths

Noisy Line Process

We define a Noisy Line Process2 by a process for which

where

  • is a slope parameter randomly chosen in , where is the maximum slope and

  • is a noise parameter

  • are i.i.d. normal variables

The trend here is given by the sign of . Figure 1 displays some possible trajectories.

(a) Flat process
(b) Trending up process
(c) Two periods but very noisy
(d) Several periods with less noise
Figure 1: Noisy line process samples. Up in green, down in red and flat in blue

Piecewise Ornstein-Uhlenbeck dynamic

We define a Piecewise Ornstein-Uhlenbeck Process as a process such that

where and . If the intervals are big enough, , and the trend label will be determined by

(1)

Samples of piecewise Ornstein-Uhlenbeckprocess are shown on figure 2.

(a) Three periods Ornstein-Uhlenbeck process with weak “pull”
(b) Four periods Ornstein-Uhlenbeck process with strong “pull”
Figure 2: Piecewise Ornstein-Uhlenbeck processes. Up in green, down in red and flat in blue

Switching Markovian dynamic

The trend is given by a Markov chain on finite states . The process is defined by

where is a slope process, a positive noise process and . In practice, and are constant with time, the constant being randomly chosen in a discrete distribution. This process exhibits a rich set of trajectories as seen on figure 3.

(a) Trendy process with noise
(b) Trendy process with low noise
(c) “Earthquake” process
(d) Rapidly changing trend
Figure 3: Some trajectories from our model with a three states Markov chain. Up in green, down in red and flat in blue

2.3 Training and Validation sets

Training sets are made of 1000 time series containing roughly 1000 data points, randomly drawn:

  • from either one of the three previous dynamics (see section 2.2)

  • or from all of the previous dynamics. This will be named mixed dynamic in the following

Model selection is made on validation sets composed of 300 time series: 100 samples from each of the three dynamics described in section 2.2. Each sample has between 500 and 1000 points depending on the dynamics and the draw. Figure 4 shows random samples from the validation set. This validation set offers a rich set of scenarios and can be used to assess the ability of an estimator to detect trends. Hyper-parameters are chosen using a separate test set which is a new random draw of the training set.

Figure 4: Some samples of a validation set

2.4 From empirical data to stylised time series dynamics

One important question arising from the chosen approach is the relevance of the simulated data. The dynamics can show behaviours that, even if not designed to simulate market dynamics, can be relatively similar to actual asset prices. As an example on figure 5 we plot real assets daily time-series versus a random sample from our three dynamics.

(a) Oil future contract
(b) EUR-USD exchange rate
(c) S&P 500index
(d) USD 10years swap rate
Figure 5: Real assets versus various samples of simulated dynamics

We see that the trajectories can be visually similar but that the distribution of daily returns may differ greatly. We must bear in mind that our aim is not to simulate market data but to detect trend defined as the sign of the drift term. We think that our dynamics are good enough to simulate this property of real time-series. One general method to get simulated dynamics close to empirical market data is the following :

  1. Chose a dynamic

  2. Compute the distribution of returns of the market time series of interest

  3. Sample time-series of the dynamic and compute the distributions returns

  4. Compute the average distance between the sampled distributions and the empirical one3

  5. Minimize this function over the dynamic parameters using black-box Bayesian optimization

3 Using Recurrent Neural Networks to detect trends

We motivate here the use of Recurrent Neural Networks (RNN) for our classification problem. Drawing from simple intuition, we provably show their benefits in a simple case.

3.1 Motivation : moving averages filtering and its extension as RNN

One of the most common way to detect trends is to adopt a filtering approach, comparing smoothed versions of the initial process. For example, we could aggregate several moving averages like:

(2)

with various values of . Determining the optimal might be difficult if we want to build an estimator adapted to various dynamics. To circumvent this difficulty, we can aggregate the values for different as the components of vectors through time4.
For example, we might want to consider concatenation of a fast, medium and slow moving averages. We might compare:

  • the slow and the fast moving averages by looking at the sign of

  • or maybe the slow versus an average of the medium and slow with the sign of

  • or whatever weighted combination we fancy with the sign of

Generally speaking, we look at the signs of components of the vector where is a given5 weight matrix. The rows of define hyperplanes. The half-spaces determined by are given by the signs of the components of . Detecting a trend is simply trying to locate with regards to convex polytopes determined by these half-spaces.
Generalizing equation (2) to upper dimensions, we have:

where is a positive matrix and a positive vector such that

The trend is determined by but we could use any other activation function instead of the sign function.

These equations are exactly equal to the update equation of a RNN composed of

  • a vanilla RNN

    • with the identity as activation function

    • with one hidden layer

    • with convex constraints on the weight matrix 6

  • with a simple linear layer and activation function

Such a RNN will be called a “convex net” in the following. This shows that RNNs can be considered as generalizations of some basic moving average comparisons. As a working example, we consider the case of the Noisy Line Process where are independent noise random variables .
For a net with constrained weights it can be shown (see annex B for details):

  • without trend, , then becomes centered around a variable of finite variance

  • with trend, then diverges

If we now introduce a hyperbolic tangent activation function instead of identity:

  • if , near zero the cell is in the linear part and we should expect the state to stay bounded around the origin

  • if the trend then the state should go towards i.e. to navigate near the faces of the hypercube

For a practical illustration see annex C.

3.2 Overview of RNNs and data

Standards Recurrent Neural Nets

In subsection 2.1, we turned the trend estimation problem into a sequence to sequence classification task, for which RNNs can be used. We consider three standard structures:

  • Vanilla RNN as defined in [3]

  • LSTM as introduced in [7]

  • GRU as introduced in [2]

RNNs contain cycles: hidden state cell can depend on the entire past input sequence. We refer to [4] for details. These three standard RNNs have different structures but they share similar update equations like:

where

  • is a vector representing some internal cells at

  • is an block-wise activation function

  • is the input at time

  • is the state at time

  • are matrices and vectors

is a elementwise application operator 7 and the matrix product.
Depending on the RNN, is a combination of blocks of and possibly .
Essentially, where is a possibly complex mapping from the previous state and actual input values to the new state. We refer the reader to [3], [7] and [2] for more details.

Training RNNs

For training and validation, we use simulated time series according to section 2.3. Our aim is to give a precise empirical comparison of these three structures taking into account the possible influence of the training dynamic. We train triplets of the form:

  • a RNN chosen among Vanilla, LSTM or GRU

  • some meta-parameters like the number of recurrent layers, the dimension of hidden layer(s), dropout (see [15] for definition)…

  • a time series dynamic chosen among Noisy Line Process, Piecewise Ornstein-Uhlenbeck, Markovian Switch or a mixed dynamic

Each of these triplets is trained and validated against the training and validation sets described in subsection 2.3. This gives us more than 400 triplets to train and validate. Roughly 100 triplets do hit convergence issues in the training period and are excluded from the validation phase. Some parameters details can be found in annex D.1. Also, to get more robust results, we did a complete training using two different gradient step optimizations:

  • Adam (see [10] for details) as it is commonly used and has some theoretical convergence properties to a stationary point (see [1] for details)

  • RMSprop algorithm (see [6] for details)

3.3 Empirical findings

We train our triplets as described in subsection 3.2.2 for both Adam and RMSprop and validate each triplet on our 300 validation samples (see section 2.3). The loss is a binary loss on the labels.
Table 1 shows the coefficients of the linear regression of loss against binary variables indicating the training dynamic, the net type, the optimization type and the validation dynamic. Each feature is translated into binary on/off variables with one less modality. The missing modality is on if all others are set to zero. A positive coefficient means that the highlighted feature increases the average loss of the sample, and conversely, a negative coefficient decreases the average loss. Full details can be found in annex D.2

Feature[Modality] Coefficient
Intercept 0.48
Training dynamic[Markovian Switch]
Training dynamic[Ornstein-Uhlenbeck] 0.029
Training dynamic[Noisy Line]
Net Type[LSTM] 0.037
Net Type[Vanilla] 0.17
Optimization[RMSP] 0.0234
Validation dynamic[Ornstein-Uhlenbeck] -0.1
Validation dynamic[Noisy Line] -0.036
Table 1: Ordinary least squares (OLS) model of the loss onto the various features. Left hand column is the feature column with the specified modality in brackets. Positive coefficient means that the presence of the modality in brackets is detrimental to performance

From figure 6:

  • training on Ornstein-Uhlenbeck dynamic seems to worsen performance

  • GRU seems to be the best net type and Vanilla not a great choice

  • the optimization algorithm RMSProp has a negative impact on performance. Adam leads to better results

  • the validation loss for Markovian Switch is higher than the two other dynamics

Figure 6: Box-plotting losses by optimization, net type and training dynamic. In dashed red the overall median loss, in dash-dotted blue the overall loss for a given optimization type. Dynamic of the training data is nl for Noisy Line, ou for Piecewise Ornstein-Uhlenbeck, ms for Markovian Switch and mix for the mixed dynamic

Training dynamic has an impact on validation performance. Choosing two dynamics e.g. Noisy Line versus Piecewise Ornstein-Uhlenbeck, we select data from those only and bootstrap. For each bootstrapping iteration, we compute the difference between the medians of losses of one dynamic versus the other. The result can be seen on table 2. Even if all intervals contain zero, and no robust conclusion can be drawn, the median loss seems lower when training using the Noisy Line or Markovian Switch dynamics.

type 1 - type 2 Median loss difference 1% confidence interval
nl - ou -0.04 -0.19 0.10
nl - ms 0.01 -0.15 0.17
nl - mix -0.009 -0.17 0.15
ou - ms 0.05 -0.10 0.21
ou - mix 0.04 -0.12 0.20
ms - mix -0.02 -0.20 0.16
Table 2: Difference of median loss for training type 1 - median loss for training type 2 using bootstrapping percentile confidence interval. In red, negative values, blue, positive values, in confidence interval columns

Net structure are compared using the same bootstrapping procedure in table 3. Vanilla RNN is consistently worse than LSTM and GRU at 99% confidence level. As a result, in the following, we will ignore triplets with Vanilla RNN. Vanilla RNN is barely better than a dummy estimator having chance of correctly predicting the trend (see annex D.3).

net 1 - net 2 Median loss difference 1% confidence interval
vanilla - lstm 0.14 -0.005 0.28
vanilla - gru 0.18 0.04 0.32
lstm - gru 0.05 -0.15 0.25
Table 3: Difference of median loss for net structure 1 - median loss for net structure 2 using bootstrap percentile confidence interval. Highlighted in yellow the underperformance of Vanilla RNN

Optimizer impact: results seem to indicate a slightly better performance of Adam versus RMSprop8.
Net structure and training dynamic interaction: using only the triplets where net structure is either GRU or LSTM, we run the same bootstrapping procedure for each datasets on the training dynamic. The results are given in table 4. All the intervals contain 0 and it is difficult to find a combination which does significantly better than the others.

type 1 - type 2 Median loss difference 1% confidence interval
nl - ou -0.05 -0.25 0.15
nl - ms 0.002 -0.18 0.19
nl - mix -0.002 -0.19 0.19
ou - ms 0.05 -0.10 0.20
ou - mix 0.05 -0.12 0.22
ms - mix -0.005 -0.18 0.17
(a) Training bootstrap for LSTM only
type 1 - type 2 Median loss difference 1% confidence interval
nl - ou -0.025 -0.21 0.17
nl - ms 0.05 -0.13 0.24
nl - mix 0.06 -0.13 0.26
ou - ms 0.08 -0.11 0.26
ou - mix 0.09 -0.08 0.25
ms - mix 0.008 -0.18 0.20
(b) Training bootstrap for GRU only
Table 4: Interaction between the net structure GRU or LSTM and the training type Noisy Line (nl), Piecewise Ornstein-Uhlenbeck (ou) or Markovian Switch (ms). The loss difference is the loss of the first element of the pair minus the loss of the second

3.4 RNN baseline selection

We would like to choose a RNN estimator having a good overall performance on validation data. As we have seen, it is difficult to choose a particular training type or net structure (GRU or LSTM) as being significantly better. A way to build a baseline would be for example to pool the estimated probabilities of the best trained estimators. The pooling function here is a simple average of each estimated probabilities from the selected estimators9. And this, indeed, gives good results on validation data as can be seen in table 5. We note little difference in performance when pooling more than five estimators.

Validation dynamic type Median loss First quartile Third quartile IQR
Mixed 0.22 0.11 0.39 0.28
Ornstein-Uhlenbeck 0.21 0.14 0.31 0.17
Markovian Switch 0.37 0.21 0.52 0.31
Noisy Line 0.11 0.05 0.23 0.18
Table 5: Loss and Interquartile Range (IQR) of loss for the pooled net of 5 best RNN estimators

Yet, choosing such an estimator would give RNNs an advantage compared to other estimators. To be as fair as possible and favour simplicity over performance we choose to optimize hyper-parameters for a GRU network trained on the Piecewise Noisy Line dynamic using Adam optimization. Some details of the RNN baseline can be found in table 6.

It is interesting to note that adding training epochs10 seems to slightly increase the median error on the test set but gives a noticeable decrease of the interquartile range by a factor near 25%.

Net structure type GRU
Dropout 0.2
Number of hidden recurrent layers 2
Dimension of hidden recurrent layers 20
Learning rate 0.005
Number of epochs 200
Training type Noisy Line
Max noise level 0.07
Max line slope 1.4
Table 6: Parameters of RNN baseline

Running the training with hyper-parameters not too far from the ones obtained by optimization gives fairly similar results. The comparison of the RNN baseline versus the pooled estimator is given in table 7 and figure 7 for the loss distributions. Even if our RNN baseline is not the best it still offers good performance.

Figure 7: Comparing validation loss distribution for pooled estimator in orange with red median and RNN baseline in blue with cyan median
Dynamic RNN Pooled estimator
All 0.25 0.22
Ornstein-Uhlenbeck 0.25 0.24
Noisy Line 0.13 0.13
Markovian Switch 0.49 0.37
Table 7: Median losses for RNN baseline or pooled estimator for various dynamics on validation set

4 Non model based estimation

By “non model based”, we mean estimators which are not based on an explicit modelling of the underlying dynamic. We compare RNN baseline of subsection 3.4 against a simple moving average estimator, its generalization (see section 3.1) and a Convolutional Neural Network (CNN see [11]). Overall, the RNN baseline exhibits much stronger validation performance.

4.1 Comparison with moving average

One of the most intuitive way to detect trend is to compare the speed of two moving averages. We compare our RNN baseline with both the most simple moving average filtering and the convex net generalization approach.

Simple moving average

We first compare the RNN baseline with a basic estimator computing two moving averages: a ”s=slow” one and a ”f=fast” one

Given , a no trend threshold, the trend prediction is made by

otherwise

Obviously, the parameters have a big impact on the estimator performance. Using Bayesian optimization we find the parameters shown in table 8.

Parameter Value
0.95
0.48
0.1
Table 8: Parameters of Moving average baseline

On figure 8 we see the loss distribution of the baseline RNN versus the loss distribution of the moving average estimator for all dynamics.

Figure 8: Comparing validation loss distribution for MA estimator in orange with red median and RNN baseline in blue with cyan median

On average, the RNN baseline is consistently better than the moving average estimator as seen on table 9. The Markovian Switch dynamic is sometimes extremely difficult to apprehend due to highly volatile regime switching. For this dynamic, we see that both estimators are equally bad which is not unexpected given the task difficulty.

Dynamic RNN MA
All 0.26 0.43
Ornstein-Uhlenbeck 0.23 0.31
Noisy Line 0.14 0.48
Markovian Switch 0.51 0.53
Table 9: Median loss for RNN or MA estimator for various dynamics on validation set

Comparison with moving average generalization

We compare the baseline RNN with the estimator built according to subsection 3.1. Basically, this is a Vanilla RNN without any activation function. Also, weights are constrained to be a stochastic matrix. It turns out, a bit surprisingly to us, that the performance is quite poor and way worse than the RNN baseline. Further investigation is needed, but training seems to fail somehow as the trained weights are all very close to zero. As a result, the input plays little role in the prediction and surely can’t do much better than a dummy estimator. For reference, basic results are shown in table 10.

Dynamic RNN Generalized moving average
All 0.27 0.61
Ornstein-Uhlenbeck 0.26 0.61
Noisy Line 0.12 0.62
Markovian Switch 0.47 0.61
Table 10: Median loss for RNN baseline and convex net estimator for various dynamics on validation set

4.2 Comparison with CNN

One dimensional CNN is sometimes seen as a good tool to analyse time series. We use a standard CNN structure stacking convolutional layer followed by a pooling layer. To keep nets architecture similar in term of parameters, we use two layers of convolution + pooling. After optimization, we get hyper-parameters shown in table 11. Interestingly, both channel and kernel have taken the maximum value in the range we tested11.

Parameter Value
Learning rate 0.004
Channel dimension 20
Kernel size 20
Table 11: Parameters of CNN baseline

Yet, we are unable to find the supposed general efficiency of CNNs in our setup as seen on figure 9. Actually, CNN performance is barely better than a dummy classifier as seen on table 12.

Figure 9: Comparing validation loss distribution for CNN estimator in orange with red median and RNN baseline in blue with cyan median
Dynamic RNN CNN
All 0.25 0.58
Ornstein-Uhlenbeck 0.27 0.48
Noisy Line 0.13 0.65
Markovian Switch 0.41 0.64
Table 12: Median loss for RNN baseline and CNN estimator for various dynamics on validation set

5 Model based estimators

In this section, we compare the performance of the RNN baseline with classifiers based on maximum likelihood estimation (MLE) of the process parameters. These estimators therefore incorporate knowledge about the underlying data generative process. For each dynamic (see subsection 2.2), we compute the MLE estimator of the trend parameter. Then, we use this value at each time step to compute a trend label . This approach, which converts a numerical estimate of the trend to a label, is described in the following subsection.

In subsections 5.2, 5.3 and 5.4 we recall the formulas of the MLE trend estimators and present their empirical performance in comparison with the RNN baseline. Overall, the baseline shows good performance against these estimators. Theoretical details of MLE derivations are included in annex A.

5.1 From MLE to trend classifier

As a reminder, the training data used for the learning step of the neural networks is comprised of piecewise trajectories of the dynamics and uses randomized model parameters. Taking into account this additional randomness in a MLE estimation framework would make the theory intractable. In order to compare MLE based trend classification with neural networks, we use a sliding window mechanism. For a sliding window of length :

  • we compute the value of the trend estimator

  • we map the value of to a label using the sign function 12 (for a given threshold ) and predict this label with probability .

We only need this mechanism for the Noisy Line Process and the Piecewise Ornstein-Uhlenbeck Process.

5.2 Noisy Line Estimator

Derivation of MLE estimator on an interval

Deriving the maximum likelihood estimator for the slope is easy as any finite sample on a subdivision is a Gaussian vector with diagonal covariance matrix. Maximizing the MLE of yields to the slope formula (see annex A.1 for mathematical details):

(3)

The MLE estimator for the slope follows a normal distribution with mean and variance . For a subdivision with constant time step the variance is given by:

hence decreasing with the number of observations at the rate .

Empirical performance

Using the same procedure as in section 4, we compare its performance against our RNN baseline on figure 10 and table 13.

Figure 10: Comparing validation loss distribution for Noisy Line Estimator in orange with red median and RNN baseline in blue with cyan median
Dynamic RNN NLE
All 0.28 0.53
Ornstein-Uhlenbeck 0.29 0.42
Noisy Line 0.14 0.56
Markovian Switch 0.47 0.61
Table 13: Median loss for RNN or Noisy Line Estimator for various dynamics on validation set

The Noisy Line Estimator is easily overtaken by the RNN baseline even on the simple noisy line dynamic13.

5.3 Piecewise OU process

Derivation of MLE estimator on an interval

Estimating the parameters of time continuous diffusions is a difficult task. One way to construct such estimators is to derive the likelihood function on a discrete grid of prices observations. Due to non-independent samples, likelihood can be hard to derive and its maximisation might require the use of numerical optimization procedures. In the present study we leverage on the theoretical results of [13, 12] that express the likelihood function in a simple stochastic integral form. In the case of the Ornstein-Uhlenbeck process with linear trend diffusion:

the formulas for the estimators are given by:

(4)
(5)

To some extent, an analogy can be drawn with classical OLS estimators where the variance scaling term corresponds to the term . The reader can refer to the technical addendum A.2 for mathematical details. When dealing with discrete time observations, the integrals are approximated using the sample values and discrete time increments. Simulations show that these estimators exhibit good empirical properties, although they are biased. It can be shown that the biases for both estimators are given by:

In practical applications, the expectations above are computed by first evaluating the residuals over the observed values of and then approximating the integrals by summation of the weighted increments.

Empirical performance

We design a trend estimator using the sliding window mechanism of subsection 5.1. We compare its performance against our RNN baseline on figure 11 and table 14. Interestingly, the performance on the Ornstein-Uhlenbeck dynamic is markedly better and comparable to the performance of the RNN on the Ornstein-Uhlenbeck dynamic.

Figure 11: Comparing validation loss distribution for Ornstein-Uhlenbeck Estimator in orange with red median and RNN baseline in blue with cyan median
Dynamic RNN OUE
All 0.28 0.50
Ornstein-Uhlenbeck 0.28 0.34
Noisy Line 0.12 0.53
Markovian Switch 0.41 0.58
Table 14: Median loss for RNN or Ornstein-Uhlenbeck Estimator (OUE) for various dynamics on validation set

5.4 Markovian switch process

Derivation of MLE estimator

The Markovian Switch dynamic described in section 2.2.3 is actually the dynamic of a Hidden Markov Model (HMM) with Gaussian emissions probabilities on log returns:

where is a simple discrete three-state Markov chain. We then use classic techniques (see [14] for example) to get an estimate of the hidden states which have generated .

Empirical performance

We train a three-state HMM with Gaussian emission probabilities on the four time series dynamics (as described in subsection 2.2). Performance is similar regardless of the training dynamic. It is not obvious that the hidden states of the HMM will fit in our up, down, flat trend categories. To be able to compute a loss for the HMM, we first map the three-state of the HMM using the mean of the distribution given the hidden state. We sort them in increasing order and map them to down, flat, up states. We would expect to get a sequence of means being negative, close to zero and positive. Actually, only estimators trained on the mixed or Markovian Switch dynamics exhibit means which are clearly separated into a negative, near zero and positive value. Performance being similar, we use as baseline the estimator trained on the Markovian Switch dynamic which seems the most natural. Globally, the HMM has a hard time predicting the trend of any dynamic. This might be a bit surprising especially with the Markovian Switch dynamic. We note however that the best validation score is given when the HMM is trained on the Markovian Switch dynamic. As seen on figure 12 or table 15 HMM does not provide a good estimator of trend and is easily overtaken by the RNN approach.

Figure 12: Comparing validation loss distribution for HMM estimator in orange with red median and RNN baseline in blue with cyan median
Dynamic RNN HMM
All 0.30 0.70
Ornstein-Uhlenbeck 0.28 0.84
Noisy Line 0.17 0.74
Markovian Switch 0.50 0.64
Table 15: Median loss for RNN or HMM estimators for various dynamics on validation set

6 Summary

In this paper, we have investigated the use of several trend estimators on time series behaving similarly to the ones encountered in finance. We have derived theoretical maximum likelihood estimators of trends for two standard dynamics and implemented them. We have shown that certain RNNs are in a way a generalization of simple moving average techniques. For a simple dynamic, we have shown that this generalization transforms the trend estimation problem into locating the state vector. Finally, we have showed empirically that GRU or LSTM cells are on average the best building blocks to use compared to a broad range of estimators in order to detect trends in time series. Putting the emphasis on learning stylized data and then transferring to real data rather than building complex structures fitted to data is also an important takeaway of this paper. Ongoing preliminary research seems to validate our approach for financial applications. This might pave the way to building efficient market estimators protected against over-fitting.

Appendix A MLE estimators theory

a.1 Simple noisy line estimator

On a discrete time grid we consider the “noisy line” dynamics:

(6)

where is a collection of i.i.d. normal random variables .

One can easily show that is a Gaussian vector with diagonal covariance matrix. The likelihood function is expressed as

Let denote the log-likelihood. Solving yields to the expression (3).

By expressing as

one can show that and .

Simulations of trajectories (6) to compute samples estimates of are in agreement with the above result.

a.2 Linear trend with diffusion estimator

We consider the diffusion with the dynamics

where is a Wiener process and are unknown scalar quantities to be estimated from observations. In an infinitesimal time period , the price moves linearly by an amount and fluctuates around this trend term by an amount equal to .

We seek to construct estimation techniques for and . In the setting of discrete observations various estimation approaches can be used. For instance, one can first de-trend the observed price series and then estimate the fluctuation speed using standard OLS techniques. The drawbacks of such an approach are twofold. Firstly, estimation is conducted regardless of the joint distribution of . Secondly, classical OLS assumptions are most likely to fail in the case of a diffusion price process. As a consequence of non-stationarity of residuals, it can be shown that the OLS estimator of is biased. Such behaviours are studied in depth in [17].

Our approach follows the results from [12] in which the authors estimate drift parameters in a continuous likelihood maximization framework. Let us recall the main results from [13, 12].

Theorem 1.

Let be a process satisfying the stochastic differential equation (SDE)

where is a non-anticipative function.

Under the assumption that - almost surely,

then the measures and are equivalent. Moreover, -almost surely, the Radon-Nikodym derivative of with respect to is given by:

(7)

The reader can refer to [13], Theorem 7.7, for a formal statement and proof. The issue of the drift parametric estimation is addressed in [12] by considering the diffusion process:

(8)

Using the result above with and under similar assumption on one can show that the measures and are equivalent and that the likelihood function can be expressed as

It is easy to show that the log-likelihood is a concave function of the parameter and that its maximum is attained for such that .

As a consequence, under the assumption that

(9)

and under the condition that -a.s. the maximum likelihood estimation of is expressed as:

(10)

When dealing with real data, the numerical value of is computed using numerical integration techniques along the observed path . From now on, we adopt the lighter notations:

so that the MLE estimator (10) is expressed as .

For most drift functions the estimator has non-zero bias. An approximation of the bias can be easily derived by substituting the expression of in (10):

Hence the bias can be computed by approximating the expectation:

In the following, we extend (8) to the 2D parametric drift case:

(11)
Theorem 2.

Let be a process satisfying the diffusion equation (11) where both and satisfy the condition (9).

Under the condition that -a.s. the maximum likelihood estimation of is expressed as:

(12)
(13)
Proof.

By substituting the drift term in (11) into (7) one obtains

Let denote the log-likelihood. To ensure the concavity of one must verify that its Hessian matrix is definite negative.

Deriving the Hessian yields to

hence of the form

The eigenvalues of are given by

For its largest eigenvalue to be negative is equivalent to , that is . This latter expression is equivalent to the Cauchy-Schwartz inequality. Hence these conditions are -a.s. verified, ensuring the concavity of . Finally one can deduce the equations (12) and (13) by solving the first order conditions

We now consider the diffusion:

(14)

From the results above the MLE estimators for both and are given by:

(15)
(16)

To obtain these formulas we use the formulas (12) and (13) with , , and . Using Ito’s Lemma one can show that:

Appendix B Asymptotic state behaviour in a simple case

We prove in this annex the results stated in the worked example of section 3.1. We consider the following process14

where and and

Let’s consider a simple noisy line process we have:

being the trend and an i.i.d noise process with expectation equal to zero and unit variance.
Without trend i.e. , we have

We note the Perron-Frobenius eigenvalue of . All eigenvalues of different from satisfy . If is a corresponding eigenvector then

Noting

So and is bijective. We can define by

Then,

Simplifying notations with , and