An empirical study of neural networks for trend detection in time series
Detecting structure in noisy time series is a difficult task. One intuitive feature, which is of particular interest in financial applications, is the notion of trend. From theoretical hints and using simulated time series, we empirically investigate the efficiency of standard recurrent neural networks (RNNs) to detect trends. We show the overall superiority and versatility of certain standard RNNs structures over various other estimators. These RNNs could be used as basic blocks to build more complex time series trend estimators.
When looking at any dataset, human brain is wired to detect patterns . Time series are no exception and quite naturally we see “trends” when shown a plot of share prices. Trends seem a relevant feature of any forecasting mechanism for time series. In this article, we focus on univariate time series having a conspicuous trend component as commonly found in financial data. Trending time series are not unique to finance and our work extends to other domains. The main contributions of this article are:
Framing the problem into a classification problem emphasizing the usefulness of simulated data
Building a general trend estimator for a wide range of dynamics
Showing in a simple case why RNNs are good trend estimators
Showing empirically the superiority of RNNs over standard estimators
Deriving theoretical maximum likelihood estimators for the considered dynamics
We first describe our general framework establishing trend detection as a sequence to sequence classification problem. We then define the time series dynamics used in our simulations. Next, we explore the use of recurrent neural networks to detect trends. Thereupon, we empirically compare performance of standard RNNs structures. We then build a general purpose trend estimator called RNN baseline. We benchmark its performance against other estimators like convolutional networks. Finally, we compare its performance against estimators based on parameter estimation (MLE) of the modelled dynamic. Mathematical topics and detailed results have been left aside in the appendix.
2 Framework and data set
In this section we define our framework, which basically tries to address the question: what setup should one consider to find a “good” general purpose estimator of trend in time series ?
2.1 The thought process
Trends can be interpreted as the slopes of a smooth function around which the time series oscillates. The simplest, and probably the closest to human intuition, would be to use piecewise linear functions as in described in . The issue with these filtering approaches is that they tend to be good ex post but slow to detect changes of trends. This is a real problem when the whole time series is not known in advance.
We take a slightly different approach. If the future value of the time series is expected to be higher [respectively lower, equal] than the current one, then the time series is said to be trending up [respectively trending down, not trending]. At each time step, we assign a unique trend value noted , the time-series is:
trending downward at if
not trending at if
trending upward at if
We can directly translate this intuition into mathematical terms. Consider a process adapted to a filtration , under some technical conditions, the Doob-Meyer theorem applies and can be decomposed in an unique way as
where is a predictable increasing [respectively decreasing, zero] process if is a sub-martingale [respectively super-martingale, martingale] starting at 0 and is a martingale.
Obviously, we can map our intuitive definition to more precise concepts.
is: trending downward is decreasing not trending is null trending upward is increasing
The monotonicity of the process will be our definition of the trend of and thus a classification task with three labels for downward, flat and upward trend. Considering an Itô process
where is a Wiener process. We can track the changing monotonicity of via the sign of which will be our practical definition of trend.
The challenge at hand is to build an estimator of the sign of , which will be our classification label. In the following, we will consider various time series dynamics where we control the sign of . This gives us a framework to analyse the performance of various estimators, while controlling for the statistical properties of the dataset.
The classification task relies on the labelling of the training set. When using historical data, labelling is not easy to do: the definition of trend is subjective and usually depends on the choice of a time window or of a performance criterion. On the contrary, when using simulated data, labelling of the training set is easy. A general-purpose estimator of trend in a simulated environment is a useful building block for handling more complex real-life cases where no trend labels are available. It gives us a robust starting point on which we can build on
2.2 Time series dynamics
Our idea is to generate as many realistic datasets as possible, and to train trend estimators on those datasets. If we train our estimator on a dataset rich enough to capture all the possible scenarios, we can hope to have an estimator robust to real-life conditions. In the following, we consider three different types of dynamics, hopefully rich and diverse enough to match a lot of the real-life behaviour:
a noisy piecewise linear process
a piecewise Ornstein-Uhlenbeck process 
a Markovian switching process 
The first two are piecewise meaning that we divide time into intervals on which the time series follows the chosen dynamic. A simple continuity constraint is applied to “glue” together these different periods.
In the rest of the section we define:
a time interval
for piecewise processes, a number of intervals of possibly different lengths
Noisy Line Process
We define a Noisy Line Process
is a slope parameter randomly chosen in , where is the maximum slope and
is a noise parameter
are i.i.d. normal variables
The trend here is given by the sign of . Figure 1 displays some possible trajectories.
Piecewise Ornstein-Uhlenbeck dynamic
We define a Piecewise Ornstein-Uhlenbeck Process as a process such that
where and . If the intervals are big enough, , and the trend label will be determined by
Samples of piecewise Ornstein-Uhlenbeckprocess are shown on figure 2.
Switching Markovian dynamic
The trend is given by a Markov chain on finite states . The process is defined by
where is a slope process, a positive noise process and . In practice, and are constant with time, the constant being randomly chosen in a discrete distribution. This process exhibits a rich set of trajectories as seen on figure 3.
2.3 Training and Validation sets
Training sets are made of 1000 time series containing roughly 1000 data points, randomly drawn:
from either one of the three previous dynamics (see section 2.2)
or from all of the previous dynamics. This will be named mixed dynamic in the following
Model selection is made on validation sets composed of 300 time series: 100 samples from each of the three dynamics described in section 2.2. Each sample has between 500 and 1000 points depending on the dynamics and the draw. Figure 4 shows random samples from the validation set. This validation set offers a rich set of scenarios and can be used to assess the ability of an estimator to detect trends. Hyper-parameters are chosen using a separate test set which is a new random draw of the training set.
2.4 From empirical data to stylised time series dynamics
One important question arising from the chosen approach is the relevance of the simulated data. The dynamics can show behaviours that, even if not designed to simulate market dynamics, can be relatively similar to actual asset prices. As an example on figure 5 we plot real assets daily time-series versus a random sample from our three dynamics.
We see that the trajectories can be visually similar but that the distribution of daily returns may differ greatly. We must bear in mind that our aim is not to simulate market data but to detect trend defined as the sign of the drift term. We think that our dynamics are good enough to simulate this property of real time-series. One general method to get simulated dynamics close to empirical market data is the following :
Chose a dynamic
Compute the distribution of returns of the market time series of interest
Sample time-series of the dynamic and compute the distributions returns
Compute the average distance between the sampled distributions and the empirical one
Minimize this function over the dynamic parameters using black-box Bayesian optimization
3 Using Recurrent Neural Networks to detect trends
We motivate here the use of Recurrent Neural Networks (RNN) for our classification problem. Drawing from simple intuition, we provably show their benefits in a simple case.
3.1 Motivation : moving averages filtering and its extension as RNN
One of the most common way to detect trends is to adopt a filtering approach, comparing smoothed versions of the initial process. For example, we could aggregate several moving averages like:
with various values of . Determining the optimal might be difficult if we want to build an estimator adapted to various dynamics. To circumvent this difficulty, we can aggregate the values for different as the components of vectors through time
For example, we might want to consider concatenation of a fast, medium and slow moving averages. We might compare:
the slow and the fast moving averages by looking at the sign of
or maybe the slow versus an average of the medium and slow with the sign of
or whatever weighted combination we fancy with the sign of
Generally speaking, we look at the signs of components of the vector where is a given
Generalizing equation (2) to upper dimensions, we have:
where is a positive matrix and a positive vector such that
The trend is determined by but we could use any other activation function instead of the sign function.
These equations are exactly equal to the update equation of a RNN composed of
a vanilla RNN
with the identity as activation function
with one hidden layer
with convex constraints on the weight matrix
with a simple linear layer and activation function
Such a RNN will be called a “convex net” in the following. This shows that RNNs can be considered as generalizations of some basic moving average comparisons.
As a working example, we consider the case of the Noisy Line Process where are independent noise random variables .
For a net with constrained weights it can be shown (see annex B for details):
without trend, , then becomes centered around a variable of finite variance
with trend, then diverges
If we now introduce a hyperbolic tangent activation function instead of identity:
if , near zero the cell is in the linear part and we should expect the state to stay bounded around the origin
if the trend then the state should go towards i.e. to navigate near the faces of the hypercube
For a practical illustration see annex C.
3.2 Overview of RNNs and data
Standards Recurrent Neural Nets
In subsection 2.1, we turned the trend estimation problem into a sequence to sequence classification task, for which RNNs can be used. We consider three standard structures:
RNNs contain cycles: hidden state cell can depend on the entire past input sequence. We refer to  for details. These three standard RNNs have different structures but they share similar update equations like:
is a vector representing some internal cells at
is an block-wise activation function
is the input at time
is the state at time
are matrices and vectors
is a elementwise application operator
Depending on the RNN, is a combination of blocks of and possibly .
Essentially, where is a possibly complex mapping from the previous state and actual input values to the new state. We refer the reader to ,  and  for more details.
For training and validation, we use simulated time series according to section 2.3. Our aim is to give a precise empirical comparison of these three structures taking into account the possible influence of the training dynamic. We train triplets of the form:
a RNN chosen among Vanilla, LSTM or GRU
some meta-parameters like the number of recurrent layers, the dimension of hidden layer(s), dropout (see  for definition)…
a time series dynamic chosen among Noisy Line Process, Piecewise Ornstein-Uhlenbeck, Markovian Switch or a mixed dynamic
Each of these triplets is trained and validated against the training and validation sets described in subsection 2.3. This gives us more than 400 triplets to train and validate. Roughly 100 triplets do hit convergence issues in the training period and are excluded from the validation phase. Some parameters details can be found in annex D.1. Also, to get more robust results, we did a complete training using two different gradient step optimizations:
RMSprop algorithm (see  for details)
3.3 Empirical findings
We train our triplets as described in subsection 3.2.2 for both Adam and RMSprop and validate each triplet on our 300 validation samples (see section 2.3). The loss is a binary loss on the labels.
Table 1 shows the coefficients of the linear regression of loss against binary variables indicating the training dynamic, the net type, the optimization type and the validation dynamic. Each feature is translated into binary on/off variables with one less modality. The missing modality is on if all others are set to zero. A positive coefficient means that the highlighted feature increases the average loss of the sample, and conversely, a negative coefficient decreases the average loss. Full details can be found in annex D.2.
|Training dynamic[Markovian Switch]|
|Training dynamic[Noisy Line]|
|Validation dynamic[Noisy Line]||-0.036|
From figure 6:
training on Ornstein-Uhlenbeck dynamic seems to worsen performance
GRU seems to be the best net type and Vanilla not a great choice
the optimization algorithm RMSProp has a negative impact on performance. Adam leads to better results
the validation loss for Markovian Switch is higher than the two other dynamics
Training dynamic has an impact on validation performance. Choosing two dynamics e.g. Noisy Line versus Piecewise Ornstein-Uhlenbeck, we select data from those only and bootstrap. For each bootstrapping iteration, we compute the difference between the medians of losses of one dynamic versus the other. The result can be seen on table 2. Even if all intervals contain zero, and no robust conclusion can be drawn, the median loss seems lower when training using the Noisy Line or Markovian Switch dynamics.
|type 1 - type 2||Median loss difference||1% confidence interval|
|nl - ou||-0.04||-0.19||0.10|
|nl - ms||0.01||-0.15||0.17|
|nl - mix||-0.009||-0.17||0.15|
|ou - ms||0.05||-0.10||0.21|
|ou - mix||0.04||-0.12||0.20|
|ms - mix||-0.02||-0.20||0.16|
Net structure are compared using the same bootstrapping procedure in table 3. Vanilla RNN is consistently worse than LSTM and GRU at 99% confidence level. As a result, in the following, we will ignore triplets with Vanilla RNN. Vanilla RNN is barely better than a dummy estimator having chance of correctly predicting the trend (see annex D.3).
|net 1 - net 2||Median loss difference||1% confidence interval|
|vanilla - lstm||0.14||-0.005||0.28|
|vanilla - gru||0.18||0.04||0.32|
|lstm - gru||0.05||-0.15||0.25|
Optimizer impact: results seem to indicate a slightly better performance of Adam versus RMSprop
Net structure and training dynamic interaction: using only the triplets where net structure is either GRU or LSTM, we run the same bootstrapping procedure for each datasets on the training dynamic. The results are given in table 4. All the intervals contain 0 and it is difficult to find a combination which does significantly better than the others.
3.4 RNN baseline selection
We would like to choose a RNN estimator having a good overall performance on validation data. As we have seen, it is difficult to choose a particular training type or net structure (GRU or LSTM) as being significantly better.
A way to build a baseline would be for example to pool the estimated probabilities of the best trained estimators. The pooling function here is a simple average of each estimated probabilities from the selected estimators
|Validation dynamic type||Median loss||First quartile||Third quartile||IQR|
Yet, choosing such an estimator would give RNNs an advantage compared to other estimators. To be as fair as possible and favour simplicity over performance we choose to optimize hyper-parameters for a GRU network trained on the Piecewise Noisy Line dynamic using Adam optimization. Some details of the RNN baseline can be found in table 6.
It is interesting to note that adding training epochs
|Net structure type||GRU|
|Number of hidden recurrent layers||2|
|Dimension of hidden recurrent layers||20|
|Number of epochs||200|
|Training type||Noisy Line|
|Max noise level||0.07|
|Max line slope||1.4|
Running the training with hyper-parameters not too far from the ones obtained by optimization gives fairly similar results. The comparison of the RNN baseline versus the pooled estimator is given in table 7 and figure 7 for the loss distributions. Even if our RNN baseline is not the best it still offers good performance.
4 Non model based estimation
By “non model based”, we mean estimators which are not based on an explicit modelling of the underlying dynamic. We compare RNN baseline of subsection 3.4 against a simple moving average estimator, its generalization (see section 3.1) and a Convolutional Neural Network (CNN see ). Overall, the RNN baseline exhibits much stronger validation performance.
4.1 Comparison with moving average
One of the most intuitive way to detect trend is to compare the speed of two moving averages. We compare our RNN baseline with both the most simple moving average filtering and the convex net generalization approach.
Simple moving average
We first compare the RNN baseline with a basic estimator computing two moving averages: a ”s=slow” one and a ”f=fast” one
Given , a no trend threshold, the trend prediction is made by
Obviously, the parameters have a big impact on the estimator performance. Using Bayesian optimization we find the parameters shown in table 8.
On figure 8 we see the loss distribution of the baseline RNN versus the loss distribution of the moving average estimator for all dynamics.
On average, the RNN baseline is consistently better than the moving average estimator as seen on table 9. The Markovian Switch dynamic is sometimes extremely difficult to apprehend due to highly volatile regime switching. For this dynamic, we see that both estimators are equally bad which is not unexpected given the task difficulty.
Comparison with moving average generalization
We compare the baseline RNN with the estimator built according to subsection 3.1. Basically, this is a Vanilla RNN without any activation function. Also, weights are constrained to be a stochastic matrix. It turns out, a bit surprisingly to us, that the performance is quite poor and way worse than the RNN baseline. Further investigation is needed, but training seems to fail somehow as the trained weights are all very close to zero. As a result, the input plays little role in the prediction and surely can’t do much better than a dummy estimator. For reference, basic results are shown in table 10.
|Dynamic||RNN||Generalized moving average|
4.2 Comparison with CNN
One dimensional CNN is sometimes seen as a good tool to analyse time series. We use a standard CNN structure stacking convolutional layer followed by a pooling layer. To keep nets architecture similar in term of parameters, we use two layers of convolution + pooling.
After optimization, we get hyper-parameters shown in table 11. Interestingly, both channel and kernel have taken the maximum value in the range we tested
5 Model based estimators
In this section, we compare the performance of the RNN baseline with classifiers based on maximum likelihood estimation (MLE) of the process parameters. These estimators therefore incorporate knowledge about the underlying data generative process. For each dynamic (see subsection 2.2), we compute the MLE estimator of the trend parameter. Then, we use this value at each time step to compute a trend label . This approach, which converts a numerical estimate of the trend to a label, is described in the following subsection.
In subsections 5.2, 5.3 and 5.4 we recall the formulas of the MLE trend estimators and present their empirical performance in comparison with the RNN baseline. Overall, the baseline shows good performance against these estimators. Theoretical details of MLE derivations are included in annex A.
5.1 From MLE to trend classifier
As a reminder, the training data used for the learning step of the neural networks is comprised of piecewise trajectories of the dynamics and uses randomized model parameters. Taking into account this additional randomness in a MLE estimation framework would make the theory intractable. In order to compare MLE based trend classification with neural networks, we use a sliding window mechanism. For a sliding window of length :
we compute the value of the trend estimator
we map the value of to a label using the sign function
12(for a given threshold ) and predict this label with probability .
We only need this mechanism for the Noisy Line Process and the Piecewise Ornstein-Uhlenbeck Process.
5.2 Noisy Line Estimator
Derivation of MLE estimator on an interval
Deriving the maximum likelihood estimator for the slope is easy as any finite sample on a subdivision is a Gaussian vector with diagonal covariance matrix. Maximizing the MLE of yields to the slope formula (see annex A.1 for mathematical details):
The MLE estimator for the slope follows a normal distribution with mean and variance . For a subdivision with constant time step the variance is given by:
hence decreasing with the number of observations at the rate .
The Noisy Line Estimator is easily overtaken by the RNN baseline even on the simple noisy line dynamic
5.3 Piecewise OU process
Derivation of MLE estimator on an interval
Estimating the parameters of time continuous diffusions is a difficult task. One way to construct such estimators is to derive the likelihood function on a discrete grid of prices observations. Due to non-independent samples, likelihood can be hard to derive and its maximisation might require the use of numerical optimization procedures. In the present study we leverage on the theoretical results of [13, 12] that express the likelihood function in a simple stochastic integral form. In the case of the Ornstein-Uhlenbeck process with linear trend diffusion:
the formulas for the estimators are given by:
To some extent, an analogy can be drawn with classical OLS estimators where the variance scaling term corresponds to the term . The reader can refer to the technical addendum A.2 for mathematical details. When dealing with discrete time observations, the integrals are approximated using the sample values and discrete time increments. Simulations show that these estimators exhibit good empirical properties, although they are biased. It can be shown that the biases for both estimators are given by:
In practical applications, the expectations above are computed by first evaluating the residuals over the observed values of and then approximating the integrals by summation of the weighted increments.
We design a trend estimator using the sliding window mechanism of subsection 5.1. We compare its performance against our RNN baseline on figure 11 and table 14. Interestingly, the performance on the Ornstein-Uhlenbeck dynamic is markedly better and comparable to the performance of the RNN on the Ornstein-Uhlenbeck dynamic.
5.4 Markovian switch process
Derivation of MLE estimator
The Markovian Switch dynamic described in section 2.2.3 is actually the dynamic of a Hidden Markov Model (HMM) with Gaussian emissions probabilities on log returns:
where is a simple discrete three-state Markov chain. We then use classic techniques (see  for example) to get an estimate of the hidden states which have generated .
We train a three-state HMM with Gaussian emission probabilities on the four time series dynamics (as described in subsection 2.2). Performance is similar regardless of the training dynamic. It is not obvious that the hidden states of the HMM will fit in our up, down, flat trend categories. To be able to compute a loss for the HMM, we first map the three-state of the HMM using the mean of the distribution given the hidden state. We sort them in increasing order and map them to down, flat, up states. We would expect to get a sequence of means being negative, close to zero and positive. Actually, only estimators trained on the mixed or Markovian Switch dynamics exhibit means which are clearly separated into a negative, near zero and positive value. Performance being similar, we use as baseline the estimator trained on the Markovian Switch dynamic which seems the most natural. Globally, the HMM has a hard time predicting the trend of any dynamic. This might be a bit surprising especially with the Markovian Switch dynamic. We note however that the best validation score is given when the HMM is trained on the Markovian Switch dynamic. As seen on figure 12 or table 15 HMM does not provide a good estimator of trend and is easily overtaken by the RNN approach.
In this paper, we have investigated the use of several trend estimators on time series behaving similarly to the ones encountered in finance. We have derived theoretical maximum likelihood estimators of trends for two standard dynamics and implemented them. We have shown that certain RNNs are in a way a generalization of simple moving average techniques. For a simple dynamic, we have shown that this generalization transforms the trend estimation problem into locating the state vector. Finally, we have showed empirically that GRU or LSTM cells are on average the best building blocks to use compared to a broad range of estimators in order to detect trends in time series. Putting the emphasis on learning stylized data and then transferring to real data rather than building complex structures fitted to data is also an important takeaway of this paper. Ongoing preliminary research seems to validate our approach for financial applications. This might pave the way to building efficient market estimators protected against over-fitting.
Appendix A MLE estimators theory
a.1 Simple noisy line estimator
On a discrete time grid we consider the “noisy line” dynamics:
where is a collection of i.i.d. normal random variables .
One can easily show that is a Gaussian vector with diagonal covariance matrix. The likelihood function is expressed as
Let denote the log-likelihood. Solving yields to the expression (3).
By expressing as
one can show that and .
Simulations of trajectories (6) to compute samples estimates of are in agreement with the above result.
a.2 Linear trend with diffusion estimator
We consider the diffusion with the dynamics
where is a Wiener process and are unknown scalar quantities to be estimated from observations. In an infinitesimal time period , the price moves linearly by an amount and fluctuates around this trend term by an amount equal to .
We seek to construct estimation techniques for and . In the setting of discrete observations various estimation approaches can be used. For instance, one can first de-trend the observed price series and then estimate the fluctuation speed using standard OLS techniques. The drawbacks of such an approach are twofold. Firstly, estimation is conducted regardless of the joint distribution of . Secondly, classical OLS assumptions are most likely to fail in the case of a diffusion price process. As a consequence of non-stationarity of residuals, it can be shown that the OLS estimator of is biased. Such behaviours are studied in depth in .
Let be a process satisfying the stochastic differential equation (SDE)
where is a non-anticipative function.
Under the assumption that - almost surely,
then the measures and are equivalent. Moreover, -almost surely, the Radon-Nikodym derivative of with respect to is given by:
Using the result above with and under similar assumption on one can show that the measures and are equivalent and that the likelihood function can be expressed as
It is easy to show that the log-likelihood is a concave function of the parameter and that its maximum is attained for such that .
As a consequence, under the assumption that
and under the condition that -a.s. the maximum likelihood estimation of is expressed as:
When dealing with real data, the numerical value of is computed using numerical integration techniques along the observed path . From now on, we adopt the lighter notations:
so that the MLE estimator (10) is expressed as .
For most drift functions the estimator has non-zero bias. An approximation of the bias can be easily derived by substituting the expression of in (10):
Hence the bias can be computed by approximating the expectation:
In the following, we extend (8) to the 2D parametric drift case:
Under the condition that -a.s. the maximum likelihood estimation of is expressed as:
Let denote the log-likelihood. To ensure the concavity of one must verify that its Hessian matrix is definite negative.
Deriving the Hessian yields to
hence of the form
The eigenvalues of are given by
For its largest eigenvalue to be negative is equivalent to , that is . This latter expression is equivalent to the Cauchy-Schwartz inequality. Hence these conditions are -a.s. verified, ensuring the concavity of . Finally one can deduce the equations (12) and (13) by solving the first order conditions
We now consider the diffusion:
From the results above the MLE estimators for both and are given by:
Appendix B Asymptotic state behaviour in a simple case
We prove in this annex the results stated in the worked example of section 3.1.
We consider the following process
where and and
Let’s consider a simple noisy line process we have:
being the trend and an i.i.d noise process with expectation equal to zero and unit variance.
Without trend i.e. , we have
We note the Perron-Frobenius eigenvalue of . All eigenvalues of different from satisfy . If is a corresponding eigenvector then
So and is bijective. We can define by
Simplifying notations with , and