An empirical study of neural networks for trend detection in time series
Abstract
Detecting structure in noisy time series is a difficult task. One intuitive feature, which is of particular interest in financial applications, is the notion of trend. From theoretical hints and using simulated time series, we empirically investigate the efficiency of standard recurrent neural networks (RNNs) to detect trends. We show the overall superiority and versatility of certain standard RNNs structures over various other estimators. These RNNs could be used as basic blocks to build more complex time series trend estimators.
1 Introduction
When looking at any dataset, human brain is wired to detect patterns [8]. Time series are no exception and quite naturally we see “trends” when shown a plot of share prices. Trends seem a relevant feature of any forecasting mechanism for time series. In this article, we focus on univariate time series having a conspicuous trend component as commonly found in financial data. Trending time series are not unique to finance and our work extends to other domains. The main contributions of this article are:

Framing the problem into a classification problem emphasizing the usefulness of simulated data

Building a general trend estimator for a wide range of dynamics

Showing in a simple case why RNNs are good trend estimators

Showing empirically the superiority of RNNs over standard estimators

Deriving theoretical maximum likelihood estimators for the considered dynamics
We first describe our general framework establishing trend detection as a sequence to sequence classification problem. We then define the time series dynamics used in our simulations. Next, we explore the use of recurrent neural networks to detect trends. Thereupon, we empirically compare performance of standard RNNs structures. We then build a general purpose trend estimator called RNN baseline. We benchmark its performance against other estimators like convolutional networks. Finally, we compare its performance against estimators based on parameter estimation (MLE) of the modelled dynamic. Mathematical topics and detailed results have been left aside in the appendix.
2 Framework and data set
In this section we define our framework, which basically tries to address the question: what setup should one consider to find a “good” general purpose estimator of trend in time series ?
2.1 The thought process
Trends can be interpreted as the slopes of a smooth function around which the time series oscillates. The simplest, and probably the closest to human intuition, would be to use piecewise linear functions as in described in [9]. The issue with these filtering approaches is that they tend to be good ex post but slow to detect changes of trends. This is a real problem when the whole time series is not known in advance.
We take a slightly different approach. If the future value of the time series is expected to be higher [respectively lower, equal] than the current one, then the time series is said to be trending up [respectively trending down, not trending].
At each time step, we assign a unique trend value noted , the timeseries is:

trending downward at if

not trending at if

trending upward at if
We can directly translate this intuition into mathematical terms. Consider a process adapted to a filtration , under some technical conditions, the DoobMeyer theorem applies and can be decomposed in an unique way as
where is a predictable increasing [respectively decreasing, zero] process if is a submartingale [respectively supermartingale, martingale] starting at 0 and is a martingale.
Obviously, we can map our intuitive definition to more precise concepts.
is:
trending downward
is decreasing
not trending
is null
trending upward
is increasing
The monotonicity of the process will be our definition of the trend of and thus a classification task with three labels for downward, flat and upward trend.
Considering an Itô process
where is a Wiener process. We can track the changing monotonicity of via the sign of which will be our practical definition of trend.
The challenge at hand is to build an estimator of the sign of , which will be our classification label.
In the following, we will consider various time series dynamics where we control the sign of . This gives us a framework to analyse the performance of various estimators, while controlling for the statistical properties of the dataset.
The classification task relies on the labelling of the training set. When using historical data, labelling is not easy to do: the definition of trend is subjective and usually depends on the choice of a time window or of a performance criterion.
On the contrary, when using simulated data, labelling of the training set is easy.
A generalpurpose estimator of trend in a simulated environment is a useful building block for handling more complex reallife cases where no trend labels are available. It gives us a robust starting point on which we can build on
2.2 Time series dynamics
Our idea is to generate as many realistic datasets as possible, and to train trend estimators on those datasets. If we train our estimator on a dataset rich enough to capture all the possible scenarios, we can hope to have an estimator robust to reallife conditions. In the following, we consider three different types of dynamics, hopefully rich and diverse enough to match a lot of the reallife behaviour:

a noisy piecewise linear process

a piecewise OrnsteinUhlenbeck process [16]

a Markovian switching process [5]
The first two are piecewise meaning that we divide time into intervals on which the time series follows the chosen dynamic. A simple continuity constraint is applied to “glue” together these different periods.
In the rest of the section we define:

a time interval

for piecewise processes, a number of intervals of possibly different lengths
Noisy Line Process
We define a Noisy Line Process
where

is a slope parameter randomly chosen in , where is the maximum slope and

is a noise parameter

are i.i.d. normal variables
The trend here is given by the sign of . Figure 1 displays some possible trajectories.
Piecewise OrnsteinUhlenbeck dynamic
We define a Piecewise OrnsteinUhlenbeck Process as a process such that
where and . If the intervals are big enough, , and the trend label will be determined by
(1) 
Samples of piecewise OrnsteinUhlenbeckprocess are shown on figure 2.
Switching Markovian dynamic
The trend is given by a Markov chain on finite states . The process is defined by
where is a slope process, a positive noise process and . In practice, and are constant with time, the constant being randomly chosen in a discrete distribution. This process exhibits a rich set of trajectories as seen on figure 3.
2.3 Training and Validation sets
Training sets are made of 1000 time series containing roughly 1000 data points, randomly drawn:

from either one of the three previous dynamics (see section 2.2)

or from all of the previous dynamics. This will be named mixed dynamic in the following
Model selection is made on validation sets composed of 300 time series: 100 samples from each of the three dynamics described in section 2.2. Each sample has between 500 and 1000 points depending on the dynamics and the draw. Figure 4 shows random samples from the validation set. This validation set offers a rich set of scenarios and can be used to assess the ability of an estimator to detect trends. Hyperparameters are chosen using a separate test set which is a new random draw of the training set.
2.4 From empirical data to stylised time series dynamics
One important question arising from the chosen approach is the relevance of the simulated data. The dynamics can show behaviours that, even if not designed to simulate market dynamics, can be relatively similar to actual asset prices. As an example on figure 5 we plot real assets daily timeseries versus a random sample from our three dynamics.
We see that the trajectories can be visually similar but that the distribution of daily returns may differ greatly. We must bear in mind that our aim is not to simulate market data but to detect trend defined as the sign of the drift term. We think that our dynamics are good enough to simulate this property of real timeseries. One general method to get simulated dynamics close to empirical market data is the following :

Chose a dynamic

Compute the distribution of returns of the market time series of interest

Sample timeseries of the dynamic and compute the distributions returns

Compute the average distance between the sampled distributions and the empirical one
^{3} 
Minimize this function over the dynamic parameters using blackbox Bayesian optimization
3 Using Recurrent Neural Networks to detect trends
We motivate here the use of Recurrent Neural Networks (RNN) for our classification problem. Drawing from simple intuition, we provably show their benefits in a simple case.
3.1 Motivation : moving averages filtering and its extension as RNN
One of the most common way to detect trends is to adopt a filtering approach, comparing smoothed versions of the initial process. For example, we could aggregate several moving averages like:
(2) 
with various values of . Determining the optimal might be difficult if we want to build an estimator adapted to various dynamics. To circumvent this difficulty, we can aggregate the values for different as the components of vectors through time
For example, we might want to consider concatenation of a fast, medium and slow moving averages. We might compare:

the slow and the fast moving averages by looking at the sign of

or maybe the slow versus an average of the medium and slow with the sign of

or whatever weighted combination we fancy with the sign of
Generally speaking, we look at the signs of components of the vector where is a given
Generalizing equation (2) to upper dimensions, we have:
where is a positive matrix and a positive vector such that
The trend is determined by but we could use any other activation function instead of the sign function.
These equations are exactly equal to the update equation of a RNN composed of

a vanilla RNN

with the identity as activation function

with one hidden layer

with convex constraints on the weight matrix
^{6}


with a simple linear layer and activation function
Such a RNN will be called a “convex net” in the following. This shows that RNNs can be considered as generalizations of some basic moving average comparisons.
As a working example, we consider the case of the Noisy Line Process where are independent noise random variables .
For a net with constrained weights it can be shown (see annex B for details):

without trend, , then becomes centered around a variable of finite variance

with trend, then diverges
If we now introduce a hyperbolic tangent activation function instead of identity:

if , near zero the cell is in the linear part and we should expect the state to stay bounded around the origin

if the trend then the state should go towards i.e. to navigate near the faces of the hypercube
For a practical illustration see annex C.
3.2 Overview of RNNs and data
Standards Recurrent Neural Nets
In subsection 2.1, we turned the trend estimation problem into a sequence to sequence classification task, for which RNNs can be used. We consider three standard structures:
RNNs contain cycles: hidden state cell can depend on the entire past input sequence. We refer to [4] for details. These three standard RNNs have different structures but they share similar update equations like:
where

is a vector representing some internal cells at

is an blockwise activation function

is the input at time

is the state at time

are matrices and vectors
is a elementwise application operator
Depending on the RNN, is a combination of blocks of and possibly .
Essentially, where is a possibly complex mapping from the previous state and actual input values to the new state. We refer the reader to [3], [7] and [2] for more details.
Training RNNs
For training and validation, we use simulated time series according to section 2.3. Our aim is to give a precise empirical comparison of these three structures taking into account the possible influence of the training dynamic. We train triplets of the form:

a RNN chosen among Vanilla, LSTM or GRU

some metaparameters like the number of recurrent layers, the dimension of hidden layer(s), dropout (see [15] for definition)…

a time series dynamic chosen among Noisy Line Process, Piecewise OrnsteinUhlenbeck, Markovian Switch or a mixed dynamic
Each of these triplets is trained and validated against the training and validation sets described in subsection 2.3. This gives us more than 400 triplets to train and validate. Roughly 100 triplets do hit convergence issues in the training period and are excluded from the validation phase. Some parameters details can be found in annex D.1. Also, to get more robust results, we did a complete training using two different gradient step optimizations:

RMSprop algorithm (see [6] for details)
3.3 Empirical findings
We train our triplets as described in subsection 3.2.2 for both Adam and RMSprop and validate each triplet on our 300 validation samples (see section 2.3). The loss is a binary loss on the labels.
Table 1 shows the coefficients of the linear regression of loss against binary variables indicating the training dynamic, the net type, the optimization type and the validation dynamic. Each feature is translated into binary on/off variables with one less modality. The missing modality is on if all others are set to zero. A positive coefficient means that the highlighted feature increases the average loss of the sample, and conversely, a negative coefficient decreases the average loss. Full details can be found in annex D.2.
Feature[Modality]  Coefficient 
Intercept  0.48 
Training dynamic[Markovian Switch]  
Training dynamic[OrnsteinUhlenbeck]  0.029 
Training dynamic[Noisy Line]  
Net Type[LSTM]  0.037 
Net Type[Vanilla]  0.17 
Optimization[RMSP]  0.0234 
Validation dynamic[OrnsteinUhlenbeck]  0.1 
Validation dynamic[Noisy Line]  0.036 
From figure 6:

training on OrnsteinUhlenbeck dynamic seems to worsen performance

GRU seems to be the best net type and Vanilla not a great choice

the optimization algorithm RMSProp has a negative impact on performance. Adam leads to better results

the validation loss for Markovian Switch is higher than the two other dynamics
Training dynamic has an impact on validation performance. Choosing two dynamics e.g. Noisy Line versus Piecewise OrnsteinUhlenbeck, we select data from those only and bootstrap. For each bootstrapping iteration, we compute the difference between the medians of losses of one dynamic versus the other. The result can be seen on table 2. Even if all intervals contain zero, and no robust conclusion can be drawn, the median loss seems lower when training using the Noisy Line or Markovian Switch dynamics.
type 1  type 2  Median loss difference  1% confidence interval  

nl  ou  0.04  0.19  0.10 
nl  ms  0.01  0.15  0.17 
nl  mix  0.009  0.17  0.15 
ou  ms  0.05  0.10  0.21 
ou  mix  0.04  0.12  0.20 
ms  mix  0.02  0.20  0.16 
Net structure are compared using the same bootstrapping procedure in table 3. Vanilla RNN is consistently worse than LSTM and GRU at 99% confidence level. As a result, in the following, we will ignore triplets with Vanilla RNN. Vanilla RNN is barely better than a dummy estimator having chance of correctly predicting the trend (see annex D.3).
net 1  net 2  Median loss difference  1% confidence interval  

vanilla  lstm  0.14  0.005  0.28 
vanilla  gru  0.18  0.04  0.32 
lstm  gru  0.05  0.15  0.25 
Optimizer impact: results seem to indicate a slightly better performance of Adam versus RMSprop
Net structure and training dynamic interaction: using only the triplets where net structure is either GRU or LSTM, we run the same bootstrapping procedure for each datasets on the training dynamic. The results are given in table 4. All the intervals contain 0 and it is difficult to find a combination which does significantly better than the others.


3.4 RNN baseline selection
We would like to choose a RNN estimator having a good overall performance on validation data. As we have seen, it is difficult to choose a particular training type or net structure (GRU or LSTM) as being significantly better.
A way to build a baseline would be for example to pool the estimated probabilities of the best trained estimators. The pooling function here is a simple average of each estimated probabilities from the selected estimators
Validation dynamic type  Median loss  First quartile  Third quartile  IQR 

Mixed  0.22  0.11  0.39  0.28 
OrnsteinUhlenbeck  0.21  0.14  0.31  0.17 
Markovian Switch  0.37  0.21  0.52  0.31 
Noisy Line  0.11  0.05  0.23  0.18 
Yet, choosing such an estimator would give RNNs an advantage compared to other estimators. To be as fair as possible and favour simplicity over performance we choose to optimize hyperparameters for a GRU network trained on the Piecewise Noisy Line dynamic using Adam optimization. Some details of the RNN baseline can be found in table 6.
It is interesting to note that adding training epochs
Net structure type  GRU 
Dropout  0.2 
Number of hidden recurrent layers  2 
Dimension of hidden recurrent layers  20 
Learning rate  0.005 
Number of epochs  200 
Training type  Noisy Line 
Max noise level  0.07 
Max line slope  1.4 
Running the training with hyperparameters not too far from the ones obtained by optimization gives fairly similar results. The comparison of the RNN baseline versus the pooled estimator is given in table 7 and figure 7 for the loss distributions. Even if our RNN baseline is not the best it still offers good performance.
Dynamic  RNN  Pooled estimator 

All  0.25  0.22 
OrnsteinUhlenbeck  0.25  0.24 
Noisy Line  0.13  0.13 
Markovian Switch  0.49  0.37 
4 Non model based estimation
By “non model based”, we mean estimators which are not based on an explicit modelling of the underlying dynamic. We compare RNN baseline of subsection 3.4 against a simple moving average estimator, its generalization (see section 3.1) and a Convolutional Neural Network (CNN see [11]). Overall, the RNN baseline exhibits much stronger validation performance.
4.1 Comparison with moving average
One of the most intuitive way to detect trend is to compare the speed of two moving averages. We compare our RNN baseline with both the most simple moving average filtering and the convex net generalization approach.
Simple moving average
We first compare the RNN baseline with a basic estimator computing two moving averages: a ”s=slow” one and a ”f=fast” one
Given , a no trend threshold, the trend prediction is made by
otherwise 
Obviously, the parameters have a big impact on the estimator performance. Using Bayesian optimization we find the parameters shown in table 8.
Parameter  Value 

0.95  
0.48  
0.1 
On figure 8 we see the loss distribution of the baseline RNN versus the loss distribution of the moving average estimator for all dynamics.
On average, the RNN baseline is consistently better than the moving average estimator as seen on table 9. The Markovian Switch dynamic is sometimes extremely difficult to apprehend due to highly volatile regime switching. For this dynamic, we see that both estimators are equally bad which is not unexpected given the task difficulty.
Dynamic  RNN  MA 

All  0.26  0.43 
OrnsteinUhlenbeck  0.23  0.31 
Noisy Line  0.14  0.48 
Markovian Switch  0.51  0.53 
Comparison with moving average generalization
We compare the baseline RNN with the estimator built according to subsection 3.1. Basically, this is a Vanilla RNN without any activation function. Also, weights are constrained to be a stochastic matrix. It turns out, a bit surprisingly to us, that the performance is quite poor and way worse than the RNN baseline. Further investigation is needed, but training seems to fail somehow as the trained weights are all very close to zero. As a result, the input plays little role in the prediction and surely can’t do much better than a dummy estimator. For reference, basic results are shown in table 10.
Dynamic  RNN  Generalized moving average 

All  0.27  0.61 
OrnsteinUhlenbeck  0.26  0.61 
Noisy Line  0.12  0.62 
Markovian Switch  0.47  0.61 
4.2 Comparison with CNN
One dimensional CNN is sometimes seen as a good tool to analyse time series. We use a standard CNN structure stacking convolutional layer followed by a pooling layer. To keep nets architecture similar in term of parameters, we use two layers of convolution + pooling.
After optimization, we get hyperparameters shown in table 11. Interestingly, both channel and kernel have taken the maximum value in the range we tested
Parameter  Value 

Learning rate  0.004 
Channel dimension  20 
Kernel size  20 
Yet, we are unable to find the supposed general efficiency of CNNs in our setup as seen on figure 9. Actually, CNN performance is barely better than a dummy classifier as seen on table 12.
Dynamic  RNN  CNN 

All  0.25  0.58 
OrnsteinUhlenbeck  0.27  0.48 
Noisy Line  0.13  0.65 
Markovian Switch  0.41  0.64 
5 Model based estimators
In this section, we compare the performance of the RNN baseline with classifiers based on maximum likelihood estimation (MLE) of the process parameters. These estimators therefore incorporate knowledge about the underlying data generative process. For each dynamic (see subsection 2.2), we compute the MLE estimator of the trend parameter. Then, we use this value at each time step to compute a trend label . This approach, which converts a numerical estimate of the trend to a label, is described in the following subsection.
In subsections 5.2, 5.3 and 5.4 we recall the formulas of the MLE trend estimators and present their empirical performance in comparison with the RNN baseline. Overall, the baseline shows good performance against these estimators. Theoretical details of MLE derivations are included in annex A.
5.1 From MLE to trend classifier
As a reminder, the training data used for the learning step of the neural networks is comprised of piecewise trajectories of the dynamics and uses randomized model parameters. Taking into account this additional randomness in a MLE estimation framework would make the theory intractable. In order to compare MLE based trend classification with neural networks, we use a sliding window mechanism. For a sliding window of length :

we compute the value of the trend estimator

we map the value of to a label using the sign function
^{12} (for a given threshold ) and predict this label with probability .
We only need this mechanism for the Noisy Line Process and the Piecewise OrnsteinUhlenbeck Process.
5.2 Noisy Line Estimator
Derivation of MLE estimator on an interval
Deriving the maximum likelihood estimator for the slope is easy as any finite sample on a subdivision is a Gaussian vector with diagonal covariance matrix. Maximizing the MLE of yields to the slope formula (see annex A.1 for mathematical details):
(3) 
The MLE estimator for the slope follows a normal distribution with mean and variance . For a subdivision with constant time step the variance is given by:
hence decreasing with the number of observations at the rate .
Empirical performance
Using the same procedure as in section 4, we compare its performance against our RNN baseline on figure 10 and table 13.
Dynamic  RNN  NLE 

All  0.28  0.53 
OrnsteinUhlenbeck  0.29  0.42 
Noisy Line  0.14  0.56 
Markovian Switch  0.47  0.61 
The Noisy Line Estimator is easily overtaken by the RNN baseline even on the simple noisy line dynamic
5.3 Piecewise OU process
Derivation of MLE estimator on an interval
Estimating the parameters of time continuous diffusions is a difficult task. One way to construct such estimators is to derive the likelihood function on a discrete grid of prices observations. Due to nonindependent samples, likelihood can be hard to derive and its maximisation might require the use of numerical optimization procedures. In the present study we leverage on the theoretical results of [13, 12] that express the likelihood function in a simple stochastic integral form. In the case of the OrnsteinUhlenbeck process with linear trend diffusion:
the formulas for the estimators are given by:
(4) 
(5) 
To some extent, an analogy can be drawn with classical OLS estimators where the variance scaling term corresponds to the term . The reader can refer to the technical addendum A.2 for mathematical details. When dealing with discrete time observations, the integrals are approximated using the sample values and discrete time increments. Simulations show that these estimators exhibit good empirical properties, although they are biased. It can be shown that the biases for both estimators are given by:
In practical applications, the expectations above are computed by first evaluating the residuals over the observed values of and then approximating the integrals by summation of the weighted increments.
Empirical performance
We design a trend estimator using the sliding window mechanism of subsection 5.1. We compare its performance against our RNN baseline on figure 11 and table 14. Interestingly, the performance on the OrnsteinUhlenbeck dynamic is markedly better and comparable to the performance of the RNN on the OrnsteinUhlenbeck dynamic.
Dynamic  RNN  OUE 

All  0.28  0.50 
OrnsteinUhlenbeck  0.28  0.34 
Noisy Line  0.12  0.53 
Markovian Switch  0.41  0.58 
5.4 Markovian switch process
Derivation of MLE estimator
The Markovian Switch dynamic described in section 2.2.3 is actually the dynamic of a Hidden Markov Model (HMM) with Gaussian emissions probabilities on log returns:
where is a simple discrete threestate Markov chain. We then use classic techniques (see [14] for example) to get an estimate of the hidden states which have generated .
Empirical performance
We train a threestate HMM with Gaussian emission probabilities on the four time series dynamics (as described in subsection 2.2). Performance is similar regardless of the training dynamic. It is not obvious that the hidden states of the HMM will fit in our up, down, flat trend categories. To be able to compute a loss for the HMM, we first map the threestate of the HMM using the mean of the distribution given the hidden state. We sort them in increasing order and map them to down, flat, up states. We would expect to get a sequence of means being negative, close to zero and positive. Actually, only estimators trained on the mixed or Markovian Switch dynamics exhibit means which are clearly separated into a negative, near zero and positive value. Performance being similar, we use as baseline the estimator trained on the Markovian Switch dynamic which seems the most natural. Globally, the HMM has a hard time predicting the trend of any dynamic. This might be a bit surprising especially with the Markovian Switch dynamic. We note however that the best validation score is given when the HMM is trained on the Markovian Switch dynamic. As seen on figure 12 or table 15 HMM does not provide a good estimator of trend and is easily overtaken by the RNN approach.
Dynamic  RNN  HMM 

All  0.30  0.70 
OrnsteinUhlenbeck  0.28  0.84 
Noisy Line  0.17  0.74 
Markovian Switch  0.50  0.64 
6 Summary
In this paper, we have investigated the use of several trend estimators on time series behaving similarly to the ones encountered in finance. We have derived theoretical maximum likelihood estimators of trends for two standard dynamics and implemented them. We have shown that certain RNNs are in a way a generalization of simple moving average techniques. For a simple dynamic, we have shown that this generalization transforms the trend estimation problem into locating the state vector. Finally, we have showed empirically that GRU or LSTM cells are on average the best building blocks to use compared to a broad range of estimators in order to detect trends in time series. Putting the emphasis on learning stylized data and then transferring to real data rather than building complex structures fitted to data is also an important takeaway of this paper. Ongoing preliminary research seems to validate our approach for financial applications. This might pave the way to building efficient market estimators protected against overfitting.
Appendix A MLE estimators theory
a.1 Simple noisy line estimator
On a discrete time grid we consider the “noisy line” dynamics:
(6) 
where is a collection of i.i.d. normal random variables .
One can easily show that is a Gaussian vector with diagonal covariance matrix. The likelihood function is expressed as
Let denote the loglikelihood. Solving yields to the expression (3).
By expressing as
one can show that and .
Simulations of trajectories (6) to compute samples estimates of are in agreement with the above result.
a.2 Linear trend with diffusion estimator
We consider the diffusion with the dynamics
where is a Wiener process and are unknown scalar quantities to be estimated from observations. In an infinitesimal time period , the price moves linearly by an amount and fluctuates around this trend term by an amount equal to .
We seek to construct estimation techniques for and . In the setting of discrete observations various estimation approaches can be used. For instance, one can first detrend the observed price series and then estimate the fluctuation speed using standard OLS techniques. The drawbacks of such an approach are twofold. Firstly, estimation is conducted regardless of the joint distribution of . Secondly, classical OLS assumptions are most likely to fail in the case of a diffusion price process. As a consequence of nonstationarity of residuals, it can be shown that the OLS estimator of is biased. Such behaviours are studied in depth in [17].
Our approach follows the results from [12] in which the authors estimate drift parameters in a continuous likelihood maximization framework. Let us recall the main results from [13, 12].
Theorem 1.
Let be a process satisfying the stochastic differential equation (SDE)
where is a nonanticipative function.
Under the assumption that  almost surely,
then the measures and are equivalent. Moreover, almost surely, the RadonNikodym derivative of with respect to is given by:
(7) 
The reader can refer to [13], Theorem 7.7, for a formal statement and proof. The issue of the drift parametric estimation is addressed in [12] by considering the diffusion process:
(8) 
Using the result above with and under similar assumption on one can show that the measures and are equivalent and that the likelihood function can be expressed as
It is easy to show that the loglikelihood is a concave function of the parameter and that its maximum is attained for such that .
As a consequence, under the assumption that
(9) 
and under the condition that a.s. the maximum likelihood estimation of is expressed as:
(10) 
When dealing with real data, the numerical value of is computed using numerical integration techniques along the observed path . From now on, we adopt the lighter notations:
so that the MLE estimator (10) is expressed as .
For most drift functions the estimator has nonzero bias. An approximation of the bias can be easily derived by substituting the expression of in (10):
Hence the bias can be computed by approximating the expectation:
In the following, we extend (8) to the 2D parametric drift case:
(11) 
Theorem 2.
Under the condition that a.s. the maximum likelihood estimation of is expressed as:
(12) 
(13) 
Proof.
Let denote the loglikelihood. To ensure the concavity of one must verify that its Hessian matrix is definite negative.
Deriving the Hessian yields to
hence of the form
The eigenvalues of are given by
For its largest eigenvalue to be negative is equivalent to , that is . This latter expression is equivalent to the CauchySchwartz inequality. Hence these conditions are a.s. verified, ensuring the concavity of . Finally one can deduce the equations (12) and (13) by solving the first order conditions
∎
We now consider the diffusion:
(14) 
From the results above the MLE estimators for both and are given by:
(15) 
(16) 
To obtain these formulas we use the formulas (12) and (13) with , , and . Using Ito’s Lemma one can show that:
Appendix B Asymptotic state behaviour in a simple case
We prove in this annex the results stated in the worked example of section 3.1.
We consider the following process
where and and
Let’s consider a simple noisy line process we have:
being the trend and an i.i.d noise process with expectation equal to zero and unit variance.
Without trend i.e. , we have
We note the PerronFrobenius eigenvalue of . All eigenvalues of different from satisfy . If is a corresponding eigenvector then
Noting
So and is bijective. We can define by
Then,
Simplifying notations with , and