# Approximating intractable short rate model distribution with neural network

## Abstract

We propose an algorithm which predicts each subsequent time step relative to the previous time step of intractable short rate model (when adjusted for drift and overall distribution of previous percentile result) and show that the method achieves superior outcomes to the unbiased estimate both on the trained dataset and different validation data.

## 1 Introduction

Let a stochastic process with the following properties be denoted as :

(1) |

where

(2) |

(3) |

where all the parameters including and are known in advance. Then is normal with the mean

(4) |

and the variance

(5) |

Since the distribution is normal, best unbiased estimate is also normally distributed, and should not need moments beyond the first two. The formulation above is a kernel of an existing problem of short rate computation within two factor log-normal models.

Given the distribution of we can deduce the distribution . The problem can be formulated as follows: given the distribution find the distribution . We accept the hypothesis of stationarity of distribution of quantiles, i.e. . However, this requires to be random which it cannot be. We use the data generated with the stochastic equations to train a neural network in order to test, whether on an average sample it can outperform the theoretical prediction. The structure makes no assumption about the underlying distribution (e.g. that the underlying distribution is Gaussian, which would require only two quantiles to derive the distribution parameters).

The same Neural network can then outperform the Method of moments in a set generated from different parameters data.

## 2 Generating training and validation data

### 2.1 Basic definitions

In order to train the neural network to perform the same functions as stochastic equations, a large data series was generated. In order to improve the precision of computation in line with the possible evolution of paths, we use a branching technique, which increases the number of unique paths by a multiple of 4 at every timestep (discretization of continuous process).

### 2.2 Generating data from Stochastic equations

We generate a large dataset (5,000 simulations based on individual realisations of the model, for each simulation). This is achieved by branching each of the outcomes in the previous step by multiple of 4 (i.e. 1st step has 4 results per simulation, the next 16 and so on, until 12th) - see Appendix 2 for more information. In order to derive formulation for our neural network we will have to reduce the model to percentile outcomes.

Given formulation of 2-factor Black-Karasinski model as:

(6) | |||

(7) |

where

and are constants, define a new state variable where:

It can be seen that . Then we can re-write the original equation as:

This is a so-called 2 factor Hull-White model. This then allows [3] the G2++ formulation of the model, where the state variables are decoupled. The new formulation will have the following terms:

(8) |

Note, that (8) is a deterministic equation relative to time t.

Reformulation of the above into two independent Vasicek equations is obvious. The attempt to estimate the original series from the previous data (i.e. decomposition and re-composition independent series) creates additional estimation error with its own noise. Hence in order to reduce the method derived error we will attempt to do the estimation on the joint series.

Explicitly this would mean that each percentile value of the outcome would need to be adjusted by the joint drift over the time step, and then amended for the standard deviation. However, having tested various possible drift functions, we came to conclusion that not introducing any drift to the previous percentile values gives the most stringent version of the test.

## 3 Method of Moments

### 3.1 Basid definitions

Typically intractable short rate are approximated by binomial trees [1] which rely on theoretical moments of the function. We use the moments of theoretical distribution to compare the results generated by running the model in the process described in Appendix 2.

### 3.2 Estimating next timestep with Method of Moments

Method of moments in this paper approach describes a theoretical measure: based on the theoretical standard deviation and mean of the distribution we derive the quantiles of the distribution values. The expectations for the next step are easily derived using basic Euler scheme:

Standard deviation over a particular time step are derived from orthogonal decomposition of the above model into two independent Vasicek processes. This approach is standard when approximating the distribution of path with binomial trees. This results in the following expression:

Where

(9) | |||

(10) | |||

(11) | |||

(12) | |||

(13) |

Given a realization of the process from (4) at step , we can index it by in order to arrive at the expected values over the timestep . From original formulation of the problem .

Having adjusted for the mean, we can then compute the realized percentiles from the theoretical standard deviation of the distribution â effectively deriving the percentiles from the theoretical first two moments of the distribution. is an array of for at .

Where denotes an 2D array shifted by +1 along the t.

Where is defined as the variance and covariance of the two processes.

## 4 Model formulation - Neural Network

### 4.1 Basic defintions

In order to show the possibility of additional information captured by looking at distribution as a whole we use the simplest neural net possible, and attempt not to overfit the data. We verify the results against the mean scenario, and scale it by standard deviation of resulting from stochastic errors between all scenarios for each percentile.

We can create a state space representation of required function of recurrent neural network as such:

where optimising for D matricies and vectors with stochastic gradient descend method [2] . Where is a 10 by 200 matrix, and is a 200 by a 10 matrix, and is a 10 parameter vector, and is a 200 parameter vector.

### 4.2 Neural Network Specification

We will then train a simple (single layer, 10 nodes) neural network to derive additional information (relying on the values of other percentiles) about the distribution from the percentiles of the previous timestep at time t (i.e. n= 200 values corresponding to 0.5% to 100% quantiles of the distribution), to predict the outcomes at time t+1.

Such that the trained neural network is able to produce outcomes that lie within 1 standard deviation (this is relative error in graphs below) from true average of the process in Figure 1. 1. Further refinements can be made to reduce the error (which is almost constant, relative to t).

We used MLPRegressors ([5]). This is a single hidden layer âMultilayer Perceptron regressorâ [4], and limited the number of nodes in a single hidden layer to 10 regressors.

The network uses 10 sigmoid nodes to regress the input (in form of percentile values) against the output (percentile values at t+1). It fits the weights and intercepts to inputs to produce outputs, and this is then repeated for each incremental month from 2nd to 12th, for all 5,000 scenarios.

The notable effect of this is constraining the amount of information carried from one example to the next. It is also useful to note, that when the number of regressors increases the fit is increasingly aligned with the method of moments.

The graphs below present the forecast error on average for each of the quantiles at 3rd, 6th, 9th and 12th timestep standardized by standard error for the 5,000 scenarios, on that dataset.

Figure 1 goes here

## 5 Simulation

The original hypothesis is that the previous value of distribution carries no bearing on the way the distribution looks at the next time step if this statement is true then the neural network (we train on data at t=0) should not be able to learn any features that would enable it to beat method of moments.

However, here the variance of the process increases with predetermined amount sigma (as per theoretical distribution), so if we have previous distribution correctly standardized against the theoretical distribution in the previous step, the distribution of the next step would be best predicted by the percentile outcomes of the previous step.

As both processes utilize Brownian motion, we can use these percentiles directly to predict the future standardized percentiles. By relying on sigmoid/logistic function, we can use the fact that the normal distribution is relatively well approximated by the logistic equation in terms of CDF and derive the new values of CDF.

time t= | Neural | NN out of | Method of | MofM out |
---|---|---|---|---|

Network | sample | Moments | of sample | |

2 | 0.00726 | 0.18239 | 0.63372 | 0.57288 |

3 | 0.01659 | 0.02198 | 0.06242 | 0.06547 |

4 | 0.00361 | 0.00324 | 0.0608 | 0.06267 |

5 | 0.00548 | 0.00392 | 0.03186 | 0.03334 |

6 | 0.00175 | 0.00614 | 0.01015 | 0.01017 |

7 | 0.00191 | 0.00651 | 0.00623 | 0.00608 |

8 | 0.00142 | 0.00549 | 0.00431 | 0.00426 |

9 | 0.0009 | 0.00483 | 0.00323 | 0.00317 |

10 | 0.00058 | 0.0041 | 0.00253 | 0.00247 |

11 | 0.00081 | 0.00342 | 0.00207 | 0.00212 |

12 | 0.00132 | 0.00284 | 0.00173 | 0.0017 |

We evaluate the quality of fit on the mean of the process (which is not part of training dataset), in order to remove stochastic error from the evaluation process. We also standardize the estimation by stochastic error of each percentile. In order to verify the veracity of the method we test it against a different dataset (not used for training, and with 1000 trails on different values) to verify that the method does indeed produce superior performance to the method of moments. (In the case where only part of these parameters are changed, the case for using Neural Networks is much stronger). This result holds across different time points across the 12 month evolution of the process.

## 6 Conclusion

This shows that the stochastic error can be predicted up until certain number of simulations - at least on a single timestep basis. If we ran shocks at timestep 1 we would end up with method of moments beating the neural networks.

Compared with method of moments the neural network produces lower RMSE for all calibrations except entirely new ones, and still is able to outperform the method of moments on a single timestep on sample size less than 16 thousand projections. Hence due to constant changes in the calibration inputs this method is superior, because it’s more accurate and faster.

Two things of note: the errors appears to have a constant structure which suggests training a time dependent neural net, or indeed a simple regression may further reduce the error. The fit of the method of moments is better around the mean, hence a further process can be added to place more reliance on theoretical mean around that point.

## 7 Appendix 1

Table of calibrated values

Here the mean reversion parameters are the log of

Variable | Period 1 | Validation |
---|---|---|

0.1759 | 0.1776 | |

0.0785 | 0.0819 | |

0.3423 | 0.3407 | |

0.2242 | 0.2177 | |

0.0377 | 0.0377 | |

Rate at t=0 | 0.0307 | 0.0394 |

## 8 Appendix 2

Data generation

The dataset for training the model is generated by running the full formulation of the original equation of Black Karasinski and Euler discretization scheme[6] reformulating equation

we know from table in Appendix 1, hence we can compute the next step at following the Euler’s discretization scheme the step size is a month, as we are using 12 monthly sub periods over a year.

Since there is a linear mapping between a discretisation of a non-exponential stochastic equations (e.g. to ) and the above formulation we can discretize the process to give:

at monthly timestep with four new results for generating from two Normal non-symmetric shocks for each node (total of 8 random numbers per every time steps), over 12 monthly timesteps. We record the percentiles over each time step. We do not record the full dataset because at the last timestep there are unique interest rate paths with floating rate precision for each trial which amounts to over 4GB of storage. We then simplify this data set by condensing this dataset to its percentiles (total of 200 from 0.5% to 100%)

We take advantage of the fact that values the are closer to the point of origin require less dispersion - I.e. there are fewer significantly different paths, but this number grows with time.

## 9 Data availability statement

Data available on request from the primary author

## Acknowledgements

With special thanks to Andrey Marchenko for thorough review and summary, and my husband Mark for making this possible.

### References

- (1998) Binomial term structure models. Mathematica in Education and Research 7, pp. 11–19. Cited by: §3.1.
- (2010) Stochastic gradient descent. Note: \urlhttps://leon.bottou.org/projects/sgdAccessed: 2010-09-30 Cited by: §4.1.
- (2001) Interest rate models - theory and practice. Springer Finance, Springer. External Links: ISBN 9783540417729, LCCN 01034206, Link Cited by: §2.2.
- (1990) Connectionist learning procedures. artificial intelligence, 40 1-3: 185 234, 1989. reprinted in j. carbonell, editor,”. Machine Learning: Paradigms and Methods”, MIT Press. Cited by: §4.2.
- (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.
- (1999) An introduction to numerical methods for stochastic differential equations. Acta numerica 8, pp. 197–246. Cited by: §8.