Differential Machine Learning

Differential Machine Learning


Differential machine learning combines automatic adjoint differentiation (AAD) with modern machine learning (ML) in the context of risk management of financial Derivatives. We introduce novel algorithms for training fast, accurate pricing and risk approximations, online, in real time, with convergence guarantees. Our machinery is applicable to arbitrary Derivatives instruments or trading books, under arbitrary stochastic models of the underlying market variables. It effectively resolves computational bottlenecks of Derivatives risk reports and capital calculations.

Differential ML is a general extension of supervised learning, where ML models are trained on examples of not only inputs and labels but also differentials of labels wrt inputs. It is also applicable in many situations outside finance, where high quality first-order derivatives wrt training inputs are available. Applications in Physics, for example, may leverage differentials known from first principles to learn function approximations more effectively.

In finance, AAD computes pathwise differentials with remarkable efficacy so differential ML algorithms provide extremely effective pricing and risk approximations. We can produce fast analytics in models too complex for closed form solutions, extract the risk factors of complex transactions and trading books, and effectively compute risk management metrics like reports across a large number of scenarios, backtesting and simulation of hedge strategies, or regulations like XVA, CCR, FRTB or SIMM-MVA.

TensorFlow implementation is available on



figurechapter \counterwithoutequationchapter


Standard ML trains neural networks (NN) and other supervised ML models on punctual examples, whereas differential ML teaches them the shape of the target function from the differentials of training labels wrt training inputs. The result is a vastly improved performance, especially in high dimension with small datasets, as we illustrate with numerical examples from both idealized and real-world contexts in Section 2.

We focus on deep learning in the main text, where the simple mathematical structure of neural networks simplifies the exposition. In the appendices, we generalize the ideas to other kind of ML models, like classic regression or principal component analysis (PCA), with equally remarkable results.

We posted a TensorFlow implementation on GitHub1. The notebooks run on Google Colab, reproduce some of our numerical examples and discuss many practical implementation details not covered in the text.

We could not have achieved these results without the contribution and commitment of Danske Bank’s Ove Scavenius and our colleagues from Superfly Analytics, the Bank’s quantitative research department. The advanced numerical results of Section 2 were computed with Danske Bank’s production risk management system. The authors also thank Bruno Dupire, Jesper Andreasen and Leif Andersen for many insightful discussions, suggestions and comments, resulting in a considerable improvement of the contents.

Pricing approximation and machine learning

Pricing function approximation is critical for Derivatives risk management, where the value and risk of transactions and portfolios must be computed rapidly. Exact closed-form formulas a la Black and Scholes are only available for simple instruments and simple models. More realistic stochastic models and more complicated exotic transactions require numerical pricing by finite difference methods (FDM) or Monte-Carlo (MC), which is too slow for many practical applications. Researchers experimented with e.g. moment matching approximations for Asian and Basket options, or Taylor expansions for stochastic volatility models, as early as the 1980s. Iconic expansion results were derived in the 1990s, including Hagan’s SABR formula [10] or Musiela’s swaption pricing formula in the Libor Market Model [4], and allowed the deployment of sophisticated models on trading desks. New results are being published regularly, either in traditional form [1] [3] or by application of advances in machine learning.

Although pricing approximations were traditionally derived by hand, automated techniques borrowed from the fields of artificial intelligence (AI) and ML got traction in the recent years. The general format is classic supervised learning: approximate asset pricing functions of a set of inputs (market variables, path-dependencies, model and instrument parameters), with a function subject to a collection of adjustable weights , learned from a training set of examples of inputs (each a vector of dimension ) paired with labels (typically real numbers), by minimization of a cost function (often the mean squared error between predictions and labels).

For example, the recent [16] and [12] trained neural networks to price European calls2, respectively in the SABR model and the ’rough’ volatility family of models [7]. The training sets included a vast number of examples, labeled by ground truth prices, computed by numerical methods. This approach essentially interpolates prices in parameter space. The computation of the training set takes considerable time and computational expense. The approximation is trained offline, also at a significant computation cost, but the trained model may be reused in many different situations. Like Hagan or Musiela’s expansions in their time, effective ML approximations make sophisticated models like rough Heston practically usable e.g. to simultaneously fit SP500 and VIX smiles [8].

ML models generally learn approximations from training data alone, without additional knowledge of the generative simulation model or financial instrument. Although performance may be considerably improved on a case by case basis with contextual information such as the nature of the transaction, the most powerful and most widely applicable ML implementations achieve accurate approximations from data alone. Neural networks, in particular, are capable of learning accurate approximations from data, as seen in [16] and [12] among many others. Trained NN computes prices and risks with near analytic speed. Inference is as fast as a few matrix by vector products in limited dimension, and differentiation is performed in similar time by backpropagation.

Online approximation with sampled payoffs

While it is the risk management of large Derivatives books that initially motivated the development of pricing approximations, they also found a major application in the context of regulations like XVA, CCR, FRTB or SIMM-MVA, where the values and risk sensitivities of Derivatives trading books are repeatedly computed in many different market states. An effective pricing approximation could execute the repeated computations orders of magnitude faster and resolve the considerable bottlenecks of these computations.

However, the offline approach of [16] or [12] is not viable in this context. Here, we learn the value of a given trading book as function of market state. The learned function is used in a set of risk reports or capital calculations and not reusable in other contexts. Such disposable approximations are trained online, i.e. as a part of the risk computation, and we need it performed quickly and automatically. In particular, we cannot afford the computational complexity of numerical ground truth prices.

It is much more efficient to train approximations on sampled payoffs in place of ground truth prices, as in the classic Least Square Method (LSM) of [15] and [5]. In this context, a training label is a payoff simulated on one Monte-Carlo path conditional to the corresponding input. The entire training set is simulated for a cost comparable to one pricing by Monte-Carlo, and labels remain unbiased (but noisy) estimates of ground truth prices (since prices are expected payoffs).

More formally, a training set of sampled payoffs consists in independent realizations of the random variables where is the initial state and is the final payoff. Informally:


Hence, universal approximations like neural networks, trained on datasets of sampled payoffs by minimization of the mean squared error (MSE) converge to the correct pricing function. The initial state is sampled over the domain of application for the approximation , whereas the final payoff is sampled with a conditional MC path. See Appx 1 for a more detailed formal exposition.

NN approximate prices more effectively than classic linear models. Neural networks are resilient in high dimension and effectively resolve the long standing curse of dimensionality by learning regression features from data. The extension of LSM to deep learning was explored in many recent works like [14], with the evidence of a considerable improvement, in the context of Bermudan options, although the conclusions carry over to arbitrary schedules of cash-flows. We further investigate the relationship of NN to linear regression in Appx 4.

Training with derivatives

We found, in agreement with recent literature, that the performance of modern deep learning remains insufficient for online application with complex transactions or trading books. A vast number of training examples (often in the hundreds of thousands or millions) is necessary to learn accurate approximations, and even a training set of sample payoffs cannot be simulated in reasonable time. Training on noisy payoffs is prone to overfitting, and unrealistic dataset sizes are necessary even in the presence of classic regularization. In addition, risk sensitivities converge considerably slower than values and often remain too approximate even with training sets in the hundreds of thousands of examples.

This article proposes to resolve these problems by training ML models on datasets augmented with differentials of labels wrt inputs:

This is a somewhat natural idea, which, along with the adequate training algorithm, enables ML models to learn accurate approximations even from small datasets of noisy payoffs, making ML approximations tractable in the context of trading books and regulations.

When learning from ground truth labels, the input is one example parameter set of the pricing function. If we were learning Black and Scholes’ pricing function, for instance, (without using the formula, which is what we would be trying to approximate), would be one possible set of values for the initial spot price, volatility, strike and expiry (ignoring rates or dividends). The label would be the (ground thruth) call price computed with these inputs (by MC or FDM since we don’t know the formula), and the derivatives labels would be the Greeks.

When learning from simulated payoffs, the input is one example state. In the Black and Scholes example, would be the spot price sampled on some present or future date , called exposure date in the context of regulations, or horizon date in other contexts. The label would be the payoff of a call expiring on a later date , sampled on that same path number . The exercise is to learn a function of approximating the value of the call measured at . In this case, the differential labels are the pathwise derivatives of the payoff at wrt the state at on path number . In Black and Scholes:

This simple exercise exhibits some general properties of pathwise differentials. First, we computed the Black and Scholes pathwise derivative analytically with an application of the chain rule. The resulting formula is computationally efficient: the derivative is computed together with the payoff along the path, there is no need to regenerate the path, contrarily to e.g. differentiation by finite difference. This efficacy is not limited to European calls in Black and Scholes: pathwise differentials are always efficiently computable by a systematic application of the chain rule, also known as adjoint differentiation or AD. Furthermore, automated implementations of AD, or AAD, perform those computations by themselves, behind the scenes.

Secondly, is a measurable random variable, and its expectation is , the Black and Scholes delta. This property too is general: assuming appropriate smoothing of discontinuous cash-flows, expectation and differentiation commute so risk sensitivities are expectations of pathwise differentials. Turning it upside down, pathwise differentials are unbiased (noisy) estimates of ground truth Greeks.

Therefore, we can compute pathwise differentials efficiently and use them for training as unbiased estimates of ground truth risks, irrespective of the transaction or trading book, and irrespective of the stochastic simulation model. Learning from ground truth labels is slow, but the learned function is reusable in many contexts. This is the correct manner to learn e.g. European option pricing functions in stochastic volatility models. Learning from simulated payoffs is fast, but the learned approximation is a function of the state, specific to a given financial instrument or trading book, under a given calibration of the stochastic model. This is how we can quickly approximate the value and risks of complex transactions and trading books, e.g. in the context of regulations. Differential labels vastly improve performance in both cases, as we see next.

Classic numerical analysis applies differentials as constraints in the context of interpolation, or penalties in the context of regularization. Regularization generally penalises the norm of differentials, e.g. the size of second order differentials, expressing a preference for linear functions. Our proposition is different. We do not express preferences, we enforce differential correctness, measured by proximity of predicted risk sensitivities to differential labels. An application of differential labels was independently proposed in [6], in the context of high dimensional semi-linear partial differential equations. Our algorithm is general. It applies to either ground truth learning (closely related to interpolation) or sample learning (related to regression). It consumes derivative sensitivities for ground truth learning or pathwise differentials for sample learning. It relies on an effective computation of the differential labels, achieved with automatic adjoint differentiation (AAD).

Effective differential labels with AAD

Differential ML consumes the differential labels from an augmented training set. The differentials must be accurate or the optimizer might get lost chasing wrong targets, and they must be computed quickly, even in high dimension, for the method to be applicable in realistic contexts. Conventional differentiation algorithms like finite differences fail on both counts. This is where the superior AAD algorithm steps in, and automatically computes the differentials of arbitrary calculations, with analytic accuracy, for a computation cost proportional to one evaluation of the price, irrespective of dimension3.

AAD was introduced to finance in the ground breaking ’Smoking Adjoints’ [9]. It is closely related to backpropagation, which powers modern deep learning and has largely contributed to its recent success. In finance, AAD produces risk reports in real time, including for exotic books or XVA. In the context of Monte-Carlo or LSM, AAD produces exact pathwise differentials for a very small cost. AAD made differentials massively available in quantitative finance. Besides evident applications to instantaneous calibration or real-time risk reports, the vast amount of information contained in differentials may be leveraged in creative ways, see e.g. [17] for an original application.

To a large extent, differential ML is another strong application of AAD. For reasons of memory and computation efficiency, AAD always computes differentials path by path when applied with Monte-Carlo, effectively estimating risk sensitivities in a vast number of different scenarios. Besides its formidable speed and accuracy, AAD therefore produces a massive amount of information. Risk reports correspond to average sensitivities across paths, they only provide a much flattened view of the pathwise differential information. Differential ML, on the other hand, leverages its full extent in order to learn value and risk, not as fixed numbers only relevant in the current state, but as functions of state capable of computing prices and Greeks very quickly in different market scenarios.

In the interest of brevity, we refer to [20] for a comprehensive description of AAD, including all details of how training differentials were obtained in this study, or the video tutorial [21], which explains its main ideas in 15 minutes.

The main article is voluntarily kept rather concise. Practical implementation details are deferred to the online notebook, and mathematical formalism is treated in the appendices along with generalizations and extensions. We present differential ML in Section 1 in the context of feedforward neural networks, numerical results in Section 2 and important extensions in Section 3. Appx 1 deploys the mathematical formalism of the machinery. Appx 2 introduces differential PCA and Appx 3 applies differential ML as a superior regularization in the context of classic linear regression. Appx 4 discusses neural architectures and asymptotic control algorithms with convergence guarantees necessary for online operation.

Chapter 1 Differential Machine Learning

This section describes differential training in the context of feedforward neural networks, although everything carries over to NN of arbitrary complexity in a straightforward manner. At this stage, we assume the availability of a training set augmented with differential labels. The dataset consists of arbitrary schedules of cash-flows simulated in an arbitrary stochastic model. Because we learn from simulated data alone, there are no restrictions on the sophistication of the model or the complexity of the cash-flows. The cash-flows of the transaction or trading book could be described with a general scripting language, and the model could be a hybrid ’model of everything’ often used for e.g. XVA computations, with dynamic parameters calibrated to current market data.

The text focuses on a mathematical and qualitative description of the algortihm, leaving the discussion of practical implementation to the online notebook4, along with TensorFlow implementation code.

1.1 Notations

1.1.1 Feedforward equations Let us first introduce notations for the description of feedforward networks. Define the input (row) vector and the predicted value . For every layer in the network, define a scalar ’activation’ function . Popular choices are relu, elu and softplus, with the convention is the identity. The notation denotes elementwise application. We denote the weights and biases of layer .

The network is defined by its feedforward equations:


where is the row vector containing the pre-activation values, also called units or neurons, in layer . Figure 1.1 illustrates a feedforward network with and , together with backpropagation.

1.1.2 Backpropagation Feedforward networks are efficiently differentiated by backpropagation, which is generally applied to compute the derivatives of some some cost function wrt the weights and biases for optimization. For now, we are not interested in those differentials, but in the differentials of the predicted value wrt the inputs . Recall that inputs are states and predictions are prices, hence, these differentials are predicted risk sensitivities (Greeks), obtained by differentiation of the lines in (1.1), in the reverse order:


with the adjoint notation and is the elementwise (Hadamard) product.

Figure 1.1: feedforward neural network with backpropagation

Notice, the similarity between (1.1) and (1.2). In fact, backpropagation defines a second feedforward network with inputs and output , where the weights are shared with the first network and the units in the second network are the adjoints of the corresponding units in the original network.

Backpropagation is easily generalized to arbitrary network architectures, as explained in deep learning literature. Generalized to arbitrary computations unrelated to deep learning or AI, backpropagation becomes AD, or AAD when implemented automatically5. Modern frameworks like TensorFlow include an implementation of backpropagation/AAD and implicitly invoke it in training loops.

1.2 Twin networks

We can combine feedforward (1.1) and backpropagation (1.2) equations into a single network representation, or twin network, corresponding to the computation of a prediction (approximate price) together with its differentials wrt inputs (approximate risk sensitivities).

The first half of the twin network (Figure 1.2) is the original network, traversed with feedforward induction to predict a value. The second half is computed with the backpropagation equations to predict risk sensitivities. It is the mirror image of the first half, with shared connection weights.

A mathematical description of the twin network is simply obtained by concatenation of equations (1.1) and (1.2). The evaluation of the twin network returns a predicted value , and its differentials wrt the inputs . The combined computation evaluates a feedforward network of twice the initial depth. Like feedforward induction, backpropagation computes a sequence of matrix by vector products. The twin network, therefore, predicts prices and risk sensitivities for twice the computation complexity of value prediction alone, irrespective of the number of risks. Hence, a trained twin net approximates prices and risk sensitivities, wrt potentially many states, in a particularly efficient manner. Note from (1.2) that the units of the second half are activated with the differentials of the original activations . If we are going to backpropagate through the twin network, we need continuous activation throughout. Hence, the initial activation must be , ruling out, e.g. ReLU.

Figure 1.2: twin network

1.2.1 Training with differential labels The purpose the twin network is to estimate the correct pricing function by an approximate function . It learns optimal weights and biases from an augmented training set , where are the differential labels.

Here, we describe the mechanics of differential training and discuss its effectiveness. As is customary with ML, we stack training data in matrices, with examples in rows and units in columns:

Notice, the equations (1.1) and (1.2) identically apply to matrices or row vectors. Hence, the evaluation of the twin network computes the matrices:

respectively in the first and second half of its structure. Training consists in finding weights and biases minimizing some cost function : .

Classic training with payoffs alone

Let us first recall classic deep learning. We have seen that the approximation obtained by global minimization of the MSE converges to the correct pricing function (modulo finite capacity bias), hence:

The second half of the twin network does not affect cost, hence, training is performed by backpropagation through the standard feedforward network alone. The many practical details of the optimization are covered in the online notebook.

Differential training with differentials alone

Let us change gears and train with pathwise differentials instead of payoffs , by minimization of the MSE (denoted ) between the differential labels (pathwise differentials) and predicted differentials (estimated risk sensitivities):

Here, we must evaluate the twin network in full to compute , effectively doubling the cost of training. Gradient-based methods minimize by backpropagation through the twin network, effectively accumulating second-order differentials in its second half. A deep learning framework, like TensorFlow, performs this computation seamlessly. As we have seen, the second half of the twin network may represent backpropagation, in the end, this is just another sequence of matrix operations, easily differentiated by another round of backpropagation, carried out silently, behind the scenes. The implementation in the demonstration notebook is identical to training with payoffs, safe for the definition of the cost function. TensorFlow automatically invokes the necessary operations, evaluating the feedforward network when minimizing and the twin network when minimizing .

In practice, we must also assign appropriate weights to the costs of wrong differentials in the definition of the . This is discussed in the implementation notebook, and in more detail in Appx 2.

Let us now discuss what it means to train approximations by minimization of the between pathwise differentials and predicted risks . Given appropriate smoothing6, expectation and differentiation commute so the (true) risk sensitivities are expectations of pathwise differentials:

It follows that pathwise differentials are unbiased estimates of risk sensitivities, and approximations trained by minimization of the converge (modulo finite capacity bias) to a function with correct differentials, hence, the right pricing function, modulo an additive constant.

Therefore, we can choose to train by minimization of value or derivative errors, and converge near the correct pricing function all the same. This consideration is, however, an asymptotic one. Training with differentials converges near the same approximation, but it converges much faster, allowing us to train accurate approximations with much smaller datasets, as we see in the numerical examples, because:

The effective size of the dataset is much larger

evidently, with training examples we have differentials ( being the dimension of the inputs ). With AAD, we effectively simulate a much larger dataset for a minimal additional cost, especially in high dimension (where classical training struggles most).

The neural nets picks up the shape of the pricing function

learning from slopes rather than points, resulting in much more stable and potent learning, even with few examples.

The neural approximation learns to produce correct Greeks

by construction, not only correct values. By learning the correct shape, the ML approximation also correctly orders values in different scenarios, which is critical in applications like value at risk (VAR) or expected loss (EL), including for FRTB.

Differentials act as an effective, bias-free regularization

as we see next.

Differential training with everything

The best numerical results are obtained in practice by combining values and derivatives errors in the cost function:

which is the one implemented in the demonstration notebook, with the two previous strategies as particular cases. Notice, the similarity with classic regularization of the form . Ridge (Tikhonov) and Lasso regularizations impose a penalty for large weights (respectively in and metrics), effectively preventing overfitting small datasets by stopping attempts to fit noisy labels. In return, classic regularization reduces the effective capacity of the model and introduces a bias, along with a strong dependency on the hyperparameter . This hyperparameter controls regularization strength and tunes the vastly documented bias-variance tradeoff. If one sets too high, their trained approximation ends up a horizontal line.

Differential training also stops attempts to fit noisy labels, with a penalty for wrong differentials. It is, therefore, a form of regularization, but a very different kind. It doesn’t introduce bias, since we have seen that training on differentials alone converges to the correct approximation too. This breed of regularization comes without bias-variance tradeoff. It reduces variance for free. Increasing hardly affects results in practice.

Differential regularization is more similar to data augmentation in computer vision, which is, in turn, a more powerful regularization. Differentials are additional training data. Like data augmentation, differential regularization reduces variance by increasing the size of the dataset for little cost. Differentials are new data of a different kind, and it shares inputs with existing data, but it reduces variance all the same, without introducing bias.

Chapter 2 Numerical results

Let us now review some numerical results and compare the performance of differential and conventional ML. We picked three examples from relevant textbook and real-world situations, where neural networks learn pricing and risk approximations from small datasets.

We kept neural architecture constant in all the examples, with four hidden layers of 20 softplus-activated units. We train neural networks on mini-batches of normalized data, with the ADAM optimizer and a one-cycle learning rate schedule. The demonstration notebook and appendices discuss all the details. A differential training set takes 2-5 times longer to simulate with AAD, and it takes twice longer to train twin nets than standard ones. In return, we are going to see that differential ML performs up to thousandfold better on small datasets.

2.1 Basket options

The first (textbook) example is a basket option in a correlated Bachelier model for seven assets7:

where and . The task is to learn the pricing function of a 1y call option on a basket, with strike 110 (we normalized asset prices at 100 without loss of generality and basket weights sum to 1). The basket price is also Gaussian in this model; hence, Bachelier’s formula gives the correct price. This example is also of particular interest because, although the input space is seven-dimensional, we know from maths that actual pricing is one-dimensional. Can the network learn this property from data?

Figure 2.1: basket option in Bachelier model, dimension 7

We have trained neural networks and predicted values and derivatives in 1024 independent test scenarios, with initial basket values on the horizontal axis and option prices/deltas on the vertical axis (we show one of the seven derivatives), compared with the correct results computed with Bachelier’s formula. We trained networks on 1024 (1k) and 65536 (64k) paths, with cross-validation and early stopping. The twin network with 1k examples performs better than the classical net with 64k examples for values, and a lot better for derivatives. In particular, it learned that the option price and deltas are a fixed function of the basket, as evidenced by the thinness of the approximation curve. The classical network doesn’t learn this property well, even with 64k examples. It overfits training data and predicts different values or deltas for various scenarios on the seven assets with virtually identical baskets.

We also compared test errors with standard MC errors (also with 1k and 64k paths). The main point of pricing approximation is to avoid nested simulations with similar accuracy. We see that the error of the twin network is, indeed, close to MC. Classical deep learning error is an order of magnitude larger. Finally, we trained with eight million samples, and verified that both networks converge to similarly low errors (not zero, due to finite capacity) while MC error converges to zero. The twin network gets there hundreds of times faster.

All those results are reproduced in the online TensorFlow notebook.

2.2 Worst-of autocallables

As a second (real-world) example, we approximate an exotic instrument, a four-underlying version of the popular worst-of autocallable trade, in a more complicated model, a collection of 4 correlated local volatility models a la Dupire:

where . The example is relevant, not only due to popularity, but also, because of the stress path-dependence, barriers and massive final digitals impose on numerical models. Appropriate smoothing was applied so pathwise differentials are well defined.

We do not have a closed form solution for reference, so performance is measured against nested Monte-Carlo simulations (a very slow process). In Figure 2.2, we show prediction results for 128 independent examples, with correct numbers on the horizontal axis, as given by the nested simulations, and predicted results on the vertical axis. Performance is measured by distance to the 45deg line.

The classical network is trained on 32768 (32k) samples, without derivatives, with cross-validation and early stopping. The twin network is trained on 8192 (8k) samples with pathwise derivatives produced with AAD. Both sets were generated in around 0.4 sec in Superfly, Danske Bank’s proprietary derivatives pricing and risk management system.

Figure 2.2: worst-of-four autocallable with correlated local volatility models

Figure 2.2 shows the results for the value and the delta to the second underlying, together with the script for the instrument, written in Danske Bank’s Jive scripting language. Note that the barriers and the digitals are explicitly smoothed with the keyword ’choose’. It is evident that the twin network with only 8k training data produces a virtually perfect approximation in values, and a decent approximation on deltas. The classical network also approximates values correctly, although not on a straight line, which may cause problems when ordering is critical, e.g. for expected loss or FRTB. Its deltas are essentially random, which rules them out for approximation of risk sensitivities, e.g. for SIMM-MVA.

Absolute standard errors are 1.2 value and 32.5 delta with the classical network with 32k examples, respectively 0.4 and 2.5 with the differential network trained on 8k examples. For comparison, the Monte-Carlo pricing error is 0.2 with 8k paths, similar to the twin net. The error on the classical net, with 4 times the training size, is larger for values and order of magnitude larger for differentials.

2.3 Derivatives trading books

For the last example, we picked a real netting set from Danske Bank’s portfolio, including single and cross currency swaps and swaptions in 10 different currencies, eligible for XVA, CCR or other regulated computations. Simulations are performed in Danske Bank’s model of everything (the ’Beast’), where interest rates are simulated each with a four-factor version of Andreasen’s take on multi-factor Cheyette [2], and correlated between one another and with forex rates.

This is an important example, because it is representative of how we want to apply twin nets in the real world. In addition, this is a stress test for neural networks. The Markov dimension of the four-factor non-Gaussian Cheyette model is 16 per currency, that is 160 inputs, 169 with forexes, and over 1000 with all the path-dependencies in this real-world book. Of course, the value effectively only depends on a small number of combinations of inputs, something the neural net is supposed to identify. In reality, the extraction of effective risk factors is considerably more effective in the presence of differential labels (see Appx 2), which explains the results in Figure 2.3.

Figure 2.3 shows the values predicted by a twin network trained on 8192 (8k) samples with AAD pathwise derivatives, compared to a vanilla net, trained on 65536 (64k) samples, all simulated in Danske Bank’s XVA system. The difference in performance is evident in the chart. The twin approximation is virtually perfect with on only 8k examples. The classical deep approximation is much more rough with 64k examples. As with the previous example, the predicted values for an independent set of 128 examples are shown on the vertical axis, with correct values on the horizontal axis. The ’correct’ values for the chart were produced with nested Monte-Carlo overnight. The entire training process for the twin network (on entry level GPU), including the generation of the 8192 examples (on multithreaded CPU), took a few seconds on a standard workstation.

Figure 2.3: real-world netting set – twin network trained on 8k samples vs classical net trained on 64k samples

We have shown in this figure the predicted values, not derivatives, because we have too many of them, often wrt obscure model parameters like accumulated covariances in Cheyette. For these derivatives to make sense, they must be turned into market risks by application of inverse Jacobian matrices [19], something we skipped in this exercise.

Standard errors are 12.85M with classical 64k and 1.77M with differential 8k, for a range of 200M for the 128 test examples, generated with the calibrated hybrid model. On this example too, twin 8k error is very similar to the Monte-Carlo pricing error (1.70M with 8k paths). Again in this very representative example, the twin network has the same degree of approximation as orders of magnitude slower nested Monte-Carlo.

Chapter 3 Extensions

We have presented algorithms in the context of single value prediction to avoid confusion and heavy notations. To conclude, we discuss two advanced extensions, allowing the network to predict multiple values and higher-order derivatives simultaneously.

3.1 Multiple outputs

One innovation in [12] is to predict call prices of multiple strikes and expiries in a single network, exploiting correlation and shared factors, and encouraging the network to learn global features like no-arbitrage conditions. We can combine our approach with this idea by an extension of the twin network to compute multiple predictions, meaning and . The adjoints are no longer well defined as vectors. Instead, we now define them as directional differentials wrt some specified linear combination of the outputs where has the coordinates of the desired direction in a row vector:

Given a direction , all the previous equations apply identically, except that the boundary condition for in the backpropagation equations is no longer the number 1, but the row vector . For example, means that adjoints are defined as derivatives of the first output . We can repeat this for to compute the derivatives of all the outputs wrt all the inputs , i.e the Jacobian matrix. Written in matrix terms, the boundary is the identity matrix and the backpropagation equations are written as follows:

where . In particular, is (indeed) the Jacobian matrix . To compute a full Jacobian, the theoretical order of calculations is times the vanilla network. Notice however, that the implementation of the multiple backpropagation in the matrix form above on a system like TensorFlow automatically benefits from CPU or GPU parallelism. Therefore, the additional computation complexity will be experienced as sublinear.

3.2 Higher order derivatives

The twin network can also predict higher-order derivatives. For simplicity, revert to the single prediction case where . The twin network predicts as a function of the input . The neural network, however, doesn’t know anything about derivatives. It just computes numbers by a sequence of equations. Hence, we might as well consider the prediction of differentials as multiple outputs.

As previously, in what is now considered a multiple prediction network, we can compute the adjoints of the outputs in the twin network. These are now the adjoints of the adjoints:

in other terms, the Hessian matrix of the value prediction . Note that the original activation functions must be for this computation. The computation of the full Hessian is of order times the original network. These additional calculations generate a lot more data, one value, derivatives and second-order derivatives for the cost of times the value prediction alone. In a parallel system like TensorFlow, the experience also remains sublinear. We can extend this argument to arbitrary order , with the only restriction that the (original) activation functions are .


Throughout our analysis we have seen that ’learning the correct shape’ from differentials is crucial to the performance of regression models, including neural networks, in such complex computational tasks as the pricing and risk approximation of arbitrary Derivatives trading books. The unreasonable effectiveness of what we called ’differential machine learning’ permits to accurately train ML models on a small number of simulated payoffs, in realtime, suitable for online learning. Differential networks apply to real-world problems, including regulations and risk reports with multiple scenarios. Twin networks predict prices and Greeks with almost analytic speed, and their empirical test error remains of comparable magnitude to nested Monte-Carlo.

Our machinery learns from data alone and applies in very general situations, with arbitrary schedules of cash-flows, scripted or not, and arbitrary simulation models. Differential ML also applies to many families of approximations, including classic linear combinations of fixed basis functions, and neural networks of arbitrary complex architecture. Differential training consumes differentials of labels wrt inputs and requires clients to somehow provide high-quality first-order differentials. In finance, they are obtained with AAD, in the same way we compute Monte-Carlo risk reports, with analytic accuracy and very little computation cost.

One of the main benefits of twin networks is their ability to learn effectively from small datasets. Differentials inject meaningful additional information, eventually resulting in better results with small datasets of 1k to 8k examples than can be obtained otherwise with training sets orders of magnitude larger. Learning effectively from small datasets is critical in the context of e.g. regulations, where the pricing approximation must be learned quickly, and the expense of a large training set cannot be afforded.

The penalty enforced for wrong differentials in the cost function also acts as a very effective regularizer, superior to classical forms of regularization like Ridge, Lasso or Dropout, which enforce arbitrary penalties to mitigate overfitting, whereas differentials meaningfully augment data. Standard regularizers are very sensitive to the regularization strength , a manually tweaked hyperparameter. Differential training is virtually insensitive to , because, even with infinite regularization, we train on derivatives alone and still converge to the correct approximation, modulo an additive constant.

Appx 2 and Appx 3 apply the same ideas to respectively PCA and classic regression. In the context of regression, differentials act as a very effective regularizer. Like Tikhonov regularization, differential regularization is analytic and works SVD. Appx 3 derives a variation of the normal equation adjusted for differential regularization. Unlike Tikhonov, differential regularization does not introduce bias. Differential PCA, unlike classic PCA, is able to extract from data the principal risk factors of a given transaction, and it can be applied as a preprocessing step to safely reduce dimension without loss of relevant information.

Differential training also appears to stabilize the training of neural networks, and improved resilience to hyperparameters like network architecture, seeding of weights or learning rate schedule was consistently observed, although to explain exactly why is a topic for further research.

Standard machine learning may often be considerably improved with contextual information not contained in data, such as the nature of the relevant features from knowledge of the transaction and the simulation model. For example, we know that the continuation value of a Bermudan option on some call date mainly depends on the swap rate to maturity and the discount rate to the next call. We can learn pricing functions much more effectively with hand engineered features. But it has to be done manually, on a case by case basis, depending on the transaction and the simulation model. If the Bermudan model is upgraded with stochastic volatility, volatility state becomes an additional feature that cannot be ignored, and hand-engineered features must be updated. Differential machine learning learns just as well, or better, from data alone, the vast amount of information contained in pathwise differentials playing a role similar, and sometimes more effectively, to manual adjustments from contextual information.

Differential machine learning is similar to data augmentation in computer vision, a technique consistently applied in that field with documented success, where multiple labeled images are produced from a single one, by cropping, zooming, rotation or recoloring. In addition to extending the training set for a negligible cost, data augmentation encourages the ML model to learn important invariances. Similarly, derivatives labels, not only increase the amount of information in the training set, but also encourage the model to learn the shape of the pricing function.




Appendix Appx 1 Learning Prices from Samples


When learning Derivatives pricing and risk approximations, the main computation load belongs to the simulation of the training set. For complex transactions and trading books, it is not viable to learn from examples of ground truth prices. True prices are computed numerically, generally by Monte-Carlo. Even a small dataset of say, 1000 examples, is therefore simulated for the computation cost of 1000 Monte-Carlo pricings, a highly unrealistic cost in a practical context. Alternatively, sample datasets a la Longstaff-Schwartz (2001) are produced for the computation cost of one Monte-Carlo pricing, where each example is not a ground truth price, but one sample of the payoff, simulated for the cost of one Monte-Carlo path. This methodology, also called LSM (for Least Square Method as it is called in the founding paper) simulates training sets in realistic time and allows to learn pricing approximations in realistic time.

This being said, we now expect the machine learning model to learn correct pricing functions without having ever seen a price. Consider a simple example: to learn the pricing function for a European call in Black and Scholes, we simulate a training set of call payoffs given initial states . The result is a random looking cloud of points , and we expect the machine to learn from this data the correct pricing function given by Black and Scholes’ formula.

It is not given at all, and it may even seem somewhat magical, that training a machine learning model on this data should converge to the correct function. When we train on ground true prices, we essentially interpolate prices in input space, where it is clear and intuitive that arbitrary functions are approximated to arbitrary accuracy by growing the size of the training set and the capacity of the model. In fact, the same holds with LSM datasets, and this appendix discusses some important intuitions and presents sketches of mathematical proof of why this is the case.

In the first section, we recall LSM in detail8 and frame it in machine learning terms. Readers familiar with the Longstaff-Schwartz algorithm may browse through this section quickly, although skipping it altogether is not recommended, this is where we set important notations. In the second section, we discuss universal approximators, formalize their training process on LSM samples, and demonstrate convergence to true prices. In the third section, we define pathwise differentials, formalize differential training and show that it too converges to true risk sensitivities.

The purpose of this document is to explain and formalize important mathematical intuitions, not to provide complete formal proofs. We often skip important mathematical technicalities so our demonstrations should really be qualified as ’sketches of proof’.

Appx 1.1 LSM datasets

Appx 1.1.1 Markov States
Model state

First, we formalize the definition of a LSM dataset. LSM datasets are simulated with a Monte-Carlo implementation of a dynamic pricing model. Dynamic models are parametric assumptions of the diffusion of a state vector , of the form:

where is a vector of dimension , is a vector valued function of dimension , is a matrix valued function of dimension and is a dimensional standard Brownian motion under the pricing measure. The number is called the Markov dimension of the model, the number is called the number of factors. Some models are non-diffusive, for example, jump diffusion models a la Merton or rough volatility models a la Gatheral. All the arguments of this note carry over to more general models, but in the interest of concision and simplicity, we only consider diffusions in the exposition. Dynamic models are implemented in Monte-Carlo simulations, e.g. with Euler’s scheme:

where is the index of the path, is the index of the time step and the are independent Gaussian vectors in dimension .

The definition of the state vector depends on the model. In Black and Scholes or local volatility extensions a la Dupire, the state is the underlying asset price. With stochastic volatility models like SABR or Heston, the bi-dimensional state is the pair (current asset price, current volatility). In Hull and White / Cheyette interest rate models, the state is a low dimensional latent representation of the yield curve. In general Heath-Jarrow-Morton / Libor Market models, the state is the collection of all forward rates in the yield curve.

We call model state on date the state vector of the model on this date.

Transaction state

Derivatives transactions also carry a state, in the sense that the transactions evolve and mutate during their lifetime. The state of a barrier option depends on whether the barrier was hit in the past. The state of a Bermudan swaption depends on whether it was exercised. Even the state of a swap depends on the coupons fixed in the past and not yet paid. European options don’t carry state until expiry, but then, they may exercise into an underlying schedule of cashflows.

We denote the transaction state at time and its dimension. For a barrier option, the transaction state is of dimension one and contains the indicator of having hit the barrier prior to . For a real-world trading book, the dimension may be in the thousands and it may be necessary to split the book to avoid dimension overload. The transaction state is simulated together with the model state in a Monte-Carlo implementation. In a system where event driven cashflows are scripted, the transaction state is also the script state, i.e. the collection of variables in the script evaluated over the Monte-Carlo path up to time .

Training inputs

The exercise is to learn the pricing function for a given transaction or a set of transactions, in a given model, on a given date , sometimes called the exposure date or horizon date. The price evidently depends on both the state of the model and the state of the transaction . The concatenation of these two vectors constitute the complete Markov state of the system, in the sense that the true price of transactions at are deterministic (but unknown) functions of .

The dimension of the state vector is .

The training inputs are a collection of examples of the Markov state in dimension . They may be sampled by Monte-Carlo simulation between today () and , or otherwise. The distribution of in the training set should reflect the intended use of the trained approximation. For example, in the context of value at risk (VAR) or expected loss (FRTB), we need an accurate approximation in extreme scenarios, hence, we need them well represented in the training set, e.g. with a Monte-Carlo simulation with increased volatility. In low dimension, the training states may be put on a regular grid over a relevant domain. In higher dimension, they may be sampled over a relevant domain with a low discrepancy sequence like Sobol. When the exposure date is today or close, sampling with Monte-Carlo is nonsensical, an appropriate sampling distribution must be applied depending on context.

Appx 1.1.2 Pricing
Cashflows and transactions

A cashflow paid at time is formally defined as a measurable random variable. This means that the cashflow is revealed on or before its payment date. In the world described by the model, this is a functional of the path of the state vector from to the payment date and may be simulated by Monte-Carlo.

A transaction is a collection of cashflows . A European call of strike expiring at is a unique cashflow, paid at , defined as . A barrier option also defines a unique cashflow:

An interest rate swap defines a schedule of cashflows paid on its fixed leg and another one paid on its floating leg. Scripting conveniently and consistently describes all cashflows, as functional of market variables, in a language purposely designed for this purpose.

A netting set or trading book is a collection of transactions, hence, ultimately, a collection of cashflows. In what follows, the word ’transaction’ refers to arbitrary collection of cashflows, maybe netting sets or trading books. The payment date of the last cashflow is called the maturity of the transaction and denoted .


The payoff of a transaction is defined as the discounted sum of all its cashflows:

hence, the payoff is a measurable random variable, which can be sampled by Monte-Carlo simulation.

For the purpose of learning the pricing function of a transaction on an exposure date , we only consider cashflows after , and discount them to the exposure date. In the interest of simplicity, we incorporate discounting to in the functional definition of the cashflows. Hence:

The payoff is still a measurable random variable. It can be sampled by Monte-Carlo simulation conditional to state at by seeding the simulation with state at and simulating up to .


Assuming a complete, arbitrage-free model, we immediately get the price of the transaction from the fundamental theorem of asset pricing:

where expectations are taken in the pricing measure defined by the model and is the filtration at (loosely speaking, the information available at ). Since by assumption is the complete Markov state of the system at :

Hence, the true price is a deterministic (but unknown) function of the Markov state.

Training labels

We see that the price corresponding to the input example is:

and that its computation, in the general case, involves averaging payoffs over a number of Monte-Carlo simulations from to , all identically seeded with . This is also called nested simulations because a set of simulations is necessary to compute the value of each example, the initial states having themselves been sampled somehow. If the initial states were sampled with Monte-Carlo simulations, they are called outer simulations. Hence, we have simulations within simulations, an extremely costly and inefficient procedure9.

Instead, for each example , we draw one single payoff from its distribution conditional to , by simulation of one Monte-Carlo path from to , seeded with at . The labels in our dataset correspond to these random draws:

Notice (dropping the condition to to simplify notations) that, while labels no longer correspond to true prices, they are unbiased (if noisy) estimates of true prices.

in other terms:

where the are independent noise with . This is why universal approximators trained on LSM datasets converge to true prices despite having never seen one.

Appx 1.2 Machine learning with LSM datasets

Appx 1.2.1 Universal approximators Having simulated a training set of examples we proceed to train approximators, defined as functions of the input vector of dimension , parameterized by a vector of learnable weights of dimension . This is a general definition of approximators. In classic regression, are the regression weights, often denoted . In a neural network, is the collection of all connection matrices and bias vectors in the multiple layers of the network.

The capacity of the approximator is an informal measure of both its computational complexity and its ability to approximate functions by matching discrete sets of datapoints. A classic formal definition of capacity is the Vapnik-Chervonenkis dimension, defined as the largest number of arbitrary datapoints the approximator can match exactly. We settle for a weaker definition of capacity as the number of learnable parameters, sufficient for our purpose.

A universal approximator is one guaranteed to approximate any function to arbitrary accuracy when its capacity is grown to infinity. Examples of universal approximator include classic linear regression, as long as the regression functions form a complete basis of the function space. Polynomial, harmonic (Fourier) and radial basis regressors are all universal approximators. Famously, neural networks are universal approximators too, a result known as the Universal Approximation Theorem.

Appx 1.2.2 LSM approximation theorem Training an approximator means setting the value of its learnable parameters in order to minimize a cost function, generally the mean square error (MSE) between the approximations and labels over a training set of examples:

The following theorem justifies the practice of training approximators on LSM datasets:

A universal approximator trained by minimization of the MSE over a training set of independent examples of Markov states at coupled with conditional sample payoffs at , converges to the true pricing function

when the size of the training set and the capacity of the approximator both grow to infinity.

We provide a sketch of proof, skipping important mathematical technicalities to highlight intuitions and important properties.

First, notice that the training set consists in independent, identically distributed realizations of the couple where is the Markov state at , sampled from a distribution reflecting the intended application of the approximator, and is the conditional payoff at , sampled from the pricing distribution defined by the model and sampled by conditional Monte-Carlo simulation.

Hence, the true pricing function satisfies:

By definition, the conditional expectation is the function of closest to in :

Hence, pricing can be framed as an optimization problem in the space of functions. By universal approximation property:

when the capacity grows to infinity, and:

when grows to infinity, by assumption of an IID training set, sampled from the correct distributions. Hence:

when both and grow to infinity. This is the theoretical basis for training machine learning models on LSM samples, and it applies to all universal approximators, including neural networks. This is why regression or neural networks trained on samples ’magically’ converge to the correct pricing function, as observed e.g. in our demonstration notebook with European calls in Black and Scholes and basket options in Bachelier. The theorem is general and equally guarantees convergence for arbitrary (complete and arbitrage-free) models and schedules of cashflows.

Appx 1.3 Differential Machine Learning with LSM datasets

Appx 1.3.1 Pathwise differentials By definition, pathwise differentials are measurable random variables equal to the gradient of the payoff at wrt the state variables at .

For example, for a European call in Black and Scholes, pathwise derivatives are equal to:

In a general context, pathwise differentials are conveniently and efficiently computed with automatic adjoint differentiation (AAD) over Monte-Carlo paths as explained in the founding paper Smoking Adjoints (Giles and Glasserman, Risk 2006) and the vast amount of literature that followed. We posted a video tutorial explaining the main ideas in 15 minutes10.

Pathwise differentials are not well defined for discontinuous cashflows, like digitals or barriers. This is classically resolved by smoothing, i.e. the replacement of discontinuous cashflows with close continuous approximations. Digitals are typically represented as tight call spreads, and barriers are represented as soft barriers. Smoothing has been a standard practice on Derivatives trading desks for several decades. For an overview of smoothing, including generalization in terms of fuzzy logic and a systematic smoothing algorithm, see our presentation11.

Provided that all cashflows are differentiable by smoothing (and some additional, generally satisfied technical requirements), the expectation and differentiation operators commute so that true risks are (conditional) expectations of pathwise differentials:

This theorem is demonstrated in stochastic literature, the most general demonstration being found in Functional Ito Calculus, also called Dupire Calculus, see Quantitative Finance Volume 19, 2019, Issue 5. It also applies to pathwise differentials wrt model parameters, and justifies the practice of Monte-Carlo risk reports by averaging pathwise derivatives.

Appx 1.3.2 Training on pathwise differentials LSM datasets consist of inputs with labels . Pathwise differentials are therefore the gradients of labels wrt inputs . The main proposition of the working paper is to augment training datasets with those differentials and implement an adequate training on the augmented dataset, with the result of vastly improved approximation performance.

Suppose first that we are training an approximator on differentials alone:

with predicted derivatives on the left hand side (LHS) and differential labels on the right hand side (RHS). Note that the LHS is the predicted sensitivity but the RHS is not the true sensitivity . It is the pathwise differential, a random variable with expectation the true sensitivity and additional sampling noise.

We have already seen this exact same situation while training approximators on LSM samples, and demonstrated that the trained approximator converges to the true conditional expectation, in this case, the expectation of pathwise differentials, a.k.a. the true risk sensitivities.

The trained approximator will therefore converge to a function with all the same differentials as the true pricing function . It follows that on convergence modulo an additive constant , trivially computed at the term of training by matching means:


We reviewed the details of LSM simulation framed in machine learning terms, and demonstrated that training approximators on LSM datasets effectively converges to the true pricing functions. We then proceeded to demonstrate that the same is true of differential training, i.e. training approximators on pathwise differentials also converges to the true pricing functions.

These are asymptotic results. They justify standard practices and guarantee consistence and meaningfulness of classical and differential training on LSM datasets, classical or augmented e.g. with AAD. They don’t say anything about speed of convergence. In practicular, they don’t provide a quantification of errors with finite capacity and finite datasets of size . They don’t explain the vastly improved performance of differential training, consistently observed across examples of practical relevance in the working paper. Both methods have the same asymptotic guarantees, where they differ is in the magnitude of errors with finite capacity and size. To quantify those is a more complex problem and a topic for further research.

Appendix Appx 2 Taking the First Step : Differential PCA




We review traditional data preparation in deep learning (DL) including principal component analysis (PCA), which effectively performs orthonormal transformation of input data, filtering constant and redundant inputs, and enabling more effective training of neural networks (NN). Of course, PCA is also useful in its own right, providing a lower dimensional latent representation of data variation along orthogonal axes.

In the context of differential DL, training data also contains differential labels (differentials of training labels wrt training inputs, computed e.g. with automatic adjoint differentiation -AAD- as explained in the working paper), and thus requires additional preprocessing.

We will see that differential labels also enable remarkably effective data preparation, which we call differential PCA. Like classic PCA, differential PCA provides a hierarchical, orthogonal representation of data. Unlike classic PCA, differential PCA represents input data in terms how it affects the target measured by training labels, a notion we call relevance. For this reason, differential PCA may be safely applied to aggressively remove irrelevant factors and considerably reduce dimension.

In the context of data generated by financial Monte-Carlo paths, differential PCA exhibits the principal risk factors of the target transaction or trading book from data alone. It is therefore a very useful algorithm on its own right, besides its effectiveness preparing data for training NN.

The first section describes and justifies elementary data preparation, as implemented in the demonstration notebook DifferentialML.ipynb on https://github.com/differential-machine-learning. Section 2 discusses the mechanism, benefits and limits of classic PCA. Section 3 introduces and derives differential PCA and discusses the details of its implementation and benefits. Section 4 brings it all together in pseudocode.

Appx 2.1 Elementary data preparation

Dataset normalization is known as a crucial, if somewhat mundane preparation step in deep learning (DL), highlighted in all DL textbooks and manuals. Recall from the working paper that we are working with augmented datasets:

with labels in dimension 1 and inputs and differentials in dimension , stacked in rows in the matrices , and . In the context of financial Monte-Carlo simulations a la Longstaff-Schwartz, inputs are Markov states on a horizon date , labels are payoffs sampled on a later date and differentials are pathwise derivatives, produced with AAD.

The normalization of augmented datasets must take additional steps compared to conventional preparation of classic datasets consisting of only inputs and labels.

Appx 2.1.1 Taking the first (and last) step A first, trivial observation is that the scale of labels carries over to the gradients of the cost functions and the size of gradient descent optimization steps. To avoid manual scaling of learning rate, gradient descent and variants are best implemented with labels normalized by subtraction of mean and division by standard deviation. This is the case for all models trained with gradient descent, including classic regression in high dimension where the closed form solution is intractable.

Contrarily to classic regression, training a neural network is a nonconvex problem, hence, its result is sensitive to the starting point. Correctly seeding connection weights is therefore a crucial step for successful training. The best practice Xavier-Glorot initialization provides a powerful seeding heuristic, implemented in modern frameworks like TensorFlow. It is based on the implicit assumption that the units in the network, including inputs, are centred and orthonormal. It therefore performs best when the inputs are at the very least normalized by mean and standard deviation, and ideally orthogonal. This is specific to neural networks. Training classic regression models, analytically or numerically, is a convex problem, so there is no need to normalize inputs or seed weights in a particular manner.

Training deep learning models therefore always starts with a normalization step and ends with a ’un-normalization step’ where predictions are scaled back to original units. Those first and last step may be seen, and implemented, as additional layers in the network with fixed (non learnable) weights. They may even be merged in the input and output layer of the network for maximum efficiency. In this document, we present normalization as a preprocessing step in the interest of simplicity.

First step

We implemented basic preprocessing in the demonstration notebook:


and similarly for standard deviations of labels and inputs .

The differentials computed by the prediction model (e.g. the twin network of the working paper) are:

hence, we adjust differential labels accordingly:

Training step

The value labels are centred and unit scaled but the differentials labels are not, they are merely re-expressed in units of ’standard deviations of per standard deviation of ’. To avoid summing apples and oranges in the combined cost function as commented in the working paper, we scale cost as follows:

and proceed to find the optimal biases and connection weights by minimization of in .

Last step

The trained model expects normalized inputs and predicts a normalized value, along with its gradient to the normalized inputs. Those results must be scaled back to original units:

where we divided two row vectors to mean elementwise division, and:

Appx 2.1.2 Limitations Basic data normalization is sufficient for textbook examples but more thorough processing is necessary in production, where datasets generated by arbitrary schedules of cashflows simulated in arbitrary models may contain a mass of constant, redundant of irrelevant inputs. Although neural networks are supposed to correctly sort data and identify relevant features during training12, in practice, nonconvex optimization is much more reliable when at least linear redundancies and irrelevances are filtered in a preprocessing step, lifting those concerns from the training algorithm and letting it focus on the extraction of nonlinear features.

In addition, it is best, although not strictly necessary, to train on orthogonal inputs. As it is well known, normalization and orthogonalization of input data, along with filtering of constant and linearly redundant inputs, is all jointly performed in a principled manner by eigenvalue decomposition of the input covariance matrix, in a classic procedure called principle component analysis or PCA.

Appx 2.2 Principal Component Analysis

Appx 2.2.1 Mechanism We briefly recall the mechanism of data preparation with classic PCA. First, normalize labels and center inputs:

i.e. what we now call is the matrix of centred inputs. Perform its eigenvalue decomposition:

where is the orthonormal matrix of eigenvectors (in columns) and is the diagonal matrix of eigenvalues .

Filter numerically constant or redundant inputs identified by eigenvalues lower than a threshold . The filter matrix has rows and columns and is obtained from the identity matrix by removal of columns corresponding to insignificant eigenvalues . Denote:

the reduced eigenvalue and eigenvector matrices of respective shapes and , and apply the following linear transformation to centred input data:

The transformed data has shape , with constant and linearly redundant columns filtered out. It is evidently centred, and easily proved orthonormal:

Note for what follows that orthonormal property is preserved by rotation, i.e. right product by any orthonormal matrix :

To update differential labels, we apply a result from elementary multivariate calculus:

Given two row vectors and in dimension and a square non singular matrix of shape such that , and a scalar, then:

The proof is left as an exercise.

It follows that:

Or the other way around:

We therefore train the ML model more effectively on transformed data:

by minimization of the the cost function (Appx 2.1.1) in the learnable weights. The trained model takes inputs in the tilde basis and predicts normalized values and differentials . Finally, we translate predictions back in the original units:

PCA performs an orthonormal transformation of input data, removing constant and linearly redundant columns, effectively cleaning data to facilitate training of NN. PCA is also useful in its own right. It identifies the main axes of variation of a data matrix and may result in a lower dimensional latent representation, with many applications in finance and elsewhere, covered in vast amounts of classic literature.

PCA is limited to linear transformation and filtering of linearly redundant inputs. A nonlinear extension is given by autoencoders (AE), a special breed of neural networks with bottleneck layers. AE are to PCA what neural networks are to regression, a powerful extension able to identify lower dimensional nonlinear latent representations, at the cost of nonconvex numerical optimization. Therefore, AE themselves require careful data preparation and are not well suited to prepare data for training other DL models.

Appx 2.2.2 Limitations
Further processing required

In the context of a differential dataset, we cannot stop preprocessing with PCA. Recall, we train by minimization of the cost function (Appx 2.1.1), where derivative errors are scaled by the size of differential labels. We will experience numerical instabilities when some differential columns are identically zero or numerically insignificant. This means the corresponding inputs are irrelevant in the sense that they don’t affect labels in any of the training examples. They really should not be part the training set, all they do is unnecessarily increase dimension, confuse optimizers and cause numerical errors. But PCA cannot eliminate them because it operates on inputs alone and disregards labels and how inputs affect them. PCA ignores relevance.

Irrelevances may even appear in the orthogonal basis, even when inputs looked all relevant in the original basis. To see that clearly, consider a simple example in dimension 2, where and are sampled from 2 standard Gaussian distributions with correlation and . Differential labels are constant across examples with and . Both differentials are clearly nonzero and both inputs appear to be relevant. PCA projects data on orthonormal axes and with eigenvalues and , and:

so after PCA transformation, one of the columns clearly appears irrelevant. Note that this is a coincidence, we would not see that if correlation between and were different from . PCA is not able to identify axes of relevance, it only identifies axes of variation. By doing so, it may accidentally land on axes with zero or insignificant relevance.

It appears from this example that, not only further processing is necessary, but also, desirable to eliminate irrelevant inputs and combinations of inputs in the same way that PCA eliminated constant and redundant inputs. Note that we don’t want to replace PCA. We want to train on orthonormal inputs and filter constants and redundancies. What we want is combine PCA with a similar treatment of relevance.

Limited dimension reduction

The eventual amount of dimension reduction PCA can provide is limited, precisely because it ignores relevance. Consider the problem of a basket option in a correlated Bachelier model, as in the section 3.1 of the working paper. The states are realizations of the stock prices at and the labels are option payoffs, conditionally sampled at . Recall that the price at of a basket option expiring at is a nonlinear scalar function (given by Bachelier’s formula) of a linear combination of the stock prices at , where is the vector of weights in the basket. The basket option, which payoff is measured by , is only affected (in a nonlinear manner) by one linear risk factor of . Although the input space is in dimension , the subspace of relevant risk factors is in dimension