Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications
Abstract
This paper presents several numerical applications of deep learningbased algorithms that have been introduced in [11]. Numerical and comparative tests using TensorFlow illustrate the performance of our different algorithms, namely control learning by performance iteration (algorithms NNcontPI and ClassifPI), control learning by hybrid iteration (algorithms HybridNow and HybridLaterQ), on the dimensional nonlinear PDEs examples from [7] and on quadratic backward stochastic differential equations as in [6]. We also performed tests on lowdimension control problems such as an option hedging problem in finance, as well as energy storage problems arising in the valuation of gas storage and in microgrid management. Numerical results and comparisons to quantizationtype algorithms Qknn, as an efficient algorithm to numerically solve lowdimensional control problems, are also provided; and some corresponding codes are available on https://github.com/comeh/.
Key words: Deep learning, algorithms, performance iteration, value iterations, MonteCarlo, quantization.
1 Introduction
This paper is devoted to the numerical resolution of discretetime stochastic control problem over a finite horizon. The dynamics of the controlled state process valued in is given by
(1.1) 
where is a sequence of i.i.d. random variables valued in some Borel space , and defined on some probability space equipped with the filtration generated by the noise ( is the trivial algebra), the control is an adapted process valued in , and is a measurable function from into . Given a running cost function defined on and a terminal cost function defined on , the cost functional associated with a control process is
(1.2) 
The set of admissible controls is the set of control processes satisfying some integrability conditions ensuring that the cost functional is welldefined and finite. The control problem, also called Markov decision process (MDP), is formulated as
(1.3) 
and the goal is to find an optimal control , i.e., attaining the optimal value: . Notice that problem (1.1)(1.3) may also be viewed as the time discretization of a continuous time stochastic control problem, in which case, is typically the Euler scheme for a controlled diffusion process.
It is wellknown that the global dynamic optimization problem (1.3) can be reduced to local optimization problems via the dynamic programming (DP) approach, which allows to determine the value function in a backward recursion by
(1.4)  
Moreover, when the infimum is attained in the DP formula (1.4) at any time by , we get an optimal control in feedback form (policy) given by: where is the Markov process defined by
The practical implementation of the DP formula may suffer from the curse of dimensionality and large complexity when the state space dimension and the control space dimension are high. In [11], we proposed algorithms relying on deep neural networks for approximating/learning the optimal policy and then eventually the value function by performance/policy iteration or hybrid iteration with Monte Carlo regressions now or later. This research led to three algorithms, namely algorithms NNcontPI, HybridNow and HybridLaterQ that are recalled in Section 2. In Section 3, we perform some numerical and comparative tests for illustrating the efficiency of our different algorithms, on dimensional nonlinear PDEs examples as in [7] and quadratic Backward Stochastic Differential equations as in [6]. We present numerical results for an option hedging problem in finance, and energy storage problems arising in the valuation of gas storage and in microgrid management. Numerical results and comparisons to quantizationtype algorithms Qknn, introduced in this paper as an efficient algorithm to numerically solve lowdimensional control problems, are also provided. Finally, we conclude in Section 4 with some comments about possible extensions and improvements of our algorithms.
2 Algorithms
We introduce in this section four neural networkbased algorithms for solving the discretetime stochastic control problem (1.1)(1.3). The convergence of these algorithms have been analyzed in detail in our companion paper [11]. We also introduce at the end of this section a quantization and nearestneighborbased algorithm (Qknn) that will be used as benchmark when testing our algorithms on lowdimensional control problems.
We are given a class of deep neural networks (DNN) for the control policy represented by the parametric functions , with parameters , and a class of DNN for the value function represented by the parametric functions: , with parameters . Recall that these DNN functions and are compositions of linear combinations and nonlinear activation functions, see [8].
Additionally, we shall be given a sequence of probability measures on the state space , that we call training measure and denoted , and which should be seen as dataset providers to learn the optimal strategies and the value functions at time .
Remark 2.1 (Training sets design)
Two cases are considered for the choice of the training measure used to generate the training sets on which will be computed the estimates at time . The first one is a knowledgebased selection, relevant when the controller knows with a certain degree of confidence where the process has to be driven in order to optimize her cost functional. The second case is when the controller has no idea where or how to drive the process to optimize the cost functional.
(1) Exploitation only strategy
In the knowledgebased setting, there is no need for exhaustive and expensive (in time mainly) exploration of the state space, and the controller can take a training measure that assigns more points in the region of the state space that is likely to be visited by the optimallydriven process.
In practice, at time , assuming we know that the optimal process is likely to lie in a region , we choose a training measure in which the density assigns lot of weight to the points of , for example , the uniform distribution in .
(2) Explore first, exploit later When the controller has no idea where or how to drive the process to optimize the cost functional, we suggest the latter to adopt the following twostep approach:

Explore first: If the agent has no idea of where to drive the process to receive large rewards, she can always proceed to an exploration step to discover favorable subsets of the state space. To do so, the training sets at time , for , can be built as uniform grids that cover a large part of the state space, or can be chosen uniform on such domain. It is essential to explore far enough to acquire a good understanding of where to drive and where not to drive the process.

Exploit later: The estimates for the optimal controls at time , that come up from the Explore first step, are relatively good in the way that they manage to avoid the poor regions when driving the process. However, the training sets that have been used to compute the estimated optimal control are too sparse to ensure accuracy on the estimation. In order to improve the accuracy, the natural idea is to build new training sets by simulating times the process using the estimates on the optimal strategy computed from the Explore first step, and then proceed to another estimation of the optimal strategies using the new training sets. This trick can be seen as a twostep algorithm that improves the final estimate of the optimal control.
Remark 2.2 (Choice of Neural Networks)
Unless otherwise specified, we use Feedback Neural Networks with two or three hidden layers and +10 neurons per hidden layer. We tried sigmoid, tanh, ReLU and ELU activation functions and noticed that ELU is most often the one providing the best results in our applications. We normalize the input data of each neural network in order to speed up the training of the latter.
Remark 2.3 (Neural Networks Training)
We use the Adam optimizer, as implemented in TensorFlow, with learning rate set to 0.001 or 0.005, which are the default values in TensorFlow, to train the optimal strategy and the value function computed from the algorithms described later. In order to force the weights of biases of the neurons to stay small, we use an regularization with parameter mainly set to 0.01, but the value can change in order to make sure that the regularization term is neither too strong or too weak when added to the loss when training neural networks.
We consider a large enough number of minibatches of size 64 or 128 for the training, depending essentially empirically on the dimension of the problem. We use at least 10 epochs^{e}^{e}eWe denote by epoch one pass of the full training set. and stop the training when the loss computed on a validation set of size 100 stops decreasing. We noticed that taking more than one epoch really improves the quality of the estimates.
Remark 2.4 (Constraints)
The proposed algorithms can deal with state and control constraints at any time, which is useful in several applications:
where is some given subset of . In this case, in order to ensure that the set of admissible controls is not empty, we assume that the sets
are non empty for all , and the DP formula now reads
From a computational point of view, it may be more convenient to work with unconstrained state/control variables, hence by relaxing the state/control constraint and introducing into the running cost a penalty function : , and . For example, if the constraint set is in the form: , for some functions , then one can take as penalty functions:
where are penalization coefficients (large in practice).
2.1 Control Learning by Performance Iteration
We present in this section Algorithm 1, which combines an optimal policy estimation by neural networks and the dynamic programming principle. We rely on the performance iteration procedure, i.e. paths are always recomputed up to the terminal time .
2.1.1 Algorithm NNContPI
Our first algorithm, refereed to as NNContPI, and described in Algorithm 1, is welldesigned for control problems with continuous control space such as or a ball in . The main idea is to first parametrize the optimal control using a neural network in which the activation function for the output layer takes values in the control space. For example, one can take the identity function as activation function for the output layer if the control space is ; or the sigmoid function if the control space is , or more generally a bounded set. Then, it remains to learn the optimal parameters of the neural network using a training set, which is given as initial positions of law and random noises.
2.1.2 Algorithm ClassifPI
In the special case where the control space is finite, i.e., Card with , a classification method can be used: consider a DNN that takes state as input and returns a probability vector with parameters . Algorithm 2, presented below, is based on this idea, and is called ClassifPI.
(2.2)  
Note that, when using Algorithms 1 and 2, the estimate of the optimal strategy at time highly relies on the estimates of the optimal strategy at time , that have been computed previously. In particular, the practitioner who wants to use Algorithms 1 and 2 needs to keep track of the estimates of the optimal strategy at time in order to compute the estimate of the optimal strategy at time .
Remark 2.5
In practice, for , one should approximate the expectations (LABEL:computa) and (LABEL:computPI) by its empirical mean, i.e. consider a sample from for the initial position at time , and other samples from the law , for , in order to generate a finite number of paths on which to estimate the expectations (LABEL:computa) and (LABEL:computPI) using Monte Carlo method.
2.2 Control and value function learning by double DNN
We present in this section two algorithms, which in contrast with Algorithms 1 or 2, only keep track on the estimates of the value function and optimal control at time in order to build an estimate of the value function and optimal control at time .
2.2.1 Regress Now (HybridNow)
The Algorithm 3, refereed to as HybridNow, combines optimal policy estimation by neural networks and dynamic programming principle, and relies on an hybrid procedure between value and performance iteration.
(2.4) 
(2.5) 
2.2.2 Regress Later and Quantization (HybridLaterQ)
The Algorithm 4, called HybridLaterQ, combines regresslater and quantization methods to build estimates of the value function. The main idea behind Algorithm 4 is to first interpolate the value function at time by a set of basis functions, which is in the spirit of the regresslaterbased algorithms, and secondly regress the interpolation at time using quantization. The usual regresslater approach requires the ability to compute closedform conditional expectations, which limits the stochastic dynamics and regression bases that can be considered. The use of quantization avoids this limitation and makes the regresslater algorithm more generally applicable.
Let us first recall the basic ingredients of quantization. We denote by a quantizer of the valued random variable (typically a Gaussian random variable), that is a discrete random variable on a grid defined by
where , , are Voronoi tesselations of , i.e., Borel partitions of the Euclidian space satisfying
The discrete law of is then characterized by
The grid points which minimize the quantization error lead to the socalled optimal quantizer, and can be obtained by a stochastic gradient descent method, known as Kohonen algorithm or competitive learning vector quantization (CLVQ) algorithm, which also provides as a byproduct an estimation of the associated weights . We refer to [13] for a description of the algorithm, and mention that for the normal distribution, the optimal grids and the weights of the Voronoi tesselations are precomputed on the website http://www.quantize.mathsfi.com.
Quantization is mainly used in Algorithm 4 to efficiently approximate the expectations: recalling the dynamics (1.1), the conditional expectation operator for any functional is equal to
that we shall approximate analytically by quantization via:
(2.6) 
(2.7) 
Observe that the solution to (2.7) actually provides a neural network that interpolates . Hence the Algorithm 4 contains an interpolation step, and moreover, any kind of distance in can be chosen as a loss to compute . In (2.7), we decide to take the loss, mainly because it is the one that worked the best in our applications.
Remark 2.7 (Quantization)
In dimension 1, we used the optimal grids and weights with points, to quantize the reduced and centered normal law ; and took 100 points to quantize the reduced and centered normal law in dimension 2, i.e. . All the grids and weights for the optimal quantization of the normal law in dimension are available in http://www.quantize.mathsfi.com for .
2.2.3 Some remarks on Algorithms 3 and 4
As in Remark 2.5, all the expectations written in our pseudocodes in Algorithm 3 and 4 should be approximated by empirical mean using a finite training set.
Algorithms 3 or 4 are quite efficient to use in the usual case where the value function and the optimal control at time are very close to the value function and the optimal control at time , which happens e.g. when the value function and the optimal control are approximations of the time discretization of a continuous in time value function and an optimal control. In this case, it is recommended to follow the two steps procedure:

initialize the parameters (i.e. weights and bias) of the neural network approximations of the value function and the optimal control at time to the ones of the neural network approximations of the value function and the optimal control at time .

take a very small learning rate parameter, for the Adam optimizer, that guarantees the stability of the parameters’ updates from the gradientdescent based learning procedure.
Doing so, one obtains stable estimates of the value function and optimal control, which is desirable. We highlight the fact that this stability procedure cannot be implemented in most of the algorithms proposed in the literature (for example the ones presented in [2] which are based on regressnow, regresslater or quantization).
2.3 Quantization with knearestneighbors (Qknnalgorithm)
Algorithm 5 presents the pseudocode of an algorithm based on the quantization and nearest neighbors methods, called Qknn, which will be the benchmark in all the lowdimensional control problems that will be considered in Section 3 to test NNContPI, ClassifPI, HybridNow and HybridLater. Also, comparisons of Algorithm 5 to other wellknown algorithms on various control problems in lowdimension are performed in [2], which show in particular that Algorithm 5 works very well to solve lowdimensional control problems. Actually, in our experiments, Algorithm 5 always outperforms the other algorithms based either on regressnow or regresslater methods whenever the dimension of the problem is low enough for Algorithm 5 to be feasible.
Just as it has been done in Section 2.2.2, we consider an optimal quantizer of the noise , i.e. a discrete random variable valued in a grid of points in , and with weights . We also consider grids , of points in , which are assumed to properly cover the region of that is likely to be visited by the optimally driven process at time . These grids can be viewed as samples of wellchosen training distributions where more points are taken in the region that is likely to be visited by the optimally driven controlled process (see Remark 2.1 for details on the choice of the training measure).
Remark 2.8
The estimate of the Qvalue at time given by (2.8) is not continuous w.r.t. the control variable , which might cause some stability issues when running Qknn, especially during the optimization procedure (2.9). We refer to Section 3.2.2. in [2] for a detailed presentation of an extension of Algorithm 5 where the estimates of the value function is continuous w.r.t. the control variable.
(2.8) 
(2.9) 
3 Numerical applications
In this section, we test the NeuralNetworksbased algorithms presented in Section 2 on different examples. In highdimension, we took the same example as already considered in [7] so that we can directly compare our results to theirs. In lowdimension, we compared the results of our algorithms to the ones provided by Qknn, which has been introduced in Section 2 as an excellent benchmark for lowdimensional control problems. Some codes, written in Python, TensorFlow or Julia, are available on https://github.com/comeh/.
3.1 A semilinear PDE
We consider the following semilinear PDE with quadratic growth in the gradient:
(3.1) 
By observing that for any ,  , the PDE (3.1) can be written as a HamiltonJacobiBellman equation
(3.2) 
hence associated with the stochastic control problem
(3.3) 
where is the controlled process governed by
is a dimensional Brownian motion, and the control process is valued in . The time discretization (with time step ) of the control problem (3.3) leads to the discretetime control problem (1.1)(1.2)(1.3) with
where is a sequence of i.i.d. random variables of law , and the cost functional
On the other hand, it is known that an explicit solution to (3.1) (or equivalently (3.2)) can be obtained via a HopfCole transformation (see e.g. [6]), and is given by
(3.4) 
We choose to run tests on two different examples that have already been considered in the literature:
Test 1
Some recent numerical results have been obtained in [7] (see Section 4.3 in [7]) when and in dimension (see Table 2 and Figure 3 in [7]). Their method is based on neural network regression to solve the BSDE representation associated to the PDE (3.1), and provide estimates of the value function at time 0 and state 0 for different values of a coefficient . We plotted the results of the HybridNow algorithm in Figure 1. HybridNow took one hour to achieves a relative error of 0.11%, using a 4cores 3GHz intel Core i7 CPU. We want to highlight the fact that the algorithm presented in [7] only needed 330 seconds to provide a relative error of 0.17%. However, to our experience, it is difficult to reduce the relative error from 0.17% to 0.11% using their algorithm. Also, we believe that the computation time of our algorithm can easily be alleviated; some ideas in that direction are discussed in Section 4.
We also considered the same problem in dimension , for which we plotted the first component of w.r.t. time in Figure 2, for five different paths of the Brownian motion, where for each , the agent follows either the naive ( ) or the HybridNow strategy. One can see that both strategies are very similar when the terminal time is far; but the HybridNow strategy clearly forces to get closer to 0 when the terminal time gets closer, in order to reduce the terminal cost.
Test 2
Tests of the algorithms have also been run in dimension 1 with the terminal cost and . This problem has already been considered in [14], where the author used the algorithm presented in [15], which is based on the BSDE representation of the PDE (3.1). Their estimates of the value function at time 0 and state 0, when , are available in [14], and have been reported in column of Table 1. Also, the exact values for the value function have been computed for these values of by Monte Carlo using the closedform formula (3.4), and are reported in the column Bench of Table 1. Tests of the HybridNow and HybridLaterQ algorithms have been run, and the estimates of the value function at time 0 and state are reported in the HybridNow and HybridLaterQ columns. We also tested Qknn and reported its results in column Qknn. Note that Qknn is particularly wellsuited to 1dimensional control problems. In particular, it is not timeconsuming since the dimension of the state space is =1. Actually, it provides the fastest results, which is not surprising since the other algorithms need time to learn the optimal strategy and value function through gradientdescent method at each time step . Moreover, Table 1 reveals that Qknn is the most accurate algorithm on this example, probably because it uses local methods in space to estimate the conditional expectation that appears in the expression of the value.
Y&R  HybridLaterQ  HybridNow  Qknn  Bench  

1.0  0.402  0.456  0.460  0.461  0.464 
0.5  0.466  0.495  0.507  0.508  0.509 
0.1  0.573  0.572  0.579  0.581  0.586 
0.0  0.620  1.000  1.000  1.000  1.000 
We end this paragraph by giving some implementation details for the different algorithms as part of Test 2:

Y&R: The algorithm Y&R converged only when using a Lipschitz version of . The following approximation was used to obtained the results in Table 1:

HybridNow: We used time steps for the timediscretization of . The value functions and optimal controls at time are estimated using neural networks with 3 hidden layers and 10+5+5 neurons.

HybridLaterQ: We used time steps for the timediscretization of . The value functions and optimal controls at time are estimated using neural networks with 3 hidden layers containing 10+5+5 neurons; and 51 points for the quantization of the exogenous noise.

Qknn: We used time steps for the timediscretization of . We take 51 points to quantize the exogenous noise, , for ; and decided to use the 200 points of the optimal grid of for the state space discretization.
The main conclusion regarding the results in this semilinear PDE problem is that HybridNow provides better estimates of the solution to the PDE in dimension =100 than the previous results available in [7] but requires more time to do so.
HybridNow and HybridLater provide better results than those available in [15] to solve the PDE in dimension 2; but are outperformed by Qknn, which is arguably very accurate.
3.2 Option hedging
Our second example comes from a classical hedging problem in finance. We consider an investor who trades in stocks with (positive) price process , and we denote by valued in the amount held in these assets on the period . We assume for simplicity that the price of the riskless asset is constant equal to (zero interest rate). It is convenient to introduce the return process as: , , so that the selffinanced wealth process of the investor with a portfolio strategy , and starting from some capital , is governed by
Given an option payoff , the objective of the agent is to minimize over her portfolio strategies her expected square replication error
where is a convex function on . Assuming that the returns , are i.i.d, we are in a dimensional framework of Section 1 with with valued in , with the dynamics function
the running cost function and the terminal cost . We test our algorithm in the case of a square loss function, i.e. , and when there is no portfolio constraints , and compare our numerical results with the explicit solution derived in [3]: denote by the distribution of , by its mean, and by assumed to be invertible; we then have
where the functions , and are given in backward induction, starting from the terminal condition
and for , by
so that , where is the initial stock price. Moreover, the optimal portfolio strategy is given in feedback form by , where is the function
and is the optimal wealth associated with , i.e., . Moreover, the initial capital that minimizes , and called (quadratic) hedging price is given by
Test
Take , and consider one asset with returns modeled by a trinomial tree:
with , , , . Take , and consider the call option with . The price of this option is defined as the initial value of the portfolio that minimizes the terminal quadratic loss of the agent when the latter follows the optimal strategy associated with the initial value of the portfolio. In this test, we want to determine the price of the call and the associated optimal strategy using different algorithms.
Remark 3.1
The option hedging problem is linearquadratic, hence belongs to the class of problems where the agent has ansatzes on the optimal control and the value function. Indeed, we expect here the optimal control to be affine w.r.t. and the value function to be quadratic w.r.t. . For these kind of problems, the algorithms presented in Section 2 can easily be adapted so that the expressions of the estimators satisfy the ansatzes. See (3.5) and (3.6) for the option hedging problem.
Numerical results
In Figure 3, we plot the value function at time 0 w.r.t , the initial value of the portfolio, when the agent follows the theoretical optimal strategy (benchmark), and the optimal strategy estimated by the HybridNow or HybridLaterQ algorithms. We perform forward Monte Carlo using 10,000 samples to approximate the lower bound of the value function at time 0 (see [9] for details on how to get an approximation of the upperbound of the value function via duality). One can observe that while all the algorithms give a call option price approximately equal to 4.5, HybridLaterQ clearly provides a better strategy than HybridNow to reduce the quadratic risk of the terminal loss.
We plot in Figure 4 three different paths of the value of the portfolio w.r.t the time , when the agent follows either the theoretical optimal strategy (red), or the estimated one using HybridNow (blue) or HybridLaterQ (green). We set for these simulations.
Comments on HybridNow and HybridLaterQ
The Option Hedging problem belongs to the class of the linearquadratic control problems for which we expect the optimal control to be affine w.r.t. and the value function to be quadratic w.r.t. . It is then natural to consider the following classes of controls and functions to properly approximate the optimal controls and the values functions at time =:
(3.5) 
(3.6) 
where describes the parameters (weights+bias) associated with the neural network and describes those associated with the neural network . The notation stands for the transposition, and for the inner product. Note that there are 2 (resp. 3) neurons in the output layer of (resp. ), so that the inner product is welldefined in (3.6) and (3.5).
3.3 Valuation of energy storage
We present a discretetime version of the energy storage valuation problem studied in [4]. We consider a commodity (gas) that has to be stored in a cave, e.g. salt domes or aquifers. The manager of such a cave aims to maximize the real options value by optimizing over a finite horizon the dynamic decisions to inject or withdraw gas as time and market conditions evolve. We denote by the gas price, which is an exogenous realvalued Markov process modeled by the following meanreverting process:
(3.7) 
where , and is the stationary value of the gas price. The current inventory in the gas storage is denoted by and depends on the manager’s decisions represented by a control process valued in : (resp. ) means that she injects (resp. withdraws) gas with an injection (resp. withdrawal) rate (resp. ) requiring (causing) a purchase (resp. sale) of (resp. ), and means that she is doing nothing. The difference between and (resp. and ) indicates gas loss during injection/withdrawal. The evolution of the inventory is then governed by