A neural network based policy iteration algorithm with global superlinear convergence for stochastic games on domains
Abstract. In this work, we propose a class of numerical schemes for solving semilinear HamiltonJacobiBellmanIsaacs (HJBI) boundary value problems which arise naturally from exit time problems of diffusion processes with controlled drift. We exploit policy iteration to reduce the semilinear problem into a sequence of linear Dirichlet problems, which are subsequently approximated by a multilayer feedforward neural network ansatz. We establish that the numerical solutions converge globally in the norm, and further demonstrate that this convergence is superlinear, by interpreting the algorithm as an inexact Newton iteration for the HJBI equation. Moreover, we construct the optimal feedback controls from the numerical value functions and deduce convergence. The numerical schemes and convergence results are then extended to HJBI boundary value problems corresponding to controlled diffusion processes with oblique boundary reflection. Numerical experiments on the stochastic Zermelo navigation problem are presented to illustrate the theoretical results and to demonstrate the effectiveness of the method.
Key words. HamiltonJacobiBellmanIsaacs equations, neural networks, policy iteration, inexact semismooth Newton method, global convergence, superlinear convergence.
AMS subject classifications. 82C32, 91A15, 65M12
1 Introduction
In this article, we propose a class of efficient numerical schemes for solving HamiltonJacobiBellmanIsaacs (HJBI) boundary value problems of the following form:
(1.1) 
where is an open bounded domain, is the (nonconvex) Hamiltonian given by
(1.2) 
and is a boundary operator, i.e., if is the identity operator, (1.1) is an HJBI Dirichlet problem, while if with some functions , (1.1) is an HJBI oblique derivative problem. Above and hereafter, when there is no ambiguity, we shall adopt the summation convention as in [16], i.e., repeated equal dummy indices indicate summation from to .
It is wellknown that the value function of zerosum stochastic differential games in domains satisfies the HJBI equation (1.1), and the optimal feedback controls can be constructed from the derivatives of the solutions (see e.g. [27] and references within; see also Section 6 for a concrete example). In particular, the HJBI Dirichlet problem corresponds to exit time problems of diffusion processes with controlled drift (see e.g. [27, 7, 32]), while the HJBI oblique derivative problem corresponds to state constraints (see e.g. [31, 29]). A nonconvex HJBI equation as above also arises from a penalty approximation of hybrid control problems involving continous controls, optimal stopping and impulse controls, where the HJB (quasi)variational inequality can be reduced to an HJBI equation by penalizing the difference between the value function and the obstacles (see e.g. [23, 42, 34, 35]). As (1.1) in general cannot be solved analytically, it is important to construct effective numerical schemes to find the solution of (1.1) and its derivatives.
The standard approach to solving (1.1) is to “discretize, then optimize”, where one first discretizes the operators in (1.1) by finite difference or finite element methods, and then solves the resulting nonlinear discretized equations by using policy iteration, also known as Howard’s algorithm, or generally (finitedimensional) semismooth Newton methods (see e.g. [14, 5, 40, 34]). However, this approach has the following drawbacks, as do most meshbased methods: (1) it can be difficult to generate meshes and to construct consistent numerical schemes for problems in domains with complicated geometries; (2) the number of unknowns in general grows exponentially with the dimension , i.e., it suffers from Bellman’s curse of dimensionality, and hence this approach is infeasible for solving highdimensional control problems. Moreover, since policy iteration is applied to a fixed finitedimensional equation resulting from a particular discretization, it is difficult to infer whether the same convergence rate of policy iteration remains valid as the mesh size tends to zero ([37, 5]).
Recently, numerical methods based on deep neural networks have been designed to solve highdimensional partial differential equations (PDEs) (see e.g. [30, 11, 3, 12, 22, 39]). Most of these methods reformulate (1.1) into a nonlinear leastsquares problem:
(1.3) 
where is a collection of neural networks with a smooth activation function. Based on collocation points chosen randomly from the domain, (1.3) is then reduced into an empirical risk minimization problem, which is subsequently solved by using stochastic optimization algorithms, in particular the Stochastic Gradient Descent (SGD) algorithm or its variants. Since these methods avoid mesh generation, they can be adapted to solve PDEs in highdimensional domains with complex geometries. Moreover, the choice of smooth activation functions leads to smooth numerical solutions, whose values can be evaluated everywhere without interpolations. In the following, we shall refer to these methods as the Direct Method, due to the fact that there is no policy iteration involved.
We observe, however, that the Direct Method also has several serious drawbacks, especially for solving nonlinear nonsmooth equations including (1.1). Firstly, the nonconvexity of both the deep neural networks and the Hamiltonian leads to a nonconvex empirical minimization problem, for which there is no theoretical guarantee on the convergence of SGD to a minimizer (see e.g. [38]). In practice, training a network with a desired accuracy could take hours or days (with hundreds of thousands of iterations) due to the slow convergence of SGD. Secondly, each SGD iteration requires the evaluation of (with respect to and ) on sample points, but is not necessarily defined everywhere due to the nonsmoothness of . Moreover, evaluating the function (again on a large set of sample points) can be expensive, especially when the sets A and B are of high dimensions, as we do not require more regularity than continuity of the coefficients with respect to the controls, so that approximate optimization may only be achieved by exhaustive search over a discrete coverage of the compact control set. Finally, as we shall see in Remark 4.1, merely including an norm of the boundary data in the loss function (1.3) does not generally lead to convergence of the derivatives of numerical solutions or the corresponding feedback control laws.
In this work, we propose an efficient neural network based policy iteration algorithm for solving (1.1). At the th iteration, , we shall update the control laws by performing pointwise maximization/minimization of the Hamiltonian based on the previous iterate , and obtain the next iterate by solving a linear boundary value problem, whose coefficients involve the control laws . This reduces the (nonconvex) semilinear problem into a sequence of linear boundary value problems, which are subsequently approximated by a multilayer neural network ansatz. Note that compared to Algorithm Ho3 in [5] for discrete HJBI equations, which requires to solve a nonlinear HJB subproblem (involving minimization over the set B) for each iteration, our algorithm only requires to solve a linear subproblem for each iteration, hence it is in general more efficient, especially when the dimension of B is high.
Policy iteration (or Successive Galerkin Approximation) was employed recently in [24, 25] to solve convex HJB equations on the whole space . Specifically, [24] approximates the solution to each linear equation via a separable polynomial ansatz (without concluding any convergence result), while [25] assumes each linear equation is solved sufficiently accurately (without specifying a numerical method), and deduces pointwise linear convergence. In this paper, we propose an easily implementable accuracy criterion for the numerical solutions of the linear PDEs which ensures the numerical solutions converge superlinearly in a suitable function space for nonconvex HJBI equations from an arbitrary initial guess.
Our algorithm enjoys the main advantage of the Direct Method, i.e., it is a meshfree method and can be applied to solve highdimensional stochastic games. Moreover, by utilizing the superlinear convergence of policy iteration, our algorithm effectively reduces the number of pointwise maximization/minimization over the sets A and B, and significantly reduces the computational cost of the Direct Method, especially for high dimensional control sets. The superlinear convergence of policy iteration also helps to eliminate the oscillation caused by SGD, which leads to smoother and more rapidly decaying loss curves in both the training and validation processes (see Figure 6.3). Our algorithm further allows training of the feedback controls on a separate network architecture from that representing the value function, or adaptively adjusting the architecture of networks for each policy iteration.
A major theoretical contribution of this work is the proof of global superlinear convergence of the policy iteration algorithm for the HJBI equation (1.1) in , which is novel even for HJB equations (i.e., one of the sets A and B is singleton). Although the (local) superlinear convergence of policy iteration for discrete equations has been proved in various works (e.g. [33, 37, 14, 5, 42, 40, 34]), to the best of our knowledge, there is no published work on the superlinear convergence of policy iteration for HJB PDEs in a function space, nor on the global convergence of policy iteration for solving nonconvex HJBI equations.
Moreover, this is the first paper which demonstrates the convergence of neural network based methods for the solutions and their (first and second order) derivatives of nonlinear PDEs with merely measurable coefficients (cf. [18, 19, 22, 39]). We will also prove the pointwise convergence of the numerical solutions and their derivatives, which subsequently enables us to construct the optimal feedback controls from the numerical value functions and deduce convergence.
Let us briefly comment on the main difficulties encountered in studying the convergence of policy iteration for HJBI equations. Recall that at the th iteration, we need to solve a linear boundary value problem, whose coefficients involve the control laws , obtained by performing pointwise maximization/minimization of the Hamiltonian . The uncountability of the state space and the nonconvexity of the Hamiltonian require us to exploit several technical measurable selection arguments to ensure the measurability of the controls , which is essential for the welldefinedness of the linear boundary value problems and the algorithm.
Moreover, the nonconvexity of the Hamiltonian prevents us from following the arguments in [37, 14, 5] for discrete HJB equations to establish the global convergence of our inexact policy iteration algorithm for HJBI equations. In fact, a crucial step in the arguments for discrete HJB equations is to use the discrete maximum principle and show the iterates generated by policy iteration converge monotonically with an arbitrary initial guess, which subsequently implies the global convergence of the iterates. However, this monotone convergence is in general false for the iterates generated by the inexact policy iteration algorithm, due to the nonconvexity of the Hamiltonian and the fact that each linear equation is only solved approximately. We shall present a novel analysis technique for establishing the global convergence of our inexact policy iteration algorithm, by first establishing a uniform bound of the norm of the iterates, and then exploiting a compactness argument to demonstrate convergence.
Finally, we remark that the proof of superlinear convergence of our algorithm is significantly different from the arguments for discrete equations. Instead of working with the supnorm for (finitedimensional) discrete equations as in [37, 14, 5, 42, 34], we employ a twonorm framework to establish the generalized differentiability of HJBI operators, where the norm gap is essential as has already been pointed out in [20, 41, 40]. Moreover, by taking advantage of the fact that the Hamiltonian only involves low order terms, we further demonstrate that the inverse of the generalized derivative is uniformly bounded. Furthermore, we include a suitable fractional Sobolev norm of the boundary data in the loss functions used in the training process, which is crucial for the superlinear convergence of the neural network based policy iteration algorithm.
We organize this paper as follows. Section 2 states the main assumptions and recalls basic results for HJBI Dirichlet problems. In Section 3 we propose a policy iteration scheme for HJBI Dirichlet problems and establish its global superlinear convergence. Then in Section 4, we shall introduce the neural network based policy iteration algorithm, establish its various convergence properties, and construct convergent approximations to optimal feedback controls. We extend the algorithm and convergence results to HJBI oblique derivative problems in Section 5. Numerical examples for twodimensional stochastic Zermelo navigation problems are presented in Section 6 to confirm the theoretical findings and to illustrate the effectiveness of our algorithms. The Appendix collects some basic results which are used in this article, and gives a proof for the main result on the HJBI oblique derivative problem.
2 HJBI Dirichlet problems
In this section, we introduce the HJBI Dirichlet boundary value problems of our interest, recall the appropriate notion of solutions, and state the main assumptions on its coefficients. We start with several important spaces used frequently throughout this work.
Let and be a bounded domain in , i.e., a bounded open connected subset of with a boundary. For each integer and real with , we denote by the standard Sobolev space of real functions with their weak derivatives of order up to in the Lebesgue space . When , we use to denote . We further denote by and the spaces of traces from and , respectively (see [17, Proposition 1.1.17]), which can be equivalently defined by using the surface measure and the local coordinate charts of the boundaries (see e.g. [15]).
We shall consider the following HJBI equation with nonhomogeneous Dirichlet boundary data:
(2.1a)  
(2.1b) 
where the nonlinear Hamiltonian is given as in (1.1):
(2.2) 
Throughout this paper, we shall focus on the strong solution to (2.1), i.e., a twice weakly differentiable function satisfying the HJBI equation (2.1a) almost everywhere in , and the boundary values on will be interpreted as traces of the corresponding Sobolev space. For instance, on in (2.1b) means that the trace of is equal to in , where denotes the trace operator (see [15, Proposition 1.1.17]). See Section 5 for boundary conditions involving the derivatives of solutions.
We now list the main assumptions on the coefficients of (2.1).
H. 1.
Let , be a bounded domain, A be a nonempty finite set, and B be a nonempty compact subset of a complete separable metric space. Let , satisfy the following ellipticity condition with a constant :
and satisfy that on , and that is continuous, for all and .
As we shall see in Theorem 3.3, the finiteness of the set A enables us to establish the semismoothness of the HJBI operator (2.1a), whose coefficients involve a general nonlinear dependence on the parameters and . If all coefficients of (2.1a) are in a separable form, i.e., it holds for all that for some functions (e.g. the penalized equation for variational inequalities with bilateral obstacles in [23]), then we can relax the finiteness of A to the same conditions on B.
Moreover, in this work we focus on boundary value problems in a domain to simplify the presentation, but the numerical schemes and their convergence analysis can be extended to problems in nonsmooth convex domains with sufficiently regular coefficients (see e.g. [17, 40]).
We end this section by proving the uniqueness of solutions to the Dirichlet problem (2.1) in . The existence of strong solutions shall be established constructively via policy iteration below (see Theorem 3.5).
Proposition 2.1.
Proof.
Let be two strong solutions to (2.1), we consider the following linear homogeneous Dirichlet problem:
(2.3) 
where we define the following measurable functions: for each ,
with the Hamiltonian defined as in (2.2). Note that the boundedness of coefficients implies that , and . Moreover, one can directly verify that the following inequality holds for all parametrized functions : for all ,
which together with (H.1) leads to the estimate that on the set , and hence we have a.e. . Then, we can deduce from Theorem A.1 that the Dirichlet problem (2.3) admits a unique strong solution and . Since satisfies (2.3) a.e. in and , we see that is a strong solution to (2.3) and hence , which subsequently implies the uniqueness of strong solutions to the Dirichlet problem (2.1). ∎
3 Policy iteration for HJBI Dirichlet problems
In this section, we propose a policy iteration algorithm for solving the Dirichlet problem (2.1). We shall also establish the global superlinear convergence of the algorithm, which subsequently gives a constructive proof for the existence of a strong solution to the Dirichlet problem (2.1).
We start by presenting the policy iteration scheme for the HJBI equations in Algorithm 1, which extends the policy iteration algorithm (or Howard’s algorithm) for discrete HJB equations (see e.g. [14, 5, 34]) to the continuous setting.

Choose an initial guess in , and set .

Given the iterate , update the following control laws: for all ,
(3.1) (3.2) 
Solve the linear Dirichlet problem for :
(3.3) where for .

If , then terminate with outputs and , otherwise increment by one and go to step 2.
The remaining part of this section is devoted to the convergence analysis of Algorithm 1. For notational simplicity, we first introduce two auxiliary functions: for each with , we shall define the following functions
(3.4)  
(3.5) 
We then recall several important concepts, which play a pivotal role in our subsequent analysis. The first concept ensures the existence of measurable feedback controls and the wellposedness of Algorithm 1.
Definition 3.1.
Let be a measurable space, and let and be topological spaces. A function is a Carathéodory function if:

for each , the function is measurable; and

for each , the function is continuous.
Remark 3.1.
It is wellknown that if are two complete separable metric spaces, and is a Carathéodory function, then for any given measurable function , the composition function is measurable (see e.g. [2, Lemma 8.2.3]). It is clear that (H.1) implies that the coefficients are Carathéodory functions (with and ). Moreover, one can easily check that both and are Carathéodory functions, i.e., (resp. ) is continuous in (resp. ) and measurable in (see Theorem A.3 for the measurability of in ).
We now recall a generalized differentiability concept for nonsmooth operators between Banach spaces, which is referred as semismoothness in [41] and slant differentiability in [9, 20]. It is wellknown (see e.g. [5, 40, 34]) that the HJBI operator in (2.1a) is in general nonFréchetdifferentiable, and this generalized differentiability is essential for showing the superlinear convergence of policy iteration applied to HJBI equations.
Definition 3.2.
Let be defined on a open subset of the Banach space with images in the Banach space . In addition, let be a given a setvalued mapping with nonempty images, i.e., for all . We say is semismooth in if for any given , we have that is continuous near , and
The setvalued mapping is called a generalized differential of in .
Remark 3.2.
As in [41], we always require that has a nonempty image, and hence the semismooth of in shall automatically imply that the image of is nonempty on .
Now we are ready to analyze Algorithm 1. We first prove the semismoothness of the Hamiltonian defined as in (2.2), by viewing it as the composition of a pointwise maximum operator and a family of HJB operators parameterized by the control . Moreover, we shall simultaneously establish that, for each iteration, one can select measurable control laws to ensure the measurability of the controlled coefficients in the linear problem (3.3), which is essential for the wellposedness of strong solutions to (3.3), and the welldefinedness of Algorithm 1.
The following proposition establishes the semismoothness of a parameterized family of firstorder HJB operators, which extends the result for scalarvalued HJB operators in [40]. Moreover, by taking advantage of the fact that the operators involve only firstorder derivatives, we are able to establish that they are semismoothness from to for some (cf. [40, Theorem 13]), which is essential for the superlinear convergence of Algorithm 1.
Proposition 3.1.
Suppose (H.1) holds. Let be a given constant satisfying if and if , and let be the HJB operator defined by
Then is Lipschitz continuous and semismooth in with a generalized differential
defined as follows: for any , we have
(3.6) 
where is any jointly measurable function such that for all and ,
(3.7) 
Proof.
Since A is a finite set, we shall assume without loss of generality that, the Banach space is endowed with the usual product norm , i.e., for all , . Note that the Sobolev embedding theorem shows that the following injections are continuous: , for all , and , for all . Thus for any given satisfying the conditions in Proposition 3.1, we can find such that the injection is continuous. Then, the boundedness of implies that the mappings and are welldefined, and is Lipschitz continuous.
Now we show that the mapping has a nonempty image from to , where we choose such that the injection is continuous, and naturally extend the operators and from to . For each , we consider the Carathéodory function such that for all , where is defined by (3.5). Theorem A.3 shows there exists a function satisfying (3.7), i.e.,
and is jointly measurable with respect to the product algebra on . Hence is nonempty for all .
We proceed to show that the operator is in fact semismooth from to , which implies the desired conclusion due to the continuous embedding . For each , we denote by the th component of , and by the th component of . Theorem A.4 and the continuity of in show that for each , the setvalued mapping
is upper hemicontinuous, from which, by following precisely the steps in the arguments for [40, Theorem 13], we can prove that is semismooth. Then, by using the fact that a direct product of semismooth operators is again semismooth with respect to the direct product of the generalized differentials of the components (see [41, Proposition 3.6]), we can deduce that is semismooth with respect to the generalized differential and finishes the proof. ∎
We then establish the semismoothness of a general pointwise maximum operator, by extending the result in [20] for the maxfunction .
Proposition 3.2.
Let be a given constant, A be a finite set, and be a bounded subset of . Let be the pointwise maximum operator such that for each ,
(3.8) 
Then is semismooth in with a generalized differential
defined as follows: for any , we have
where is any measurable function such that
(3.9) 
Moreover, is uniformly bounded (in the operator norm) for all .
Proof.
Let the Banach space be endowed with the product norm defined as in the proof of Proposition 3.1. We first show the mappings and are welldefined, has nonempty images, and is uniformly bounded for .
The finiteness of A implies that any can also be viewed as a Carathéodory function . Hence for any given , we can deduce from Theorem A.3 the existence of a measurable function satisfying (3.9). Moreover, for any given measurable function and , the function remains Lebesgue measurable (see Remark 3.1). Then, for any given with , one can easily check that , and , which subsequently implies that and are welldefined, and the image of is nonempty on . Moreover, for any , Hölder’s inequality leads to the following estimate:
which shows that for all .
Now we prove by contradiction that the operator is semismooth. Suppose there exists a constant and functions such that as , and
(3.10) 
where for each , is defined with some measurable function . Then, by passing to a subsequence, we may assume that for all , the sequence converges to zero pointwise a.e. in , as .
For notational simplicity, we define for all and . Then for a.e. , we have for all , for all . By using the finiteness of A and the convergence of , it is straightforward to prove by contradiction that for all such , for all large enough .
We now derive an upper bound of the lefthand side of (3.10). For a.e. , we have
from any . Thus, for each , we have for a.e. that,
where, by applying Theorem A.3 twice, we can see that both the setvalued mapping and the function are measurable.
We then introduce the set for each . The measurability of the setvalued mapping implies the associated distance function is a Carathéodory function (see [1, Theorem 18.5]), which subsequently leads to the measurability of for all . Hence we can deduce that
which leads to the following estimate:
where we have used the bounded convergence theorem and the fact that for a.e. , for all large enough . This contradicts to the hypothesis (3.10), and hence finishes our proof. ∎
Now we are ready to conclude the semismoothness of the HJBI operator. Note that the argument in [40] does not apply directly to the HJBI operator, due to the nonconvexity of the Hamiltonian defined as in (2.2).
Theorem 3.3.
Proof.
Note that we can decompose the HJBI operator into , where is the linear operator , is the HJB operator defined in Proposition 3.1, is the pointwise maximum operator defined in Proposition 3.2, and is a constant satisfying if , and if .
Proposition 3.1 shows that is Lipschitz continuous and semismooth with respect to the generalized differential defined by (3.6), while Proposition 3.2 shows that is semismooth with respect to the uniformly bounded generalized differential defined by (3.8). Hence, we know the composed operator is semismooth with respect to the composition of the generalized differentials (see [41, Proposition 3.8]), i.e., for all . Consequently, by using the fact that is Fréchet differentiable with the derivative , we can conclude from Propositions 3.1 and 3.2 that is semismooth on , and that (3.11) is a desired generalized derivative of at . ∎
The above characterization of the generalized differential of the HJBI operator enables us to demonstrate the superlinear convergence of Algorithm 1 by reformulating it as a semismooth Newton method for an operator equation.
Theorem 3.4.
Proof.
Note that the Dirichlet problem (2.1) can be written as an operator equation with the following operator