Deep Fictitious Play for Stochastic Differential Games
Abstract
In this paper, we apply the idea of fictitious play to design deep neural networks (DNNs), and develop deep learning theory and algorithms for computing the Nash equilibrium of asymmetric player nonzerosum stochastic differential games, for which we refer as deep fictitious play, a multistage learning process. Specifically at each stage, we propose the strategy of letting individual player optimize her own payoff subject to the other players’ previous actions, equivalent to solve decoupled stochastic control optimization problems, which are approximated by DNNs. Therefore, the fictitious play strategy leads to a structure consisting of DNNs, which only communicate at the end of each stage. The resulted deep learning algorithm based on fictitious play is scalable, parallel and modelfree, i.e., using GPU parallelization, it can be applied to any player stochastic differential game with different symmetries and heterogeneities (e.g., existence of major players). We illustrate the performance of the deep learning algorithm by comparing to the closedform solution of the linear quadratic game. Moreover, we prove the convergence of fictitious play over a small time horizon, and verify that the convergent limit forms an openloop Nash equilibrium. We also discuss the extensions to other strategies designed upon fictitious play and closedloop Nash equilibrium in the end.
Keywords: Stochastic differential game, fictitious play, deep learning, Nash equilibrium
1 Introduction
In stochastic differential games, a Nash equilibrium refers to strategies by which no player has an incentive to deviate. Finding a Nash equilibrium is one of the core problems in noncooperative game theory, however, due to the notorious intractability of player game, the computation of the Nash equilibrium has been shown extremely timeconsuming and memory demanding, especially for large [16]. On the other hand, a rich literature on game theory has been developed to study consequences of strategies on interactions between a large group of rational “agents”, e.g., system risk caused by interbank borrowing and lending, price impacts imposed by agents’ optimal liquidation, and market price from monopolistic competition. This makes it crucial to develop efficient theory and fast algorithms for computing the Nash equilibrium of player stochastic differential games.
Deep neural networks with many layers have been recently shown to do a great job in artificial intelligence (e.g., [2, 39] ). The idea behind is to use compositions of simple functions to approximate complicated ones, and there are approximation theorems showing that a wide class of functions on compact subsets can be approximated by a single hidden layer neural network (e.g., [53]). This brings a possibility of solving a highdimensional system using deep neural networks, and in fact, these techniques have been successfully applied to solve stochastic control problems [20, 29, 1].
In this paper, we propose to build deep neural networks by using strategies of fictitious play, and develop deep learning algorithms for computing the Nash equilibrium of asymmetric player nonzerosum stochastic differential games. We consider a stochastic different game with players, and each player controls her own private state by taking an action in the control set . The dynamics of the controlled state process on are given by
(1.1) 
where are dimensional independent Brownian motions, are deterministic functions: . The dynamics are coupled since all private states and all the controls
Each player’s control lives in the space of progressively measurable valued processes satisfying the integrability condition:
(1.2) 
Using the strategy , the cost associated to player is of the form:
(1.3) 
where the running cost and terminal cost are deterministic measurable functions.
In solving stochastic differential games, the notion of optimality of common interest is the Nash equilibrium. A set of strategies is called a Nash equilibrium if
(1.4) 
where represents strategies of players other than the th one
In fact, depending on the space where one searches for actions (the information structure available to the players), the types of equilibria include openloop (), closedloop (), and closedloop in feedback form (). We start with the setup (1.4) which corresponds to the openloop case, and shall comment on the generalization of deep learning theory for closedloop cases in Section 5.3.
An alternative method of solving player stochastic differential games is via meanfield games, introduced by Lasry and Lions in [36, 37, 38] and by Huang, Malhamé and Caines in [28, 27] . The idea is to approximate the Nash equilibrium by the solution of mean field equilibrium (the formal limit of ) under mild conditions [9], which leads to an approximation error of order assuming that the players are indistinguishable, i.e., all coefficients are free of . We refer to the books [10, 11] and the references therein for further background on meanfield games. However, beyond the case of a continuum of infinitesimal agents with or without major players, the meanfield equilibrium may not be a good approximation in general. In addition, the meanfield game often exhibits multiple equilibria, some of which do not correspond to the limit of player game as , e.g., in the optimal stopping games [49]. Moreover, when the number of players is of middle size (e.g., ), the approximation error made by the meanfield equilibrium is large while directly solvers based on forwardbackward stochastic differential equations (FBSDEs) or on partial differential equations (PDEs) are still computationally unaffordable. Therefore, it is demanding to develop new theory and algorithms for solving the player game.
The idea proposed in this paper is natural and motivated by the fictitious play, a learning process in game theory firstly introduced by Brown in the static case [6, 7] and recently adapted to the mean field case by Cardaliaguet [8, 5]. In the fictitious play, after some arbitrary initial moves at the first stage, the players myopically choose their best responses against the empirical strategy distribution of others’ action at every subsequent stage. It is hoped that such a learning process will converge and lead to a Nash equilibrium. In fact, Robinson [56] showed this holds for zerosum games, and Miyazawa [43] extended it to games. However, Shapley’s famous counterexample [57] indicates that this is not always true. Since then, many attempts are made to identify classes of games where the global convergence holds [42, 46, 47, 24, 14, 3, 23], and where the process breaks down [31, 44, 19, 34], to name a few.
Based on fictitious play, we propose a deep learning theory and algorithm for computing the openloop Nash equilibria. Unlike closedloop strategies of feedback form, which can be reformulated as the solution to coupled HamiltonJacobiBellman (HJB) equations by dynamic programming principle (DPP), openloop strategies are usually identified through FBSDEs. The existence of explicit solutions to both equations highly depends on the symmetry of the problem, in particular, for most cases where explicit solutions are available, the players are statistical identical. Traditional ways of solving FBSDEs run into the technical difficulty of the curse of dimensionality. Observing impressive results solved by deep learning on various challenge problems [2, 35, 39], we shall use deep neural networks to overcome the dimensionality problem for moderate large and asymmetric games. We first boil down the game into stochastic control subproblems, which are conditionally independent given past play at each stage. Since we first focus on openloop equilbria (as opposed to closedloop ones) in each subproblem, the strategies are considered as general progressively measurable processes (as opposed to functions of ). Therefore, without the feedback effects, one can design a deep neural network to solve stochastic control subproblems individually. The control at each time step is approximated by a feedforward subnetwork, whose inputs are initial states and noises in lieu of the definition of openloop equlibria. For player ’s control problem, are generated using strategies from past, i.e., considered as fixed while player optimizes herself.
Main contribution. The contribution of deep fictitious play is threefold. Firstly, our algorithm is scalable: in each round of play, the subproblems can be solved in parallel, which can be accelerated by the feature of multiGPU. Secondly, we propose a deep neural network for solving general stochastic control problem where strategies are general processes instead of feedback form. In lack of DPP, algorithms from reinforcement learning are no longer available. We approximate the optimal control directly in contrast to approximating value functions [54]. Thirdly, the algorithm can be applied to asymmetric games, as for each player, there is a corresponding neural network.
Related literature. Most literature in deep learning and reinforcement learning algorithms in stochastic control problems uses DPP with which, the problem can be solved backwardly, i.e., to find the optimal control at the terminal time, and then decide the previous decision. Among them, let me mention the recent works [29, 1], which approximate the optimal policy by neural networks in the spirit of deep reinforcement learning, and the approximated optimal policy is obtained in a backward manner. While in our algorithm, we stack these subnetworks together to form a deep network and train them simultaneously. In fact, our structure is inspired by Han and E[20], where they also train the stack of subnetwork but for the seek of feedbackform controls.
Organization of the paper. In Section 2, we systematically introduce the deep fictitious play theory, and implementation of deep learning algorithms using Keras with GPU acceleration. In Section 3, we apply deep fictitious play to linear quadratic games, and prove the convergence of fictitious play over a small time horizon, with the limit forming an openloop Nash equilibrium. Performance of deep learning algorithms are presented in Section 4, where we simulate stochastic differential games with a large number of players (e.g., ). We make conclusive remarks, and discuss the extensions to other strategies of fictitious play and closedloop cases in Section 5.
2 Deep fictitious play
In this section, we describe the theory and algorithms of deep fictitious play, which by name, is known to build on fictitious play and deep learning. We first summarize all the notations that shall be used as below. Given a probability space , we consider

, a vector of dimensional independent Brownian motions;

, the augmented filtration generated by ;

, the space of all progressively measurable valued stochastic processes such that .

, the space of admissible strategies, i.e., elements in satisfy (1.2). , a product of copies of ;

, a collection of all players’ strategy profiles. With a negative superscript, means the strategy profiles excluding player ’s. If a nonnegative superscript appears (e.g., ), this Ntuple stands for the strategies from stage . When both exist, is a tuple representing strategies excluding player at stage . We use the same notations for other stochastic processes (e.g., );
We assume that the players start with an initial smooth belief . At the beginning of stage , is observable by all players. Player then chooses best response to her beliefs about opponents described by their play at the previous stage . Then, player faces an optimization problem:
(2.1) 
where are state processes controlled by :
(2.2) 
Denote by the minimizer in (2.1):
(2.3) 
we assume exists through out the paper. More precisely, is the player ’s optimal strategy at the stage when her opponents dynamics (1.1) evolve according to , . All players find their best responses simultaneously, which together form .
Remark 2.1.
Note that the above learning process is slightly different than the usual simultaneous fictitious play, where the belief is described by the time average of past play: . We shall discuss this with more details in Section 5.1.
As discussed in the introduction, in general one can not expect that the player’s actions always converge. However, if the sequence ever admits a limit, denoted by , we expect it to form an openloop Nash equilibrium under mild assumptions. Intuitively, in the limiting situation, when all other players are using strategies , , by some stability argument, player ’s optimal strategy to the control problem (2.1) should be , meaning that she will not deviate from , which makes an open loop equilibrium by definition. Therefore, finding an openloop Nash equilibrium consists of iterating this play until it converges.
We here give an argument under general problem setup using Pontryagin stochastic maximum principle (SMP). For simplicity, we present the case of uncontrolled volatility without common noise: , , , and refer to [11, Chapter 1] for generalization. The Hamiltonian for player at stage is defined by:
(2.4) 
where the dependence on is introduced by . We assume all coefficients are continuously differentiable with respect to ; is convex and continuously differentiable with respect to ; is convex; the function is convex almost surely in . By the sufficient part of SMP, we look for a control of the form:
(2.5) 
and solve the resulting forwardbackward stochastic differential equations (FBSDEs):
(2.6) 
If there exists a solution , then an optimal control to problem (2.1) is given by plugging the solution into the function :
(2.7) 
Now suppose (2.6) is solvable, the sequence given in (2.7) converges to as . Denote by the solution of (2.6) with being replaced by . If the system possesses stability, then is also the limit of . In this case, given other players using , the optimal control of player is
(2.8) 
where we have used the stability of (2.6) and the continuous dependence of on the parameter for the first identity, the solvability of (2.6) for the second identity, and the convergence of for the last identity. Therefore, one can put appropriate conditions on to ensure these, and we refer to [52, 51, 40, 41] for detailed discussions. Remark that, all assumptions are satisfied for the case of linearquadratic games, and thus all the above arguments can go through. We will give more details in Section 3.
In general, problem (2.3) is not analytical tractable, and one needs to solve it numerically. Next we present a novel architecture of DNN and a deep learning algorithm that has a parallelization feature. It starts with a brief introduction on deep learning, followed by the detailed deep fictitious play algorithm.
2.1 Preliminaries on deep learning
Inspired by neurons in human brains, a neural network (NN) is designed for computers to learn from observational data. It has become an effective tool in many field including computer vision, speech recognition, social network filtering, image analysis, etc., where results produced by NNs are comparable or even superior to human experts. An example of NNs performing well is image classification, where the task is to identify which of a set of categories a new observation belongs to, on the basis of a training set of data containing observations of known category membership. Denote by the observations and its category. This problem consists of efficient and accurate learning of the mapping from observations to categories , which can be complicated and nontrivial. Thanks to the universal approximation theorem and the KolmogorovArnold representation theorem [15, 33, 25], NNs are able to provide good approximations to nontrivial mapping.
Our goal is to use deep neural networks to solve the stochastic control problem (2.3). NNs are made by stacking layers one on top of another. Layers with different functions or neuron structures are called differently, including fullyconnected layer, constitutional layer, pooling layer, recurrent layers, etc.. As our algorithm 1 will focus on fullyconnected layers, we here give an example of feedforward NN using fullyconnected layers in Figure 1. Nodes in the figure represent neurons and arrows represent the information flow. As shown, information is constantly âfed forwardâ from one layer to the next. The first layer (leftmost column) is called the input layer, and the last layer (rightmost column) is called the output layer. Layers in between are called hidden layers, as they have no connection with the external world. In this case, there is only one hidden layer with four neurons.
We now explain how information is processed in NNs. For fullyconnected layers, every neuron consists of two kinds of parameters, the weights and the bias . Each layer can choose an activation function, then an input goes through it gives . In the above example of NN, the data fed to neuron outputs , , which yields as the input of neuron . The final output is . In traditional classification problems, categorical information associated to the input is known, and the optimal weights and bias are chosen to minimize a loss function :
(2.9) 
where is the output of the NNs, as functions of , and is given from the data. The process of finding optimal parameters is called the training of an NN.
The activation function and loss function are chosen at the user’s preference, and common choices are sigmoid , ReLU for , and mean squared error and cross entropy for in (2.9). In terms of finding the optimal parameters in (2.9), it is in general a highdimensional optimization problem, and usually done by various stochastic gradient descent methods (e.g. Adam [32, 55], NADAM [17]). For further discussions, we refer to [26, Section 2.1] and [29, Section 2.2].
However, solving (2.3) is not in line with the above procedure, in the sense that there is no target category assigned to each input , and consequently, the loss function is not a distance measuring between the network output and . We aim at approximating the optimal strategy at each stage by feedforward NNs. What we actually use NN is its ability of approximating complex relations by composition of simple functions (by stacking fully connected layers) and finding the (sub)optimizer with its welldeveloped builtin stochastic gradient descent (SGD) solvers. We shall explain further the structures of NNs in the following section.
2.2 Deep learning algorithms
We introduce the algorithms of deep learning based on fictitious play by describing two key parts as below.
Part I: solve a stochastic control problem using DNN
We in fact solve a time discretization version of problem (2.3). Partitioning into equallyspaced intervals, with the time step . Denote by the “discretized” filtration with . An discretetime analogy of (2.3) is:
(2.10) 
where
(2.11) 
and each entry in follows the Euler scheme of (1.1) associated to the strategy if , and to if :
(2.12)  
In the discrete setting, is interpreted as . Our task is to approximate the functional dependence of the control on noises. Similar to the strategy used in [20], we implement this by a multilayer feedforward subnetwork:
(2.13) 
where denotes the collection of all weights and biases in the subnetwork for player . Then, at stage , the optimization problem for player becomes
(2.14) 
Denote by the minimizer of (2.14), then the approximated optimal strategy is given by (2.13) evaluated at . Note that even though we only write explicitly the dependence of ’s on , it affects all ’s through interactions (2.12). In fact, depends on , for all . Therefore, finding the gradient in minimizing (2.14) is a nontrivial task. Thanks to the key feature of NNs, computation can be done via a forwardbackward propagation algorithm derived from chain rule composition [48].
The architecture of the NN for finding is presented in Figure 2: “InputLayer” are inputs of this network; “Rcost” and “Tcost”, representing running and terminal cost, contribute to the total cost ; “Sequential” is a multilayer feedforward subnetwork for control approximation at each time step; “Concatenate” is an auxiliary layer combining some of previous layers as inputs of “Sequential”.
There are three main kinds of information flows in the network for each period , :

given by “Sequential” layer. It is an layer feedforward subnetwork to approximate the control of player at time , containing parameters to be optimized.

given by “Rcost” layer. This layer possesses two functions. Firstly, it computes the running cost at time using , where is produced from previous step. The cost is then added to the final output. Secondly, it updates states value via dynamics (2.12), using for player and using for player which are inputs of the network. No parameter is minimized at this layer.

given by “Concatenate” layer. This layer combines two previous ones together, acting as a preparation for the input of “Sequential” layer. No parameter is minimized at this layer.
At time , the terminal cost is calculated using and added to the final output via “Tcost” layer. With these preparations, we introduce the deep fictitious play as below.
Part II: find an equilibrium by fictitious play
Here we use a flowchart to describe the algorithm of deep fictitious play (see Algorithm 1).
2.3 Implementation
Computing environment. The Algorithm 1 described in Section 2.2.2 is implemented in Python using the highlevel neural network API Keras [13]. Numerical examples will be presented in Section 4. All experiments are performed using Amazon EC2 services, which provide a variety of instances for computing acceleration. All computations uses NVIDIA K80 GPUs with 12GiB of GPU memory on Deep Learning Amazon Machine Image running on Ubuntu 16.04.
Parallelizability. As going relatively large, to make computation manageable, one can distribute Step to several GPUs. That is, assigning each available GPU the task of training a subset of neural networks, where this subset is fixed from stage to stage. This will speed up the computation time significantly, as peertopeer GPU communications are not needed in the designed algorithm.
Input, output and parameters for neural networks. Before training, we sample , which, together with the initial states and initial belief , are the inputs of NNs. Adam, a variant of SGD that adaptively estimate lowerorder moments, is chosen to optimize the parameters . The hyperparameters set for Adam solver follows the original paper [32]. Regarding the architecture of “Sequential”, it is a layered subnetwork. We set , with 1 input layer, 2 hidden layers, and 1 output layer containing nodes. Rectified linear unit is chosen for hidden layers while no activation is applied to the output layer. We also add Batch Normalization [30] for hidden layers before activation. This method performs the normalization for each training minibatch to eliminate internal covariate shift phenomenon, and thus frees us from delicate parameter initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.
Parameters of the network are initialized at Step 1. In Step 7, training continues from previous stage without reinitialization. This is because, although opponents’ policies change from stage to stage, they will not vary significantly and parameter values from previous stage should be better than a random initialization. For fixed computational budget, instead of using the stopping criteria in Step 12 one can terminate the loop until reaches a predetermined upper bound . In Step 7, the number of epochs to train the model at every single stage does not need to be large (at the scale of hundreds). This is because we are not aiming at a onetime accurate approximation of the optimal policy. Especially at the first few rounds when opponents’ policies are far from optimal, pursuing accurate approximation is not meaningful. Instead, by using small budget to obtain moderate accuracy at each iteration, we are able to repeat the game for more times. In summary, for the two computational scheme: large small epochs, and small large epochs, the former one is better. If opponents’ policies stay the same from stage to stage, then the two schemes receive the same accuracy. In reality, the opponents’ policies is updated from time to time, and the former scheme enables us to obtain player ’s reaction with more accurate belief of his opponents. Step 1519 are not computational costly, and the value functions usually converge after several () iterations.
3 LinearQuadratic games
Although the deep fictitious theory and algorithm can be applied for any player game, the proof of convergence is in general hard. Here we consider a special case of linearquadratic symmetric player games, and analyze the convergence of defined in (2.3). The strategy analyzed here will provide an openloop Nash equilibrium, as proved at the end of section.
We follow the linearquadratic model proposed in [12], where players’s dynamics interact through their empirical mean:
(3.1) 
Here are independent standard Brownian motions (BMs). Each player controls the drift by in order to minimize the cost functional
(3.2) 
with the running cost defined by
(3.3) 
and the terminal cost function by
(3.4) 
All parameters are nonnegative, and is imposed so that is convex in . In [12], is viewed as the logmonetary reserves of bank at time . For further interpretation, we refer to [12].
In the spirit of fictitious play, the player game is recasted into individual optimal control problems played iteratively. The players start with a smooth belief of their opponents’ actions . At stage , the players have observed the same past controls ’s, and then each player optimizes her control problem individually, assuming other players will follow their choice at state . That is, for player ’s problem, her dynamics are controlled through , while other players’ states evolve according to the past strategies :
(3.5)  
(3.6) 
Player faces an optimal control problem:
(3.7)  
The space where we search for optimal is the space of squareintegrable progressivelymeasurable valued processes on , to be consistent with openloop equilibria. Denote by the minimizer of this control problem at stage :
(3.8) 
In what follows, we shall show:

exists , that is, the minimal cost in (3.7) is always attainable;

the family converges;

the limit of forms a Nash equilibrium.
3.1 The probabilistic approach
Observing that the cost functional in (3.7) solely depends on the process and the control , we make the following simplification. Notice that (3.5) and (3.6) imply
(3.9) 
Then, player ’s problem is equivalent to:
(3.10) 
In what follows, we show the existence of unique minimizer, denoted by , using stochastic maximum principle (SMP). The Hamiltonian for player at stage reads as
(3.11) 
For a given admissible control , the adjoint processes satisfy the backward stochastic differential equation (BSDE):
(3.12) 
with the terminal condition . Standard results on BSDE [50], together with the estimates on the controlled state , guarantee the existence and uniqueness of adjoint processes. Pontryagin stochastic maximum principle (SMP) suggests the form of optimizer:
(3.13) 
Plugging this candidate into the system (3.9)(3.12) produces a system of affine FBSDEs:
(3.14) 
The sufficient condition of SMP suggests that if we solves (3.14), we actually have obtained the optimal control by plugging its solution into equation (3.13). In fact, the coefficients satisfy the monotone property in [52], thus the system is uniquely solved in , and the resulted optimal control is indeed admissible. This answers question (a). For the other two questions, we need to further analyze (3.14).
Note that the system can be decoupled using:
(3.15) 
where satisfies the Riccati equation:
(3.16) 
and the decoupled processes satisfy: