On the Convergence of Model Free Learning in Mean Field Games

# On the Convergence of Model Free Learning in Mean Field Games

## Abstract

Learning by experience in Multi-Agent Systems (MAS) is a difficult and exciting task, due to the lack of stationarity of the environment, whose dynamics evolves as the population learns. In order to design scalable algorithms for systems with a large population of interacting agents (e.g., swarms), this paper focuses on Mean Field MAS, where the number of agents is asymptotically infinite. Recently, a very active burgeoning field studies the effects of diverse reinforcement learning algorithms for agents with no prior information on a stationary Mean Field Game (MFG) and learn their policy through repeated experience. We adopt a high perspective on this problem and analyze in full generality the convergence of a fictitious iterative scheme using any single agent learning algorithm at each step. We quantify the quality of the computed approximate Nash equilibrium, in terms of the accumulated errors arising at each learning iteration step. Notably, we show for the first time convergence of model free learning algorithms towards non-stationary MFG equilibria, relying only on classical assumptions on the MFG dynamics. We illustrate our theoretical results with a numerical experiment in a continuous action-space environment, where the approximate best response of the iterative fictitious play scheme is computed with a deep RL algorithm.

## 1 Introduction

In Multi-agent systems (MAS), several autonomous robots or agents interact and cooperate, compete or coordinate in order to complete their task. The difficult nature of the task at hand combined with the large number of possible situations imply that the agents have to learn by experience. In comparison to the single-agent case, the derivation of efficient learning algorithms in this context is difficult due to the lack of stationarity of the environment, whose dynamics evolves as the population learns [5]. This gives rise to research topics lying at the intersection of game theory and reinforcement learning. Nevertheless, in typical examples, the number of interacting agents can be very large (e.g., swarm systems) and defies the scalability properties of most learning algorithms. For anonymous identical agents, a key simplification in game theory is the introduction of the asymptotic limit where the number of agents is infinite, leading to the modeling intuition behind the theory of Mean Field Games (MFG). This calls for an analysis of model free learning scheme for MAS in terms of MFG.

MFG were introduced by \citeauthorMR2269875 \shortciteMR2269875,MR2271747 and \citeauthorMR2346927-HuangCainesMalhame-2006-closedLoop \shortciteMR2346927-HuangCainesMalhame-2006-closedLoop in order to model the dynamic equilibrium between a large number of anonymous identical agents in interactions. Such systems encompass the modeling of numerous applications such as traffic jam dynamics, swarm systems, financial market equilibrium, crowd evacuation, smart grid control, web advertising auction, vaccination dynamics, rumor spreading on social media, among others. In a sequential game theory setting, each player needs to take into account his impact on the strategies of the other players. Studying games with an infinite number of players is easier from this point of view, as the impact of one single player on the others can be neglected. Hereby, the asymptotic limit with infinite population size considered in MFG becomes highly relevant. A solution to a dynamic MFG is determined via the optimal policy of a representative agent in response to the flow of the entire population. A mean field (MF) Nash equilibrium arises when the distribution of the best response policies over the population generates the exact same population flow. In most cases, a MF Nash equilibrium provides an approximate Nash equilibrium for an analogous game with a finite number of players [8, 3, 10].

In the abundant literature on MFG, most papers consider planning problems with fully informed agents about the game operation scheme, the reward function and the MF population dynamics. Only a few contributions focus on learning problems in MFG, see e.g. [31, 7, 17, 16] for model based approaches. Very recently, a rapidly growing literature intends to approximate the solution of stationary MFG in the realistic setting where agents with no prior information on the game learn their best response policy through repeated experience. These contributions restrict to a stationary setting and focus on specific Reinforcement Learning (RL) algorithms: Q-learning [15, 30], fictitious play [23] or policy gradient methods [28], and sometimes rely on hardly verifiable assumptions.

In this paper, we take a step back and adopt a general high perspective on the convergence of model free learning algorithms in possibly non-stationary MFG and emphasize their potential for MAS with a large number of agents. Our approach investigates how any single-agent learning algorithm can perform in an MFG setting, in order to learn a (possibly approximate) Nash equilibrium, via repeated experiences and without any prior knowledge. Namely, we quantify precisely how the convergence of model free iterative learning algorithms reduces to the error analysis of each learning iteration step, in analogy with how the convergence of RL algorithms reduces to the aggregation of repeated supervised learning approximation errors [12, 26]. For this purpose, our approach relies on a model free Fictitious Play (FP) iterative learning scheme for repeated games, where each agent calibrates its belief to the empirical frequency of the previously observed population flows. The FP approach is very natural when agents are trying to learn how to play a game by experience, while interacting with others. Before a new round of experience, they need to anticipate the behavior of the other players, and FP ergodic averaging is nicely designed for this purpose. This algorithm is typically useful for building from experience collaboration a cooperation patterns in a MAS using a decentralized learning scheme. In our framework of interest, all agents are identical (as usual in MFG), and we consistently suppose that they use the same learning scheme. Whenever the agents can compute their exact best response to any population flow, FP is proved to reach asymptotically a Nash equilibrium in some (but not all [27]) classes of games, such as first order monotone MFG [17]. However, in a realistic setting, the agents are not able to compute their exact best response and can only attain an approximate version of it. This induces at each iteration a learning approximation error, which propagates through the FP learning scheme.

The main contribution of this paper is theoretical, as we provide a rigorous study of the error propagation in Approximate FP algorithms for MFGs, using an innovative line of proof in comparison to the standard two time scale approximation convergence results [21, 4]. Our convergence results are derived under easily verifiable assumptions on possibly non-stationary MFG dynamics and cost, which are highly classical in the MFG literature (namely order monotone MFGs). This allows discussing the convergence to a (possibly approximate) MF Nash equilibrium, when using any standard single-agent learning algorithm as an inner step embedded in a FP iterative scheme. Especially, our theoretical framework encompasses the convergence of RL algorithms to MFG equilibria in non stationary settings, which, as far as we know, is new in the literature. We illustrate our theoretical results on an authorative MFG numerical experiment on crowd congestion, where the approximate best response of the iterative FP scheme is computed with a deep RL algorithm. This provides for the first time a model free learning example on MFG in a continuous state-action environment.

## 2 Background

#### Mean Field Games.

MFGs were introduced by  \citeauthorMR2269875 \shortciteMR2269875,MR2271747 and by \citeauthorMR2346927-HuangCainesMalhame-2006-closedLoop \shortciteMR2346927-HuangCainesMalhame-2006-closedLoop and correspond to the asymptotic limit of a differential game, where the number of agents is infinite. Since all agents are assumed to be identical and indistinguishable, individual interactions are irrelevant in the limit and only the distribution of states matters (see  [10] for a complete overview). Most of the MFG literature is displayed in continuous time, but we choose to present our analysis in a discrete time setting in order to alleviate the presentation and emphasize the fruitful connections with the learning literature.

Finding a Mean Field Nash equilibrium boils down to identifying the equilibrium distribution dynamics of the population as well as the best response (or optimal policy) of a representative agent to this population mean field flow. Since the number of players is infinite, each agent has an infinitesimal influence on the population distribution. Yet, since all agents are rational, at equilibrium the state distribution generated by the optimal policy must coincide with the population distribution.

Notations. Let and be compact convex subsets of and respectively, which represent the state and action spaces common to every agent. Let be a time horizon and let denote the time sequence . We denote by the set of probability measures on and by the set of all possible flows of population state distributions . The initial distribution of the population is an atomless measure on denoted by . For , represents the distribution at time of the state occupation of the entire population.

State dynamics Mean field population flow. At any time , each agent belongs to a state and picks an action . For a sequence of actions , the dynamics of is governed by a Markov Decision Process (MDP) with (possibly non-stationary) transition density parameterized by the mean field flow of the population. This indexation transcribes the interactions with the other agents, through their state distribution . Typically, the dynamics of is described by an equation of the form

 xt+1=xt+b(xt,at,μt)+ϵt+1, (1)

where is a drift function and is a dynamic source of noise. We stress that the mean field term represents the whole population distribution and not just the average state, as e.g. in [30].

We denote by the set of policies (or controls) which are feedback in the state: at time , an agent using policy while in state plays the action . The process controlled by is denoted .

Agent’s reward scheme. An infinitesimal agent starting at time in state chooses a policy in order to maximize the following discounted expected sum of rewards:

 J(x0,π,μ):=E[T−1∑t=0γtr(xπt,μt,at)], (2)

while interacting with the population MF flow . At time , the agent’s rewards are impacted by , which represents the aggregate state distribution of all the other agents (i.e. of the whole population). Since the agents are anonymous, only the MF distribution flow of the states matters.

As denotes the state distribution at time , the average reward for a representative agent is given by

 J(π,μ):=Ex0∼μ0[J(x0,π,μ)], (3)

when this agent uses policy , while the mean field population flow is .

###### Definition 1 (Best response).

A policy maximizing is called a best response of the representative agent to the MF population dynamic flow .

MF Nash Equilibrium. While interacting, the agents may or may not reach a Nash equilibrium, whose definition, based on the previous best response policy characterization, reads as follows:

###### Definition 2 (Mean Field Nash equilibrium).

A pair consisting of a policy and a MF population distribution flow is called MF Nash equilibrium if it satisfies

• Agent rationality: is a best response to ;

• Population consistency: for all , is the distribution of , starting with distribution and controlled by policy .

Namely, if the mean field population flow is , the policy is optimal, and if all the agents play according to , the induced mean field population flow coincides with . Hereby, identifies to an MF Nash equilibrium.

Observe that reaching an MF Nash equilibrium requires the computation of the exact best response policy, which can be difficult in practice. We are concerned with the design of an iterative learning scheme, where the available best response is partially accurate and typically approximated by RL through repeated experiences. For example, this realistic situation arises when agents are repeatedly optimizing their daily driving trajectories, without any prior information on the traffic jam dynamics.

## 3 Fictitious Play Algorithms for MFG

Fictitious play [25] is an iterative learning scheme for repeated games, where each agent calibrates its belief to the empirical frequency of previously observed strategies of other agents, and plays optimally according to its beliefs. This constitutes its best response. Even in simple two-player games, the convergence of FP to a Nash equilibrium is not guaranteed [27]. However, the convergence of FP has recently been proved for some classes of MFG [17, 6].

Yet, in most cases, agents do not have access to the exact best response policy but use an approximate version of it instead, in the spirit of [21, 24]. At iteration , the agent has only access to an approximate version of the best response to the anticipated mean field flow , defined precisely in Algorithm 1.
At iteration step , the learning scheme induces an average additional error defined as

 ℓn:=J(π∗,(n+1),¯μ(n))−J(^π(n+1),¯μ(n))≥0, (4)

for . Observe that identifies to the expected loss over the entire population at step , when replacing the the exact best response by the approximate policy .
In Section 4 below, we quantify the propagation of approximating errors and clarify the convergence properties of Algorithm 1 for any type of learning procedure at each intermediate step. The specific setting where the approximate optimal policy is computed using single-agent RL algorithms is discussed in Sec 4.3.

#### Approximate Nash equilibrium

At each step , we denote by the representative agent belief on the aggregate population policy, defined as an equally randomized version of all previous approximate best responses : for each and , is the probability distribution on the set of actions according to which the player picks uniformly at random an element of .
With a slight abuse of notation, we write

 J(x0,¯π(n),¯μ(n)):=1nn∑k=1J(x0,^π(k),¯μ(n)),n∈N.

and modify the definition of in (3) accordingly. Observe for later use that, by construction, defined in Algorithm 1 coincides with the population MF flow induced by the policy . In order to assess the quality of as an (approximate) MF Nash equilibrium, we introduce, for ,

 en := J(π∗,(n+1),¯μ(n))−J(¯π(n),¯μ(n))≥0.

The exploitability quantifies at iteration the expected gain for a typical agent, when shifting its belief to the exact best response , while interacting with the MF population flow . After iterations in Algorithm 1, is a quantitative measure of the quality of as an MF Nash equilibrium. For the sake of clarification, let us introduce a more precise weaker notion of MF Nash equilibrium, inspired by [9].

###### Definition 3 (Approximate MF Nash equilibrium).

For and , a pair consisting of a policy and a population distribution flow is called an MF Nash equilibrium if

 μ0({x0;J(x0,π∗ϵ,δ,μ∗ϵ,δ)≥J(x0,π′,μ∗ϵ,δ)−ϵ,∀π′})

is at least , and coincides with the MF distribution flow starting from , when every agent uses policy .

An MF Nash equilibrium identifies to a weak equilibrium which reveals -optimal for at least a fraction of the population. We are now in position to clarify how the exploitability quantifies the quality of as an MF Nash equilibrium.

###### Theorem 4.

If for some , then is an -MF Nash equilibrium in the sense of Definition 3. If goes to 0 as , any accumulation point of is a MF Nash equilibrium

###### Proof.

Fix and assume Let us introduce

 φ(x0):=J(x0,π∗,(n+1),¯μ(n))−J(x0,¯π(n),¯μ(n))≥0,

as is the best response to the MF flow .
Using Markov’s inequality and the bound on we obtain

 μ0({x0∈X:φ(x0)≥ϵ}) =Px0∼μ0[φ(x0)≥ϵ] ≤Ex0∼μ0[φ(x0)]ϵ=enϵ,

which is smaller than . Collecting the terms and using the definition of , we deduce that

 μ0({x0;J(x0,¯π(n),¯μ(n))≥J(x0,π∗,(n+1),¯μ(n))−ϵ})

is at least , so that is an -MF Nash equilibrium.
The second part of the theorem follows directly. ∎

## 4 Error propagation & Nash equilibrium approximation for first order MFG

Since the exploitability identifies to a relevant quality measure of Algorithm 1 after iterations, we now evaluate how the individual learning errors aggregate over . For the sake of simplicity, we focus our discussion on order MFG, i.e. without source of noise in the dynamics. This allows us to build our reasoning on the analysis of [17, Chapter 3] and to avoid a limitative restriction to second order games with a potential structure, for which similar results should hold in that setting as well, see \citeauthorcardaliaguet2017learning \shortcitecardaliaguet2017learning.

### 4.1 First order mean field game

The state evolves in with dynamics (1), where we take , and . In other words, each agent controls exactly its state variation between two time steps and does not endure any noise. While interacting with a MF flow , each agent intends to maximize the classical reward scheme given by (2) with a running reward at time of the form:

 r(xπt,μt,at)↦~r(xπt,at)+¯r(xπt,μt), (5)

where the extra captures the impact of the other agents’ positions. In Sec. 5, we provide in particular a congestion example where models an appeal for non-crowded regions. This type of conditions translates into the so-called Lasry-Lions monotonicity condition \citeauthorMR2269875 \shortciteMR2269875,MR2271747 which ensures uniqueness of MF Nash equilibrium. More precisely, existence and uniqueness of solution to the order MFG of interest hold under the following classical set of assumptions.

###### Assumption 1.

For some constant , the reward functions and satisfy:

• For any , the map is twice differentiable and

 1CId≤Daa~r(x,.)≤CId,
• The function is continuous on and is on  ,

• We have

 ∥~r(.,.)∥∞+∥¯r(.,.)∥∞≤C,
• The Lasry-Lions monotonicity condition holds: for all ,

 ∫X[¯r(.,m1)−¯r(.,m2)]d[m1−m2]<0. (6)

### 4.2 Error propagation in the Fictitious Play algorithm

We now investigate how the learning error propagates through FP for any learning algorithm, while Sec 4.3 focuses on the specific case where the best response is approximated via RL.
The key ingredient of FP iterative learning schemes is the quick stabilization of the sequence of beliefs .

###### Lemma 5.

Under Assumption 1, the FP MF flow satisfies:

 d1(¯μ(n),¯μ(n+1))≤Cn,n∈N,for some C>0,

where is the Wasserstein distance.

The proof follows from a straightforward adaptation of  [17, Lemma 3.3.2] to our setting.

As the sequence of beliefs stabilizes, the impact of recent learning errors reduces and we are in position to quantify the global error of the algorithm after iteration steps. This is the main result of the paper, whose proof interestingly differs from the more classical two-time scale approximation argument [4].

###### Theorem 6.

Under Assumption 1, the Nash equilibrium quality satisfies both estimates: for all

 en ≤C1n+C1nn∑i=1d1(μ∗,(i+1),^μ(i+1))+1nn∑i=1ℓi, (7) en ≤ℓn+C2n+C2nn∑i=1d1(^μ(i+1),^μ(i+2))+n∑i=1i+1nℓi, (8)

for some constants and .

###### Sketch of Proof.

Our argumentation builds up on the exact FP analysis of [17, Theorem 3.3.1], which hereby extends to the approximate best response setting.
Let us introduce the approximate exploitability

 ^en:=J(^π(n+1),¯μ(n))−J(¯π(n),¯μ(n))≥0,

so that , for . In order to control the exploitability , we focus our analysis on . Denoting , we get:

 ^en+1−nn+1^en = ∫XTJn+1d(^μ(n+2)−¯μ(n+1)) −nn+1∫XTJnd(^μ(n+1)−¯μ(n)) = n∫XT(Jn+1−Jn)d(¯μ(n+1)−¯μ(n)) +∫XTJn+1d(^μ(n+2)−^μ(n+1)),

where the last equality follows from the definition of .
The monotonicity of the reward in Assumption 1 implies

 ^en+1−nn+1^en ≤ ∫XTJn+1d(^μ(n+2)−^μ(n+1)).

Besides, Assumption 1 together with the compactness of and [17, Lemma 3.5.2] and Lemma 5 imply that is -Lipschitz, leading to

 ^en+1−nn+1^en ≤ ∫XTJnd(^μ(n+2)−^μ(n+1)) +Cnd1(^μ(n+2),^μ(n+1)).

As is the best response to the mean field flow , recalling the definition of in (4), we deduce

 ^en+1−nn+1^en≤ℓn+Cnd1(^μ(n+2),^μ(n+1)).

Together with estimate and [17, Lemma 3.3.1], we derive (8) and conclude the proof. ∎

Bound (7) indicates a nice averaging aggregation of the learning errors , but requires a strong additional control on the Wasserstein distance between the MF flows generated by both approximate and exact best responses. Such estimate is readily available for the numerical approximation of convex stochastic control problems [19] but less classical in the RL literature, as discussed in Sec 4.3. When such an estimate is not available, Bound (8) provides a slower convergence rate, up to a weak -regularity of the approximate best response in terms of the mean field flow , recall Lemma 5. Such estimate is highly classical in the setting of convex stochastic control problems with Lipschitz rewards [13, 19].

At finite distance, the following corollary sums up these properties in terms of MF Nash equilibrium.

###### Corollary 7.

Under Assumption 1, if ever or is bounded by , is an -MF Nash equilibrium, for large enough.

In a similar fashion, we can conclude on the general asymptotic convergence of Algorithm 1 to the unique MF-Nash equilibrium, before discussing the specific implications for RL best response approximation schemes.

###### Corollary 8.

Under Assumption 1, the approximate FP algorithm converges to the unique MF Nash equilibrium whenever one of the following two conditions holds:

1. The approximate best response update procedure is continuous in , and , as  ;

2. The learning and policy approximation errors and converge to .

The convergence of the sequence follows from the tightness and pre-compactness property of this collection of measures with respect to the Wasserstein distance, see e.g. Remark 3.5.3 in [17].

### 4.3 Discussion on the convergence for Best Response RL approximation

The result in Theorem 6 is general and relies on standard assumptions of MFGs. It also relies on a good enough control of the approximation error on the best response at each iteration. Here, we discuss to what extent existing theoretical results for RL algorithms allow satisfying this assumption.

As stated in Corollary 8, in order for the approximate FP to converge to the exact MF Nash equilibrium, the approximate best response should converge quickly enough to the best one, depending on the number of iterations. From an RL perspective, this would require being able to compute the approximate optimal policy to an arbitrary precision, with high probability. As far as we know, such a result is possible only when an exact representation of any value function is possible, that is, in the tabular setting which imposes finite state and action spaces. Notably, convergence and rate of convergence of Q-learning-like algorithms have been studied in the literature, see e.g. [29, 18, 11, 2]. For example, the speedy Q-learning algorithm requires steps to learn an -optimal state-action value function with high probability, with the number of state-action couples. According to Corollary 8, if the error is in with (and if we have continuity in ), then the scheme converges to the Nash equilibrium. This suggests using steps for the RL agent at iteration . Yet, this kind of results does not provide guarantees on the continuity in .

According to Corollary 7, bounding the learning errors and the distance between two iterates of the distribution is sufficient to reach an approximate Nash equilibrium. As approximate FP can be seen as repeated RL problems, RL (or approximate dynamic programming) can be seen as repeated supervised learning problems, and the propagation of errors from supervised to RL is a well studied field, see e.g. [12, 26]. Basically, if the supervised learning steps are bounded by some , then the learning error of the RL algorithm is bounded by , where is a so -called concentrability coefficient, measuring the mismatch between some measures. In principle, we could then propagate the learning error of the supervised learning part up to the FP error, through the RL error. However, these results do not provide any guarantees on the proximity between the estimated optimal policy and the actual one (which would be a sufficient condition for the proximity between population distributions); it only provides a guarantee on the distance between their respective returns. This is due to the fact that in RL, the optimal value function is unique, but not the optimal policy. A perspective would be to consider regularized MDPs [14], where the optimal policy is unique (and greediness is Lipschitz). Yet, this would come at the cost of a bias in the Nash equilibria. The approach in [15] somehow builds partially on this idea in their specific learning scheme.

## 5 Numerical illustration

As an illustration, we consider a stylized authoritative MFG model with congestion in the spirit of \citeauthorMR3698446 \shortciteMR3698446. This application should be seen as a proof of concept showing that the method described above can be applied beyond the framework used for our theoretical results. We compute a model free approximation of the MFG solution combining Algorithm 1 with Deep Deterministic Policy Gradient (DDPG) [22]. As far as we know, this is the first numerical illustration of model free deep RL Algorithm for MFG with continuous states and actions. Our numerical results also demonstrate the empirical convergence of the Fictitious RL scheme in a larger setting, even when the MFG is of not first order type.

As usual in RL, instead of (2) we consider the problem in infinite horizon with the following discounted reward:

 J(x0,π,μ):=E[∞∑t=0γtr(xπt,μt,at)], (9)

when an infinitesimal player interacts with the population MF flow . The goal is to learn the policy which is optimal in the long run, i.e., when the behavior of the population becomes stationary.

#### Environment

Each agent has a position located on the torus with periodic boundary conditions (for simplicity of explicit solution), whose dynamics is governed by where is the time step of the continuous time process. It receives the per-step reward

 r(xt,μt,at)=~r(xt)−12|at|2−log(μt),

where the last term motivates agents to avoid congestion, i.e. the proximity to a region with a large population density. In the continuous time setting with no discounting, a direct PDE argument provides the ergodic solution in closed form [1]

 a∗:x↦πcos(2πx)andμ∗:x↦e2sin(2πx)∫Te2sin(2πy)dy, (10)

when the geographic reward is of the form . This closed form solution offers a nice benchmark for our experiments and allows to measure the errors made by our algorithm.

#### Implemented Algorithm

Model free FP for MFGs takes a somehow similar approach as \citeauthorLanctot17PSRO \shortciteLanctot17PSRO in the sense that we estimate the best response using a model free RL algorithm (namely DDPG). However we do not maintain those best responses as in [20] but rather learn the population MF flow of the distribution of the representative agents. The best response approximation through DDPG and the estimation of the population MF are left in the Algorithm. We ran trajectories of DDPG with a trajectory length of . The noise used for exploration is a centered normal noise with variance and we used Adam optimizers with starting learning rate and . At each iteration of FP, we added trajectories of length to the replay buffer. Finally, we estimated the density using classes and doing steps of Adam (with initial learning rate).

#### Results.

Figure 1 presents the learned equilibrium computed for , and uniform initial distribution, as well as the continuous time closed form ergodic solution for , see (10). We emphasize that the variation in together with the discrete/continuous time difference setting implies that the theoretical solutions to both problems are close but do not exactly coincide. We keep this benchmark since no ergodic closed form solution is available for . As observed on Figure 1, both ergodic explicit and learned distributions and controls are close. As expected, the density of players is larger around the point of maximum of the reward but the distribution is not highly concentrated due to the logarithmic penalty encoding aversion for congested regions. More precisely, Figure 1 indicates that the errors between the distributions and the controls decrease with the number of iterations. The convergence of control distributions echoes to the discussion on error propagation in Section 4.2. This clearly illustrates the numerical convergence of the Deep RL FP mean field algorithm.

## 6 Related work

The related literature is as follows. Recently, model free RL algorithms for solving MFGs were analyzed in the following papers: \citeauthorguo2019learning \shortciteguo2019learning and \citeauthortiwari2019reinforcement \shortcitetiwari2019reinforcement study -learning, \citeauthormguni2018decentralisedli \shortcitemguni2018decentralisedli consider FP but contains several inaccuracies, as already pointed out in \citeauthorsubramanianpolicy \shortcitesubramanianpolicy, which focuses on policy gradient methods. However, their studies are restricted to a stationary setting and focus on particular RL algorithms. Their convergence results hold under assumptions that are often hard to verify in practice. Although not focusing on an MFG, \citeauthoryang2018mean \shortciteyang2018mean uses the idea of MF approximation by considering interactions through the empirical mean action. Numerical illustrations provided in all these papers are in a finite state-action setting, while we present a numerical example in a continuous state-action setting. On a different note, \citeauthoryang2018deep \shortciteyang2018deep studies the link between MFG and inverse RL. Some authors also study “learning” algorithms which use the full knowledge of the model (and hence are not model-free): \citeauthoryin2010learning \shortciteyin2010learning studied a MF oscillator game while \citeauthorhu2019deep \shortcitehu2019deep proposed a decentralized deep FP learning architecture for large MARL, whose convergence holds on linear quadratic MFG examples with explicit solution and small maturity.

## 7 Conclusion and future research

In comparison to the existing literature focusing on specific RL algorithms for MFGs, we took a step back and offer a general perspective on the error propagation in iterative scheme for MFG, using any learning algorithm. We presented a rigorous convergence analysis of model free FP learning algorithm for MF Agent systems, encompassing cases where the best response is approximated using any single agent learning algorithm as well as non-stationary settings. We showed how the convergence of model free iterative FP algorithm reduces to the error analysis of each learning iteration step, as the convergence of RL algorithm reduces to the aggregation of repeated supervised learning approximation errors [12, 26]. Our theoretical setting covers for the first time the consideration of non-stationary MFG and relies on reasonable and verifiable assumptions on the MFG of interest. The convergence is illustrated for the first time by numerical experiments in a continuous state-action setting, based on deep RL algorithm. Our analysis motivates and properly justifies the use of asymptotic Mean Field approximation for the study of learning by experience schemes in Multi-Agent systems, with a large number of agents.
For RL approximation schemes, our analysis suggests a much faster convergence rate, whenever the best response approximation quality can be controlled in Wasserstein distance. This kind of estimate is classical in the numerical approximation of stochastic control literature but currently not available in the RL literature. The derivation of such estimate deserves to be addressed in future research papers. Finally, we focused on convergence properties for a centralized Multi-Agent learning algorithm, paving the way for addressing such property for a more relevant decentralized one.

## Appendix

This Appendix regroups the technical proofs related to the error propagation bounds on the Approximate Fictitious Play algorithm detailed in Theorem 6. Many arguments reported here are inspired by the results presented in [17] for the exact fictitious play algorithm.

We follow the notations of Section 4. In particular, we recall that denotes the set of accessible states before time . Since we take and the set is compact, is compact too and in particular it is a bounded subset of .

In order to measure the proximity between MF population flows, we denote by , the Wasserstein distance defined (using Kantorovitch-Rubinstein duality) as: for all ,

 d1(μ,μ′)=suph∈Lip1(XT,R)∑t∈T∫Xh(x)d(μt−μ′t)(x),

where is the set of -Lipschitz continuous function from to .

### A. Stability of the FP mean field flow (¯μ(n))n

Let us first provide the proof of Lemma 5 which ensures the closeness in of two consecutive elements of the Mean field flow learning sequence . Let first recall from the definition of that we have: for all ,

 ¯μ(n+1)−¯μ(n) =1n[^μ(n+1)−¯μ(n+1)] =1n+1[^μ(n+1)−¯μ(n)]. (11)
###### Proof of Lemma 5.

Let . We recall that is bounded and pick . Then, using (11) together with the definition of , we compute

 ∣∣∣∫XTh(x)d(¯μ(n+1)−¯μ(n))(x)∣∣∣ =1n+1∣∣∣∫XTh(x)d(^μ(n)−¯μ(n))(x)∣∣∣ =1n+1∣∣∣∫XT(h(x)−h(x0))d(^μ(n)−¯μ(n))(x)∣∣∣ ≤1n+1∫XT∥x−x0∥[d^μ(n)(x)+d¯μ(n)(x)] ≤Cn+1,

since is bounded. This result being valid for any , we obtain

 d1(¯μ(n+1),¯μ(n))≤Cn+1,n∈N.

### B. Propagation error estimates

This section is dedicated to the rigorous derivation of the bounds (8) and (7) presented in Theorem 4.

We first recall the following useful result, see e.g. [17, Lemma 3.3.1].

###### Lemma 9.

Let and be two sequences of real numbers such that

 (n+1)φn+1−nφn≤λn,n∈N.

Then, we have the estimate:

 φn≤φ0n+1nn∑i=1λi,n∈N.

For ease of notation, we introduce and (which are functions defined over ), for all . More precisely, for ,

 Jn(x):=J(x0,(xt+1−xt)t=0,…,T−1,¯μ(n)).

Observe that this definition is accurate by the definition of the first order MFG setting presented in Section 4.1 and because there is a bijection between process trajectory and the combination of initial position and policy.

###### Proof of estimate (8) in Theorem 6.

We adapt the arguments in the proof of [17, Theorem 3.3.1] to our setting with approximate best responses.

Let us introduce the approximate learning error defined by: for ,

 ^en :=Ex0∼μ0[J(x0,^π(n+1),¯μ(n))−J(x0,¯π(n),¯μ(n))]≥0,

so that . In order to control , we will focus our analysis on and compute

 (n+1)^en+1−n^en = (n+1)∫XTJn+1d(^μ(n+2)−¯μ(n+1))−n∫XTJnd(^μ(n+1)−¯μ(n)) = (n+1)∫XTJn+1d(^μ(n+1)−¯μ(n+1))−n∫XTJnd(^μ(n+1)−¯μ(n))+(n+1)∫XTJn+1d(^μ(n+2)−^μ(n+1)) = n(n+1)∫XT(Jn+1−Jn)d(¯μ(n+1)−¯μ(n))+(n+1)∫XTJn+1d(^μ(n+2)−^μ(n+1)),

where the last equality follows from (11).
Thanks to Assumption 1, the monotonicity of the reward function implies directly

 (n+1)^en+1−n^en≤(n+1)∫XTJn+1d(^μ(n+2)−^μ(n+1)). (12)

By definition of together with the first order MFG dynamics, we have the expression

 Jn(x)=T−1∑t=0γt[~r(xt,xt+1−xt)+¯r(xt,¯μ(n)t)].

Moreover, using Assumption 1 and the compactness of , we deduce as in [17, Lemma 3.5.2], the existence of a constant such that for all ,

 |Jn+1(x)−Jn(x)−Jn+1(x′)+Jn(x′)|≤C∥x−x′∥∞d1(¯μ(n+1),¯μ(n)). (13)

This property of the reward function together with Lemma 5 indicate that is -Lipschitz, so that

 (n+1)^en+1−n^en ≤ (n+1)∫XTJnd(^μ(n+2)−^μ(n+1))+Cd1(^μ(n+2),^μ(n+1)) ≤ (n+1)∫XTJnd(μ∗,(n+1)−^μ(n+1))+Cd1(^μ(n+2),^μ(n+1)),

where, in the last inequality, we used the fact that is the best response with respect to and hence

 ∫XTJndμ∗,(n+1) ≥∫XTJnd^μ(n+2). (14)

Indeed, by optimality of , we have

 ∫XTJndμ∗,(n+1) =J(π∗,(n+1),¯μ(n))≥J(^π(n+2),¯μ(n)).

By definition of the learning error in (4), we deduce

 (n+1)^en+1−n^en≤(n+1)ℓn+Cd1(^μ(n+2),^μ(n+1)).

By Lemma 9 applied to

 φn=^en, and λn=(n+1)ℓn+Cd1(^μ(n+2),^μ(n+1)),

we derive the estimate

 ^en≤ℓn+^e0n+1nn∑i=1(i+1)ℓi+1nn∑i=1d1(^μ(i+2),^μ(i+