Reinforcement Learning for Mean Field Game
Stochastic games provide a framework for interactions among multi-agents and enable a myriad of applications. In these games, agents decide on actions simultaneously, the state of an agent moves to the next state, and each agent receives a reward. However, finding an equilibrium (if exists) in this game is often difficult when the number of agents become large. This paper focuses on finding a mean-field equilibrium (MFE) in an action coupled stochastic game setting in an episodic framework. It is assumed that the impact of the other agents’ can be assumed by the empirical distribution of the mean of the actions. All agents know the action distribution and employ lower-myopic best response dynamics to choose the optimal oblivious strategy. This paper proposes a posterior sampling based approach for reinforcement learning in the mean-field game, where each agent samples a transition probability from the previous transitions. We show that the policy and action distributions converge to the optimal oblivious strategy and the limiting distribution, respectively, which constitute a MFE.
We are living in a world where multiple agents interact repeatedly over a common environment. For example, multiple robots interact to achieve a specific goal. The multi-agent reinforcement learning (MARL) refers to the problem of learning and planning in a sequential decision making system when the underlying system dynamics are unknown, and may need to be learnt by trying different options and observing their actions. Learning in a MARL is fundamentally different from the traditional single-agent reinforcement learning problem (RL) since agents not only interact with the environment but also with each other. Thus, an agent when tries to learn the underlying system dynamics has to consider the action taken by the other actions. Changes in the policy (or, action) of an agent affects the others and vice-versa.
One natural learning algorithm is to extend the RL algorithms to the MARL by assuming that the other agents’ actions are independent. However, the studies show that a smart agent who learns the joint actions of the others performs better compared to the ones which do not . Hence, it is futile to consider that an agent will not indulge itself to learn the joint actions of the others. When the agents are strategic, i.e., they only want to take actions which maximize their own utility (or, value), Nash equilibrium is often employed as the equilibrium concept. The existing equilibrium solving approaches are only capable of solving a handful of agents and for some restricted games (when there exists an adversarial equilibrium or coordination equilibrium) . The computational complexity of finding Nash Equilibrium at every stage game  prevents those approaches to be applied in games where the number of agents is large.
In this paper, we consider a MARL where a large number of agents co-exist. Similar to , we revert to the mean-field approach where we assume that the Q-function of an agent is affected by the mean actions of the others. Mean-field game drastically reduces the complexity since an agent only now needs to consider the empirical distribution of the actions played by other agents. Further, we consider an oblivious strategy , where each agent takes an action based only on its own state. An agent does not have to track the policy evolution of the other agents. Unlike , we consider a generalized version of the game where the state can be different for different agents. Further, we do not consider a game where the adversarial equilibrium or coordinated equilibrium is required to be present. We also do not need to track the action and the realized Q-value of other agents as was the case in .
Mean-field games exist in several domains. For example, the mean-field game is observed in an security game where a large number of agents make individual decisions about their own security. However, the ultimate security depends on the decisions made by other agents. For example, in a network of computers, if an agent invests heavily in building firewalls, its computer can still be breached if the agents’ computers are not secure. In the security game, each agent invests a certain amount to attain a security level, however, the investment level depends on the investment made by the other agents. If the number of agents is large, the game can be modeled as the mean-field game as the average investment made by per agent impacts the decision of an agent.
Another example of mean-field game is the demand response price in the smart grid. The utility company sets a price based on the average demand per household, hence, at a certain time, the average demand is high, the agent may want to reduce its consumption in order to decrease the price.
Unlike the standard literature on the mean-field equilibrium on stochastic games [1, 9, 2], we consider that the transition probabilities are unknown to the agents. Instead, each agent learns the underlying transition probability matrix a readily implementable posterior sampling approach [3, 6]. All agents employ best response dynamics to choose the best response strategy which maximizes the discounted payoff for the remaining episode length. We show the asymptotic convergence of the policy and the action distribution to the optimal oblivious strategy and the limiting action distribution, respectively, and the converged point constitutes a mean-field equilibrium (MFE). We use the compactness of state and action space for the convergence. We estimate the value function using update steps which are similar to the Expected Sarsa algorithm  and show that the iterates converge to the optimal value function of the true distribution.
The key contribution of the paper is a novel algorithm that is used by each agent in a multi-agent setting, which is shown to converge to a mean-field equilibrium. The proposed algorithm does not assume the knowledge of transition probabilities, and learns them using a posterior sampling approach.
Ii-a Stochastic Game
An -player stochastic game is formalized by the tuple = , where the parameters are defined as follows. The state of the agent at time is given by , where is the state space set. is defined as the set of the feasible actions an agent can take in state . is the action space set defined as . If the state of the agent at time is given by and the action taken by the agent is given by , then the next state is with probability which is assumed to be same for all the agents. is the probability distribution over the realized reward by the agent when action is selected in the state , is the action taken by all other agents. It is assumed to be similar for each agent with support [0,1]. The constant is the discount factor. is the initial state distribution.
We consider an episodic framework where the length of the time horizon is . State space set , action space set , are deterministic and need not be learned by the agent. We consider that the game is played in episodes . The length of each episode is given by . In each episode, the game is played in discrete steps, . The episodes begin at times . At each time , the state of the agent is given by , the agent selects an action , agent observes a scalar reward and the state transitions to the state . Let denote the history of the agent till time .
Ii-B Problem Formulation
In a game with a large number of players, we might expect that the fluctuations of the agents’ action will “average out”. Since the effect of other agents on a single agent’s payoff is only via the actions of the population, it is intuitive that as the number of agents increases, a single agent has negligible effect on the game. This intuition is formalized as the mean field equilibrium.
Let be the fraction of the agents (excluding agent that take action at time . Mathematically, for (where is the size of the action space set), We have
where is the indicator function that the action taken by the agent at time is .
In mean field equilibrium, each agent makes an action based on the average of the population actions. Thus the agent doesn’t need to concern itself with the action of the other agents, it just needs to keep the track of the averages of the population actions. Thus, the actions of other players are taken using the set as the proportion of actions of the other players. Each agent conjectures that the next state is randomly distributed according to the transition probability measure :
Note that we assume that the transition probabilities are unknown and are learned using the posterior sampling algorithm described in Section III. Since each player is only concerned with its current state and the average action of the population, we describe a set of strategies which is known as oblivious strategy.
In an oblivious strategy, the set of policies available to the agent is given by
Consider , an oblivious policy followed by the agent that maps each state and average action of other agents to an action . We now define a value function for the agent for oblivious policy at time step as:
where is the expected reward realized by the agent in state when action is selected. It is clear that the action . Since we mainly consider one agent decision, index will be dropped at places where the agent index is unambiguous. The index will also be dropped from to denote the limiting distribution or the long-run average population distribution. Each agent determines the policy based on limiting distribution . However, the decision of agents () impact the limiting distribution . An equilibrium is established if and only if the the decisions do not change the values of . Thus the agents need to be aware of the long-run action distribution to make their decision.
Since the action space and state space are finite, the set of strategies available to the players is also finite. The player adopts the lower-myopic best response dynamics to choose the policy. As time proceeds, the strategies and the action distribution converge to the asymptotic equilibrium . Therefore, we have dropped time index from the action distribution to represent the limiting distribution.
The above defined value function satisfies the Bellman-property for finite horizon MDPs, given by
The proof of the above expression is given in the Appendix A.
We also define the -function as:
We, now, define the optimal oblivious strategy:
The set maps a distribution to the set of the optimal oblivious strategies which are chosen from the -function In other words, for a given , a policy if and only if
Here, the policy is used at time step so that the value is maximized for all states . Note that does not depend on the distribution . Hence, it is a state oblivious strategy where each agent takes its decision based on its own strategy only. The set can be empty however in the subsequent section we’ll show that the set is non-empty under the assumptions used in this paper. Note that with slight abuse of notation, we denote as the strategy which has been learnt till episode .
Suppose all agents play the optimal oblivious strategy . The initial population state distribution denoted by evolves with all the agents knowing the limiting action distribution and play according to the optimal oblivious strategy , then if the long run state distribution is equal to the initial state distribution , the distribution is said be invariant of the dynamics induced by and . We denote the set of all such state distributions through a map .
We assume that an agent observes the action distribution, however, it does not observe the state of the other agents. Thus, an agent does not know the probability transition matrix and will try to estimate it from the past observations as described in the next section.
Iii Proposed Algorithm
In this section, we propose an algorithm, which will be shown to converge to the mean field equilibrium (MFE) in the following section. For each agent , the algorithm begins with a prior distribution over the stochastic game with state space set and action space set and time horizon . The game is played episodes . The length of each episode is given by . In each episode, the game is played in discrete steps, . The episodes begin at times . At each time , the state of the agent is given by , it selects an action , and observes a scalar reward then transitions to the state . Let denote the history of the agent till time .
At the beginning of each episode, the MDP, is sampled from the posterior distribution conditioned on the history . The sampled action distribution is represented by . We assume that after some samples, has converged. Till the time does not converge, the proposed algorithm will not converge. The updated value function in the last iteration is used to calculate the optimal oblivious policy and follows it for the entire episode. Recall for a given , a policy if and only if for all and . But to choose the policy form the set we use lower myopic learning dynamics, where at each episode we choose the strategy which is infimum of the set or the value .
We note that is used in the algorithm instead of where is the true distribution, since is not known. In order to obtain an estimate, each agent samples a transition probability matrix according to the posterior distribution. Each agent follows the strategy according to the -values over the episode. Based on action decision by each agent, we update the value function and the -function based on the obtained reward functions which depend on the value of . This update is akin to learning the value function in Expected Sarsa algorithm . The detailed algorithm steps can be seen in Algorithm 1. We note that, after the algorithm converges, the value of converges and thus all the transition probabilities and value functions depend on the limiting distribution. We use as a shorter notation to represent
Here is used as an estimate of the value and is used as an estimate for which is updated over the episodes in a way similar to the Expected Sarsa algorithm. The iteration is as follows:
where is the estimate of the where it is obvious that and is the estimate of in the episode.
where is the limiting distribution of other agents at time and denotes the expected changes in the limiting action distribution when an agent selects the action . We make the following assumptions throughput the paper.
We assume that for all and some , the following hold.
where , , and are positive constants.
The above assumption is equivalent to the conditions Lipschitz continuity of and . We further assume that the values of used to update Q-function in the algorithm satisfy the following.
We assume that as . Further, and is finite.
Iv Convergence Result
In this section, we’ll show that if the oblivious strategy is chosen according to the proposed algorithm, then the oblivious strategy and the limiting population action distribution constitutes a Mean Field Equilibrium (MFE). More formally, we have
The optimal oblivious strategy obtained from the Algorithm 1 and the limiting action distribution constitute a mean-field equilibrium and the value function obtained from the algorithm converges to the optimal value function of the true distribution.
The rest of the section proves this result. We first note that the lower-myopic best response strategy leads to a convergence of the action strategy following the results in  for finite action space and state space. The key intuition for the lower-myopic strategy is to avoid conflicts when there is non-unique strategy at the agent that maximizes the value function which might lead to choosing different strategies at different iterations if lower myopic is not used. Further, any way of resolving the multiple optimas could be used, including upper-myopic giving the same result. Having shown that converges, we now proceed to show that the converged point of the algorithm results in a MFE.
We first show the conditions needed for a policy , a population state , and action distribution to constitute an MFE (Section IV-A). Then, we show that the conditions for the policy to be MFE given in Section IV-A are met for any optimal oblivious strategy (Section IV-B). Thus, the key property that is required to show the desired result is that the proposed algorithm leads to an optimal oblivious strategy. In order to show that, we first show that the optimal oblivious strategy set is non-empty (Section IV-C). Then, we show that the value function of the sampled distribution converges to the true distribution (Section IV-D). However, we do not estimate the value function directly based on the samples, while use Expected Sarsa updates for the estimate of the value function. Thus, we finally show that the iterates of the value function obtained from the algorithm converge to the optimal value function of the sampled distribution. (Section 4.5). This combined with the result in Section IV-D shows that the update of the value function steps eventually converge to the value function with knowledge of true underlying distribution of the transition probability , thus proving that the proposed algorithm converges to an optimal oblivious strategy which constitute a mean field equilibrium thus proving the theorem.
Iv-a Conditions for a Strategy to be a MFE
In this section, we will describe the conditions for an oblivious strategy to be a MFE. In Section II-B, we defined two maps and . For a given action coupled stochastic game, the map for a given population action distribution gives the set of the optimal oblivious strategies. Further, the map for a given population action distribution and oblivious strategy gives the set of invariant population state distribution f.
We define the map which gives the induced population distribution induced from the oblivious strategy and the population state distribution f. The following lemma gives the conditions that the stochastic game constitutes a mean field equilibrium. These conditions have been provided in , and the reader is referred to  for further details and proof of this result.
An action coupled stochastic game with the strategy , population state distribution f and population action distribution constitute a mean field equilibrium if , f and .
Iv-B Conditions of Lemma 1 are met for any Optimal Oblivious Strategy
In this section, we show that the conditions of Lemma 1 are met for any optimal oblivious strategy. In the mean field equilibrium, each agent plays according to the strategy . If the long run average population action distribution is , and each agent takes an oblivious strategy, hence, we must have the evolution of the state space such that the oblivious strategy on those states leads to an average action distribution of . Let the long run average state distribution be , i.e.,
where is the state of the agent . Then the above statement implies that must satisfy
where represents the set of states for which . This is equivalent to saying that if all the agents follow the optimal oblivious strategy , then the long run average population state distribution f and the long run average population action satisfy f and .
Iv-C Optimal Oblivious Strategy Set is Non-Empty
In this subsection, we show that there exists an optimal oblivious strategy. More formally, we have the following lemma.
For the limiting population average action distribution , the set of the optimal oblivious strategies given by is non-empty. A policy if,
Note that is bounded and Lipschitz continuous. In addition, for each state , the next state is drawn from a countable set. Further,
An oblivious strategy is optimal if and only if it attains a maximum on the right hand side of Eq. (16) for every . Since the reward is bounded, and , thus, is bounded for all which means there exists an optimal oblivious strategy that maximizes the RHS. Hence, the set for a given is non-empty. Therefore for each episode, there exists an optimal oblivious strategy which is given by . ∎
Iv-D Sampling does not Lead to a Gap for Expected Value Function
In the last subsection, we proved that there exists an optimal oblivious strategy. In this subsection, we’ll show that the expected optimal value function achieved by the algorithm and the true distribution is equal. We first describe the Azuma-Hoeffding Lemma that will be used in the result.
(Azuma-Hoeffding Lemma ) If is a zero-mean martingale with almost surely bounded increments, - C, then for any 0 with probability at least 1- , .
At the start of every episode, each agent samples a probability distribution from the posterior distribution. The following result bounds the difference between the optimal value function learned by the true distribution function using the optimal policy which is unknown, and the optimal value function achieved by the sampled distribution from the policy . (Here, is abbreviated to )
If the sampled distribution is chosen form Algorithm 1, we have the convergence of the optimal value function of the sample distribution , to the optimal value function of the true distribution , , i.e., for all states as ,
To prove this, we first show an equivalence of the true distribution and the sampled distribution which comes from the property of posterior sampling shown in , which says that for any measurable function of history , we have
which can be applied to the difference of the optimal functions of the two distributions to show that for all states ,
Note that the length of all episodes is given by and the support of the reward is [0,1]. Therefore for all states , we have . Note that this condition is similar to bounded increments in Azuma-Hoeffding Lemma (Lemma 3).
Since is a zero mean martingale with respect to the filtration , and satisfies the assumptions of Azuma-Hoeffding Lemma, we obtain the result as in the statement of the Lemma. Also, for all states , we have, . So, the difference is a zero-mean martingale and has the bounded increments property. Applying the Azuma-Hoeffding Lemma to the martingale, we have the following result,
For total time of the algorithm, we have . Thus, for all , as , we have
Substituting , the above expression says that times the average difference in an episode which converges to zero as total time , which gives us the convergence of the optimal value functions of the two distributions. Thus, we have
Iv-E Value Function Update Steps in Algorithm 1 Converge to the Actual Value Function
We have already shown the bound between the optimal value function achieved by the sampled and the true distribution. In this subsection, we will show that if the policy is chosen from Algorithm 1, the value converges to the optimal function. Further, the value function converges to the optimal function.
The -value computed using the algorithm, converges to the optimal -value of the sampled distribution given by i.e.
This result follows on the same lines as the convergence of Expected Sarsa algorithm , and is thus omitted. The upadates use in the Algorihtm use learning rates which satisfy the conditions required for the convergence in Expected Sarsa. ∎
Having shown that the Q function converges, it is easy to see that the value function converges, and thus we have
The value function obtained from the Algorithm 1 converges to the optimal value function obtained from the true distribution i.e.
From Lemma 5, we note that the iterates of value function converge to the optimal value function of the sampled distribution given by . Also, by Lemma 4, the optimal value function of the sampled distribution converges to the optimal value function achieved by the true distribution . Combining these, we obtain the result as in the statement of the theorem. ∎
We consider an action coupled stochastic game consisting of large number of agents where the transition probabilities are unknown to the agents. We resort to the concept of mean-field equilibrium where each agent’s reward and the transition probability is only impacted through the mean distribution of the actions of the other agents. When the number of agents grows large, the mean-field equilibrium becomes equivalent to the Nash equilibrium. We propose a posterior sampling based approach where each agent draws a sample using an updated posterior distribution and selects an optimal oblivious strategy accordingly. We show that the proposed algorithm converges to the mean field equilibrium without knowing the transition probabilities apriori.
This paper shows asymptotic convergence to the mean-field equilibrium, while finding the convergence rate is an interesting future direction.
-  Sachin Adlakha and Ramesh Johari. Mean field equilibrium in dynamic games with strategic complementarities. Operations Research, 61(4):971–989, 2013.
-  Sachin Adlakha, Ramesh Johari, and Gabriel Y Weintraub. Equilibria of dynamic games with many players: Existence, approximation, and market structure. Journal of Economic Theory, 156:269–316, 2015.
-  Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
-  Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
-  Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
-  Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
-  Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387–434, 2005.
-  Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected sarsa. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184. IEEE, 2009.
-  Gabriel Y Weintraub, C Lanier Benkard, and Benjamin Van Roy. Markov perfect industry dynamics with many firms. Econometrica, 76(6):1375–1411, 2008.
-  Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438, 2018.
Appendix A Dynamic Programming Equations for Finite-Horizon MDPs
We define the value function at time step as:
This can be further re-written as
Separating the two terms inside the expectation and taking outside the expectation, we get
After choosing the action in state , the agent transitions to the state with transition probability .
But , which gives
Thus, the above defined value functions satisfy the dynamic programming equations for finite-horizon MDPs.