Policy Gradient With Value Function Approximation
For Collective Multiagent Planning
Abstract
Decentralized (PO)MDPs provide an expressive framework for sequential decision making in a multiagent system. Given their computational complexity, recent research has focused on tractable yet practical subclasses of DecPOMDPs. We address such a subclass called DecPOMDP where the collective behavior of a population of agents affects the jointreward and environment dynamics. Our main contribution is an actorcritic (AC) reinforcement learning method for optimizing DecPOMDP policies. Vanilla AC has slow convergence for larger problems. To address this, we show how a particular decomposition of the approximate actionvalue function over agents leads to effective updates, and also derive a new way to train the critic based on local reward signals. Comparisons on a synthetic benchmark and a real world taxi fleet optimization problem show that our new AC approach provides better quality solutions than previous best approaches.
Policy Gradient With Value Function Approximation
For Collective Multiagent Planning
Duc Thien Nguyen Akshat Kumar Hoong Chuin Lau School of Information Systems Singapore Management University 80 Stamford Road, Singapore 178902 {dtnguyen.2014,akshatkumar,hclau}@smu.edu.sg
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
Decentralized partially observable MDPs (DecPOMDPs) have emerged in recent years as a promising framework for multiagent collaborative sequential decision making (Bernstein et al., 2002). DecPOMDPs model settings where agents act based on different partial observations about the environment and each other to maximize a global objective. Applications of DecPOMDPs include coordinating planetary rovers (Becker et al., 2004b), multirobot coordination (Amato et al., 2015) and throughput optimization in wireless network (Winstein and Balakrishnan, 2013; Pajarinen et al., 2014). However, solving DecPOMDPs is computationally challenging, being NEXPHard even for 2agent problems (Bernstein et al., 2002).
To increase scalability and application to practical problems, past research has explored restricted interactions among agents such as state transition and observation independence (Nair et al., 2005; Kumar et al., 2011, 2015), event driven interactions (Becker et al., 2004a) and weak coupling among agents (Witwicki and Durfee, 2010). Recently, a number of works have focused on settings where agent identities do not affect interactions among agents. Instead, environment dynamics are primarily driven by the collective influence of agents (Varakantham et al., 2014; Sonu et al., 2015; Robbel et al., 2016; Nguyen et al., 2017), similar to well known congestion games (Meyers and Schulz, 2012). Several problems in urban transportation such as taxi supplydemand matching can be modeled using such collective planning models (Varakantham et al., 2012; Nguyen et al., 2017).
In this work, we focus on the collective DecPOMDP framework (DecPOMDP) that formalizes such a collective multiagent sequential decision making problem under uncertainty (Nguyen et al., 2017). Nguyen et al. present a sampling based approach to optimize policies in the DecPOMDP model. A key drawback of this previous approach is that policies are represented in a tabular form which scales poorly with the size of observation space of agents. Motivated by the recent success of reinforcement learning (RL) approaches (Mnih et al., 2015; Schulman et al., 2015; Mnih et al., 2016; Foerster et al., 2016; Leibo et al., 2017), our main contribution is a actorcritic (AC) reinforcement learning method (Konda and Tsitsiklis, 2003) for optimizing DecPOMDP policies.
Policies are represented using function approximator such as a neural network, thereby avoiding the scalability issues of a tabular policy. We derive the policy gradient and develop a factored actionvalue approximator based on collective agent interactions in DecPOMDPs. Vanilla AC is slow to converge on large problems due to known issues of learning with global reward in large multiagent systems (Bagnell and Ng, 2005). To address this, we also develop a new way to train the critic, our actionvalue approximator, that effectively utilizes local value function of agents.
We test our approach on a synthetic multirobot grid navigation domain from (Nguyen et al., 2017), and a real world supplydemand taxi matching problem in a large Asian city with up to 8000 taxis (or agents) showing the scalability of our approach to large multiagent systems. Empirically, our new factored actorcritic approach works better than previous best approaches providing much higher solution quality. The factored AC algorithm empirically converges much faster than the vanilla validating the effectiveness of our new training approach for the critic.
Related work: Our work is based on the framework of policy gradient with approximate value function similar to Sutton et al. (1999). However, as we empirically show, directly applying the original policy gradient from Sutton et al. (1999) into the multiagent setting and specifically for the DecPOMDP model results in a high variance solution. In this work, we show a suitable form of compatible value function approximation for DecPOMDPs that results in an efficient and low variance policy gradient update. Reinforcement learning for decentralized policies has been studied earlier in Peshkin et al. (2000), Aberdeen (2006). Guestrin et al. (2002) also proposed using REINFORCE to train a softmax policy of a factored value function from the coordination graph. However in such previous works, policy gradient is estimated from the global empirical returns instead of a decomposed critic. We show in section 4 that having a decomposed critic along with an individual value function based training of this critic is important for sampleefficient learning. Our empirical results show that our proposed critic training has faster convergence than training with global empirical returns.
2 Collective Decentralized POMDP Model
We first describe the DecPOMDP model introduced in (Nguyen et al., 2017). A step Dynamic Bayesian Network (DBN) for this model is shown using the plate notation in figure 1. It consists of the following:

A finite planning horizon .

The number of agents . An agent can be in one of the states in the state space . The joint state space is . We denote a single state as .

A set of action for each agent . We denote an individual action as .

Let denote the complete stateaction trajectory of an agent . We denote the state and action of agent at time using random variables , . Different indicator functions are defined in table 1. We define the following count given the trajectory of each agent :
As noted in table 1, count denotes the number of agents in state taking action at time step and transitioning to next state ; other counts, and , are defined analogously. Using these counts, we can define the count tables and for the time step as shown in table 1.

We assume a general partially observable setting wherein agents can have different observations based on the collective influence of other agents. An agent observes its local state . In addition, it also observes at time based on its local state and the count table . E.g., an agent in state at time can observe the count of other agents also in state (=) or other agents in some neighborhood of the state (=).

The transition function is . The transition function is the same for all the agents. Notice that it is affected by , which depends on the collective behavior of the agent population.

Each agent has a nonstationary policy denoting the probability of agent to take action given its observation at time . We denote the policy over the planning horizon of an agent to be .

An agent receives the reward dependent on its local state and action, and the counts .

Initial state distribution, , is the same for all agents.
We present here the simplest version where all the agents are of the same type having similar state transition, observation and reward models. The model can handle multiple agent types where agents have different dynamics based on their type. We can also incorporate an external state that is unaffected by agents’ actions (such as taxi demand in transportation domain). Our results are extendible to address such settings also.
if agent is at state at time or  
if agent takes action in state at time or  
if agent takes action in state at time and transitions to state or  
Number of agents at state at time  
Number of agents at state taking action at time  
Number of agents at state taking action at time and transitioning to state at time  
Count table  
Count table  
Count table 
Models such as DecPOMDPs are useful in settings where agent population is large, and agent identity does not affect the reward or the transition function. A motivating application of this model is for the taxifleet optimization where the problem is to compute policies for taxis such that the total profit of the fleet is maximized (Varakantham et al., 2012; Nguyen et al., 2017). The decision making for a taxi is as follows. At time , each taxi observes its current city zone (different zones constitute the statespace ), and also the count of other taxis in the current zone and its neighboring zones as well as an estimate of the current local demand. This constitutes the countbased observation for the taxi. Based on this observation, the taxi must decide whether to stay in the current zone to look for passengers or move to another zone. These decision choices depend on several factors such as the ratio of demand and the count of other taxis in the current zone. Similarly, the environment is stochastic with variable taxi demand at different times. Such historical demand data is often available using GPS traces of the taxi fleet (Varakantham et al., 2012).
CountBased statistic for planning: A key property in the DecPOMDP model is that the model dynamics depend on the collective interaction among agents rather than agent identities. In settings such as taxi fleet optimization, the agent population size can be quite large ( for our real world experiments). Given such a large population, it is not possible to compute unique policy for each agent. Therefore, similar to previous work (Varakantham et al., 2012; Nguyen et al., 2017), our goal is to compute a homogenous policy for all the agents. As the policy is dependent on counts, it represents an expressive class of policies.
For a fixed population , let denote the stateaction trajectories of different agents sampled from the DBN in figure 1. Let , be the combined vector of the resulting count tables for each time step . Nguyen et al. show that counts are the sufficient statistic for planning. That is, the jointvalue function of a policy over horizon can be computed by the expectation over counts as (Nguyen et al., 2017):
(1) 
Set is the set of all allowed consistent count tables as:
is the distribution over counts (detailed expression in appendix). A key benefit of this result is that we can evaluate the policy by sampling counts directly from without sampling individual agent trajectories for different agents, resulting in significant computational savings. Our goal is to compute the optimal policy that maximizes . We assume a RL setting with centralized learning and decentralized execution. We assume a simulator is available that can provide count samples from .
3 Policy Gradient for DecPOMDPs
Previous work proposed an expectationmaximization (EM) (Dempster et al., 1977) based sampling approach to optimize the policy (Nguyen et al., 2017). The policy is represented as a piecewise linear tabular policy over the space of counts where each linear piece specifies a distribution over next actions. However, this tabular representation is limited in its expressive power as the number of pieces is fixed apriori, and the range of each piece has to be defined manually which can adversely affect performance. Furthermore, exponentially many pieces are required when the observation is multidimensional (i.e., an agent observes counts from some local neighborhood of its location). To address such issues, our goal is to optimize policies in a functional form such as a neural network.
We first extend the policy gradient theorem of (Sutton et al., 1999) to DecPOMDPs. Let denote the vector of policy parameters. We next show how to compute . Let , denote the jointstate and jointactions of all the agents at time . The value function of a given policy in an expanded form is given as:
(2) 
where is the distribution of the joint stateaction under the policy . The value function is computed as:
(3) 
We next state the policy gradient theorem for DecPOMDPs:
Theorem 1.
For any DecPOMDP, the policy gradient is given as:
(4) 
The proofs of this theorem and other subsequent results are provided in the appendix.
Notice that computing the policy gradient using the above result is not practical for multiple reasons. The space of joinstate action is combinatorial. Given that the agent population size can be large, sampling each agent’s trajectory is not computationally tractable. To remedy this, we later show how to compute the gradient by directly sampling counts similar to policy evaluation in (1). Similarly, one can estimate the actionvalue function using empirical returns as an approximation. This would be the analogue of the standard REINFORCE algorithm (Williams, 1992) for DecPOMDPs. It is well known that REINFORCE may learn slowly than other methods that use a learned actionvalue function (Sutton et al., 1999). Therefore, we next present a function approximator for , and show the computation of policy gradient by directly sampling counts .
3.1 Policy Gradient with ActionValue Approximation
One can approximate the actionvalue function in several different ways. We consider the following special form of the approximate value function :
(5) 
where each is defined for each agent and takes as input the agent’s local state, action and the observation. Notice that different components are correlated as they depend on the common count table . Such a decomposable form is useful as it leads to efficient policy gradient computation. Furthermore, an important class of approximate value function having this form for DecPOMDPs is the compatible value function (Sutton et al., 1999) which results in an unbiased policy gradient (details in appendix).
Proposition 1.
Compatible value function for DecPOMDPs can be factorized as:
We can directly replace in policy gradient (29) by the approximate actionvalue function . Empirically, we found that variance using this estimator was high. We exploit the structure of and show further factorization of the policy gradient next which works much better empirically.
Theorem 2.
For any value function having the decomposition as:
(6) 
the policy gradient can be computed as
(7) 
The above result shows that if the approximate value function is factored, then the resulting policy gradient also becomes factored. The above result also applies to agents with multiple types as we assumed the function is different for each agent. In the simpler case when all the agents are of same type, then we have the same function for each agent, and also deduce the following:
(8) 
Using the above result, we simplify the policy gradient as:
(9) 
3.2 Countbased Policy Gradient Computation
Notice that in (9), the expectation is still w.r.t. jointstates and actions which is not efficient in large population sizes. To address this issue, we exploit the insight that the approximate value function in (8) and the inner expression in (9) depends only on the counts generated by the jointstate and action .
Theorem 3.
For any value function having the form: , the policy gradient can be computed as:
(10) 
The above result shows that the policy gradient can be computed by sampling count table vectors from the underlying distribution analogous to computing the value function of the policy in (1), which is tractable even for large population sizes.
4 Training ActionValue Function
In our approach, after count samples are generated to compute the policy gradient, we also need to adjust the parameters of our critic . Notice that as per (8), the action value function depends only on the counts generated by the jointstate and action . Training can be done by taking a gradient step to minimize the following loss function:
(11) 
where is a count sample generated from the distribution ; is the action value function and is the total empirical return for time step computed using (1):
(12) 
However, we found that the loss in (11) did not work well for training the critic for larger problems. Several count samples were required to reliably train which adversely affects scalability for large problems with many agents. It is already known in multiagent RL that algorithms that solely rely on the global reward signal (e.g. in our case) may require several more samples than approaches that take advantage of local reward signals (Bagnell and Ng, 2005). Motivated by this observation, we next develop a local reward signal based strategy to train the critic .
Individual Value Function: Let be a count sample. Given the count sample , let denote the total expected reward obtained by an agent that is in state and takes action at time . This individual value function can be computed using dynamic programming as shown in (Nguyen et al., 2017). Based on this value function, we next show an alternative reparameterization of the global empirical reward in (12):
Lemma 1.
The empirical return for the time step given the count sample can be reparameterized as: .
Individual Value Function Based Loss: Given lemma 2, we next derive an upper bound on the on the true loss (11) which effectively utilizes individual value functions:
(13)  
(14) 
where the last relation is derived by CauchySchwarz inequality. We train the critic using the modified loss function in (14). Empirically, we observed that for larger problems, this new loss function in (14) resulted in much faster convergence than the original loss function in (13). Intuitively, this is because the new loss (14) tries to adjust each critic component closer to its counterpart empirical return . However, in the original loss function (13), the focus is on minimizing the global loss, rather than adjusting each individual critic factor towards the corresponding empirical return.
Algorithm 1 shows the outline of our AC approach for DecPOMDPs. Lines 7 and 8 show two different options to train the critic. Line 7 represents critic update based on local value functions, also referred to as factored critic update (). Line 8 shows update based on global reward or global critic update (). Line 10 shows the policy gradient computed using theorem 5 (). Line 11 shows how the gradient is computed by directly using from eq. (5) in eq. 29.
5 Experiments
This section compares the performance of our AC approach with two other approaches for solving DecPOMDPs—SoftMax based flow update (SMFU) (Varakantham et al., 2012), and the ExpectationMaximization (EM) approach (Nguyen et al., 2017). SMFU can only optimize policies where an agent's action only depends on its local state, , as it approximates the effect of counts by computing the single most likely count vector during the planning phase. The EM approach can optimize countbased piecewise linear policies where is a piecewise function over the space of all possible count observations .
Algorithm 1 shows two ways of updating the critic (in lines 7, 8) and two ways of updating the actor (in lines 10, 11) leading to 4 possible settings for our actorcritic approach—, , , . We also investigate the properties of these different actorcritic approaches. The neural network structure and other experimental settings are provided in the appendix.
For fair comparisons with previous approaches, we use three different models for countsbased observation . In `o0' setting, policies depend only on agent's local state and not on counts. In `o1' setting, policies depend on the local state and the single count observation . That is, the agent can only observe the count of other agents in its current state . In `oN' setting, the agent observes its local state and also the count of other agents from a local neighborhood (defined later) of the state . The `oN' observation model provides the most information to an agent. However, it is also much more difficult to optimize as policies have more parameters. The SMFU only works with `o0' setting; EM and our actorcritic approach work for all the settings.
Taxi SupplyDemand Matching: We test our approach on this realworld domain described in section 2, and introduced in (Varakantham et al., 2012). In this problem, the goal is to compute taxi policies for optimizing the total revenue of the fleet. The data contains GPS traces of taxi movement in a large Asian city over 1 year. We use the observed demand information extracted from this dataset. On an average, there are around 8000 taxis per day (data is not exhaustive over all taxi operators). The city is divided into 81 zones and the plan horizon is 48 half hour intervals over 24 hours. For details about the environment dynamics, we refer to (Varakantham et al., 2012).
Figure 2(a) shows the quality comparisons among different approaches with different observation models (`o0', `o1' and `oN'). We test with total number of taxis as 4000 and 8000 to see if taxi population size affects the relative performance of different approaches. The yaxis shows the average per day profit for the entire fleet. For the `o0' case, all approaches (`o0', SMFU, EM`o0') give similar quality with `o0' and EM`o0' performing slightly better than SMFU for the 8000 taxis. For the `o1' case, there is sharp improvement in quality by `o1' over `o0' confirming that taking count based observation into account results in better policies. Our approach `o1' is also significantly better than the policies optimized by EM`o1' for both 4000 and 8000 taxi setting.
To further test the scalability and the ability to optimize complex policies by our approach in the `oN' setting, we define the neighborhood of each state (which is a zone in the city) to be the set of its geographically connected zones based on the zonal decomposition shown in (Nguyen et al., 2017). On an average, there are about 8 neighboring zones for a given zone, resulting in 9 count based observations available to the agent for taking decisions. Each agent observes both the taxi count and the demand information from such neighboring zones. In figure 2(a), `oN' result clearly shows that taking multiple observations into account significantly increases solution quality—`oN' provides an increase of 64% in quality over `o0' and 20% over `o1' for the 8000 taxi case. For EM`oN', we used a bare minimum of 2 pieces per observation dimension (resulting in pieces per time step). We observed that EM was unable to converge within 30K iterations and provided even worse quality than EM`o1' at the end. These results show that despite the larger search space, our approach can effectively optimize complex policies whereas the tabular policy based EM approach was ineffective for this case.
Figures 3(ac) show the quality Vs. iterations for different variations of our actor critic approach—, , , —for the `o0', `o1' and the `oN' observation model. These figures clearly show that using factored actor and the factored critic update in is the most reliable strategy over all the other variations and for all the observation models. Variations such as and were not able to converge at all despite having exactly the same parameters as . These results validate different strategies that we have developed in our work to make vanilla AC converge faster for large problems.
Robot navigation in a congested environment: We also tested on a synthetic benchmark introduced in (Nguyen et al., 2017). The goal is for a population of robots () to move from a set of initial locations to a goal state in a 5x5 grid. If there is congestion on an edge, then each agent attempting to cross the edge has higher chance of action failure. Similarly, agents also receive a negative reward if there is edge congestion. On successfully reaching the goal state, agents receive a positive reward and transition back to one of the initial state. We set the horizon to 100 steps.
Figure 2(b) shows the solution quality comparisons among different approaches. In the `oN' observation model, the agent observes its 4 immediate neighbor node's count information. In this problem, SMFU performed worst, and EM both performed much better. As expected `oN' provides the best solution quality over all the other approaches. In this domain, EM is competitive with as for this relatively smaller problem with 25 agents, the space of counts is much smaller than in the taxi domain. Therefore, EM's piecewise policy is able to provide a fine grained approximation over the count range.
6 Summary
We addressed the problem of collective multiagent planning where the collective behavior of a population of agents affects the model dynamics. We developed a new actorcritic method for solving such collective planning problems within the DecPOMDP framework. We derived several new results for DecPOMDPs such as the policy gradient derivation, and the structure of the compatible value function. To overcome the slow convergence of the vanilla actorcritic method we developed multiple techniques based on value function factorization and training the critic using individual value function of agents. Using such techniques, our approach provided significantly better quality than previous approaches, and proved scalable and effective for optimizing policies in a real world taxi supplydemand problem and a synthetic grid navigation problem.
7 Acknowledgments
This research project is supported by National Research Foundation Singapore under its Corp Lab @ University scheme and Fujitsu Limited. First author is also supported by ASTAR graduate scholarship.
References
 Aberdeen (2006) Aberdeen, D. (2006). Policygradient methods for planning. In Advances in Neural Information Processing Systems, pages 9–16.
 Amato et al. (2015) Amato, C., Konidaris, G., Cruz, G., Maynor, C. A., How, J. P., and Kaelbling, L. P. (2015). Planning for decentralized control of multiple robots under uncertainty. In IEEE International Conference on Robotics and Automation, ICRA, pages 1241–1248.
 Bagnell and Ng (2005) Bagnell, J. A. and Ng, A. Y. (2005). On local rewards and scaling distributed reinforcement learning. In International Conference on Neural Information Processing Systems, pages 91–98.
 Becker et al. (2004a) Becker, R., Zilberstein, S., and Lesser, V. (2004a). Decentralized Markov decision processes with eventdriven interactions. In Proceedings of the 3rd International Conference on Autonomous Agents and Multiagent Systems, pages 302–309.
 Becker et al. (2004b) Becker, R., Zilberstein, S., Lesser, V., and Goldman, C. V. (2004b). Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22:423–455.
 Bernstein et al. (2002) Bernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27:819–840.
 Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical society, Series B, 39(1):1–38.
 Foerster et al. (2016) Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. (2016). Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145.
 Guestrin et al. (2002) Guestrin, C., Lagoudakis, M., and Parr, R. (2002). Coordinated reinforcement learning. In ICML, volume 2, pages 227–234.
 Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 Konda and Tsitsiklis (2003) Konda, V. R. and Tsitsiklis, J. N. (2003). On actorcritic algorithms. SIAM Journal on Control and Optimization, 42(4):1143–1166.
 Kumar et al. (2011) Kumar, A., Zilberstein, S., and Toussaint, M. (2011). Scalable multiagent planning using probabilistic inference. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence, pages 2140–2146, Barcelona, Spain.
 Kumar et al. (2015) Kumar, A., Zilberstein, S., and Toussaint, M. (2015). Probabilistic inference techniques for scalable multiagent decision making. Journal of Artificial Intelligence Research, 53(1):223–270.
 Leibo et al. (2017) Leibo, J. Z., Zambaldi, V. F., Lanctot, M., Marecki, J., and Graepel, T. (2017). Multiagent reinforcement learning in sequential social dilemmas. In International Conference on Autonomous Agents and Multiagent Systems.
 Meyers and Schulz (2012) Meyers, C. A. and Schulz, A. S. (2012). The complexity of congestion games. Networks, 59:252–260.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533.
 Nair et al. (2005) Nair, R., Varakantham, P., Tambe, M., and Yokoo, M. (2005). Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. In AAAI Conference on Artificial Intelligence, pages 133–139.
 Nguyen et al. (2017) Nguyen, D. T., Kumar, A., and Lau, H. C. (2017). Collective multiagent sequential decision making under uncertainty. In AAAI Conference on Artificial Intelligence, pages 3036–3043.
 Pajarinen et al. (2014) Pajarinen, J., Hottinen, A., and Peltonen, J. (2014). Optimizing spatial and temporal reuse in wireless networks by decentralized partially observable Markov decision processes. IEEE Trans. on Mobile Computing, 13(4):866–879.
 Peshkin et al. (2000) Peshkin, L., Kim, K.E., Meuleau, N., and Kaelbling, L. P. (2000). Learning to cooperate via policy search. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 489–496. Morgan Kaufmann Publishers Inc.
 Robbel et al. (2016) Robbel, P., Oliehoek, F. A., and Kochenderfer, M. J. (2016). Exploiting anonymity in approximate linear programming: Scaling to large multiagent MDPs. In AAAI Conference on Artificial Intelligence, pages 2537–2543.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
 Sonu et al. (2015) Sonu, E., Chen, Y., and Doshi, P. (2015). Individual planning in agent populations: Exploiting anonymity and frameaction hypergraphs. In International Conference on Automated Planning and Scheduling, pages 202–210.
 Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In International Conference on Neural Information Processing Systems, pages 1057–1063.
 van Hasselt et al. (2016) van Hasselt, H., Guez, A., Hessel, M., Mnih, V., and Silver, D. (2016). Learning values across many orders of magnitude. arXiv preprint arXiv:1602.07714.
 Varakantham et al. (2014) Varakantham, P., Adulyasak, Y., and Jaillet, P. (2014). Decentralized stochastic planning with anonymity in interactions. In AAAI Conference on Artificial Intelligence, pages 2505–2511.
 Varakantham et al. (2012) Varakantham, P. R., Cheng, S.F., Gordon, G., and Ahmed, A. (2012). Decision support for agent populations in uncertain and congested environments. In AAAI Conference on Artificial Intelligence, pages 1471–1477.
 Williams (1992) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256.
 Winstein and Balakrishnan (2013) Winstein, K. and Balakrishnan, H. (2013). Tcp ex machina: Computergenerated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference, SIGCOMM '13, pages 123–134.
 Witwicki and Durfee (2010) Witwicki, S. J. and Durfee, E. H. (2010). Influencebased policy abstraction for weaklycoupled DecPOMDPs. In International Conference on Automated Planning and Scheduling, pages 185–192.
Appendix A Distribution Over Counts
We show the distribution directly over the count tables as shown in Nguyen et al. (2017). The distribution is defined as:
(15) 
where is given as:
(16) 
where is the count table consisting of the count value for each state at time .
The function counts the total number of ordered stateaction trajectories with sufficient statistic equal to , given as:
(17) 
Set is the set of all allowed consistent count tables as:
(18)  
(19) 
Appendix B Policy gradient in DecPOMDPs
In following part, we show the policy gradient in DecPOMDPs with respect to the accumulated reward at the first time period . The proof is similar to Sutton et al. (1999)'s proof.
(20)  
(21)  
using the Q function definition for DecPOMDPs and taking the derivative we get  
(22)  
If we continue unrolling out the terms in the above expression, we get  
(23)  
this can be rewritten use the log trick  
(24)  
(25) 
Next, we simplify the gradient term as:
Proposition 2.
We have
(26) 
Proof.
We simplify the above gradient as following:
(27) 
∎
Notice that we have proved the result in a general setting where each agent has a different policy . In a homogeneous agent system (when each agent is of the same type and has the same policy ), the last equation can be simplified by grouping agents taking similar action in similar state to give us:
(28) 
Using the above results, the final policy gradient expression for DecPOMDPs is readily proved.
Theorem 4.
For any DecPOMDP, the policy gradient is given as:
(29) 
Appendix C Action Value Function Approximation For DecPOMDP
We consider a special form of approximate value function
(30) 
There are 2 reasons to consider this form of approximate value function:

This form will leads to the efficient update of policy gradient

We can train this form efficiently if we can decompose the value function into sum of some individual value. Each component can be understand as the contribution of each individual into the total value function.
One of important class of approximate value functions having this form is the compatible value function. As shown in Sutton et al. (1999), for compatible value functions, the policy gradient using the function approximator is equal to the true policy gradient.
Proposition 3.
The compatible value function approximation in DecPOMDPs has the form
Proof.
Recall from Sutton et al. (1999), the compatible value function approximates the value function with linear value , where denotes function parameter vector and is compatible feature vector computed from the policy as
(31) 
Applying this for DecPOMDPs and using the result from proposition 2, we have the linear compatible feature in a DecPOMDP to be:
(32) 
We can rearrange as follows
(33)  
(34) 
If we set , the theorem is proved. ∎
We also prove the next result in a general setting with each agent having a different policy .
Theorem 5.
For any value function having the decomposition as:
(35) 
the policy gradient can be computed as
(36) 
Proof.
Substitute the approximate value function to in the policy gradient formula (25), we have the policy gradient computed by approximate value function to be
(37)  
(38)  
(39) 
Let us simplify the inner summation for a specific by looking at:
(40) 
Given the independence of value functions of other agents w.r.t. the action of agent , we have:
(41)  
(42)  
(43) 
Applying this to (39), we can dismiss all the term of to simplify (39) into (