Scalable Centralized Deep MultiAgent Reinforcement Learning via Policy Gradients
Abstract
In this paper, we explore using deep reinforcement learning for problems with multiple agents. Most existing methods for deep multiagent reinforcement learning consider only a small number of agents. When the number of agents increases, the dimensionality of the input and control spaces increase as well, and these methods do not scale well. To address this, we propose casting the multiagent reinforcement learning problem as a distributed optimization problem. Our algorithm assumes that for multiagent settings, policies of individual agents in a given population live close to each other in parameter space and can be approximated by a single policy. With this simple assumption, we show our algorithm to be extremely effective for reinforcement learning in multiagent settings. We demonstrate its effectiveness against existing comparable approaches on cooperative and competitive tasks.
Scalable Centralized Deep MultiAgent Reinforcement Learning via Policy Gradients
Arbaaz Khan, Clark Zhang, Daniel D. Lee, Vijay Kumar, Alejandro Ribeiro GRASP Laboratory University of Pennsylvania
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Leveraging the power of deep neural networks in reinforcement learning (RL) has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video games dqn (); mnih2016asynchronous (), learning complex control policies for robot tasks visuomotor () and learning to plan with only sensory information pathak2017curiosity (); macn (); gupta2017cognitive (). While these results are impressive, most of these methods consider only single agent settings.
In the real world, many applications, especially in fields like robotics and communications, require multiple agents to interact with each other in cooperative or competitive settings. Examples include warehouse management with teams of robots enright2011optimization (), multirobot furniture assembly knepper2013ikeabot (), and concurrent control and communication for teams of robots 2017Transactions_Stephan (). Traditionally, these problems were solved by minimizing a carefully set up optimization problem constrained by robot and environment dynamics. Often, these become intractable when adding simple constraints to the problem or by simply increasing the number of agents solovey2016hardness (). In this paper, we attempt to solve multiagent problems by framing them as multiagent reinforcement learning (MARL) problems and leverage the power of deep neural networks. In MARL, the environment from the perspective on an agent appears nonstationary. This is because the other agents are also changing their policies (due to learning). Traditional RL paradigms such as Qlearning are ill suited for such nonstationary environments.
Several recent works have proposed using decentralized actorcentralized critic models foerster2017counterfactual (); lowe2017multi (). These have been shown to work well when the number of agents being considered is small. Setting up a large number of actor networks is not computationally resource efficient. Further, the input space of the critic network grows quickly with the number of agents. Also, in decentralized frameworks, every agent must estimate and track the other agents da2006dealing (); sutton2007role (). Most deep RL algorithms are sample inefficient even with only a single agent. Attempting to learn individual policies for multiple agents in a decentralized framework becomes highly inefficient, as we will demonstrate. Thus, attempting to learn multiple policies with limited interaction using decentralized frameworks is often infeasible.
Instead, we propose the use of a centralized model. Here, all agents become aware of the actions of other agents, which mitigates the nonstationarity. To use a centralized framework for MARL, one must collect experiences from individual agents and then learn to combine these to output actions for all agents. One option is to use highcapacity models like neural networks to learn policies that can map the joint observations of all agents to the joint actions of all agents. This simple approach works when the number of agents is small but suffers from the curse of dimensionality when the number of agents increases. Another possibility is to learn a policy for one agent and fine tune it across all agents but this also turns out to be impractical. To mitigate the problems of scale and limited interaction, we propose using a distributed optimization framework for the MARL problem. The key idea is to learn one policy for all agents that exhibits emergent behaviors when multiple agents interact. This type of policy has been shown to be used in nature potter2013microclimatic () as well as in swarm robotics rubenstein2012kilobot (). In this paper, the goal is to learn these policies from raw observations and rewards with reinforcement learning.
Optimizing one policy across all agents is difficult and sometimes intractable (especially when number of agents are large). Instead, we take a distributed approach where each agent improves the central policy with their local observations. Then, a central controller combines these improvements in a way that refines the overall policy. This can be seen as recasting the original problem of optimizing one policy to optimizing several policies subject to the constraint that they are identical. After training, there will only be a single policy for all agents to use. This is a optimization technique that has seen success in distributed settings before boyd2011distributed (). Thus the main contributions of this paper are :

A novel algorithm for solving MARL problems using distributed optimization.

The policy gradient formulation when using distributed optimization for MARL
2 Related Work
MultiAgent Reinforcement Learning (MARL) has been an actively explored area of research in the field of reinforcement learning busoniu2006multi (); littman1994markov (). Many initial approaches have been focused on tabular methods to compute Qvalues for general sum Markov games hu2003nash (). Another approach in the past has been to remove the nonstationarity in MARL by treating each episode as an iterative game, where the other agent is held constant during its turn. In such a game, the proposed algorithm searches for a Nash equilibrium conitzer2007awesome (). Naturally, for complex competitive or collaborative tasks with many agents, finding a Nash equilibrium is nontrivial. Building on the recent success of methods for deep RL, there has been a renewed interest in using high capacity models such as neural networks for solving MARL problems. However, this is not very straightforward and is hard to extend to games where the number of agents is more than two tampuu2017multiagent ().
When using deep neural networks for MARL, one method that has worked well in the past is the use of decentralized actors for each agent and a centralized critic with parameter sharing among the agents foerster2017counterfactual (); lowe2017multi (). While this works well for a small number of agents, it is sample inefficient and very often, the training becomes unstable when the number of agents in the environment increases.
In our work, we derive the policy gradient derivation for multiple agents. This derivation is very similar to that for policy gradients in metalearning from al2017continuous (); maml (), where the authors use metalearning to solve continuous task adaptation. In al2017continuous () the authors propose a metalearning algorithm that attempts to mitigate the nonstationarity by treating it as a sequence of stationary tasks and train agents to exploit the dependencies between consecutive tasks such that they can handle similar non stationaries at execution time. This is in contrast to our work where we are focused on the MARL problem. In MARL there are often very few intertask (in the MARL setting this corresponds to interagent) dependencies that can be exploited. Instead, we focus on using distributed learning to learn a policy.
3 Collaborative Reinforcement Learning in Markov Teams
We consider policy learning problems in a collaborative Markov team littman1994markov (). The team is composed of agents generically indexed by which at any given point in time are described by a state and an action . Observe that we are assuming all agents to have common state space and common action space . Individual states and actions of each agent are collected in the vectors and . Since the team is assumed to be Markov, the probability distribution of the state at time is completely determined by the conditional transition probability . We further assume here that agents are statistically identical in that the probability transition kernel is invariant to permutations.
At any point in time , the agents can communicate their states to each other and agents utilize this information to select their actions. This means that each agent executes a policy with the action executed by agent at time being . As agents operate in their environment, they collect individual rewards which depend on the state of the team and their own individual action . The quantity of interest to agent is not this instantaneous reward but rather the long term reward accumulated over a time horizon as discounted by a factor ,
(1) 
The reward in (1) is stochastic as it depends on the trajectory’s realization. In conventional RL problems, agent would define the cost and search for a policy that maximizes this long term expected reward. This expectation, however, neglects the effect of other agents, which we can incorporate competitively or collaboratively. In a competitive formulation agent considers the loss that is integrated not only with respect to its own policy but with respect to the policies of all agents . In the collaborative problems we consider here, agent takes the rewards of other agents into consideration. Thus, the reward of interest to agent is the expected reward accumulated over time and across all agents,
(2) 
where, we recall, denotes the joint policy of the team and we have further defined to group the policies of all agents except .
The goal in a collaborative reinforcement learning problem is to find a policies that optimize the aggregate expected reward in (2). We can write these optimal policies as . The drawback with this problem formulation is that it requires learning separate policies for each individual agent. This is intractable when is large, which motivates a restriction in which all agents are required to execute a common policy. This leads to the optimization problem
(3) 
We reformulate into the more tractable problem
(4) 
In the next section, we present a distributed algorithm to solve this optimization problem.
4 Distributed Optimization for MARL using Policy Gradients
Let us reiterate the problem in Eqn 3 in terms of the parameterization of the policy and trajectories drawn from the policy. Eqn 3 can be interpreted as a problem where we aim to solve is to find the best set of parameters that parameterizes a policy to maximize the sum of rewards for all agents over some time horizon . Specifically, the optimization problem in Eqn 3 can be written as:
(5) 
where are trajectories of agent
(6) 
sampled from the distribution of trajectories induced by the policy . However, as stated above this problem can be intractable for large . Rewriting the parametrized version of the more tractable optimization in Eqn 4 we get:
(7)  
subject to 
where we define the trajectories to be those obtained when agent follows policy and all other agents follow policy . ^{1}^{1}1This optimization problem is the same as the one in Eqn 4. The difference being that, we have now written the optimization in terms of the parametrization of the policies and trajectories drawn from the policies.
(8) 
The difference between Eqn 5 and Eqn 7 is that we have formed copies of labeled and put a constraint that . This approach allows us to look at the problem in a different light. Similar to other distributed optimization problems such as ADMM boyd2011distributed (), we can decouple the optimization over from that of . The general approach is an iterative process where

For each agent , optimize the corresponding

Consolidate the into
This is often realized as a projected gradient descent where for each agent , we apply the gradients as well as applying a gradient . Then, in the next iteration all agents start at where is realized by taking a projection step such that is taken to satisfy the constraint in problem 7. However, when computing this projected gradient step, we need to keep track of all to compute the average. This is infeasible if this is done for a large number of agents. Instead a simple approximation to the projected gradient is used by setting . In the next subsection, we present our algorithm Distributed Multi Agent Policy Gradient or DiMAPG and its practical implementation.
4.1 Distributed MultiAgent Policy Gradients (DIMAPG)
In this section, we propose the Distributed Multi Agent Policy Gradient (DiMAPG) algorithm which learns a centralized policy that can be deployed across all agents. Consider a population from which statistically identical agents are sampled according to a distribution . The parameters of this agentspecific policy are updated by taking the gradient w.r.t at the specific value of (where is your current central policy):
(9) 
where is step size hyperparameter and is as defined in Eqn 7. Note that uses trajectories generated when all agents follow policies while uses trajectories when agent follows while all other agents follow .We do this because, when the environment is held constant w.r.t agent, then the problem for agent reduces to a MDP sutton1998reinforcement ().
In practice, we can take gradient steps instead of just one as presented in Eqn 9. This can be done with the following inductive steps
(10)  
Finally, we update :
(11) 
Numerically, we approximate by drawing trajectories where agent uses policy while all other agents uses policy and averaging over the policy gradients reinforce (); sutton1998reinforcement () that each trajectory provides. Recall that the trajectories and are random variables with distributions and respectively. The individual agent policy parameters, are also random variables with distribution . The overall optimization can be written as:
(12) 
Assuming, we sample N agents, Eqn. 12 can be rewritten as:
(13) 
To learn , we use policy gradient methods reinforce (); sutton1998reinforcement () which operate by taking the gradient of Eqn. 13. One can also use recently proposed state of the art methods for policy gradient methods gae (); schulman2015trust (). The gradient for each agent in Eqn 13 (the quantity inside the sum) w.r.t can be written as:
(14) 
The policy gradient for each agent consists of two policy gradient terms, one over the trajectories sampled using () and another term over the trajectories sampled using . It may be noted that the terms from the agent specific policy improvement when the other agents are held stationary (Eqn 10) do not appear in the final term. We show that it is possible to marginalize these terms out in the derivation for the gradient and point the reader to the appendix for a full derivation of the policy gradient. The full algorithm for DiMAPG is presented in Algorithm 1.
5 Experiments
5.1 Environments
To test the effectiveness of DIMAPG, we perform experiments on both collaborative and competitive tasks. The environments from lowe2017multi () and the manyagent (MAgent) environment from zheng2017magent () are adapted for our experiments. We setup the following experiments to test out our algorithm :
Cooperative Navigation This task consists of agents and goals. All agents are identical, and each agent observes the position of the goals and the other agents relative to its own position. The agents are collectively rewarded based on the how far any agent is from each goal. Further, the agents get negative reward for colliding with other agents. This can be seen as a coverage task where all agents must learn to cover all goals without colliding into each other. We test increasing the number of agents and goal regions and report the minimum reward across all agents.
Predator Prey This task environment consists of two populations  predators and preys. Prey are faster than the predators. The environment is also populated with static obstacles that the agents must learn to avoid or use to their advantage. All agents observe relative positions and velocities of other agents and the positions of the static obstacles. Predators are rewarded positively when they collide with the preys and the preys are rewarded are negatively.
Survival This task consists of a large number of agents operating in an environment with limited resources or food. Agents get reward for eating food but also get reward for killing other agents (reward for eating food is higher). Agents must either rush to get reward from eating food or monopolize the food by killing other agents. However, when the agents kill other agents they incur a small negative reward. Each agent’s observations consists of a spatial local view component and a non spatial component. The local view component encodes information about other agents within a range while the non spatial component encodes features such as the agents ID, last action executed, last reward and the relative position of the agent in the environment.
5.2 Experimental Results
For all experiments, we use a neural network policy that consists of two hidden layers with 100 units each and uses ReLU nonlinearity. For the Cooperative Navigation task, we use the vanilla policy gradient or REINFORCE reinforce () to compute updates () and TRPO schulman2015trust () to compute . For the Predator Prey and Survival tasks we switch to using REINFORCE for both and . To establish baselines, we compare against both centralized and decentralized deep MARL approaches. For decentralized learning, we use MADDPG from lowe2017multi () using the online implementation open sourced by the authors. Since the authors in lowe2017multi () already show MADDPG agents work better than other methods where individual agents are trained by DDPG, REINFORCE, ActorCritic, TRPO, DQN, we do not re implement those algorithms. Instead, we implement a centralized A3C (ActorCritic) mnih2016asynchronous () and centralized TRPO that take in as input the joint space of all agents observations and output actions over the joint space of all agents. We call this the Kitchensink approach. Details about the policy architecture for A3C_Kitchenshink and TRPO_Kitchensink are provided in the appendix. Our experiments are designed using the rllab benchmark suite duan2016benchmarking () and use Tensorflow tensorflow2015whitepaper () to setup the computation graph for the neural network and compute gradients.
5.2.1 Cooperative Navigation
We setup cooperative navigation as described in Section 5.1. Agents are rewarded for being close to the goals (negative square of distance to the goals) and get negatively rewarded for colliding into each other or when they step out of the environment boundary. We also observe that in order to stabilize training, we need to clip our rewards in the range [1,1]. We use a horizon after which episodes are terminated. Additional hyper parameters are provided in the Appendix.
n=3  n=10  

Using  34.8  8 
Using  37.19  8.5 
Fine Tune  44.17  56.3 
We run our proposed algorithm and baselines on this environment when number of agents and . Since the baselines A3C_Kitchenshink and TRPO_Kitchensink operate over the joint space, they are setup to maximize the minimum reward across all agents. The training curve for our tasks can be seen in Fig 3. We notice that for the simple case, A3C_Kitchenshink performs very well and quickly converges. This is expected since the number of agents is low and the dimensionality of the input space is not large. TRPO_Kitchenshink and MADDPG perform worse and while they converge, the convergence is only seen after 300400k episodes. When is increased to ten, we observe that only DIMAPG is able to quickly learn policies for all agents.
In our initial hypothesis, we sought to use across all agents since we assumed that the policies for all agents in a given population live close to each other in parameter space. We observe from Table 1 that after training using or (after kshot adaptation from ) yields almost similar results thus, verifying our hypothesis. We also consider the case where we train only 1 agent and then run the same policy across all agents. We observe that this yields poor results.
5.3 Predator Prey
The goal of this experiment is to compare the effectiveness of DIMAPG on competitive tasks. In this task, there exist 2 populations of agents; predators and preys. Extending our hypothesis to this task, we would like to learn a single policy for all predators and a single policy for all preys. It is important to note that even though, the policies are different, they are trained in parallel which in the centralized setup enables us to condition each agents trajectory on the actions of other agents even if they are in a different population. We experiment with two scenarios; 12vs1 and 3vs1 predator prey games where the prey are faster than the predator. The horizon used is .
Our results are presented in Fig 4. We observe that DIMAPG is able to effectively learn better policies than both MADDPG and the centralized Kitchensink methods on this competitive task. Similar results with DIMAPG are achieved even when the number of predators and preys are increased.
5.4 Survival
The goal of this experiment is to demonstrate the effectiveness of DIMAPG on environments with a large number of agents. The environment is populated with agents and food (the food is static particles at the center). Agents must learn to survive by eating food. To do so they can either rush to gather food and get reward or monopolize the food by first killing other agents (killing other agents results in a small negative reward). We use DIMAPG to learn the central policy that is deployed across all agents by randomly sampling agents from the population. We roll out each episode for a horizon of . Each environment is populated with food particles (eating one food particle yields a reward of +5). For this task, it is infeasible to train the other baselines and hence we do not benchmark for this experiment.
Statistics  N=230  N=630 

Food Left  0  0 
Survivors  227  490 
Average Reward  946  674 
We gauge the performance of DIMAPG on this task by evaluating the number of surviving agents and the food left at the end of the episode as well as the average reward over agents per episode.(Table 2). It is observed in the case when , the agents do not kill each other and instead learn to gather food. When the number of agents is increased to agents close to the food rush in to gather food while those further away start killing other agents.
6 Conclusion and Outlook
Thus, in this work we have proposed a distributed optimization setup for multiagent reinforcement learning that learns to combine information from all agents into a single policy that works well for large populations. We show that our proposed algorithm performs better than other state of the art deep multi agent reinforcement learning algorithms when the number of agents are increased.
One bottleneck in our work is the significant computation cost involved in computing the second derivatives for the gradient updates. Due to this, in practice we make approximations for the second derivative and are restricted to simple feedforward neural networks. On more challenging tasks, it might be a good idea to try recurrent neural networks and investigate methods such as the one presented in martens2018kroneckerfactored () to compute fast gradients. We leave this for future work.
References
 (1) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 (2) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, pp. 1928–1937, 2016.
 (3) S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” arXiv preprint arXiv:1504.00702, 2015.
 (4) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiositydriven exploration by selfsupervised prediction,” arXiv preprint arXiv:1705.05363, 2017.
 (5) A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee, “Memory augmented control networks,” in International Conference on Learning Representations, 2018.
 (6) S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive mapping and planning for visual navigation,” arXiv preprint arXiv:1702.03920, 2017.
 (7) J. Enright and P. R. Wurman, “Optimization and coordinated autonomy in mobile fulfillment systems.,” 2011.
 (8) R. A. Knepper, T. Layton, J. Romanishin, and D. Rus, “Ikeabot: An autonomous multirobot coordinated furniture assembly system,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 855–862, IEEE, 2013.
 (9) J. Stephan, J. Fink, V. Kumar, and A. Ribeiro, “Concurrent control of mobility and communication in multirobot systems,” IEEE Transactions on Robotics, vol. 33, pp. 1248–1254, October 2017.
 (10) K. Solovey and D. Halperin, “On the hardness of unlabeled multirobot motion planning,” The International Journal of Robotics Research, vol. 35, no. 14, pp. 1750–1759, 2016.
 (11) J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multiagent policy gradients,” arXiv preprint arXiv:1705.08926, 2017.
 (12) R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Advances in Neural Information Processing Systems, pp. 6382–6393, 2017.
 (13) B. C. Da Silva, E. W. Basso, A. L. Bazzan, and P. M. Engel, “Dealing with nonstationary environments using context detection,” in Proceedings of the 23rd international conference on Machine learning, pp. 217–224, ACM, 2006.
 (14) R. S. Sutton, A. Koop, and D. Silver, “On the role of tracking in stationary environments,” in Proceedings of the 24th international conference on Machine learning, pp. 871–878, ACM, 2007.
 (15) K. A. Potter, H. Arthur Woods, and S. Pincebourde, “Microclimatic challenges in global change biology,” Global change biology, vol. 19, no. 10, pp. 2932–2939, 2013.
 (16) M. Rubenstein, C. Ahler, and R. Nagpal, “Kilobot: A low cost scalable robot system for collective behaviors,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 3293–3298, IEEE, 2012.
 (17) S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
 (18) L. Busoniu, R. Babuska, and B. De Schutter, “Multiagent reinforcement learning: A survey,” in Control, Automation, Robotics and Vision, 2006. ICARCV’06. 9th International Conference on, pp. 1–6, IEEE, 2006.
 (19) M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Machine Learning Proceedings 1994, pp. 157–163, Elsevier, 1994.
 (20) J. Hu and M. P. Wellman, “Nash qlearning for generalsum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
 (21) V. Conitzer and T. Sandholm, “Awesome: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents,” Machine Learning, vol. 67, no. 12, pp. 23–43, 2007.
 (22) A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
 (23) M. AlShedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel, “Continuous adaptation via metalearning in nonstationary and competitive environments,” arXiv preprint arXiv:1710.03641, 2017.
 (24) C. Finn, P. Abbeel, and S. Levine, “Modelagnostic metalearning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
 (25) R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press Cambridge, 1998.
 (26) R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 (27) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
 (28) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, pp. 1889–1897, 2015.
 (29) L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu, “Magent: A manyagent reinforcement learning platform for artificial collective intelligence,” arXiv preprint arXiv:1712.00600, 2017.
 (30) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, pp. 1329–1338, 2016.
 (31) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
 (32) J. Martens, J. Ba, and M. Johnson, “Kroneckerfactored curvature approximations for recurrent neural networks,” in International Conference on Learning Representations, 2018.
APPENDIX
Appendix A Derivation for MultiAgent Policy Gradient
Following Section 4.1, the overall optimization problem for distributed metalearning was given as :
(15) 
where trajectories and are random variables with distributions and respectively. Assuming, we sample N agents, the above Eqn 15 can be rewritten as:
(16) 
Let :
(17) 
Since it is required that we maximize only over theta, we are interested in marginalizing . Expanding all expectations we can write:
(18) 
Assuming, we use the k gradient steps instead of just one as presented in Eqn 10 in the main paper, this can be rewritten as :
(19) 
The term in the above Eqn 18 can be integrated out if we assume a delta distribution for :
(20) 
A similar observation can be made for the intermediate terms ,
, in the above Eqn 19.
Thus after integrating these terms out (in the above Eqn 18 or 19, we are left with:
(21) 
Taking the gradient of this above equation 21 and rewriting it as an expectation form we get:
(22) 
Appendix B Connection to MetaLearning
We observe that there exists a natural connection between our proposed distributed learning and gradient based metalearning techniques such as the one used in [23,24]. We briefly introduce gradient based metalearning here and draw connections from our work to that of metalearning.
b.1 ModelAgnostic Meta Learning (MAML)
Consider a series of RL tasks that one would like to learn. Each task can be thought of as a Markov Decision Process (MDP) consisting of observations , actions , a state transition function and a reward function . To solve the MDP (for each task), one would like to learn a policy that maximizes the expected sum of rewards over a finite time horizon , . Let the policy be represented by some function where is the initial parameters of the function.
In MAML [24] the authors show that, it is possible to learn a policy which can be used on a task to collect a limited number of trajectories or experience and quickly adapt to a task specific policy that minimizes the task specific loss . MAML learns task specific policy by taking the gradient of w.r.t . This is then followed by collecting new trajectories or experience set using in task . is then updated by taking the gradient of w.r.t over all tasks. The update equations for and are given as:
(23) 
where and are the hyperparameters for step size. Authors in [23] extend MAML to show that one can think about MAML from a probabilistic perspective where all tasks, trajectories and policies can be thought as random variables and is generated from some conditional distribution .
b.2 Distributed Optimization for Multi Agent systems
We observe the metapolicy that MAML attempts to learn and uses as an initialization point for the different tasks is similar in spirit to the central policy DIMAPG attempts to learn and execute on all agents. In both, approaches captures information across multiple tasks or multiple agents. An important difference between our work and MAML or metalearning is that during execution (post training) we execute while MAML uses to do a 1shot adaptation for task and then executes on .
Another interesting point to note here is the difference in the trajectories that is used by MAML and the trajectory that is used by DIMAPG to update task or agent specific policy or . In the distributed optimization for multiagent setting, due to the nonstationarity, it is absolutely necessary that we ensure the other agents are held constant (to ) while agent is optimizing its task specific policy . MAML has no such requirement.
Appendix C Experimental Details
c.1 A3C KitchenSink and TRPO KitchenSink
For A3C KitchenSink, we input the agents observation and reshape it into a matrix. This is then fed into a 2D convolution layer with 16 outputs, Elu activation and a kernel size of 2, stride of 1. The output from this layer is fed into another 2D convolution layer with 32 outputs,Elu activation and a kernel size of 2, stride of 1. The output from this layer is flattened and fed into a fully connected layer with 256 outputs and Elu activation. This is followed by feeding into a LSTM layer with 256 hidden units. The output from the LSTM is then fed into two separate fully connected layers to get the policy estimate and the value function estimate. Actorcritic loss is setup and minimzied using Adam with learning rate 1e4. For TRPO Kitchensink, we setup similar policy layer and value function layer.
c.2 Dimapg
For this task, we used a neural network policy with two hidden layers with 100 units each. The network uses a ReLU nonlinearity. Depending on the experiment we compute agent specific gradient updates using REINFORCE and TRPO for the central policy gradient updates. The baseline is fitted separately at each iteration for all agents sampled from the population. We use the standard linear feature baseline. The learning rate for agent specific policy updates ==0.01. Learning rate for central policy updates . In practice, to adapt to we do multiple gradient steps. We observe k=3 (number of gradient steps) is a good choice for most tasks. For both and updates, we collect 25 trajectories.
c.3 Survivor
In this experiment, the environment is populated with agents and food particles. The agents must learn to survive by eating food. To do so they can either rush to gather food and get reward or monopolize the food by first killing other agents (killing other agents results in a small negative reward). Each agent in this environment also has orientation. The agents can either chose to one of 12 neighboring cells or stay as is, or chose to attack any agent or entity in 8 neighboring cells. Finally the agent can also choose to turn right or left. At every step, the agents receive a "step reward" of 0.01. If the agent dies, its given a reward of 1. If the agent attacks another agent, it receives a penalty of 0.1. However, if it chooses to attack another agent by forming a group it receives an award of 1. The agent also gets a reward of +5 for eating food.
As stated in the main paper, it is observed that in the case when , the agents do not kill each other and instead learn to gather food. When the number of agents is increased to agents close to the food rush in to gather food while those further away start killing other agents. We present a snapshot of the learned policy in Figure 1 and Figure 2.