Evolutionary Population Curriculum forScaling Multi-Agent Reinforcement Learning

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning


In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially. The project page is https://sites.google.com/view/epciclr2020. The source code is released at https://github.com/qian18long/epciclr2020.

1 Introduction

Most real-world problems involve interactions between multiple agents and the problem becomes significantly harder when there exist complex cooperation and competition among agents. Inspired by the tremendous success of deep reinforcement learning (RL) in single-agent applications, such as Atari games (Mnih et al., 2013), robotics manipulation (Levine et al., 2016), and navigation (Zhu et al., 2017; Wu et al., 2018; Yang et al., 2019), it has become a popular trend to apply deep RL techniques into multi-agent applications, including communication (Foerster et al., 2016; Sukhbaatar and Fergus, 2016; Mordatch and Abbeel, 2018), traffic light control (Wu et al., 2017), physical combats (Bansal et al., 2018), and video games (Liu et al., 2019; OpenAI, 2018).

A fundamental challenge for multi-agent reinforcement learning (MARL) is that, as the number of agents increases, the problem becomes significantly more complex and the variance of policy gradients can grow exponentially (Lowe et al., 2017). Despite the advances on tackling this challenge via actor-critic methods (Lowe et al., 2017; Foerster et al., 2018), which utilize decentralized actors and centralized critics to stabilize training, recent works still scale poorly and are mostly restricted to less than a dozen agents. However, many real-world applications involve a moderately large population of agents, such as algorithmic trading (Wellman et al., 2005), sport team competition (Hausknecht and Stone, 2015), and humanitarian assistance and disaster response (Meier, 2015), where one agent should collaborate and/or compete with all other agents. When directly applying the existing MARL algorithms to complex games with a large number of agents, as we will show in Sec. 5.3, the agents may fail to learn good strategies and end up with little interaction with other agents even when collaboration is significantly beneficial. Yang et al. (2018) proposed a provably-converged mean-field formulation to scale up the actor-critic framework by feeding the state information and the average value of nearby agents’ actions to the critic. However, this formulation strongly relies on the assumption that the value function for each agent can be well approximated by the mean of local pairwise interactions. This assumption often does not hold when the interactions between agents become complex, leading to a significant drop in the performance.

In this paper, we propose a general learning paradigm called Evolutionary Population Curriculum (EPC), which allows us to scale up the number of agents exponentially. The core idea of EPC is to progressively increase the population of agents throughout the training process. Particularly, we divide the learning procedure into multiple stages with increasing number of agents in the environment. The agents first learn to play in simpler scenarios with less agents and then leverage these experiences to gradually adapt to later stages with more agents and ultimately our desired population.

There are two key components in our curriculum learning paradigm. To process the varying number of agents during the curriculum procedure, the policy/critic needs to be population-invariant. So, we choose a self-attention (Vaswani et al., 2017) based architecture which can generalize to an arbitrary number of agents with a fixed number of parameters. More importantly, we introduce an evolutionary selection process, which helps address the misalignment of learning goals across stages and improves the agents’ performance in the target environment. Intuitively, our within-stage MARL training objective only incentivizes agents to overfit a particular population in the current stage. When moving towards a new stage with a larger population, the successfully trained agents may not adapt well to the scaled environment. To mitigate this issue, we maintain multiple sets of agents in each stage, evolve them through cross-set mix-and-match and parallel MARL fine-tuning in the scaled environment, and select those with better adaptability to the next stage.

EPC is RL-algorithm agnostic and can be potentially integrated with most existing MARL algorithms. In this paper, we illustrate the empirical benefits of EPC by implementing it on a popular MARL algorithm, MADDPG (Lowe et al., 2017), and experimenting on three challenging environments, including a predator-prey-style individual survival game, a mixed cooperative-and-competitive battle game, and a fully cooperative food collection game. We show that EPC outperforms baseline approaches by a large margin on all these environments as the number of agents grows even exponentially. We also demonstrate that our method can improve the stability of the training procedure.

2 Related Work

Multi-Agent Reinforcement Learning: It has been a long history in applying RL to multi-agent games (Littman, 1994; Shoham et al., 2003; Panait and Luke, 2005; Wright et al., 2019). Recently, deep RL techniques have been applied into the multi-agent scenarios to solve complex Markov games and great algorithmic advances have been achieved. Foerster et al. (2016) and He et al. (2016) explored a multi-agent variant of deep Q-learning; Peng et al. (2017) studied a fully centralized actor-critic variant; Foerster et al. (2018) developed a decentralized multi-agent policy gradient algorithm with a centralized baseline; Lowe et al. (2017) proposes the MADDPG algorithm which extended DDPG to the multi-agent setting with decentralized policies and centralized Q functions. Our population curriculum approach is a general framework for scaling MARL which can be potentially combined with any of these algorithms. Particularly, we implement our method on top of the MADDPG algorithm in this paper and take different MADDPG variants as baselines in experiments. There are also other works studying large-scale MARL recently (Lin et al., 2018; Jiang and Lu, 2018; Yang et al., 2018; Suarez et al., 2019), which typically simplify the problem by weight sharing and taking only local observations. We consider a much more general setting with global observations and unshared-weight agents. Additionally, our approach is a general learning paradigm which is complementary to the specific techniques proposed in these works.

Attention-Based Policy Architecture: Attention mechanism is widely used in RL policy representation to capture object level information (Duan et al., 2017; Wang et al., 2018), represent relations (Zambaldi et al., 2018; Malysheva et al., 2018; Yang et al., 2019) and extract communication channels (Jiang and Lu, 2018). Iqbal and Sha (2019) use an attention-based critic. In our work, we utilize an attention module in both policy and critic, inspired by the transformer architecture (Vaswani et al., 2017), for the purpose of generalization to an arbitrary number of input entities.

Curriculum Learning: Curriculum learning can be tracked back to Elman (1993), and its core idea is to “start small”: learn the easier aspects of the task first and then gradually increase the task difficulty. It has been extended to deep neural networks on both vision and language tasks (Bengio et al., 2009) and much beyond: Karras et al. (2017) propose to progressively increase the network capacity for synthesizing high quality images; Murali et al. (2018) apply a curriculum over the control space for robotic manipulation tasks; several works (Wu and Tian, 2016; Pinto et al., 2017; Florensa et al., 2017; Sukhbaatar et al., 2017; Wang et al., 2019) have proposed to first train RL agents on easier goals and switch to harder ones later. Baker et al. (2019) show that multi-agent self-play can also lead to autocurricula in open-ended environments. In our paper, we propose to progressively increase the number of the agents as a curriculum for better scaling multi-agent reinforcement learning.

Evolutionary Learning: Evolutionary algorithms, originally inspired by Darwin’s natural selection, has a long history (Bäck and Schwefel, 1993), which trains a population of agents in parallel, and let them evolve via crossover, mutation and selection processes. Recently, evolutionary algorithms have been applied to learn deep RL policies with various aims, such as to enhance training scalability (Salimans et al., 2017), to tune hyper-parameters (Jaderberg et al., 2017), to evolve intrinsic dense rewards (Jaderberg et al., 2018), to learn a neural loss for better generalization (Houthooft et al., 2018), to obtain diverse samples for faster off-policy learning (Khadka and Tumer, 2018), and to encourage exploration (Conti et al., 2018). Leveraging this insight, we apply evolutionary learning to better scale MARL: we train several groups of agents in each curriculum stage and keep evolving them to larger populations for the purpose of better adaptation towards the desired population scale and improved training stability. Czarnecki et al. (2018) proposed a similar evolutionary mix-and-match training paradigm to progressively increase agent capacity, i.e., larger action spaces and more parameters. Their work considers a fixed environment with an increasingly more complex agent and utilizes the traditional parameter crossover and mutation during evolution. By contrast, we focus on scaling MARL, namely an increasingly more complex environment with a growing number of agents. More importantly, we utilize MARL fine-tuning as an implicit mutation operator rather than the classical way of mutating parameters, which is more efficient, guided and applicable to even a very small number of evolution individuals. A similar idea of using learning for mutation is also considered by Gangwani and Peng (2018) in the single-agent setting.

3 Background

Markov Games: We consider a multi-agent Markov decision processes (MDPs) (Littman, 1994). Such an -agent Markov game is defined by state space of the game, action spaces and observation spaces for each agent. Each agent receives a private observation correlated with the state and produces an action by a stochastic policy parameterized by . Then the next states are produced according to the transition function . The initial state is determined by a distribution . Each agent obtains rewards as a function of the state and its action , and aims to maximize its own expected return , where is a discount factor and is the time horizon. To minimize notation, we omit subscript of policy when there is no ambiguity.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG): MADDPG (Lowe et al., 2017) is a multi-agent variant of the deterministic policy gradient algorithm (Silver et al., 2014). It learns a centralized Q function for each agent which conditions on global state information to resolve the non-stationary issue. Consider agents with deterministic policies where is parameterized by . The policy gradient for agent is:


Here denotes the replay buffer while is a centralized action-value function for agent that takes the actions of all agents, and the state information (i.e., or simply if is available). Let denote the next state from the environment transition. The replay buffer contains experiences in the form of tuples . Suppose the centralized critic is parameterized by . Then it is updated via:


where is the set of target policies with delayed parameters . Note that the centralized critic is only used during training. At execution time, each policy remains decentralized and only takes local observation .

4 Evolutionary Population Curriculum

Figure 1: Our population-invariant Q function: (a) utilizes the attention mechanism to combine embeddings from different observation-action encoder ; (b) is a detailed description for , which also utilizes an attention module to combine different entities in one observation.

In this section, we will first describe the base network architecture with the self-attention mechanism (Vaswani et al., 2017) which allows us to incorporate a flexible number of agents during training. Then we will introduce the population curriculum paradigm and the evolutionary selection process.

4.1 Population-Invariant Architecture

We describe our choice of architecture based on the MADDPG algorithm (Lowe et al., 2017), which is population-invariant in the sense that both the Q function and the policy can take in an arbitrary number of input entities. We first introduce the Q function (Fig. 1) and then the policy.

We adopt the decentralized execution framework, so each agent has its own Q function and policy network. Particularly for agent , its centralized Q function is represented as follows:


Here is an observation-action encoder (the green box in Fig. 1(a)) which takes in the observation and the action from agent , and outputs the agent embedding of agent ; denotes the global attention embedding (the orange box in Fig. 1(a)) over all the agent embeddings. We will explain and later. is a 1-layer fully connected network processing the embedding of the th agent’s own observation and action. is a 2-layer fully connected network that takes the concatenation of the output of and the global attention embedding and outputs the final Q value.

Attention Embedding : We define the attention embedding by a weighted sum of each agent’s embedding for :


The coefficient is computed by


where and are parameters to learn. computes the correlation between the embeddings of agent and every other agent via an inner product. is then obtained by normalizing by a softmax function. Since we represent the observations and actions of other agents with a weighted mean from Eq. 4, we can model the interactions between agent and an arbitrary number of other agents, which allows us to easily increase the number of agents in our curriculum training paradigm.

Observation-Action Encoder : We now define the structure of (Fig. 1(b)). Note that the observation of agent , , also includes many entities, i.e., states of all visible agents and objects in the game. Suppose contains entities, i.e., . may also vary as the agent population scales over training procedure or simply during an episode when some agents die. Thus, we apply another attention module to combine these entity observations together in a similar way to how is computed (Eq. 45).

In more details, we first apply an entity encoder for each entity type to obtain entity embeddings of all the entities within that type. For example, in , we can have embeddings for agent entities (green boxes in Fig. 1(b)) and landmark/object entities (purple boxes in Fig. 1(b)). Then we apply an attention module over each entity type by attending the entity embedding of agent to all the entities of this type to obtain an attended type embedding (the orange box in Fig. 1(b)). Next, we concatenate all the type embeddings together with the entity embedding of agent as well as its action embedding. Finally, this concatenated vector is forwarded to a fully connected layer to generate the output of . Note that in the overall critic network of agent , the same encoder is applied to every observation-action pair so that the network can maintain a fixed size of parameters even when the number of agents increases significantly.

Policy Network: The policy network has a similar structure as the observation-action encoder , which uses an attention module over the entities of each type in the observation to adapt to the changing population during training. The only difference in this network is that the action is not included in the input. Notably, we do not share parameters between the Q function and the policy.

4.2 Population Curriculum

We propose to progressively scale the number of agents in MARL with a curriculum. Before combining with the evolutionary selection process, we first introduce a simpler version, the vanilla population curriculum (PC), where we perform the following stage-wise procedure: (i) the initial stage starts with MARL training over a small number of agents using MADDPG and our population-invariant architecture; (ii) we start a new stage and double2 the number of agents by cloning each of the existing agents; (iii) apply MADDPG training on this scaled population until convergence; (iv) if the desired number of agents is not reached, go back to step (ii).

Mathematically, given trained agents with parameters from the previous stage, we want to increase the number of the agents to with new parameters for the next stage . In this vanilla version of population curriculum, we simply initialize by setting and , and then continue MADDPG training on to get the final policies for the new stage. Although and are both initialized from , as training proceeds, they will converge to different policies since these policies are trained in a decentralized manner in MADDPG.

4.3 Evolutionary Selection

Introducing new agents by directly cloning existing ones from the previous stage has a clear limitation: the policy parameters suitable for the previous environment are not necessarily the best initialization for the current stage as the population is scaled up. In the purpose of better performance in the final game with our desired population, we need to promote agents with better adaptation abilities during early stages of training.

Therefore, we propose an evolutionary selection process to facilitate the agents’ scaling adaption ability during the curriculum procedure. Instead of training a single set of agents, we maintain parallel sets of agents in each stage, and perform crossover, mutation and selection among them for the next stage. This is the last piece in our proposed Evolutionary Population Curriculum (EPC) paradigm, which is essentially population curriculum enhanced by the evolutionary selection process.

Specifically, we assume the agents in the multi-agent game have different roles. Agents in the same role have the same action set and reward structure. For example, we have roles in a predator-prey game, namely predators and prey, and role of agents for a fully cooperative game with homogeneous agents. For notation conciseness, we assume there are agents of role , namely ; agents of role , namely , and so on. In each stage, we keep parallel sets for each role of agents, denoted by for role , and take a 3-step procedure, i.e., mix-and-match (crossover), MARL fine-tuning (mutation) and selection, as follows to evolve these parallel sets of agents for the next stage.

Mix-and-Match (Crossover): In the beginning of a curriculum stage, we scale the population of agents from to . Note that we have parallel agent sets of size for role , namely . We first perform a mix-and-match over these parallel sets within every role : for each set , we pair it with all the sets of the same role, which leads to new scaled agent sets of size . Given these scaled sets of agents, we then perform another mix-and-match across all the roles: we pick one scaled set for each role and combine these selected sets to produce a scaled game with agents. For example, in the case of , we can pick one agent set from the first role and another agent set from the second role to form a scaled game. Thus, there are different combinations in total through this mix-and-match process. We sample games from these combinations for mutation in the next step. Since we are mixing parallel sets of agents, this process can be considered as the crossover operator in standard evolutionary algorithms.

MARL Fine-Tuning (Mutation): In standard evolutionary algorithms, mutations are directly performed on the parameters, which is inefficient in high-dimensional spaces and typically requires a large amount of mutants to achieve sufficient diversity for evolution. Instead, here we adopt MARL fine-tuning in each curriculum stage (step (iii) in vanilla PC) as our guided mutation operator, which naturally and efficiently explores effective directions in the parameter space. Meanwhile, due to the training variance, MARL also introduces randomness which benefits the overall diversity of the evolutionary process. Concretely, we apply parallel MADDPG training on each of the scaled games generated from the mix-and-match step and obtain mutated sets of agents for each role.

Selection: Among these mutated sets of agents for each role, only the best mutants can survive. In the case of , the fitness score of a set of agents is computed as their average reward after MARL training. In other cases when , given a particular mutated set of agents of a specific role, we randomly generate games for this set of agents and other mutated sets from different agent roles. We take its average reward from these randomly generated games as the fitness score for this mutated set. We pick the top- scored sets of agents in each role to advance to the next curriculum stage.

Data: environment with agents of roles, desired population , initial population , evolution size , mix-and-match size
Result: a set of best policies
initialize parallel agent sets for each role ;
initial parallel MARL training on games, for ;
while  do
       for  do
             for each role : , (mix-and-match);
      MARL training in parallel on for (guided mutation) ;
       for role  do
             for  do
             top- w.r.t. from (selection);
return the best set of agents in each role, i.e., ;
Algorithm 1 Evolutionary Population Curriculum

Overall Algorithm: Finally, when the desired population is achieved, we take the best set of agents in each role based on their last fitness scores as the output. We conclude the detailed steps of EPC in Alg. 1. Note that in the first curriculum stage, we just train parallel games without mix-and-match or mutation. So, EPC simply selects the best from the initial sets in the first stage while the evolutionary selection process only takes effect starting from the second stage. We emphasize that although we evolve multiple sets of agents in each stage, the three operators, mix-and-match, MARL fine-tuning and selection, are all perfectly parallel. Thus, the evolutionary selection process only introduces little influence on the overall training time. Lastly, EPC is an RL-algorithm-agnostic learning paradigm that can be potentially integrated with any MARL algorithm other than MADDPG.

5 Experiment

We experiment on three challenging environments, including a predatory-prey-style Grassland game, a mixed-cooperative-and-competitive Adversarial Battle game and a fully cooperative Food Collection game. We compare EPC with multiple baseline methods on these environments with different scales of agent populations and show consistently large gains over the baselines. In the following, we will first introduce the environments and the baselines, and then both qualitative and quantitative performances of different methods on all three environments.

5.1 Environments

All these environments are built on top of the particle-world environment (Mordatch and Abbeel, 2018) where agents take actions in discrete timesteps in a continous 2D world.

Figure 2: Environment Visualizations

Grassland: In this game, we have roles of agents, sheep and wolves, where sheep moves twice as fast as wolves. We also have a fixed amount of grass pellets (food for sheep) as green landmarks (Fig. 2a). A wolf will be rewarded when it collides with (eats) a sheep, and the (eaten) sheep will obtain a negative reward and becomes inactive (dead). A sheep will be rewarded when it comes across a grass pellet and the grass will be collected and respawned in another random position. Note that in this survival game, each individual agent has its own reward and does not share rewards with others.

Adversarial Battle: This scenario consists of units of resources as green landmarks and two teams of agents (i.e., for each team) competing for the resources (Fig. 2b). Both teams have the same number of agents (). When an agent collects a unit of resource, the resource will be respawned and all the agents in its team will receive a positive reward. Furthermore, if there are more than two agents from team 1 collide with one agent from team 2, the whole team 1 will be rewarded while the trapped agent from team 2 will be deactivated (dead) and the whole team 2 will be penalized, and vice versa.

Food Collection: This game has food locations and fully cooperative agents (). The agents need to collaboratively occupy as many food locations as possible within the game horizon (Fig. 2c). Whenever a food is occupied by any agent, the whole team will get a reward of in that timestep for that food. The more food occupied, the more rewards the team will collect.

In addition, we introduce collision penalties as well as auxiliary shaped rewards for each agent in each game for easier training. All the environments are fully observable so that each agent needs to handle a lot of entities and react w.r.t. the global state. More environment details are in Appx. A.

5.2 Methods and Metric

We evaluate the following approaches in our experiments: (1) the MADDPG algorithm (Lowe et al., 2017) with its original architecture (MADDPG); (2) the provably-converged mean-field algorithm (Yang et al., 2018) (mean-field); (3) the MADDPG algorithm with our population-invariant architecture (Att-MADDPG); (4) the vanilla population curriculum without evolutionary selection (vanilla-PC); and (5) our proposed EPC approach (EPC). For EPC parameters, we choose for Grassland and Adversarial Battle and for Food Collection; for the mix-and-match size , we simply set it and enumerate all possible mix-and-match combinations instead of random sampling. All the baseline methods are trained until the same amount of accumulative episodes as EPC took. More training details can be found in Appx. B.

For Grassland and Adversarial Battle with , we evaluate the performance of different methods by competing their trained agents against our EPC trained agents. Specifically, in Grassland, we let sheep trained by each approach compete with the wolves from EPC and collect the average sheep reward as the evaluation metric for sheep. Similarly, we take the same measurement for wolves from each method. In Adversarial Battle, since two teams are symmetric, we just evaluate the shared reward of one team trained by each baseline against another team by EPC as the metric. For Food Collection with , since it is fully cooperative, we take the team reward for each method as the evaluation metric. In addition, for better visualization, we plot the normalized scores by normalizing the rewards of different methods between 0 and 1 in each scale for each game. More evaluation details are in Appx. C.

5.3 Qualitative Results

In Grassland, as the number of wolves goes up, it becomes increasingly more challenging for sheep to survive; meanwhile, as the sheep become more intelligent, the wolves will be incentivized to be more aggressive accordingly. In Fig. 3, we illustrate two representative matches for competition, including one using the MADDPG sheep against the EPC wolves (Fig. 2(a)), and the other between the EPC sheep and the MADDPG wolves (Fig. 2(b)). From Fig. 2(a), we can observe that the MADDPG sheep can be easily eaten up by the EPC wolves (note that dark circle means the sheep is eaten). On the other hand, in Fig. 2(b), we can see that the EPC sheep learns to eat the grass and avoid the wolves at the same time.

In Adversarial Battle, we visualize two matches in Fig. 5 with one over agents by EPC (Fig. 3(a)) and the other over agents by MADDPG (Fig. 3(b)). We can clearly see the collaborations between the EPC agents: although the agents are initially spread over the environment, they learn to quickly gather as a group to protect themselves from being killed. While for the MADDPG agents, their behavior shows little incentives to cooperate or compete — these agents stay in their local regions throughout the episode and only collect resources or kill enemies very infrequently.

In Food Collection (Fig. 5), the EPC agents in Fig. 4(a) learn to spread out and occupy as many food as possible to maximize the team rewards. While only one agent among the MADDPG agents in Fig. 4(b) successfully occupies a food in the episode.

5.4 Quantitative Results

Quantitative Results in Grassland In the Grassland game, we perform curriculum training by starting with 3 sheep and 2 wolves, and gradually increase the population of agents. We denote a game with sheep and wolves by “scale -”. We start with scale - and gradually increase the game size to scales -, - and finally -. For the two curriculum learning approach, vanilla-PC and EPC, we train over episodes in the first curriculum stage (scale -) and fine-tune the agents with episodes after mix-and-match in each of the following stage. For other methods that train the agents from scratch, we take the same accumulative training iterations as the curriculum methods for a fair comparison.

(a) MADDPG sheep vs EPC wolves
(b) MADDPG wolves vs EPC sheep
Figure 3: Example matches between EPC and MADDPG trained agents in Grassland
(a) EPC (b) MADDPG
Figure 4: Adversarial Battle: dark particles are dead agents.
(a) EPC (b) MADDPG
Figure 5: Food Collection
(a) Normalized scores of wolves and sheep
(b) Sheep statistics
Figure 6: Results in Grassland. In part (a), we show the normalized scores of wolves and sheep trained by different methods when competing with EPC sheep and EPC wolves respectively. In part (b), we measure the sheep statistics over different scales (x-axis), including the average number of total grass pellets eaten per episode (left) and the average percentage of sheep that survive until the end of episode (right). EPC trained agents (yellow) are consistently better than any baseline method.

Main Results: We report the performance of different methods for each game scale in Fig. 5(a). Overall, there are little differences between the mean-field approach and the original MADDPG algorithm while the using the population-invariant architecture (i.e., Att-MADDPG) generally boosts the performance of MADDPG. For the method with population curriculum, vanilla-PC performs almost the same as training from scratch (Att-MADDPG) when the number of agents in the environment is small (i.e., -) but the performance gap becomes much more significant when the population further grows (i.e., - and -). For our proposed EPC method, it consistently outperforms all the baselines across all the scales. Particularly, in the largest scale -, EPC sheep receive 10x more rewards than the best baseline sheep without curriculum training.

Detailed Statistics: Besides rewards, we also compute the statistics of sheep to understand how the trained sheep behave in the game. We perform competitions between sheep trained by different methods against the EPC wolves and measure the average number of total grass pellets eaten per episode, i.e, #grass eaten, and the average percentage of sheep that survive until the end of an episode, i.e., survival rate, in Fig. 5(b). We can observe that as the population increases, it becomes increasingly harder for sheep to survive while EPC trained sheep remain a high survival rate even on the largest scale. Moreover, as more sheep in the game, EPC trained sheep consistently learn to eat more grass even under the strong pressure from wolves. In contrast, the amount of eaten grass of MADDPG approach (i.e., Att-MADDPG) drastically decreases when the number of wolves becomes large.

Quantitative Results in Adversarial Battle
Figure 7: Adversarial Battle

In this game, we evaluate on environments with different sizes of agent population , denoted by scale - where . We start the curriculum from scale - and the increase the population size to scale - () and finally - (). Both vanilla-PC and EPC take training episodes in the first stage and then episodes in the following two curriculum stages. We report the normalized scores of different methods in Fig. 7, where agents trained by EPC outperforms all the baseline methods increasingly more significant as the agent population grows.

Quantitative Results in Food Collection
Figure 8: Food Collection

In this game, we begin curriculum training with , namely 3 agents and 3 food locations, and progressively increase the population size to , and finally . Both vanilla-PC and EPC perform training on episodes on the first stage of and then episodes in each of the following curriculum stage. We report the normalized scores for all the methods in Fig. 8, where EPC is always the best among all the approaches with a clear margin. Note that the performance of the original MADDPG and the mean-field approach drops drastically as the population size increases. Particularly, the mean-field approach performs even worse than the original MADDPG method. We believe this is because in this game, the agents must act according to the global team state collaboratively, which means the local approximation assumption in the mean-field approach does not hold clearly.

Ablative Analysis
(a) Stability comparison: Grassland,
(b) Adversarial Battle,
(c) Food Collection.
(d) Normalized scores: Grassland,
(e) Adversarial Battle,
(f) Food Collection.
Figure 9: Ablation analysis on the second curriculum stage in all the games over 3 different training seeds. Stability comparison (top) in (a), (b) and (c): We observe EPC has much less variance comparing to vanilla-PC. Normalized scores during fine-tuning (bottom) in (d), (e) and (f): This illustrates that EPC can successfully transfer the agents trained with a smaller population to a larger population by fine-tuning.
(a) Environment Generalization: Grassland.
(b) Adversarial Battle.
(c) Food Collection.
Figure 10: Environment Generalization: We take the agents trained on the largest scale and test on an environment with twice the population. We perform experiments on all the games and show that EPC also advances the agents’ generalizability.

Stability Analysis: The evolutionary selection process in EPC not only leads to better final performances but also stabilizes the training procedure. We validate the stability of EPC by computing the variance over 3 training seeds for the same experiment and comparing with the variance of vanilla-PC, which is also obtained from 3 training seeds. Specifically, we pick the second stage of curriculum learning and visualize the variance of agent scores throughout the stage of training. These scores are computed by competing against the final policy trained by EPC. We perform analysis on all the 3 environments: Grassland with scale - (Fig. 8(a)), Adversarial Battle with scale - (Fig. 8(b)) and Food Collection with scale (Fig. 8(c)). We can observe that the variance of EPC is much smaller than vanilla-PC in different games.

Convergence Analysis: To illustrate that the self-attention based policies trained from a smaller scale is able to well adapt to a larger scale via fine-tuning, we pick a particular mutant by EPC in the second curriculum stage and visualize its learning curve throughout fine-tuning for all the environments, Grassland (Fig. 8(d)), Adversarial Battle (Fig. 8(e)) and Food Collection (Fig. 8(f)). The scores are computed in the same way as the stability analysis. By comparing to MADDPG and Att-MADDPG, which train policies from scratch, we can see that EPC starts learning with a much higher score, continues to improve during fine-tuning and quickly converges to a better solution. Note that all baselines are in fact trained much longer. The full convergence curves are in App. D.1.

Generalization: We investigate whether the learned policies can generalize to a different test environment with even a larger scale than the training ones. To do so, we take the best polices trained by different methods on the largest population and directly apply these policies to a new environment with a doubled population by self-cloning. We evaluate in all the environments with EPC, vanilla-PC and Att-MADDPG and measure the normalized scores of different methods, which is computed in the same way as the fitness score. In all cases, we observe a large advantage of EPC over the other two methods, indicating the better generalization ability for policies trained by EPC.

6 Conclusion

In this paper, we propose to scale multi-agent reinforcement learning by using curriculum learning over the agent population with evolutionary selection. Our approach has shown significant improvements over baselines not only in the performance but also the training stability. Given these encouraging results on different environments, we believe our method is general and can potentially benefit scaling other MARL algorithms. We also hope that learning with a large population of agents can also lead to the emergence of swarm intelligence in environments with simple rules in the future.


This research is supported in part by ONR MURI N000141612007, ONR Young Investigator to AG. FF is also supported in part by NSF grant IIS-1850477, a research grant from Lockheed Martin, and the U.S. Army Combat Capabilities Development Command Army Research Laboratory Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the funding agencies. We also sincerely thank Bowen Baker and Ingmar Kanitscheider from OpenAI for valuable suggestions and comments.

Appendix A Environment Details

In the Grassland game, sheep gets + reward when he eats the grass, - reward when eaten by wolf. The wolf get + reward when eats a sheep. We also shape the reward by distance, sheep will get less negative reward when it is closer to grass and wolf will get less negative reward when it is closer to sheep. This game is adapted from the original Predator-prey game in the MADDPG paper (Lowe et al., 2017) by introducing grass and allowing agent to die.

In the Adversarial Battle game, agent will get reward when he eats the food, reward when killed by other agents. If agents kill an enemy, they will be rewarded . We shape the reward by distance. Agent will receive less negative rewards when it is closer to other agents and grass. We want to encourage collision within agents and also will be easier for them to learn to eat. This game is adapted from the mean-field MARL paper (Yang et al., 2018) by converting it from a grid world to particle-world, introducing food and only allowing 2-agent cooperative killing.

In the Food Collection game, there are agents and food locations. Each agent will get a shared reward per timestep when one food is occupied by any agent. If one agent gets collision with another, all of the agents will get a punish of . We shape the reward by distance. Agents will receive less negative rewards when it gets closer to the food. Since the number of agents and food are equal, we want to avoid the collision within agents and let the agents to learn to occupy as many food as possible. This is exactly the same game as the Cooperative Navigation game in the MADDPG paper. We slightly change the reward function to ensure it is bounded w.r.t. arbitrary number of agents.

We use the normalized reward as the score during evaluation. For a particular game with a particular scale, we first collect the reward for each type of agents, namely the average reward of each individual of that type without the shaped rewards. Then we re-scale the collected rewards by considering the lowest reward among all methods as score 0 and highest reward as score 1.

Appendix B Training Details

We follow all the hyper-parameters in the original MADDPG paper (Lowe et al., 2017) for both EPC and all the baseline methods considered. Particularly, we use the Adam optimizer with learning rate 0.01, , and across all experiments. is set for target network update and is used as discount factor. We also use a replay buffer of size and we update the network parameters after every samples. The batch size is . All the baseline methods are trained for a number of episodes that equals the accumulative number of episodes that EPC has taken.

We set in all the games during training except that in the food collection game due to computational constraints. During EPC training, in the grassland game, we train the scale of 3 sheep 2 wolf for 100000 episodes. We train another 50000 episodes every time the agents number doubles. In the adversarial battle game and food collection game, we train the first scale for 50000 episodes. We train another 20000 episodes every time the agents number doubles.

In the grassland game, the entity types are the agent itself, other sheep, other wolf and food. We thus have four types of entity encoders for each of those entity types. In the adversarial battle game, Similar to grassland game, the entity types are agent itself, other teammates, enemies and food. We also have four types of entity encoders for each of those entity types. Since there is only one group in the food collection game, the entity types are agent itself, other teammates and food. We thus have three entity encoders in our network.

Appendix C Evaluation Details

To evaluate the agents trained in the environment with , we make two roles of agents trained with different approaches compete against each other. Each competition is simulated for 10000 episodes. The average normalized reward over the 10000 episodes will be used as the competition score for each side. Note that in our experiments, we let all the methods compete against our EPC approach for evaluation. For adversarial battle game, we take the average score of two teams as the model’s final evaluation score, since the two teams in this game are completely symmetric.

In the food collection game, since there is only one role, we simply simulate the model for 10000 episodes. The average normalized reward over the 10000 episodes will be used as the score of the model.

Appendix D Additional Details on Experiment Results

d.1 Full training curves for baselines

All the baseline methods are trained for a number of episodes that equals the accumulative number of episodes that EPC has taken.

The purpose of Figure 8(e),8(e),8(f) is simply showing the transfer performance, i.e., the initialization produced by EPC from the previous stage is effective and can indeed leverage past experiences to warm-start. The x-axis of the plot was shrunk for visualization purpose. Here we illustrate the complete convergence curve of baselines, i.e., ATT-MADDPG and MADDPG, in Figure 10(a),10(b),10(c) for the 3 games respective. Although Att-MADDPG takes a much longer time to learn, its performance is still far worse than EPC.

(a) Normalized scores: Grassland,
(b) Adversarial Battle,
(c) Food Collection.
Figure 11: Full learning curves on the second curriculum stage in all the games. EPC fine-tunes the policies obtained from the previous stage while MADDPG and Att-MADDPG are trained from scratch for a much longer time.

d.2 Raw reward numbers of evaluation results

In this section, we provide the actual rewards without normalization when comparing all the baselines with EPC. These scores are corresponding to the histograms reported in Figure 5(a)78.

Grassland game, wolf rewards, corresponding to wolf in Figure 5(a):

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
3-2 0.596 0.877 0.8145 0.8145 1.407
6-4 3.7735 0.9515 2.5905 2.001 3.7735
12-8 3.1915 3.385 10.2125 9.974 14.377
24-16 14.482 18.272 32.8945 47.6365 61.4245

Grassland game, sheep rewards, corresponding to sheep in Figure 5(a):

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
3-2 -4.0026 -3.9947 2.66 2.66 8.3846
6-4 -20.2494 -20.5107 0.9892 1.1804 10.1455
12-8 -52.863 -53.6338 -42.4801 -11.3736 3.3774
24-16 -119.1327 -118.5668 -111.0656 -70.1981 -44.1031

Adversarial Battle game, rewards of team 1, corresponding to Figure 7:

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
4-4 7.51555 5.81165 22.6357 22.6357 26.86355
8-8 0.6692 -0.58115 43.7801 46.89595 65.75585
16-16 -46.6398 -35.5978 28.8336 109.4406 189.69775

Food Collection game, team rewards, corresponding to Figure 8:

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
3 55.06 42.74 61.6488 61.6488 64.822
6 17.01 3.37 49.3626 58.0014 63.7004
12 6.32 6.735 49.45755 52.3625 59.54
24 10.346075 7.830975 33.435 49.998025 59.47035

Food Collection game, coverage rate:

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
3 0.622 0.610 0.662 0.662 0.723
6 0.198 0.034 0.488 0.588 0.662
12 0.060 0.065 0.424 0.550 0.552
24 0.095 0.077 0.269 0.556 0.568

d.3 Pairwise competition results between all methods in competitive games

For visualization purpose, we only illustrate the scores of the competitions between baselines and EPC in the main paper. Here we provide the complete competition rewards between every pair of methods in both Grassland and Adversarial Battle with the largest population of agents as follows.

Here show the wolf rewards in Grassland with scale 24-16. For wolves trained by each approach, we compare them against the sheep by all the methods. EPC wolves always have the highest rewards as in the bottom row. Correspondingly, when different wolves compete against EPC sheep, they always obtain the lowest rewards as in the rightmost column.

\backslashboxwolfsheep MADDPG mean-field Att-MADDPG Vanilla-PC EPC
MADDPG 66.914 67.0945 66.34 23.048 14.482
mean-field 75.7655 74.23 74.7375 28.3705 18.272
Att-MADDPG 103.22 103.326 98.07 49.557 32.8945
Vanilla-PC 110.333 111.3735 101.8975 64.53 47.6365
EPC 120.9025 121.4325 115.956 82.381 61.4245

Here show the sheep rewards in Grassland with scale 24-16. For sheep trained by each approach, we compete them against the all different wolves. EPC sheep always have the highest rewards as in the last row. Correspondingly, when competing different sheep against EPC wolves, the rewards are always the lowest as in the rightmost column.

\backslashboxsheepwolf MADDPG mean-field Att-MADDPG Vanilla-PC EPC
MADDPG -63.5096 -72.5443 -100.4636 -107.825 -119.1327
mean-field -63.7089 -71.0714 -100.6304 -108.8917 -118.5668
Att-MADDPG -62.9416 -71.3339 -95.0522 -99.1207 -111.0656
Vanilla-PC -5.7086 -7.8011 -31.9936 -49.5186 -70.1981
EPC 9.2135 10.5892 -6.9846 -27.3475 -44.1031

Here we show the rewards of team 1 in Adversarial Battle with scale 16-16. For agents trained by each approach, we compare them as team 1 against all different methods as team 2. When EPC agents as team 1, no matter which opponent is, they always get the highest rewards as in the last row. When other methods compete against EPC, the obtained rewards are always the lowest as in the rightmost column.

\backslashboxreportedcompared MADDPG mean-field Att-MADDPG Vanilla-PC EPC
MADDPG 61.4555 17.1591 2.8033 -29.9242 -46.6398
mean-field 104.41315 59.01315 27.9004 11.1891 -35.5978
Att-MADDPG 146.9829 117.3804 81.702 18.57425 28.8336
Vanilla-PC 202.9163 155.2805 174.91965 123.7212 109.4406
EPC 339.63255 318.3223 256.8464 198.3621 189.69775

d.4 Variance of Performance Evaluations

We present the performance of all approaches in all three games with the largest scale. We train all the approaches with 3 different seeds and show the normalized scores with variance as following. We can see that EPC not only gives better results but also much smaller variance.

(a) Normalized scores with variances in Grassland in scale 24-16
(b) Normalized scores with variances in Adversarial battle in scale 16-16
(c) Normalized scores with variances in Food Collection with 24 agents

Appendix E Additional Experiments on the Original Predator-Prey Game

Grassland is adapted from the Predator-prey game introduced by the original MADDPG paper (Lowe et al., 2017). To further validate our empirical results, we additionally study the performances of different algorithms on the unmodified Predator-prey game as follows.

We first report the normalized score in Fig. 12 by comparing all the methods against EPC. EPC is consistently better than all methods in all the scales.

Figure 12: Normalized scores in the original Predator-prey game

Besides, we also report the raw reward numbers when competing against EPC. Since the Predator-prey game is a zero-sum game, we simply report the predator rewards (the prey reward is exactly the negative value).

\backslashboxscale MADDPG mean field Att-MADDPG vanilla-PC EPC
3-2 10.405 7.39 8.675 8.675 12.662
6-4 31.458 25.529 31.747 45.155 54.546
12-8 74.939 64.377 133.261 200.638 214.328

Furthermore, we also demonstrate the results of full pairwise competition between every two methods for scale 12-8 below. Consistently, we can see that EPC predators always have the highest scores as in the last row. When competing against EPC prey, the lowest rewards are observed.

\backslashboxpredatorprey MADDPG mean-field Att-MADDPG Vanilla-PC EPC
MADDPG 229.887 249.4 247.423 122.878 74.939
mean-field 207.022 210.838 228.469 107.811 64.377
Att-MADDPG 569.862 532.611 373.979 204.743 133.261
Vanilla-PC 758.293 737.067 521.486 303.86 200.638
EPC 827.298 764.417 519.319 299.505 214.328


  1. footnotemark:
  2. Generally, we can scale up the population with any constant factor by introducing any amount of cloned agents. We use the factor of 2 as a concrete example here for easier understanding.


  1. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation 1 (1), pp. 1–23. Cited by: §2.
  2. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528. Cited by: §2.
  3. Emergent complexity via multi-agent competition. In ICLR, Cited by: §1.
  4. Curriculum learning. In ICML, pp. 41–48. Cited by: §2.
  5. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In NeurIPS, pp. 5027–5038. Cited by: §2.
  6. Mix & match agent curricula for reinforcement learning. In ICML, pp. 1095–1103. Cited by: §2.
  7. One-shot imitation learning. In NIPS, pp. 1087–1098. Cited by: §2.
  8. Learning and development in neural networks: the importance of starting small. Cognition 48 (1), pp. 71–99. Cited by: §2.
  9. Reverse curriculum generation for reinforcement learning. In CoRL, pp. 482–495. Cited by: §2.
  10. Learning to communicate with deep multi-agent reinforcement learning. In NIPS, pp. 2137–2145. Cited by: §1, §2.
  11. Counterfactual multi-agent policy gradients. In AAAI, Cited by: §1, §2.
  12. Genetic policy optimization. In ICLR, Cited by: §2.
  13. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143. Cited by: §1.
  14. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pp. 1804–1813. Cited by: §2.
  15. Evolved policy gradients. In NeurIPS, pp. 5400–5409. Cited by: §2.
  16. Actor-attention-critic for multi-agent reinforcement learning. In ICML, pp. 2961–2970. Cited by: §2.
  17. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281. Cited by: §2.
  18. Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.
  19. Learning attentional communication for multi-agent cooperation. In NeurIPS, Cited by: §2, §2.
  20. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.
  21. Evolution-guided policy gradient in reinforcement learning. In NeurIPS, pp. 1188–1200. Cited by: §2.
  22. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  23. Efficient large-scale fleet management via multi-agent deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1774–1783. Cited by: §2.
  24. Markov games as a framework for multi-agent reinforcement learning. In ICML, Vol. 157, pp. 157–163. Cited by: §2, §3.
  25. Emergent coordination through competition. In ICLR, Cited by: §1.
  26. Multi-agent actor-critic for mixed cooperative-competitive environments. NIPS. Cited by: Appendix A, Appendix B, Appendix E, §1, §1, §2, §3, §4.1, §5.2.
  27. Deep multi-agent reinforcement learning with relevance graphs. arXiv preprint arXiv:1811.12557. Cited by: §2.
  28. Digital humanitarians: how big data is changing the face of humanitarian response. Routledge. Cited by: §1.
  29. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  30. Emergence of grounded compositional language in multi-agent populations. In AAAI, Cited by: §1, §5.1.
  31. CASSL: curriculum accelerated self-supervised learning. In ICRA, pp. 6453–6460. Cited by: §2.
  32. OpenAI Five. Note: \urlhttps://blog.openai.com/openai-five/ Cited by: §1.
  33. Cooperative multi-agent learning: the state of the art. AAMAS 11 (3), pp. 387–434. Cited by: §2.
  34. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069 2. Cited by: §2.
  35. Robust adversarial reinforcement learning. In ICML, pp. 2817–2826. Cited by: §2.
  36. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §2.
  37. Multi-agent reinforcement learning: a critical survey. Web manuscript. Cited by: §2.
  38. Deterministic policy gradient algorithms. In ICML, pp. 387–395. Cited by: §3.
  39. Neural mmo: a massively multiagent game environment for training and evaluating intelligent agents. arXiv preprint arXiv:1903.00784. Cited by: §2.
  40. Learning multiagent communication with backpropagation. In NIPS, pp. 2244–2252. Cited by: §1.
  41. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407. Cited by: §2.
  42. Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §2, §4.
  43. Paired open-ended trailblazer (poet): endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753. Cited by: §2.
  44. Non-local neural networks. In CVPR, Cited by: §2.
  45. Approximate strategic reasoning through hierarchical reduction of large symmetric games. In AAAI, Cited by: §1.
  46. Iterated deep reinforcement learning in games: history-aware training for improved stability. In Proceedings of the 2019 ACM Conference on Economics and Computation, pp. 617–636. Cited by: §2.
  47. Emergent behaviors in mixed-autonomy traffic. In CoRL, pp. 398–407. Cited by: §1.
  48. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: §1.
  49. Training agent for first-person shooter game with actor-critic curriculum learning. In ICLR, Cited by: §2.
  50. Visual semantic navigation using scene priors. In ICLR, Cited by: §1, §2.
  51. Mean field multi-agent reinforcement learning. In ICML, pp. 5567–5576. Cited by: Appendix A, §1, §2, §5.2.
  52. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830. Cited by: §2.
  53. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pp. 3357–3364. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description