Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Promoting Coordination
through Policy Regularization in
Multi-Agent Deep Reinforcement Learning

Paul Barde  , 1, 3    Julien Roy*, 1, 2    Félix G. Harvey1, 2    Derek Nowrouzezahrai1, 3    Christopher Pal1, 2
1Quebec AI institute (Mila),    2Polytechnique Montréal,    3McGill University,
Equal Contribution

A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination using policy regularization and discuss two possible avenues respectively based on inter-agent modelling and synchronized sub-policy selection. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three baselines including MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that the proposed methods lead to significant improvements on cooperative problems. We further analyse the effects of the proposed regularizations on the behaviors learned by the agents.

1 Introduction

Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of Reinforcement Learning (RL) with interesting developments in recent years (hernandez2018multiagent). A popular framework for MARL is the use of a Centralized Training and a Decentralized Execution (CTDE) procedure (lowe2017multi; foerster2018counterfactual; iqbal2018actor; foerster2018bayesian; rashid2018qmix). It is typically implemented by training critics that approximate the value of the joint observations and actions, which are used to train actors restricted to the observation of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on these actions in order to grasp their benefit. Thus, it might fail in scenarios where coordination is unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover coordinated behaviors more efficiently and supersede task-specific reward shaping and curriculum learning.

In this work, we explore two different priors for successful coordination and use these to regularize the learned policies. The first avenue, TeamReg, assumes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second, CoachReg, supposes that coordinating agents individually recognize different situations and synchronously use different sub-policies to react to them. In the following sections we show how to derive practical regularization terms from these premises and meticulously evaluate them111Source code for the algorithms and environments will be made public upon publication of this work..

Our contributions are twofold. First, we propose two novel approaches that aim at promoting coordination in multi-agent systems. Our methods augment CTDE MARL algorithms with additional multi-agent objectives that act as regularizers and are optimized jointly with the main return-maximization objective. Second, we design two new sparse-reward cooperative tasks in the multi-agent particle environment (mordatch2018emergence). We use them along with two standards multi-agent tasks to present a detailed evaluation of our approaches against three different baselines. Finally, we validate our methods’ key components by performing an ablation study. Our experiments suggest that our TeamReg objective provides a dense learning signal that helps to guide the policy towards coordination in the absence of external reward, eventually leading it to the discovery of high performing team strategies in a number of cooperative tasks. Similarly, by enforcing synchronous sub-policy selections, CoachReg enables to fine-tune a sub-behavior for each recognized situation yielding significant improvements on the overall performance.

2 Background

2.1 Markov Games

In this work we consider the framework of Markov Games (littman1994markov), a multi-agent extension of Markov Decision Processes (MDPs) with independent agents. A Markov Game is defined by the tuple . , , and respectively are the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, , and are individually defined for each agent . They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step , the global state of the environment is given by and every agent’s individual action vector is denoted by . To select their action, each agent has only access to its own observation vector which is extracted by its observation function from the global state . The initial global state is sampled from the initial state distribution and the next states of the environment are sampled from the probability distribution over the possible next states given by the transition function . Finally, at each time-step, each agent receives an individual scalar reward from its reward function . Agents aim at maximizing their expected discounted return over the time horizon , where is a discount factor.

2.2 Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG (lowe2017multi) is an adaptation of the Deep Deterministic Policy Gradient algorithm (DDPG) (lillicrap2015continuous) to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent possesses its own deterministic policy for action selection and critic for state-action value estimation, which are respectively parametrized by and . All parametric models are trained off-policy from previous transitions uniformly sampled from a replay buffer . Note that is the joint observation vector and is the joint action vector, obtained by concatenating the individual observation vectors and action vectors of all agents. Each centralized critic is trained to estimate the expected return for a particular agent using the Deep Q-Network (DQN) (mnih2015human) loss:


For a given set of weights , we define its target counterpart , updated from where is a hyper-parameter. Each policy is updated to maximize the expected discounted return of the corresponding agent :


By taking into account all agents’ observation-action pairs when guiding an agent’s policy, the value-functions are trained in a centralized, stationary environment, despite taking place in a multi-agent setting. In addition, this mechanism can allow to implicitly learn coordinated strategies that can then be deployed in a decentralized way. However, this procedure does not encourage the discovery of coordinated strategies since high-reward behaviors have to be randomly experienced through unguided exploration. This work aims at alleviating this limitation.

3 Related Work

Many works in MARL consider explicit communication channels between the agents and distinguish between communicative actions (e.g. broadcasting a given message) and physical actions (e.g. moving in a given direction) (foerster2016learning; mordatch2018emergence; lazaridou2016multi). Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol in order to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space (ahilan2019feudal). Yet, explicit communication is not a necessary condition for coordination as agents can rely on physical communication (mordatch2018emergence; gupta2017cooperative).

Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. strouse2018learning use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behavior towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. jaques2018intrinsic approximate the direct causal effect between agent’s actions and use it as an intrinsic reward to encourage social empowerment. This approximation relies on an agent modelling the other agents’ policies to predict its effect on them. In our work, agent modelling focuses on maximising predictability between agents.

Finally, barton2018measuring propose convergent cross mapping (CCM) to measure the degree of effective coordination between two agents. Although this may represent an interesting avenue for behavior analysis, it fails to provide a tool for effectively enforcing coordination as CCM must be computed over long time series which makes it an impractical learning signal for single-step temporal difference methods. In this work, we design two coordination-driven multi-agent approaches that do not rely on the existence of explicit communication channels and allow to carry the learned coordinated behaviors at test time, when all agents act in a decentralized fashion.

4 Coordination and Policy

Figure 1: Illustration of TeamReg with two agents. Each agent’s policy is equipped with additional heads that are trained to predict other agents’ actions and every agent is regularized to produce actions that its teammates correctly predict. Note that the method is depicted for agent 1 only to avoid cluttering.

Intuitively, coordination can be defined as an agent’s behavior being informed by the one of another agent, i.e. structure in the agents’ interactions. Namely, a team where agents act independently of one another would not be coordinated. To promote such structure, our proposed methods rely on team-objectives as regularizers of the common policy gradient update. In this regard, our approach is closely related to General Value Functions and Auxiliary tasks (sutton2018reinforcement) used in Deep RL to learn efficient representations (jaderberg2018human; hern2019agent). However, this work’s novelty lies in the explicit bias of agents’ policy towards either predictabily for their teammates or synchronous sub-policy selection. Pseudocodes of our implementations are provided in Appendix C (see Algorithms 1 and 2).

4.1 Team regularization

The structure of coordinated interactions can be leveraged to attain a certain degree of predictability of one agent’s behavior with respect to its teammate(s). We hypothesize that the reciprocal also holds i.e. that promoting agents’ predictability could foster such team structure and lead to more coordinated behaviors. This assumption is cast into the decentralized framework by training agents to predict their teammates’ actions given only their own observation. For continuous control, the loss is defined as the Mean Squared Error (MSE) between the predicted and true actions of the teammates, yielding a teammate-modelling secondary objective similar to the models of other agents used in jaques2018intrinsic and hern2019agent and often referred to as agent modelling (schadd2007opponent). Most previous works such as hern2019agent focus on stationary, non-learning teammates and exclusively use this approach to improve the richness of the learned representations. In our case however, the same objective is also used to drive the teammates’ behaviors closer to the prediction by leveraging a differentiable action selection mechanism. Therefore, we call this the team-spirit objective between agents and :


where is the policy head of agent trying to predict the action of agent . The total gradient for a given agent becomes:


where and are hyper-parameters that respectively weight how well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. We call TeamReg this dual regularization from team-spirit objectives. Figure 1 summarizes these interactions.

Figure 2: Illustration of CoachReg with two agents. A central model, the coach, takes all agents’ observations as input and outputs the current mode (policy mask). Agents are regularized to predict the same mask from their local observations only and optimize the corresponding sub-policy.

4.2 Coach regularization

In order to foster structured agents interactions, this method aims at teaching the agents to recognize different situations and synchronously select corresponding sub-behaviors.

4.2.1 Sub-policy selection

Firstly, to enable explicit sub-behavior selection, we propose policy masks that modulate the agents’ policy. A policy mask is a one-hot vector of size with its component set to one. In practice, we use policy masks to perform dropout (srivastava2014dropout) in a structured manner on , the pre-activations of the first hidden layer of the policy network . To do so, we construct the vector , which is the concatenation of copies of , in order to reach the dimensionality . The element-wise product is then performed and only the units of at indices are kept for . In our contribution, each agent generates , its own policy mask, from its observation . Here, a simple linear layer is used to produce a categorical probability distribution from which the one-hot vector is sampled:


To our knowledge, while this method draws similarity to the options and hierarchical frameworks (sutton2018reinforcement; ahilan2019feudal) and to policy dropout for exploration (xie2018nadpex), it is the first to introduce an agent induced modulation of the policy network by a structured dropout that is decentralized at evaluation and without an explicit communication channel. Although the policy masking mechanism enables the agent to swiftly switch between sub-policies it does not encourage the agents to synchronously modulate their behavior.

4.2.2 Synchronous sub-policy selection

To promote synchronization we introduce the coach entity, parametrized by , which learns to produce policy-masks from the joint observations, i.e. . The coach is used at training time only and drives the agents toward synchronously selecting the same behavior mask. In other words, the coach is trained to output masks that (1) yield high returns when used by the agents and (2) are predictable by the agents. Similarly, each agent is regularized so that (1) its private mask matches the coach’s mask and (2) it derives efficient behavior when using the coach’s mask. At evaluation time, the coach is removed and the agents only rely on their own policy masks. The policy gradient loss when agent is provided with the coach’s mask is given by:


The difference between the mask of agent and the coach’s one is measured from the Kullback–Leibler divergence:


The total gradient for agent is:


with and the regularization coefficients. Similarly, the coach is trained with the following dual objective, weighted by the coefficient:


In order to propagate gradients through the sampled policy mask we reparametrized the categorical distribution using the Gumbel-softmax trick (jang2016categorical) with a temperature of 1. We call this coordinated sub-policy selection regularization CoachReg and illustrate it in Figure 2.

5 Training environments

All of our tasks are based on the OpenAI multi-agent particle environments (mordatch2018emergence). SPREAD and CHASE were introduced by (lowe2017multi). We use SPREAD as is but with sparse rewards only. CHASE is modified with a prey controlled by repulsion forces and only the predators are learnable, as we wish to focus on coordination in cooperative tasks. Finally we introduce COMPROMISE and BOUNCE where agents are explicitly tied together. While non-zero return can be achieved in these tasks by selfish agents, they all benefit from coordinated strategies and optimal return can only be achieved by agents working closely together. Figure 3 presents visualizations and a brief description of all four tasks. A detailed description is provided in Appendix A. In all tasks, agents receive as observation their own global position and velocity as well as the relative position of other entities. Note that work showcasing experiments on this environment often use discrete action spaces and (dense) reward shaping (e.g. the proximity with the objective) (iqbal2018actor; lowe2017multi; jiang2018learning). However, in our experiments, agents learn with continuous action spaces and from sparse rewards.

Figure 3: Multi-agent tasks used in this work. (a) SPREAD: Agents must spread out and cover a set of landmarks. (b) BOUNCE: Two agents are linked together by a spring and must position themselves so that the falling black ball bounces towards a target. (c) COMPROMISE: Two linked agents must compete or cooperate to reach their own assigned landmark. (d) CHASE: Two agents chase a (non-learning) prey (turquoise) that moves w.r.t repulsion forces from predators and walls.

6 Results and Discussion

The proposed methods offer a way to incorporate new inductive biases in CTDE multi-agent policy search algorithms. In this work, we evaluate them by extending MADDPG, a state of the art algorithm widely used in the MARL litterature. We compare against vanilla MADDPG as well as two of its variations in the four cooperative multi-agent tasks described in Section 5. The first variation (DDPG) is the single-agent counterpart of MADDPG (decentralized training). The second (MADDPG + sharing) shares the policy and value-function models across agents.

To offer a fair comparison between all methods, the hyper-parameter search routine is the same for each algorithm and environment (see Appendix D.1). For each search-experiment (one per algorithm per environment), 50 randomly sampled hyper-parameter configurations each using 3 training seeds (total of 150 runs) are used to train the models for episodes. For each algorithm-environment pair, we then select the best hyper-parameter configuration for the final comparison and retrain them on 10 seeds for twice as long. We give more details about the training setup and model selection in Appendix B and D.2. The results of the hyperparameter searches are given in Appendix D.5.

6.1 Asymptotic Performance

From the average learning curves reported in Figure 4 we observe that CoachReg significantly improves performance on three environments (SPREAD, BOUNCE and COMPROMISE) and performs on par with the baselines on the last one (CHASE). The same can be said for TeamReg, except on COMPROMISE, the only task with an adversarial component, where it significantly underperforms compared to the other algorithms. We discuss this specific case in Section 6.3. Finally, parameter sharing is the best performing choice on CHASE, yet this superiority is restricted to this task where the optimal play is to move symmetrically and squeeze the prey into a corner.

Figure 4: Learning curves (mean return over agents) for all algorithms on all four environments. Solid lines are the mean and envelopes are the Standard Error (SE) across the 10 training seeds.

6.2 Ablation study

Figure 5: Learning curves (mean return over agents) for the ablated algorithms on all environments. Solid lines are the mean and envelopes are the Standard Error (SE) across the 10 training seeds.

Additionally to our two proposed algorithms and the three baselines, we present results for two ablated versions of our methods. The first ablation (MADDPG + agent modelling) is similar to TeamReg but with , which results in only enforcing agent modelling (i.e. agent predictability is not encouraged). The second ablation (MADDPG + policy mask) is structurally equivalent to CoachReg, but with , which means that agents still predict and apply a mask to their own policy, but synchronicity is not encouraged. Figure 12 and 13 (Appendix D.6) present the results of the corresponding hyper-parameter search and Figure 5 shows the learning curves for our full regularization approaches, their respective ablated versions and MADDPG.

The use of unsynchronized policy masks might result in swift and unpredictable behavioral changes and make it difficult for agents to perform together and coordinate. Experimentally, “MADDPG + policy mask” performs similarly or worse than MADDPG on all but one environment, and never outperforms the full CoachReg approach. However, policy masks alone seem enough to succeed on SPREAD, which is about selecting a landmark from a set. Regarding “MADDPG + agent modelling”, it does not drastically improve on MADDPG apart on the SPREAD environment, and the full TeamReg approach shows improvement over its ablated version except on the COMPROMISE task, which we discuss in Section 6.3.

6.3 Effects of enforcing predictable behavior

First, we investigate the reason for TeamReg’s poor performance on COMPROMISE. Then, we analyse how TeamReg might be helpful in other environments.

Figure 6: Average performance difference () between the two agents in COMPROMISE for each 150 runs of the hyper-parameter searches (left). All occurrences of abnormally high performance difference are associated with high values of (right).

COMPROMISE is the only task with a competitive component (and the only one in which agents do not share their rewards). The two agents being linked, a good policy has both agents reach their landmark successively (maybe by simply having both agents navigate towards the closest landmark). However, if one agent never reaches for its landmark, the optimal strategy for the other one becomes to drag it around and always go for its own, leading to a strong imbalance in the return cumulated by both agents. While this scenario very rarely occurs for the other algorithms, we found TeamReg to often lead to such domination cases (see Figure 14 in Appendix E). Figure 6 depicts the agents’ performance difference for every 150 runs of the hyperparameter search for TeamReg and the baselines, and shows that (1) TeamReg is the only algorithm that does lead to large imbalances in performance between the two agents and (2) that these cases where one agent becomes dominant are all associated with high values of , which drives the agents to behave in a predictable fashion to one another. However, the dominated agent eventually gets exposed more and more to sparse reward gathered by being dragged (by chance) onto its own landmark, picks up the goal of the task and starts pulling in its own direction, which causes the average return over agents to drop as we see in Figure 4. This experiment demonstrates that using a predictability-based team-regularization in a competitive task can be harmful; quite understandably, you might not want to optimize an objective that aims at making your behavior predictable to your opponent.

On SPREAD and BOUNCE, TeamReg significantly improves the performance over the baselines. We aim to analyze here the effects of on cooperative tasks and investigate if it does make the agent modelling task more successful (by encouraging the agent to be predictable). To this end, we compare the best performing hyper-parameter configuration for TeamReg on the SPREAD environment with its ablated versions. The average return and team-spirit loss defined in Section 4.1 are presented in Figure 7 for these three experiments.

Figure 7: Comparison between enabling and disabling the regularizing weights and for MADDPG+TeamReg on the SPREAD environment. Values are averaged over the 3 agents and over the 3 seeds used in the hyper-parameter exploration.

Initially, due to the weight initialization, the predicted and actual actions both have relatively small norms yielding small values of team-spirit loss. As training goes on (1000 episodes), the norms of the action-vector increase and the regularization loss becomes more important. As expected, leads to the highest team-spirit loss as it is not trained to predict the actions of other agents correctly. When using only the agent-modelling objective (), the agents significantly decrease the team-spirit loss, but it never reaches values as low as when using the full TeamReg objective. Finally, when also pushing agents to be predictable (), the agents best predict each others’ actions and performance is also improved. We also notice that the team-spirit loss increases when performance starts to improve i.e. when agents start to master the task (8000 episodes). Indeed, once the reward maximisation signals becomes stronger, the relative importance of the second task is reduced. We hypothesize that being predictable with respect to one-another may push agents to explore in a more structured and informed manner in the absence of reward signal, as similarly pursued by intrinsic motivation approaches (chentanez2005intrinsically).

6.4 Analysis of synchronous sub-policy selection

(a) The ball is on the left side of the target, agents both select the purple policy mask
(b) The ball is on the right side of the target, agents both select the green policy mask
Figure 8: Visualization of two different BOUNCE evaluation episodes. Note that here, the agents’ colors represent their chosen policy mask. Agents have learned to synchronously identify two distinct situations and act accordingly. The coach’s masks (not used at evaluation time) are displayed with the timestep at the bottom of each frame.
Figure 9: (a) Entropy of the policy mask distributions for each task, averaged over agents and training seeds. is the entropy of a -CUD. (b) Hamming Proximity between the policy mask sequence of each agent averaged across agent pairs and seeds. rand stands for agents independently sampling their masks from -CUD. Error bars are SE across seeds.

In this section we aim at experimentally verifying that CoachReg yields the desired behavior: agents synchronously alternating between varied sub-policies. A special attention is given when the sub-policies are interpretable. To this end we record and analyze the agents’ policy masks on 100 different episodes for each task.

From the collected masks, we reconstructed the empirical mask distribution of each agent (see Figure 15 in Appendix F.1) whose entropy provides an indication of the mask diversity used by a given agent. Figure 9 (a) shows the mean entropy for each environment compared to the entropy of Categorical Uniform Distributions of size (-CUD). It shows that, on all the environments, agents use at least two distinct masks by having non-zero entropy. In addition, agents tend to alternate between masks with more variety (close to uniformly switching between 3 masks) on SPREAD (where there are 3 agents and 3 goals) than on the other environments (comprised of 2 agents). To test if agents are synchronously selecting the same policy mask at test time (without a coach), we compute the Hamming proximity between the agents’ mask sequences with where is the Hamming distance, i.e. the number of timesteps where the two sequences are different divided by the total number of timesteps. From Figure 9 (b) we observe that agents are producing similar mask sequences. Notably, their mask sequences are significantly more similar that the ones of two agent randomly choosing between two masks at each timestep. Finally, we observe that some settings result in the agents coming up with interesting strategies, like the one depicted in Figure 8 where the agents alternate between two sub-policies depending on the position of the target. More cases where the agents change sub-policies during an episode are presented in Appendix F.1. These results indicate that, in addition to improving the performance on coordination tasks, CoachReg indeed yields the expected behaviors. An interesting following work would be to use entropy regularization to increase the mask usage variety and mutual information to further disentangle sub-policies.

6.5 Robustness to hyper-parameters

Stability across hyper-parameter configurations is a recurring challenge in Deep RL. The average performance for each sampled configuration allows to empirically evaluate the robustness of an algorithm w.r.t. its hyper-parameters. We share the full results of the hyper-parameter searches in Figures 10, 11, 12 and 13 in Appendix D.5 and D.6. Figure 11 shows that while most algorithms can perform reasonably well with the correct configuration, our proposed coordination regularizers can improve robustness to hyper-parameter despite the fact that they have more hyper-parameters to search over. Such robustness can be of great value with limited computational budgets.

7 Conclusion

In this work we introduced two policy regularization methods to promote multi-agent coordination within the CTDE framework: TeamReg, which is based on inter-agent action predictability and CoachReg that relies on synchronized behavior selection. A thorough empirical evaluation of these methods showed that they significantly improve asymptotic performances on cooperative multi-agent tasks. Interesting avenues for future work would be to study the proposed regularizations on other policy search methods as well as to combine both incentives and investigate how the two coordinating objectives interact. Finally, a limitation of the current formulation is that it relies on single-step metrics, which simplifies off-policy learning but also limits the longer-term coordination opportunities. A promising direction is thus to explore model-based planning approaches to promote long-term multi-agent interactions.


We wish to thank Olivier Delalleau for providing insightful feedback and comments as well as Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montréal and Mitacs for providing part of the funding for this work.


Appendix A Tasks descriptions

SPREAD (Figure 3a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receive a team-reward where is the number of landmarks occupied by at least one agent and the number of collisions occurring at that timestep. To maximize their return, agents must therefore spread out and cover all landmarks. Initial agents’ and landmarks’ positions are random. Termination is triggered when the maximum number of timesteps is reached.

BOUNCE (Figure 3b): In this environment, two agents (small orange circles) are linked together with a spring that pulls them toward each other when stretched above its relaxation length. At episode’s mid-time a ball (smaller black circle) falls from the top of the environment. Agents must position correctly so as to have the ball bounce on the spring towards the target (bigger beige circle), which turns yellow if the ball’s bouncing trajectory passes through it. They receive a team-reward of if the ball reflects towards the side walls, if the ball reflects towards the top of the environment, and if the ball reflects towards the target. At initialisation, the target’s and ball’s vertical position is fixed, their horizontal positions are random. Agents’ initial positions are also random. Termination is triggered when the ball is bounced by the agents or when the maximum number of timesteps is reached.

COMPROMISE (Figure 3c): In this environment, two agents (small orange circles) are linked together with a spring that pulls them toward each other when stretched above its relaxation length. They both have a distinct assigned landmark (light gray circle for light orange agent, dark gray circle for dark orange agent), and receive a reward of when they reach it. Once a landmark is reached by its corresponding agent, the landmark is randomly relocated in the environment. Initial positions of agents and landmark are random. Termination is triggered when the maximum number of timesteps is reached.

CHASE (Figure 3d): In this environment, two predators (orange circles) are chasing a prey (turquoise circle). The prey moves with respect to a scripted policy consisting of repulsion forces from the walls and predators. At each timestep, the learning agents (predators) receive a team-reward of where is the number of predators touching the prey. The prey has a greater max speed and acceleration than the predators. Therefore, to maximize their return, the two agents must coordinate in order to squeeze the prey into a corner or a wall and effectively trap it there. Termination is triggered when the maximum number of time steps is reached.

Appendix B Training details

In all of our experiments, we use the Adam optimizer (kingma2014adam) to perform parameter updates. All models (actors and critics) are parametrized by feedforward networks containing two hidden layers of units. We use the Rectified Linear Unit (ReLU) (nair2010rectified) as activation function and layer normalization (ba2016layer) on the pre-activations unit to stabilize the learning. We use a buffer-size of entries and a batch-size of . We collect transitions by interacting with the environment for each learning update. For all tasks in our hyper-parameter searches, we train the agents for episodes of steps and then re-train the best configuration for each algorithm-environment pair for twice as long ( episodes) to ensure full convergence for the final evaluation. The scale of the exploration noise is kept constant for the first half of the training time and then decreases linearly to until the end of training. We use a discount factor of and a gradient clipping threshold of in all experiments. Finally for CoachReg, we fixed to 4 meaning that agents could choose between 4 sub-policies.

Appendix C Algorithms

  Randomly initialize critic networks and actor networks
  Initialize the target weights
  Initialize one replay buffer
  for episode from 0 to number of episodes do
     Initialize random processes for action exploration
     Receive initial joint observation
     for timestep t from 0 to episode length do
        Select action for each agent
        Execute joint action and observe joint reward and new observation
        Store transition in
     end for
     Sample a random minibatch of transitions from
     for each agent  do
        Evaluate and from Equations (1) and (2)
        for each other agent (do
           Evaluate from Equations (3)
           Update actor with
        end for
        Update critic with
        Update actor with
     end for
     Update all target weights
  end for
Algorithm 1 Team
  Randomly initialize critic networks , actor networks and one coach network
  Initialize target networks and
  Initialize one replay buffer
  for episode from 0 to number of episodes do
     Initialize random processes for action exploration
     Receive initial joint observation
     for timestep t from 0 to episode length do
        Select action for each agent
        Execute joint action and observe joint reward and new observation
        Store transition in
     end for
     Sample a random minibatch of transitions from
     for each agent  do
        Evaluate and from Equations (1) and (2)
        Update critic with
        Update actor with
     end for
     for each agent  do
        Evaluate and from Equations (8) and (7)
        Update actor with
     end for
     Update coach with
     Update all target weights
  end for
Algorithm 2 Coach

Appendix D Hyper-parameter search

d.1 Hyper-parameter search ranges

We perform searches over the following hyper-parameters: the learning rate of the actor , the learning rate of the critic relative to the actor (), the target-network soft-update parameter and the initial scale of the exploration noise for the Ornstein-Uhlenbeck noise generating process (uhlenbeck1930theory) as used by lillicrap2015continuous. When using TeamReg and CoachReg, we additionally search over the regularization weights , and . The learning rate of the coach is always equal to the actor’s learning rate (i.e. ), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 shows the ranges from which values for the hyper-parameters are drawn uniformly during the searches.

Hyper-parameter Range
Table 1: Ranges for hyper-parameter search, the log base is 10

d.2 Model selection

During training, a policy is evaluated on a set of 10 different episodes every 100 learning steps. At the end of the training, the model at the best evaluation iteration is saved as the best version of the policy for this training, and is re-evaluated on 100 different episodes to have a better assessment of its final performance. The performance of a hyper-parameter configuration is defined as the average performance (across seeds) of the policies learned using this set of hyper-parameter values.

d.3 Selected hyper-parameters

Tables 2, 3, 4, and 5 shows the best hyper-parameters found by the random searches for each of the environments and each of the algorithms.

Hyper-parameter DDPG MADDPG MADDPG+Sharing MADDPG+TeamReg MADDPG+CoachReg
- - -
- - -
- - - -
Table 2: Best found hyper-parameters for the SPREAD environment
Hyper-parameter DDPG MADDPG MADDPG+Sharing MADDPG+TeamReg MADDPG+CoachReg
- - -
- - -
- - - -
Table 3: Best found hyper-parameters for the BOUNCE environment
Hyper-parameter DDPG MADDPG MADDPG+Sharing MADDPG+TeamReg MADDPG+CoachReg
- - -
- - -
- - - -
Table 4: Best found hyper-parameters for the CHASE environment
Hyper-parameter DDPG MADDPG MADDPG+Sharing MADDPG+TeamReg MADDPG+CoachReg
- - -
- - -
- - - -
Table 5: Best found hyper-parameters for the COMPROMISE environment

d.4 Selected hyper-parameters (ablations)

Tables 6, 7, 8, and 9 shows the best hyper-parameters found by the random searches for each of the environments and each of the ablated algorithms.

Hyper-parameter MADDPG+Agent Modelling MADDPG+Policy Mask
Table 6: Best found hyper-parameters for the SPREAD environment
Hyper-parameter MADDPG+Agent Modelling MADDPG+Policy Mask
Table 7: Best found hyper-parameters for the BOUNCE environment
Hyper-parameter MADDPG+Agent Modelling MADDPG+Policy Mask
Table 8: Best found hyper-parameters for the CHASE environment
Hyper-parameter MADDPG+Agent Modelling MADDPG+Policy Mask
Table 9: Best found hyper-parameters for the COMPROMISE environment

d.5 Hyper-parameter search results

The performance of each parameter configuration is reported in Figure 10 yielding the performance distribution across hyper-parameters configurations for each algorithm on each task. The same distributions are depicted in Figure 11 using box-and-whisker plot. It can be seen that TeamReg and CoachReg both boost the performance of the third quartile, suggesting an increase in the robustness across hyper-parameter.

Figure 10: Hyper-parameter tuning results for all algorithms. There is one distribution per (algorithm, environment) pair, each one formed of 50 points (hyper-parameter configuration samples). Each point represents the best model performance averaged over 100 evaluation episodes and averaged over the 3 training seeds for one sampled hyper-parameters configuration (total of 300 performance values per sampled configuration).

Figure 11: Summarized performance distributions of the sampled hyper-parameters configurations for each (algorithm, environment) pair. The box-plots divide in quartiles the 49 lower-performing configurations for each distribution while the score of the best-performing configuration is highlighted above the box-plots by a single dot.

d.6 Hyper-parameter search results (ablations)

From Figure 13 it seems that the “policy mask” or the “agent modelling” additions respectively provide nearly the same robustness boosts as CoachReg and TeamReg.

Figure 12: Hyper-parameter tuning results for ablated algorithms compared to their full approach counterparts and MADDPG. There is one distribution per (algorithm, environment) pair, each one formed of 50 points (hyper-parameter configuration sample). Each point represents the best model performance averaged over 100 evaluation episodes and averaged over the 3 training seeds for one sampled hyper-parameters configuration (total of 300 performance values per sampled configuration).

Figure 13: Summarized performance distributions of the sampled hyper-parameters configurations for each (ablated algorithm, environment) pair. The box-plots divide in quartiles the 49 lower-performing configurations for each distribution while the score of the best-performing configuration is highlighted above the box-plots by a single dot.

Appendix E The effects of enforcing predictability (additional results)

Figure 14: Learning curves for TeamReg and the three baselines on COMPROMISE. We see that while both agents remain equally performant as they improve at the task for the baseline algorithms, TeamReg tends to make one agent much stronger than the other one. This domination is optimal as long as the other agent remains docile, as the dominant agent can gather much more reward than if it had to compromise. However, when the dominated agent finally picks up the task, the dominant agent that has learned a policy that does not compromise see its return dramatically go down and the mean over agents overall then remains lower than for the baselines.

Appendix F Analysis of sub-policy selection (additional results)

f.1 Mask densities

We depict on Figure 15 the mask distribution of each agent for each (seed, environment) experiment. Firstly, in most of the experiments, agents use at least 2 different masks. Secondly, for a given experiments, agents’ distributions are very similar, suggesting that they are using the same masks in the same situations and that they are therefore synchronized. Finally, agents collapse more to using only one mask on CHASE, where they also display more dissimilarity between one another. This may explain why CHASE is the only task where CoachReg does not improve performance. Indeed, on CHASE, agents do not seem synchronized nor leveraging multiple sub-policies which are the priors to coordination behind CoachReg. In brief, we observe that CoachReg is less effective in enforcing those priors to coordination of CHASE, an environment where it does not boost nor harm performance.

Figure 15: Agent’s policy mask distributions. For each (seed, environment) we collected the masks of each agents on 100 episodes.

f.2 Episodes roll-outs

We render here some episodes roll-outs, the agents synchronously switch between policy masks during an episode. In addition, the whole group selects the same mask as the one that would have been suggested by the coach.

Figure 16: Visualization sequences on two different environments. An agent’s color represent its current policy mask. For informative purposes the policy mask that the coach would have produced if these situations would have happened during training is displayed next to the frame’s timestep. Agents synchronously switch between the available policy masks.

f.3 Mask diversity and synchronicity (ablation)

As in Subsection 6.4 we report the mean entropy of the mask distribution and the mean Hamming proximity for the ablated “MADDPG + policy mask” and compare it to the full CoachReg. With “MADDPG + policy mask” agents are not incentivized to use the same masks. Therefore, in order to assess if they synchronously change policy masks, we computed, for each agent pair, seed and environment, the Hamming proximity for every possible masks equivalence (mask 3 of agent 1 corresponds to mask 0 of agent 2, etc.) and selected the equivalence that maximised the Hamming proximity between the two sequences.

We can observe that while “MADDPG + policy mask” agents display a more diverse mask usage, their selection is less synchronized than with CoachReg. This is easily understandable as the coach will tend to reduce diversity in order to have all the agents agree on a common mask, on the other hand this agreement enables the agents to synchronize their mask selection. To this regard, it should be noted that “MADDPG + policy mask” agents are more synchronized that agents independently sampling their masks from -CUD, suggesting that, even in the absence of the coach, agents tend to synchronize their mask selection.

Figure 17: Entropy of the policy mask distributions for each task and method, averaged over agents and training seeds. is the entropy of a -CUD. (b) Hamming Proximity between the policy mask sequence of each agent averaged across agent pairs and seeds. rand stands for agents independently sampling their masks from -CUD. Error bars are SE across seeds.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description