through Policy Regularization in
Multi-Agent Deep Reinforcement Learning
A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination using policy regularization and discuss two possible avenues respectively based on inter-agent modelling and synchronized sub-policy selection. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three baselines including MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that the proposed methods lead to significant improvements on cooperative problems. We further analyse the effects of the proposed regularizations on the behaviors learned by the agents.
Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of Reinforcement Learning (RL) with interesting developments in recent years (hernandez2018multiagent). A popular framework for MARL is the use of a Centralized Training and a Decentralized Execution (CTDE) procedure (lowe2017multi; foerster2018counterfactual; iqbal2018actor; foerster2018bayesian; rashid2018qmix). It is typically implemented by training critics that approximate the value of the joint observations and actions, which are used to train actors restricted to the observation of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on these actions in order to grasp their benefit. Thus, it might fail in scenarios where coordination is unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover coordinated behaviors more efficiently and supersede task-specific reward shaping and curriculum learning.
In this work, we explore two different priors for successful coordination and use these to regularize the learned policies. The first avenue, TeamReg, assumes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second, CoachReg, supposes that coordinating agents individually recognize different situations and synchronously use different sub-policies to react to them. In the following sections we show how to derive practical regularization terms from these premises and meticulously evaluate them111Source code for the algorithms and environments will be made public upon publication of this work..
Our contributions are twofold. First, we propose two novel approaches that aim at promoting coordination in multi-agent systems. Our methods augment CTDE MARL algorithms with additional multi-agent objectives that act as regularizers and are optimized jointly with the main return-maximization objective. Second, we design two new sparse-reward cooperative tasks in the multi-agent particle environment (mordatch2018emergence). We use them along with two standards multi-agent tasks to present a detailed evaluation of our approaches against three different baselines. Finally, we validate our methods’ key components by performing an ablation study. Our experiments suggest that our TeamReg objective provides a dense learning signal that helps to guide the policy towards coordination in the absence of external reward, eventually leading it to the discovery of high performing team strategies in a number of cooperative tasks. Similarly, by enforcing synchronous sub-policy selections, CoachReg enables to fine-tune a sub-behavior for each recognized situation yielding significant improvements on the overall performance.
2.1 Markov Games
In this work we consider the framework of Markov Games (littman1994markov), a multi-agent extension of Markov Decision Processes (MDPs) with independent agents. A Markov Game is defined by the tuple . , , and respectively are the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, , and are individually defined for each agent . They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step , the global state of the environment is given by and every agent’s individual action vector is denoted by . To select their action, each agent has only access to its own observation vector which is extracted by its observation function from the global state . The initial global state is sampled from the initial state distribution and the next states of the environment are sampled from the probability distribution over the possible next states given by the transition function . Finally, at each time-step, each agent receives an individual scalar reward from its reward function . Agents aim at maximizing their expected discounted return over the time horizon , where is a discount factor.
2.2 Multi-Agent Deep Deterministic Policy Gradient (MADDPG)
MADDPG (lowe2017multi) is an adaptation of the Deep Deterministic Policy Gradient algorithm (DDPG) (lillicrap2015continuous) to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent possesses its own deterministic policy for action selection and critic for state-action value estimation, which are respectively parametrized by and . All parametric models are trained off-policy from previous transitions uniformly sampled from a replay buffer . Note that is the joint observation vector and is the joint action vector, obtained by concatenating the individual observation vectors and action vectors of all agents. Each centralized critic is trained to estimate the expected return for a particular agent using the Deep Q-Network (DQN) (mnih2015human) loss:
For a given set of weights , we define its target counterpart , updated from where is a hyper-parameter. Each policy is updated to maximize the expected discounted return of the corresponding agent :
By taking into account all agents’ observation-action pairs when guiding an agent’s policy, the value-functions are trained in a centralized, stationary environment, despite taking place in a multi-agent setting. In addition, this mechanism can allow to implicitly learn coordinated strategies that can then be deployed in a decentralized way. However, this procedure does not encourage the discovery of coordinated strategies since high-reward behaviors have to be randomly experienced through unguided exploration. This work aims at alleviating this limitation.
3 Related Work
Many works in MARL consider explicit communication channels between the agents and distinguish between communicative actions (e.g. broadcasting a given message) and physical actions (e.g. moving in a given direction) (foerster2016learning; mordatch2018emergence; lazaridou2016multi). Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol in order to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space (ahilan2019feudal). Yet, explicit communication is not a necessary condition for coordination as agents can rely on physical communication (mordatch2018emergence; gupta2017cooperative).
Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. strouse2018learning use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behavior towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. jaques2018intrinsic approximate the direct causal effect between agent’s actions and use it as an intrinsic reward to encourage social empowerment. This approximation relies on an agent modelling the other agents’ policies to predict its effect on them. In our work, agent modelling focuses on maximising predictability between agents.
Finally, barton2018measuring propose convergent cross mapping (CCM) to measure the degree of effective coordination between two agents. Although this may represent an interesting avenue for behavior analysis, it fails to provide a tool for effectively enforcing coordination as CCM must be computed over long time series which makes it an impractical learning signal for single-step temporal difference methods. In this work, we design two coordination-driven multi-agent approaches that do not rely on the existence of explicit communication channels and allow to carry the learned coordinated behaviors at test time, when all agents act in a decentralized fashion.
4 Coordination and Policy
Intuitively, coordination can be defined as an agent’s behavior being informed by the one of another agent, i.e. structure in the agents’ interactions. Namely, a team where agents act independently of one another would not be coordinated. To promote such structure, our proposed methods rely on team-objectives as regularizers of the common policy gradient update. In this regard, our approach is closely related to General Value Functions and Auxiliary tasks (sutton2018reinforcement) used in Deep RL to learn efficient representations (jaderberg2018human; hern2019agent). However, this work’s novelty lies in the explicit bias of agents’ policy towards either predictabily for their teammates or synchronous sub-policy selection. Pseudocodes of our implementations are provided in Appendix C (see Algorithms 1 and 2).
4.1 Team regularization
The structure of coordinated interactions can be leveraged to attain a certain degree of predictability of one agent’s behavior with respect to its teammate(s). We hypothesize that the reciprocal also holds i.e. that promoting agents’ predictability could foster such team structure and lead to more coordinated behaviors. This assumption is cast into the decentralized framework by training agents to predict their teammates’ actions given only their own observation. For continuous control, the loss is defined as the Mean Squared Error (MSE) between the predicted and true actions of the teammates, yielding a teammate-modelling secondary objective similar to the models of other agents used in jaques2018intrinsic and hern2019agent and often referred to as agent modelling (schadd2007opponent). Most previous works such as hern2019agent focus on stationary, non-learning teammates and exclusively use this approach to improve the richness of the learned representations. In our case however, the same objective is also used to drive the teammates’ behaviors closer to the prediction by leveraging a differentiable action selection mechanism. Therefore, we call this the team-spirit objective between agents and :
where is the policy head of agent trying to predict the action of agent . The total gradient for a given agent becomes:
where and are hyper-parameters that respectively weight how well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. We call TeamReg this dual regularization from team-spirit objectives. Figure 1 summarizes these interactions.
4.2 Coach regularization
In order to foster structured agents interactions, this method aims at teaching the agents to recognize different situations and synchronously select corresponding sub-behaviors.
4.2.1 Sub-policy selection
Firstly, to enable explicit sub-behavior selection, we propose policy masks that modulate the agents’ policy. A policy mask is a one-hot vector of size with its component set to one. In practice, we use policy masks to perform dropout (srivastava2014dropout) in a structured manner on , the pre-activations of the first hidden layer of the policy network . To do so, we construct the vector , which is the concatenation of copies of , in order to reach the dimensionality . The element-wise product is then performed and only the units of at indices are kept for . In our contribution, each agent generates , its own policy mask, from its observation . Here, a simple linear layer is used to produce a categorical probability distribution from which the one-hot vector is sampled:
To our knowledge, while this method draws similarity to the options and hierarchical frameworks (sutton2018reinforcement; ahilan2019feudal) and to policy dropout for exploration (xie2018nadpex), it is the first to introduce an agent induced modulation of the policy network by a structured dropout that is decentralized at evaluation and without an explicit communication channel. Although the policy masking mechanism enables the agent to swiftly switch between sub-policies it does not encourage the agents to synchronously modulate their behavior.
4.2.2 Synchronous sub-policy selection
To promote synchronization we introduce the coach entity, parametrized by , which learns to produce policy-masks from the joint observations, i.e. . The coach is used at training time only and drives the agents toward synchronously selecting the same behavior mask. In other words, the coach is trained to output masks that (1) yield high returns when used by the agents and (2) are predictable by the agents. Similarly, each agent is regularized so that (1) its private mask matches the coach’s mask and (2) it derives efficient behavior when using the coach’s mask. At evaluation time, the coach is removed and the agents only rely on their own policy masks. The policy gradient loss when agent is provided with the coach’s mask is given by:
The difference between the mask of agent and the coach’s one is measured from the Kullback–Leibler divergence:
The total gradient for agent is:
with and the regularization coefficients. Similarly, the coach is trained with the following dual objective, weighted by the coefficient:
In order to propagate gradients through the sampled policy mask we reparametrized the categorical distribution using the Gumbel-softmax trick (jang2016categorical) with a temperature of 1. We call this coordinated sub-policy selection regularization CoachReg and illustrate it in Figure 2.
5 Training environments
All of our tasks are based on the OpenAI multi-agent particle environments (mordatch2018emergence). SPREAD and CHASE were introduced by (lowe2017multi). We use SPREAD as is but with sparse rewards only. CHASE is modified with a prey controlled by repulsion forces and only the predators are learnable, as we wish to focus on coordination in cooperative tasks. Finally we introduce COMPROMISE and BOUNCE where agents are explicitly tied together. While non-zero return can be achieved in these tasks by selfish agents, they all benefit from coordinated strategies and optimal return can only be achieved by agents working closely together. Figure 3 presents visualizations and a brief description of all four tasks. A detailed description is provided in Appendix A. In all tasks, agents receive as observation their own global position and velocity as well as the relative position of other entities. Note that work showcasing experiments on this environment often use discrete action spaces and (dense) reward shaping (e.g. the proximity with the objective) (iqbal2018actor; lowe2017multi; jiang2018learning). However, in our experiments, agents learn with continuous action spaces and from sparse rewards.
6 Results and Discussion
The proposed methods offer a way to incorporate new inductive biases in CTDE multi-agent policy search algorithms. In this work, we evaluate them by extending MADDPG, a state of the art algorithm widely used in the MARL litterature. We compare against vanilla MADDPG as well as two of its variations in the four cooperative multi-agent tasks described in Section 5. The first variation (DDPG) is the single-agent counterpart of MADDPG (decentralized training). The second (MADDPG + sharing) shares the policy and value-function models across agents.
To offer a fair comparison between all methods, the hyper-parameter search routine is the same for each algorithm and environment (see Appendix D.1). For each search-experiment (one per algorithm per environment), 50 randomly sampled hyper-parameter configurations each using 3 training seeds (total of 150 runs) are used to train the models for episodes. For each algorithm-environment pair, we then select the best hyper-parameter configuration for the final comparison and retrain them on 10 seeds for twice as long. We give more details about the training setup and model selection in Appendix B and D.2. The results of the hyperparameter searches are given in Appendix D.5.
6.1 Asymptotic Performance
From the average learning curves reported in Figure 4 we observe that CoachReg significantly improves performance on three environments (SPREAD, BOUNCE and COMPROMISE) and performs on par with the baselines on the last one (CHASE). The same can be said for TeamReg, except on COMPROMISE, the only task with an adversarial component, where it significantly underperforms compared to the other algorithms. We discuss this specific case in Section 6.3. Finally, parameter sharing is the best performing choice on CHASE, yet this superiority is restricted to this task where the optimal play is to move symmetrically and squeeze the prey into a corner.
6.2 Ablation study
Additionally to our two proposed algorithms and the three baselines, we present results for two ablated versions of our methods. The first ablation (MADDPG + agent modelling) is similar to TeamReg but with , which results in only enforcing agent modelling (i.e. agent predictability is not encouraged). The second ablation (MADDPG + policy mask) is structurally equivalent to CoachReg, but with , which means that agents still predict and apply a mask to their own policy, but synchronicity is not encouraged. Figure 12 and 13 (Appendix D.6) present the results of the corresponding hyper-parameter search and Figure 5 shows the learning curves for our full regularization approaches, their respective ablated versions and MADDPG.
The use of unsynchronized policy masks might result in swift and unpredictable behavioral changes and make it difficult for agents to perform together and coordinate. Experimentally, “MADDPG + policy mask” performs similarly or worse than MADDPG on all but one environment, and never outperforms the full CoachReg approach. However, policy masks alone seem enough to succeed on SPREAD, which is about selecting a landmark from a set. Regarding “MADDPG + agent modelling”, it does not drastically improve on MADDPG apart on the SPREAD environment, and the full TeamReg approach shows improvement over its ablated version except on the COMPROMISE task, which we discuss in Section 6.3.
6.3 Effects of enforcing predictable behavior
First, we investigate the reason for TeamReg’s poor performance on COMPROMISE. Then, we analyse how TeamReg might be helpful in other environments.
COMPROMISE is the only task with a competitive component (and the only one in which agents do not share their rewards). The two agents being linked, a good policy has both agents reach their landmark successively (maybe by simply having both agents navigate towards the closest landmark). However, if one agent never reaches for its landmark, the optimal strategy for the other one becomes to drag it around and always go for its own, leading to a strong imbalance in the return cumulated by both agents. While this scenario very rarely occurs for the other algorithms, we found TeamReg to often lead to such domination cases (see Figure 14 in Appendix E). Figure 6 depicts the agents’ performance difference for every 150 runs of the hyperparameter search for TeamReg and the baselines, and shows that (1) TeamReg is the only algorithm that does lead to large imbalances in performance between the two agents and (2) that these cases where one agent becomes dominant are all associated with high values of , which drives the agents to behave in a predictable fashion to one another. However, the dominated agent eventually gets exposed more and more to sparse reward gathered by being dragged (by chance) onto its own landmark, picks up the goal of the task and starts pulling in its own direction, which causes the average return over agents to drop as we see in Figure 4. This experiment demonstrates that using a predictability-based team-regularization in a competitive task can be harmful; quite understandably, you might not want to optimize an objective that aims at making your behavior predictable to your opponent.
On SPREAD and BOUNCE, TeamReg significantly improves the performance over the baselines. We aim to analyze here the effects of on cooperative tasks and investigate if it does make the agent modelling task more successful (by encouraging the agent to be predictable). To this end, we compare the best performing hyper-parameter configuration for TeamReg on the SPREAD environment with its ablated versions. The average return and team-spirit loss defined in Section 4.1 are presented in Figure 7 for these three experiments.
Initially, due to the weight initialization, the predicted and actual actions both have relatively small norms yielding small values of team-spirit loss. As training goes on (1000 episodes), the norms of the action-vector increase and the regularization loss becomes more important. As expected, leads to the highest team-spirit loss as it is not trained to predict the actions of other agents correctly. When using only the agent-modelling objective (), the agents significantly decrease the team-spirit loss, but it never reaches values as low as when using the full TeamReg objective. Finally, when also pushing agents to be predictable (), the agents best predict each others’ actions and performance is also improved. We also notice that the team-spirit loss increases when performance starts to improve i.e. when agents start to master the task (8000 episodes). Indeed, once the reward maximisation signals becomes stronger, the relative importance of the second task is reduced. We hypothesize that being predictable with respect to one-another may push agents to explore in a more structured and informed manner in the absence of reward signal, as similarly pursued by intrinsic motivation approaches (chentanez2005intrinsically).
6.4 Analysis of synchronous sub-policy selection
|(a) The ball is on the left side of the target, agents both select the purple policy mask|
|(b) The ball is on the right side of the target, agents both select the green policy mask|
In this section we aim at experimentally verifying that CoachReg yields the desired behavior: agents synchronously alternating between varied sub-policies. A special attention is given when the sub-policies are interpretable. To this end we record and analyze the agents’ policy masks on 100 different episodes for each task.
From the collected masks, we reconstructed the empirical mask distribution of each agent (see Figure 15 in Appendix F.1) whose entropy provides an indication of the mask diversity used by a given agent. Figure 9 (a) shows the mean entropy for each environment compared to the entropy of Categorical Uniform Distributions of size (-CUD). It shows that, on all the environments, agents use at least two distinct masks by having non-zero entropy. In addition, agents tend to alternate between masks with more variety (close to uniformly switching between 3 masks) on SPREAD (where there are 3 agents and 3 goals) than on the other environments (comprised of 2 agents). To test if agents are synchronously selecting the same policy mask at test time (without a coach), we compute the Hamming proximity between the agents’ mask sequences with where is the Hamming distance, i.e. the number of timesteps where the two sequences are different divided by the total number of timesteps. From Figure 9 (b) we observe that agents are producing similar mask sequences. Notably, their mask sequences are significantly more similar that the ones of two agent randomly choosing between two masks at each timestep. Finally, we observe that some settings result in the agents coming up with interesting strategies, like the one depicted in Figure 8 where the agents alternate between two sub-policies depending on the position of the target. More cases where the agents change sub-policies during an episode are presented in Appendix F.1. These results indicate that, in addition to improving the performance on coordination tasks, CoachReg indeed yields the expected behaviors. An interesting following work would be to use entropy regularization to increase the mask usage variety and mutual information to further disentangle sub-policies.
6.5 Robustness to hyper-parameters
Stability across hyper-parameter configurations is a recurring challenge in Deep RL. The average performance for each sampled configuration allows to empirically evaluate the robustness of an algorithm w.r.t. its hyper-parameters. We share the full results of the hyper-parameter searches in Figures 10, 11, 12 and 13 in Appendix D.5 and D.6. Figure 11 shows that while most algorithms can perform reasonably well with the correct configuration, our proposed coordination regularizers can improve robustness to hyper-parameter despite the fact that they have more hyper-parameters to search over. Such robustness can be of great value with limited computational budgets.
In this work we introduced two policy regularization methods to promote multi-agent coordination within the CTDE framework: TeamReg, which is based on inter-agent action predictability and CoachReg that relies on synchronized behavior selection. A thorough empirical evaluation of these methods showed that they significantly improve asymptotic performances on cooperative multi-agent tasks. Interesting avenues for future work would be to study the proposed regularizations on other policy search methods as well as to combine both incentives and investigate how the two coordinating objectives interact. Finally, a limitation of the current formulation is that it relies on single-step metrics, which simplifies off-policy learning but also limits the longer-term coordination opportunities. A promising direction is thus to explore model-based planning approaches to promote long-term multi-agent interactions.
We wish to thank Olivier Delalleau for providing insightful feedback and comments as well as Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montréal and Mitacs for providing part of the funding for this work.
Appendix A Tasks descriptions
SPREAD (Figure 3a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receive a team-reward where is the number of landmarks occupied by at least one agent and the number of collisions occurring at that timestep. To maximize their return, agents must therefore spread out and cover all landmarks. Initial agents’ and landmarks’ positions are random. Termination is triggered when the maximum number of timesteps is reached.
BOUNCE (Figure 3b): In this environment, two agents (small orange circles) are linked together with a spring that pulls them toward each other when stretched above its relaxation length. At episode’s mid-time a ball (smaller black circle) falls from the top of the environment. Agents must position correctly so as to have the ball bounce on the spring towards the target (bigger beige circle), which turns yellow if the ball’s bouncing trajectory passes through it. They receive a team-reward of if the ball reflects towards the side walls, if the ball reflects towards the top of the environment, and if the ball reflects towards the target. At initialisation, the target’s and ball’s vertical position is fixed, their horizontal positions are random. Agents’ initial positions are also random. Termination is triggered when the ball is bounced by the agents or when the maximum number of timesteps is reached.
COMPROMISE (Figure 3c): In this environment, two agents (small orange circles) are linked together with a spring that pulls them toward each other when stretched above its relaxation length. They both have a distinct assigned landmark (light gray circle for light orange agent, dark gray circle for dark orange agent), and receive a reward of when they reach it. Once a landmark is reached by its corresponding agent, the landmark is randomly relocated in the environment. Initial positions of agents and landmark are random. Termination is triggered when the maximum number of timesteps is reached.
CHASE (Figure 3d): In this environment, two predators (orange circles) are chasing a prey (turquoise circle). The prey moves with respect to a scripted policy consisting of repulsion forces from the walls and predators. At each timestep, the learning agents (predators) receive a team-reward of where is the number of predators touching the prey. The prey has a greater max speed and acceleration than the predators. Therefore, to maximize their return, the two agents must coordinate in order to squeeze the prey into a corner or a wall and effectively trap it there. Termination is triggered when the maximum number of time steps is reached.
Appendix B Training details
In all of our experiments, we use the Adam optimizer (kingma2014adam) to perform parameter updates. All models (actors and critics) are parametrized by feedforward networks containing two hidden layers of units. We use the Rectified Linear Unit (ReLU) (nair2010rectified) as activation function and layer normalization (ba2016layer) on the pre-activations unit to stabilize the learning. We use a buffer-size of entries and a batch-size of . We collect transitions by interacting with the environment for each learning update. For all tasks in our hyper-parameter searches, we train the agents for episodes of steps and then re-train the best configuration for each algorithm-environment pair for twice as long ( episodes) to ensure full convergence for the final evaluation. The scale of the exploration noise is kept constant for the first half of the training time and then decreases linearly to until the end of training. We use a discount factor of and a gradient clipping threshold of in all experiments. Finally for CoachReg, we fixed to 4 meaning that agents could choose between 4 sub-policies.
Appendix C Algorithms
Appendix D Hyper-parameter search
d.1 Hyper-parameter search ranges
We perform searches over the following hyper-parameters: the learning rate of the actor , the learning rate of the critic relative to the actor (), the target-network soft-update parameter and the initial scale of the exploration noise for the Ornstein-Uhlenbeck noise generating process (uhlenbeck1930theory) as used by lillicrap2015continuous. When using TeamReg and CoachReg, we additionally search over the regularization weights , and . The learning rate of the coach is always equal to the actor’s learning rate (i.e. ), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 shows the ranges from which values for the hyper-parameters are drawn uniformly during the searches.
d.2 Model selection
During training, a policy is evaluated on a set of 10 different episodes every 100 learning steps. At the end of the training, the model at the best evaluation iteration is saved as the best version of the policy for this training, and is re-evaluated on 100 different episodes to have a better assessment of its final performance. The performance of a hyper-parameter configuration is defined as the average performance (across seeds) of the policies learned using this set of hyper-parameter values.
d.3 Selected hyper-parameters
d.4 Selected hyper-parameters (ablations)
|Hyper-parameter||MADDPG+Agent Modelling||MADDPG+Policy Mask|
|Hyper-parameter||MADDPG+Agent Modelling||MADDPG+Policy Mask|
|Hyper-parameter||MADDPG+Agent Modelling||MADDPG+Policy Mask|
|Hyper-parameter||MADDPG+Agent Modelling||MADDPG+Policy Mask|
d.5 Hyper-parameter search results
The performance of each parameter configuration is reported in Figure 10 yielding the performance distribution across hyper-parameters configurations for each algorithm on each task. The same distributions are depicted in Figure 11 using box-and-whisker plot. It can be seen that TeamReg and CoachReg both boost the performance of the third quartile, suggesting an increase in the robustness across hyper-parameter.
d.6 Hyper-parameter search results (ablations)
From Figure 13 it seems that the “policy mask” or the “agent modelling” additions respectively provide nearly the same robustness boosts as CoachReg and TeamReg.
Appendix E The effects of enforcing predictability (additional results)
Appendix F Analysis of sub-policy selection (additional results)
f.1 Mask densities
We depict on Figure 15 the mask distribution of each agent for each (seed, environment) experiment. Firstly, in most of the experiments, agents use at least 2 different masks. Secondly, for a given experiments, agents’ distributions are very similar, suggesting that they are using the same masks in the same situations and that they are therefore synchronized. Finally, agents collapse more to using only one mask on CHASE, where they also display more dissimilarity between one another. This may explain why CHASE is the only task where CoachReg does not improve performance. Indeed, on CHASE, agents do not seem synchronized nor leveraging multiple sub-policies which are the priors to coordination behind CoachReg. In brief, we observe that CoachReg is less effective in enforcing those priors to coordination of CHASE, an environment where it does not boost nor harm performance.
f.2 Episodes roll-outs
We render here some episodes roll-outs, the agents synchronously switch between policy masks during an episode. In addition, the whole group selects the same mask as the one that would have been suggested by the coach.
f.3 Mask diversity and synchronicity (ablation)
As in Subsection 6.4 we report the mean entropy of the mask distribution and the mean Hamming proximity for the ablated “MADDPG + policy mask” and compare it to the full CoachReg. With “MADDPG + policy mask” agents are not incentivized to use the same masks. Therefore, in order to assess if they synchronously change policy masks, we computed, for each agent pair, seed and environment, the Hamming proximity for every possible masks equivalence (mask 3 of agent 1 corresponds to mask 0 of agent 2, etc.) and selected the equivalence that maximised the Hamming proximity between the two sequences.
We can observe that while “MADDPG + policy mask” agents display a more diverse mask usage, their selection is less synchronized than with CoachReg. This is easily understandable as the coach will tend to reduce diversity in order to have all the agents agree on a common mask, on the other hand this agreement enables the agents to synchronize their mask selection. To this regard, it should be noted that “MADDPG + policy mask” agents are more synchronized that agents independently sampling their masks from -CUD, suggesting that, even in the absence of the coach, agents tend to synchronize their mask selection.