Learning Structured Communication for Multi-agent Reinforcement Learning
This work explores the large-scale multi-agent communication mechanism under a multi-agent reinforcement learning (MARL) setting. We summarize the general categories of topology for communication structures in MARL literature, which are often manually specified. Then we propose a novel framework termed as Learning Structured Communication (LSC) by using a more flexible and efficient communication topology. Our framework allows for adaptive agent grouping to form different hierarchical formations over episodes, which is generated by an auxiliary task combined with a hierarchical routing protocol. Given each formed topology, a hierarchical graph neural network is learned to enable effective message information generation and propagation among inter- and intra-group communications. In contrast to existing communication mechanisms, our method has an explicit while learnable design for hierarchical communication. Experiments on challenging tasks show the proposed LSC enjoys high communication efficiency, scalability, and global cooperation capability.
Reinforcement learning (RL) has achieved remarkable success in solving single-agent sequential decision problems under interactive and complicated environments, such as games [13, 18] and robotics . In many real-world applications such as intelligent transportation systems  and unmanned systems , not only one, but usually a large number of agents are involved in the learning tasks. Such a setting naturally leads to the popular multi-agent reinforcement learning (MARL) problems, where the key research challenges include how to design scalable and efficient learning schemes under a non-stationary environment (caused by partial observation and/or the dynamics of other agents’ policies) with large and/or dynamic problem dimension, and complicated uncertain relationship between agents.
Learning to communicate effectively among agents has shown crucial to strengthen the inter-agent collaboration and ultimately improve the quality of policies learned by MARL. In this paper, we categorize the existing designs for communication topology
We further analyze the above communication topology patterns by considering the accessibility and comprehension of messages for effective communication. Both the fully-connected structure and star structure ensure messages are accessible for all agents. While as discussed in ATOC , once a large number of messages emerge concurrently, extracting valuable information would become difficult. Tree structure and neighboring structure constrain the communication to neighbors and hence is able to improve message comprehension. To achieve global accessibility, they define neighbors as -nearest agents and utilize multi-round communications. However, due to the lack of pooling mechanism, DGN  incurs two rounds of convolutions in communication; thus, only the information from two-hop distant agents is aggregated.
In this paper, we are aimed to improve both efficient message accessibility and effective message comprehension for large-scale MARL by proposing the so-called Learning Structured Communication (LSC) approach. Specifically, our LSC contains a structured communication module and a communication-based policy module. It aims to establish a hierarchical communication structure for learning the communication pattern as well as policy. In particular, a hierarchical communication structure (Fig. 1(e)) is established in a distributed fashion by a cluster-based routing protocol. To make the structure be formed dynamically for better cooperation, an auxiliary reinforcement task is designed to learn the communication weight in the end-to-end fashion. In the hierarchical structure, agents are grouped to different groups, and every group is assigned with a high-level agent. We design an intra-inter group communication mechanism to achieve global communication efficiently. Inter-group communication can help agents to capture global information better while intra-group communication helps fine-grained message exchanges. With these two modules, our experiments show that LSC can efficiently achieve global communication efficiency. The main highlights of this paper are summarized below.
1) We summarize the four existing categories of communication topology in the MARL literature, namely i) fully-connected, ii) star, iii) tree, and iv) neighboring, which are in general manually specified and fixed. We believe this perspective is enlightening for the design of new communication topology, given the fact that it has not been well organized in the existing literature.
2) We develop a new hierarchical communication topology LSC, which differs from the existing four patterns. Our approach allows for an adaptive formation of agents by dynamically grouping agents via a reinforcement learning procedure combined with a routing protocol. The messages can be jointly extracted and propagated through both intra- and inter-group communications via a hierarchical graph neural network.
3) Experimental results show that the proposed LSC yields promising results on public benchmarks in terms of communication efficiency, scalability, and global cooperation capability.
To our best knowledge, the current paper is the first work about hierarchical communication learning in MARL. We note that the idea of adopting hierarchical structure learning on MARL recently appears in HAMA . The differences are obvious and fundamental: first, their hierarchy structure is used for learning agents’ relation, but not for communication; second, their hierarchy design is fixed, other than adaptive and dynamically learned as done in this paper.
2 Related Work
These approaches try to let agents achieve consensus and cooperation directly from local observations, whereby a centralized training and decentralized execution framework (CTDE) is often used. Methods like MADDPG , QMIX , COMA  and MAAC  concatenate all the agents’ observations and/or polices to obtain the state representation. This helps achieve better cooperation. However, the curse of dimensionality occurs when a large number of agents are present. HAMA  adopts a hierarchical graph attention network to leverage the group relationships. However, the groups are clustered by predefined rules, which is not feasible for complex scenarios.
Learning-for-communication. In these approaches, the agents aim to achieve consensus and cooperation through communications. Agents need to learn to communicate with others and process the received messages to enhance collaboration. As mentioned, the communication topology of the existing methods can be categorized as i) fully-connected (FC); ii) star; iii) tree; and iv) neighboring.
Fully-connected structures assume that each agent communicates with all the other agents. DIAL  learns what to communicate by back-propagating all the other agents’ gradients to the message generation network. SchedNet  learns a weight-based scheduler to determine the communication priority based on DIAL, but the way of using the communication bandwidth is not scalable. Star structures assume agents only communicate with the single central agent as the hub. CommNet  aggregates all the agents’ hidden states as the global message, thus can only be applied to cooperative scenarios. Extended from CommNet, IC3  adds a communication gate to decide whether the agents to communicate. However, letting one agent handle all the messages in the star network cause a bottleneck at the central agent, both in communication bandwidth and information extraction. Tree and neighboring structures constrain communication to neighbors, thus avoids the single-point bottleneck issue. The -nearest neighbor mechanism is often used to define neighbors. However, agents can distribute unevenly, and thereby choosing a good is sometimes not easy in practical scenarios. ATOC  adopts tree structures, whereby each group in a chain (thus not hierarchical) performs communication sequentially. Although the inter-group communication can be achieved by the intersection of two groups, the large time complexity would be unbearable for real-time systems. To address the aforementioned difficulties, DGN  uses neighboring structured communication together with the graph convolution network (GCN). Multiple rounds of communications are adopted to enlarge the receptive field. As a common issue in GCN, shallow GCN without pooling layers can hardly explore rich global information as discussed in H-GCN 
MARL with Graph Neural Network (GNN). GNN is powerful in extracting relations among entities, with emerging applications in MARL. RFM  designs an auxiliary action prediction task (predict other agents’ actions) with graph networks , which can help agents learn interpretable intermediate representations. MAGNet  uses heuristic rules to learn the relevant graph to help actor and critic learning. DGN  learns the GCN together with the relation kernel by minimizing the TD error, which can be applied to dynamic multi-agent RL problems. HAMA  adopts a hierarchical graph attention network based on a pre-defined hierarchical graph to help agents capture interrelations. The pre-defined and fixed group scheme used in HAMA limits its adaptability in dynamic scenarios.
In this paper, we target at learning both the underlying topology (with a hierarchy prior design) and the top-layered message extraction as well as the propagation process via GNNs. The message communication mechanism, the underlying topology as well as the way of using GNN in this paper all are novel and different from the existing works.
3 LSC: Learning Structured Communication
3.1 Preliminaries and Overview
Partial Observable Stochastic Games (POSG). Agents learn policies by maximizing cumulative rewards via interacting with environment and other agents. POSG can be characterized as a tuple where denotes the set of agents indexed from to ; is the finite set of states; represents the initial state distribution and denotes the set of joint actions. is the action space of agent , denotes a joint action; denotes the joint observations and is the observation space for agent , denotes a joint observation; denotes the Markovian transition distribution with being the probability of state transiting to with result after taking action . is the Markovian observation emission probability. means the reward function. denotes the joint reward each agent. The overall task of MARL can be solved by proper objective modeling, which may indicate, e.g., cooperative, competitive, or mixed relationship among agents.
Deep Q-Learning. Deep -Network (DQN)  is popular in deep RL as it is one of the few RL methods applicable to large-scale MARL. In each step, each agent observes state and takes an action based on policy . It receives reward and next state from the environment. To maximize the cumulative reward after step , , DQN learns the action-value function by minimizing , where . The agent follows -greedy policy: select the action that maximizes the -value with probability - or randomly. Independent Deep -Learning (IDQN)  extends DQN by ignoring other agents for the POSG. Each agent learns a -function based on its own observation and reward. Our LSC extends DQN with hierarchical communication.
Proposed Approach Overview. LSC takes the aforementioned formulation of POSG with the communication mechanism taken into account. Every agent needs to learn both action and communication policies. As discussed in MFRL , the Q-Learning family is more stable than on-policy methods in large-scale MARL. Hence, we choose DQN to learn the action policy with communication and form the global state perception.
To construct a hierarchical communication structure, LSC designs a flexible two-level communication topology, where agents are dynamically divided into high-level agents and low-level agents, indicated by the yellow and blue points in Fig.1(e). The high-level agents are in charge of forming global perception and coordinating low-level agents in their group. The low-level agents need to convey the local information to high-level agents. LSC includes two key modules: i) structured communication module and ii) communication-based policy module, as shown in Fig. 2. The first module aims to establish the dynamic hierarchical structured communication topology in a distributed fashion, while the second module contains the GNN-based communication extraction and -network components.
3.2 Structured Communication Module
The structured communication module is designed by three principles: 1) agents in the same group are more likely to understand and cooperate inner group; 2) high-level agents are more likely to capture the global perception through the exchanged messages; 3) high-level agents are distributed sparsely to lower the communication cost. According to ATOC  and DGN , nearby agents are more likely to understand each other and form cooperation. Thus, we use the local geometrical relationship and the policy performance as our guide to establishing the hierarchical structure, as shown in Fig. 3.
Specifically, two sub-modules are included: the weight generator and the Cluster-Based Routing Protocol (CBRP). The weight generator sub-module aims to determine the importance of communication for each agent automatically. It is modeled by a neural network , where the weight can measure the confidence of an agent to become high-level. Further, the CBRP sub-module employs the weights of all agents and considers the local geometry to construct the hierarchical communication network. The CBRP sub-module can be implemented in a distributed fashion, leading to a distributed election of high-level agents. This advantage ensures the applicability of LSC to large-scale scenarios, which is demonstrated in the experiments.
The CBRP method  is a typical method for establishing a hierarchical routing structure. It takes a hyper-parameter cluster radius as the basis to establish structure, and we denote the agent’s perceptive field as the area within cluster radius. Each low-level agent checks whether other agents have larger weights or contain high-level agents within its receptive area. If no such agent is found, this agent is elected as a high-level agent; otherwise, it keeps as a low-level agent. Meanwhile, each high-level agent checks whether other high-level agents exist in its receptive field. If no such agent is found or the founded high-level agents’ weights are smaller than its weights, it keeps as a high-level agent; otherwise, it downgrades to a low-level agent. After a sufficient number of rounds, the hierarchical structure would be established with sparsity: no high-level agent is included in other high-level agents’ receptive fields, which benefits communication efficiency. All agents are separated into groups with one high-level agent as the group leader. The overall hierarchical communication network is thus established: connecting high-level agents across groups, and connecting each low-level agent to its high-level agent.
A naive way of designing a weight generator is to set a fixed weight for all the agents simply. However, improper weights would result in a poor hierarchical structured communication network, which further causes the diverse performance of the communication-based policy. The experimental results also suggest that the choice of weights has a non-negligible influence on the performance, which motivates us to train these two modules end-to-end.
However, the CBRP sub-module is not differentiable, which means that the gradients cannot be back-propagated from the communication-based policy module to the weight generator sub-module. Therefore, we introduce an auxiliary RL task for weight generating, where the action for each agent is weight choosing with the same observation and reward of the original task, as well as practical constraints. Hence, we can have a close-loop task-driven communication weight generating manner. Specifically the weight is defined in the discrete set . IDQN is chosen to implement the weight generator for simplicity. The loss for the weight generator sub-module is:
where , and denotes the reward for agent .
3.3 Communication-based Policy Module
Once the communication network topology is determined, the communication-based policy module learns the communication message and generates a global collaboration policy. The communication-based policy module consists of two sub-modules: GNN-based communication sub-module and the -Net policy sub-module. The former is used to learn the communication messages and further update overall state perceptions. The latter learns the policy based on the new state perceptions after efficient communication.
As illustrated in Fig. 4, the well-established hierarchical structured communication network can be represented by a directed grpah . The node set contains nodes, which can be divided into the high-level node set and the low-level node set . For , the node feature vector includes the embedding feature , the high-level node feature and the global feature ; for , the node feature vector only includes the embedding feature . For each edge with , the edge feature vector is denoted as . Functions and denote the update embedding function and aggregate function respectively. As shown in Fig. 4 and detailed in Table 1, the overall GNN-based communication sub-module consists of three steps.
Step 1) Intra-group aggregation. In each group, the low-level agents embed their local information and send it to the associated high-level agent ; the high-level agents aggregate the information from all associated low-level agents and obtain the cluster perception;
Step 2) Inter-group sharing. The high-level agent communicates with the other high-level agent with cluster perception. This further aggregates all received other high-level messages to obtain the global perception;
Step 3) Intra-group sharing. Each of the high-level agents communicates all its features with the associated low-level agents while the low-level agents aggregate the received information from high-level agents. The embedding feature of both high-level and low-level agents are then updated.
|Type||Edge||Edge Update Scheme||Node Update Scheme|
|Step 1: intra-group aggregation||,||,|
|Step 2: inter-group sharing||,||,|
|Step 3: intra-group sharing||,||,||,|
The GNN-based communication sub-module is modeled as a GNN () with parameter , while the following -Net of agent () is parameterized by shared parameter . The gradient can be back-propagated from -Net to the graph neural network. As a result, the overall loss of communication-based policy module is as follows:
where , and is the reward for agent . Soft updating scheme is used:
The whole LSC is depicted in Algorithm 1. The CBRP function automatically and distributively establishes the structured communication network based on the learned importance weights. HCOMM denotes the communication-based policy module, which outputs the -values based on the GNN-based communication messages. The details of CBRP and HCOMM can be found in the Appendix.
3.4 Communication Efficiency Analysis
We discuss the communication efficiency
Table 2 compares the communication efficiency of different communication structures. In fully-connected (FC) structures where each agent communicates with all the others, the message exchanging complexity is , and the max bandwidth for an agent is . In the star structure, agents only need to communicate with the central agent, and thus the message exchanging complexity decrease to . For the tree structure and neighboring structure, agents only need to communicate with neighbors. We denote the number of groups as and the maximum number of agents in a group as . The tree structure and neighboring structure need message exchanging complexity. The tree structure lets groups communicate sequentially, thus need communication steps. Our hierarchical communication structure only needs low-level agents to communicate with the high-level agents, and high-level agents need to communicate with each other. Thus the message exchanging complexity is . Since and increase mildly with , the proposed dynamic communication topology in LSC is suitable for large-scale MARL.
We choose MAgent
4.1 Task I: Large-scale Battle Game in MAgent
In this scenario (self-interested cooperative scenario), agents move and fight against enemies. The enemies have higher speed, higher attack power, and better stamina. Thus, agents need to form high-quality cooperation to wipe out enemies. The enemies are controlled by an IDQN  pretrained policy. To evaluate different methods, besides the learning curve, we choose some quantitative evaluation criteria of the battle game, like ‘Mean-reward’ (average per-step reward of all agents), ’’ (average number of kills per episode), ’’ (average number of deaths per episode) and ’’ (kill to death ratio ).
We first compare the performance of different communication structures. All the models are trained with and for episodes. Fig. 5(a) shows the learning curve. The solid and shadow denote the mean and variance, respectively. As seen, LSC performs better in terms of the converged mean reward. We further test the learned models for rounds, and the results are shown in Table 3. One can observe that LSC achieves a higher mean reward and a larger kill-death ratio. IDQN yields the lowest score due to a lack of communication schemes. Since LSC has additional intra-group communication when compared to LSC-nbor and extra inter-group communication between high-level agents when compared to LSC-star, the performance improvement of LSC over the two benchmarks well demonstrates the benefits brought by a hierarchical structure, intra-inter group communications.
To investigate the impact of the learned importance weight generator, we compare LSC with a counterpart with a basic fixed weight generator (all agents are set to the same communication weight) in the battle scenario. In Fig. 5(b), our learned weight generator significantly outperforms the fixed one (LSC-fix). The result indicates that the structured communication module and communication-based policy module in LSC are strongly connected. We also compare LSC with LSC-star-gate (which incorporates communication gates in the star structure like IC3). In Fig. 5(c), one can see that LSC outperforms LSC-star-gate, and therefore the communication gate is barely beneficial.
Fig. 6 is presented to better understand the strategies learned by these algorithms. Specifically, LSC learns the encircle and fire focusing strategies as shown in Fig. 6(b) and Fig. 6(c). We find that the baselines can hardly handle the situation when some agents are far away from enemies. For example, the situation in Fig. 6(a), the agents in the top right can not know where to attack without communication. We let every learned model run from the initial state, and find only LSC learns to form an inter-group encircle and wipe out the enemies (see Fig. 6(d)). IDQN tends to cooperate within the visual range. Thus the agents that find no enemy would get close to the wall to defend attacks. Local cooperation leads to failed results, as shown in Fig. 6(e). Star and neighbor structures help agents form global cooperation. However, the agents far away from the majority have difficulty in comprehending the global information. In Fig. 6(f) and Fig. 6(g), such remote agents choose the spread out and exploration strategy, thus fail to wipe out enemies. To sum up, in Fig. 6(d), agents controlled by LSC form a global encircle strategy by communication in both intra-group and inter-group, which can wipe out enemies more efficiently.
Fig.7 presents two weight-visualization graphs to show how the CBRP sub-module works. 7(a) visualizes the agent weights of an intermediate stage during the testing procedure for a battle. Red agents denote the enemies. As discussed in Section 3.2, only three kinds of discrete weights can be obtained for each agent, i.e., . The blue agents denote the agents have weight 2, and the green ones denote the agents have weight 1, while there is no agent with weight in this stage. Without the CBRP sub-module, all the blue agents with weight will be elected as high-level agents. This leads to an almost dense high-level agent structure, for instance, the red circle area. 7(b) visualize the agent weights after implementing CBRP method. Only three agents are set to be high-level agents (weight ), while all others are set to be low-level agents (weight ).
|Criteria / Method||LSC||LSC-star||LSC-nbor||IDQN|
4.2 Task II: Cooperative Spread in MPE
We design a new scenario: cooperative spread based on MPE , to test the performance of LSC in the fully cooperative scenario. There are agents and landmarks in this environment. Every landmark needs to be reached by three agents. However, when there are more than three agents reaching the landmark, the landmark would be overloaded and penalizes the agents. We train LSC and other baselines with episodes, whose learning curves are shown in Fig. 8. We take test rounds on the obtained models, and Table 4 contains some evaluation criteria in the testing procedure, for instance ’’ (number of successive reaching of three agents to one landmark), ’’ (number of successive reaching of more than three agents to one landmark) and ‘Mean-reward’ (average per-episode reward of all agents).
Fig. 8 and Table 4 show LSC outperforms in training and testing. The high-level agents (in star or hierarchical topology) can help speed up the learning process by making an agreement of global information. Thus LSC and star topology learn faster than the other two. LSC converges to a higher reward than baselines. Table 4 shows IDQN’s strategy is passive. Agents avoid overload while taking less chance to reach the landmark. Star and neighboring structures take more aggressive strategies, and the star structure cannot achieve fine-grained information from neighbors, thus lead to more overloads. The neighboring structure makes some agents disconnected to others, thus cannot achieve global communication and lead to less chance for success. LSC achieves the highest reward during testing, with reasonable overload.
5 Conclusion and Future Work
In this paper, a novel learning structured communication (LSC) algorithm has been proposed for multi-agent reinforcement learning. The hierarchical structure is self-learned with a clustering-based routing protocol. The communication message representation is then naturally embedded and extracted via a graph neural network. Experiments on two scenarios demonstrate that our LSC can outperform existing learning-to-communicate algorithms with better communication efficiency, cooperation capability, and scalability. In the future, it is worthwhile to improve LSC by considering some practical constraints such as communication bandwidth and latency.
Appendix A Appendix
a.1 Hyperparameters and Experimental settings
In MAgent battle, agents fight enemies in a grid world. Each agent in both sides has a perception field and can attack its 8-adjacent grids. The speed, attack power, and health point for each agent are , , and , which are increased to , , and for the enemy to increase the difficulty. The reward is for successful attacking an enemy, for being killed, and for attacking a blank grid.
In the cooperative spread, 12 agents need to cooperate to reach every landmark with three agents. Each agent can get the relative position of other agents while only when the landmark in its receptive field(distance smaller than ) agent can get the landmark’s relative position. The action space contains UP, DOWN, LEFT, RIGHT, and STAY. When three agents reach a landmark(distance smaller than 0.2), the landmark will reward to all the agents with . However, if there are more than three agents reach a landmark, all the agents are penalized with . To help agents learn to reach landmarks, we add dense reward (landmarks give the negative-sum nearest three agents distance as a reward) like the typical spread setting.
To enable reproducibility, we summarize the hyperparameters for LSC and baselines at table 5. The weight generator’s hyper-parameters are all the same as IDQN, but the output layer sets to 3(the level of weights).
|dimension of msg||—|
a.2 CBRP Function and HCOMM Function
- We interchangeably abuse the term topology and structure.
- Communication efficiency varies by different communication mechanisms. Here our analysis is under the peer to peer mode.
- (2002) A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies 10 (5-6), pp. 433–454. Cited by: §1.
- (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2, §4.
- (2019) TarMAC: Targeted multi-agent communication. In ICML, pp. 1538–1546. Cited by: §1.
- (2016) Learning to communicate with deep multi-agent reinforcement learning. In NeurIPS, pp. 2137–2145. Cited by: §1, §2.
- (2018) Counterfactual multi-agent policy gradients. In AAAI, pp. 2974–2982. Cited by: §2.
- (2019) Actor-attention-critic for multi-agent reinforcement learning. In ICML, pp. 2961–2970. Cited by: §2.
- (2020) Graph convolutional reinforcement learning. In ICLR, Cited by: §1, §1, §2, §2, §3.2, §4.
- (2018) Learning attentional communication for multi-agent cooperation. In NeurIPS, pp. 7254–7264. Cited by: §1, §1, §2, §3.2.
- (2019) Learning to schedule communication in multi-agent reinforcement learning. In ICLR, Cited by: §1, §2.
- (2016) Continuous control with deep reinforcement learning. In ICLR, Cited by: §1.
- (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, pp. 6379–6390. Cited by: §2, §4.2.
- (2018) Deep multi-agent reinforcement learning with relevance graphs. arXiv preprint arXiv:1811.12557. Cited by: §2.
- (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §3.1.
- (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, pp. 4292–4301. Cited by: §2.
- (2009) Cluster based routing protocol for mobile ad hoc networks. INFOCOMP 8 (1), pp. 30–36. Cited by: §3.2.
- (2019) Multi-agent actor-critic with hierarchical graph attention network. arXiv preprint arXiv:1909.12557. Cited by: §1, §2, §2.
- (2009) Multi-agent team cooperation: a game theory approach. Automatica 45 (10), pp. 2205–2213. Cited by: §1.
- (2016-01) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
- (2019) Learning when to communicate at scale in multiagent cooperative and competitive tasks. In ICLR, Cited by: §1, §2.
- (2016) Learning multiagent communication with backpropagation. In NeurIPS, pp. 2244–2252. Cited by: §4.
- (2019) Relational forward models for multi-agent learning. In ICLR, Cited by: §2.
- (2017) Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE 12 (4), pp. 1–15. Cited by: §3.1, §4.1, §4.
- (2018) Mean field multi-agent reinforcement learning. In ICML, pp. 5571–5580. Cited by: §3.1.
- (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pp. 4800–4810. Cited by: §2.