Graph Convolutional Reinforcement Learning

Graph Convolutional Reinforcement Learning

Jiechuan Jiang
Peking University
\ANDChen Dun
Rice University
Tiejun Huang & Zongqing Lu
Peking University
Work done at Peking University.Correspondence to Zongqing Lu.

Learning to cooperate is crucially important in multi-agent environments. The key is to understand the mutual interplay between agents. However, multi-agent environments are highly dynamic, which makes it hard to learn abstract representations of their mutual interplay. To tackle these difficulties, we propose graph convolutional reinforcement learning, where graph convolution adapts to the dynamics of the underlying graph of the multi-agent environment, and relation kernels capture the interplay between agents by their relation representations. Latent features produced by convolutional layers from gradually increased receptive fields are exploited to learn cooperation, and cooperation is further boosted by temporal relation regularization for consistency. Empirically, we show that our method substantially outperforms existing methods in a variety of cooperative scenarios.

1 Introduction

Cooperation is a widespread phenomenon in nature from viruses, bacteria, and social amoebae to insect societies, social animals, and humans (Melis and Semmann, 2010). Human exceeds all other species in terms of range and scale of cooperation. The development of human cooperation is facilitated by the underlying graph of human societies (Ohtsuki et al., 2006; Apicella et al., 2012), where the mutual interplay between humans is abstracted by their relations.

It is crucially important to enable agents to learn to cooperate in multi-agent environments for many applications, e.g., autonomous driving (Shalev-Shwartz et al., 2016), traffic light control (Wiering, 2000), smart grid control (Yang et al., 2018a), and multi-robot control (Matignon et al., 2012). Multi-agent reinforcement learning (MARL) facilitated by communication (Sukhbaatar et al., 2016; Peng et al., 2017; Jiang and Lu, 2018), mean field theory (Yang et al., 2018b), and causal influence (Jaques et al., 2019) have been exploited for multi-agent cooperation. However, communication among all agents (Sukhbaatar et al., 2016; Peng et al., 2017) makes it hard to extract valuable information for cooperation, while communication with only nearby agents (Jiang and Lu, 2018) may restrain the range of cooperation. MeanField (Yang et al., 2018b) captures the interplay of agents by mean action, but the mean action eliminates the difference among agents and thus incurs the loss of important information that could help cooperation. Causal influence (Jaques et al., 2019) is a measure of action influence, which is the policy change of an agent in the presence of an action of another agent. However, causal influence is not directly related to the reward of environment and thus may not encourage cooperation. Unlike existing work, we consider the underlying graph of agents, which could potentially help understand agents’ mutual interplay and promote their cooperation as it does in human cooperation (Ohtsuki et al., 2006; Apicella et al., 2012).

In this paper, we propose graph convolutional reinforcement learning, where the multi-agent environment is modeled as a graph, each agent is a node, and the encoding of local observation of agent is the feature of node. We apply convolution to the graph of agents. By employing multi-head attention (Vaswani et al., 2017) as the convolution kernel, graph convolution is able to extract the relation representation between nodes and convolve the features from neighboring nodes just like a neuron in a convolutional neural network (CNN). Latent features extracted from gradually increased receptive fields are exploited to learn cooperative policies. Moreover, the relation representation is temporally regularized to help the agent develop consistent cooperative policy.

Graph convolutional reinforcement learning, namely DGN, is instantiated based on deep network and trained end-to-end. DGN shares weights among all agent, making it easy to scale. DGN abstracts the mutual interplay between agents by relation kernels, extracts latent features by convolution, and induces consistent cooperation by temporal relation regularization. We empirically show the learning effectiveness of DGN in jungle and battle games and routing in packet switching networks. We demonstrate that DGN agents are able to develop cooperative and sophisticated strategies and DGN outperforms existing methods in a large margin.

By ablation studies, we confirm the following. Graph convolution greatly enhances the cooperation of agents. Unlike other parameter-sharing methods, graph convolution allows the policy to be optimized by jointly considering the agents in the receptive field of an agent, promoting the mutual help. Relation kernels that are independent from the input order of features can effectively capture the interplay between agents and abstract relation representation to further improve cooperation. Temporal regularization, which minimizes the KL divergence of relation representations in successive timesteps, boosts the cooperation, helping the agent to form a long-term and consistent policy in the highly dynamic environment with many moving agents.

2 Related Work

MARL. MADDPG (Lowe et al., 2017) and COMA (Foerster et al., 2018) are actor-critic models for the settings of local reward and shared reward, respectively. A centralized critic that takes as input the observations and actions of all agents are used in both, which makes them hard to scale. PS-TRPO (Gupta et al., 2017) solves problems that were previously considered intractable by most MARL algorithms via sharing of policy parameters that also improves multi-agent cooperation. However, the cooperation is still limited without sharing information among agents. Sharing parameters of value function among agents is considered in (Zhang et al., 2018) and convergence guarantee is provided for linear function approximation. However, the proposed algorithms and their convergence are established only in fully observable environments. Value propagation is proposed in (Qu et al., 2019) for networked MARL, which uses softmax temporal consistency to connect value and policy updates. However, this method only works on networked agents with static connectivity. CommNet (Sukhbaatar et al., 2016) and BiCNet (Peng et al., 2017) communicate the encoding of local observation among agents. ATOC (Jiang and Lu, 2018) and TarMAC (Das et al., 2019) enable agents to learn when to communicate and who to send messages to, respectively, using attention mechanism. These communication models prove that communication does help for cooperation. However, full communication is costly and inefficient, while restrained communication may limit the range of cooperation.

Graph Convolution and Relation. Many important real-world applications come in the form of graphs, such as social networks (Kipf and Welling, 2017), protein-interaction networks (Duvenaud et al., 2015), and 3D point cloud (Charles et al., 2017). Several frameworks (Henaff et al., 2015; Niepert et al., 2016; Kipf and Welling, 2017; Velickovic et al., 2017) have been architected to extract locally connected features from arbitrary graphs. A graph convolutional network (GCN) takes as input the feature matrix that summarizes the attributes of each node and outputs a node-level feature matrix. The function is similar to the convolution operation in CNNs, where the kernels are convolved across local regions of the input to produce feature maps. Using GCNs, interaction networks can reason the objects, relations and physics in complex systems, which has been proven difficult for CNNs. A few interaction frameworks have been proposed to predict the future states and underlying properties, such as IN (Battaglia et al., 2016), VIN (Watters et al., 2017), and VAIN (Hoshen, 2017). Relational reinforcement learning (RRL) (Zambaldi et al., 2018) embeds multi-head dot-product attention (Vaswani et al., 2017) as relation block into neural networks to learn pairwise interaction representation of a set of entities in the agent’s state, helping the agent solve tasks with complex logic. Relational Forward Models (RFM) (Tacchetti et al., 2019) use supervised learning to predict the actions of all other agents based on global state. However, in partially observable environments, it is hard for RFM to learn to make accurate prediction with only local observation.

3 Method

We construct the multi-agent environment as a graph, where agents in the environment are represented by the nodes of the graph and each node has a set of neighbors, , which is determined by distance or other metrics, depending on the environment, and varies over time. Moreover, neighboring nodes can communicate with each other. The intuition behind this is neighboring agents are more likely to interact with and affect each other. In addition, in many multi-agent environments, it may be costly and less helpful to take all other agents into consideration, because receiving a large amount of information requires high bandwidth and incurs high computational complexity, and agents cannot differentiate valuable information from globally shared information (Tan, 1993; Jiang and Lu, 2018). As convolution can gradually increase the receptive field of an agent111The receptive field of an agent at a convolutional layer is its perceived agents at that layer., the scope of cooperation is not restricted. Therefore, it is efficient and effective to consider only neighboring agents. Unlike the static graph considered in GCNs, the graph of multi-agent environment is dynamic and continuously changing over time as agents move or enter/leave the environment. Therefore, DGN should be able to adapt to the dynamics of the graph and learn as the multi-agent environment evolves.

3.1 Graph Convolution

Figure 1: DGN consists of three modules: encoder, convolutional layer, and network. All agents share weights and gradients are accumulated to update the weights.

We consider partially observable environments, where at each timestep each agent receives a local observation , which is the property of node in the graph, takes an action , and gets a reward . DGN consists of three types of modules: observation encoder, convolutional layer and network, as illustrated in Figure 1. The local observation is encoded into a feature vector by MLP for low-dimensional input or CNN for visual input. The convolutional layer integrates the feature vectors in the local region (including node and its neighbors ) and generates the latent feature vector . By stacking more convolutional layers, the receptive field of an agent gradually grows, where more information is gathered, and thus the scope of cooperation can also increase. That is, by one convolutional layer, node can directly acquire the feature vectors from the encoders of nodes in one-hop (i.e., ). By stacking two layers, node can get the output of the first convolutional layer of the nodes in one-hop, which contains the information from nodes in two-hop. However, more convolutional layers will not increase the local region of node , i.e., node still only communicates with its neighbors. Details of the convolution kernel will be discussed in next subsection.

As the number and position of agents vary over time, the underlying graph continuously changes, which brings difficulties to graph convolution. To address the issue, we merge all agents’ feature vectors at time into a feature matrix with size in the order of index, where is the number of agents and is the length of feature vector. Then, we construct an adjacency matrix with size for agent , where the first row is the one-hot representation of the index of node , and the th row, , is the one-hot representation of the index of the th neighbor. Then, we can obtain the feature vectors in the local region of node by .

Inspired by DenseNet (Huang et al., 2017), for each agent, the features of all the preceding layers are concatenated and fed into the network, so as to assemble and reuse the observation representation and features from different receptive fields, which respectively have distinctive contributions to the strategy that takes the cooperation at different scopes into consideration.

During training, at each timestep, we store the tuple in the replay buffer, where is the set of observations, is the set of actions, is the set of next observations, is the set of rewards, and is the set of adjacency matrix. Note that we drop time in the notations for simplicity. Then, we sample a random minibatch of from the replay buffer and minimize the loss


where denotes the set of observations of the agents in ’s receptive fields determined by , is the discount factor, and function, parameterized by , takes as input and outputs value for agent . As the action of agent can change the graph at next timestep which makes it hard to learn function, we keep unchanged in two successive timesteps when computing the -loss in training to ease this learning difficulty. The gradients of -loss of all agents are accumulated to update the parameters. Then, we softly update the target network as .

Like CommNet, DGN can also be seen as a factorization of a centralized policy that outputs actions for all the agents to optimize the average expected return. The factorization is that all agents share and the model of each agent is connected to its neighbors, dynamically determined by the graph of agents at each timestep. More convolutional layers (i.e., larger receptive field) yield a higher degree of centralization that mitigates non-stationarity. In addition, unlike other methods with parameter-sharing, e.g., DQN, that sample experiences from individual agents, DGN samples experiences based on the graph of agents, not individual agents, and thus takes into consideration the interactions between agents. Nevertheless, the parameter-sharing of DGN does not prevent the emergence of sophisticated cooperative strategies, as we will show in the experiments. Note that during execution each agent only requires the (latent) features from its neighbors (e.g., via communication) regardless of the number of agents, which makes DGN easily scale.

3.2 Relation Kernel

Convolution kernels integrate the feature in the receptive field to extract the latent feature. One of the most important properties is that the kernel should be independent from the order of the input feature vectors. Mean operation as in CommNet meets this requirement, but it leads to only marginal performance gain. BiCNet uses the learnable kernel, i.e., RNN. However, the input order of feature vectors severely impacts the performance, though the affect is alleviated by bi-direction mechanism. Further, convolution kernels should be able to learn how to abstract the relation between agents so as to integrate their input features.

Figure 2: Illustration of computation of the convolutional layer with relation kernel of multi-head attention.

Inspired by RRL, we use multi-head dot-product attention as the convolutional kernel to compute interactions between agents. For each agent , let denote and . The input feature of each agent is projected to query, key and value representation by each independent attention head. For attention head , the relation between and is computed as


where is a scaling factor. For each attention head, the value representations of all the input features are weighed by the relation and summed together. Then, the outputs of attention heads for agent are concatenated and then fed into function , i.e., one-layer MLP with ReLU non-linearities, to produce the output of the convolutional layer,


Figure 2 illustrates the computation of the convolutional layer with relation kernel. Multi-head attention makes the kernel independent from the order of input feature vectors, and allows the kernel to jointly attend to different representation subspaces. More attention heads give more relation representations and make the training more stable empirically (Vaswani et al., 2017). Moreover, with multiple convolutional layers, higher order relation representations can be extracted, which effectively capture the interplay between agents and greatly help to make cooperative decision.

3.3 Temporal Relation Regularization

As we train our model using deep learning, we use future value estimate as target for the current estimate. We follow this insight and apply it to the relation kernel in our model. Intuitively, if the relation representation produced by the relation kernel of upper layer truly captures the abstract relation between surrounding agents and itself, such relation representation should be stable/consistent for at least a short period of time, even when the state/feature of surrounding agents changes. Since in our relation kernel, the relation is represented as the attention weight distribution to the observation of surrounding agents, we use the attention weight distribution in the next state as the target for the current attention weight distribution. This will encourage the agent to form the consistent relation representation and hence consistent cooperation, regardless of consistent action, while RNN/LSTM forces consistent action, regardless of cooperation. As the relation in different states should not be the same but similar, we use divergence to compute the distance between the attention weight distributions in the two states.

It should be noted that we do not use the target network to produce the target relation representation as in normal deep learning. This is because relation representation is highly correlated with the weights of feature extraction. But update of such weights in target network always lags behind that of the current network. Since we only focus on the self-consistent of the relation representation based on the current feature extraction network, we apply current network to the next state to produce the new relation representation instead of the target network as in deep learning.

Let denotes the attention weight distribution of relation representations at convolutional layer for agent . Then, with temporal relation regularization, the loss is modified as below


where and is the coefficient for the regularization loss. Temporal relation regularization of upper layer in DGN helps the agent to form long-term and consistent action policy in the highly dynamical environment with many moving agents. This will further help agents to form cooperative behavior since many cooperation tasks need long-term consistent actions of the collaborated agents to get the final reward. We will further analyze this in the experiments.

4 Experiments

For the experiments, we adopt a grid-world platform MAgent (Zheng et al., 2017). In the environment, each agent corresponds to one grid and has a local observation that contains a square view with grids centered at the agent and its own coordinates. The discrete actions are moving or attacking. Two scenarios, battle and jungle, are considered to investigate the cooperation among agents. Also, we build an environment, routing, that simulates routing in packet switching networks. These three scenarios are illustrated in Figure 3. In the experiments, we compare DGN with independent Q-learning, DQN, which is fully decentralized, CommNet (Sukhbaatar et al., 2016), and MeanField Q-learning (MFQ) (Yang et al., 2018b). In the experiments, DGN and the baselines are parameter-sharing and their basic hyperparameters are all the same. Moreover, to ensure the comparison is fair, their parameter sizes are also similar. Please refer to Appendix for hyperparameters and experimental settings, respecitvely. The video ( provides more details about the experiments. The codes of DGN are available at

Figure 3: Illustration of experimental scenarios: battle (left), jungle (mid), and routing (right).

4.1 Battle

In this scenario, agents learn to fight against enemies who have superior abilities than the agents. The moving or attacking range of the agent is the four neighbor grids, however, the enemy can move to one of twelve nearest grids or attack one of eight neighbor grids. Each agent/enemy has six hit points (i.e., being killed by six attacks). After the death of an agent/enemy, the balance will be easily lost and hence we will add a new agent/enemy at a random location to maintain the balance. By that, we can make fair comparison among different methods in terms of kills, deaths and kill-death ratio besides reward for given timesteps. The pretrained DQN model built-in MAgent takes the role of enemy. As individual enemy is much powerful than individual agent, an agent has to collaborate with others to develop coordinated tactics to fight enemies. Moreover, as the hit point of enemy is six, agents have to consistently cooperate to kill an enemy.

Figure 4: Learning curves in battle.

We trained all the models with the setting of and for episodes. Figure 4 shows their learning curves in terms of mean reward. For all the models, the shadowed area is enclosed by the min and max value of three training runs, and the solid line in middle is the mean value (same for jungle and routing). DGN converges to much higher mean reward than other baselines, and its learning curve is more stable. MFQ outperforms CommNet and DQN which first get relative high reward, but eventually converge to much lower reward. As observed in the experiment, at the beginning of training, DQN and CommNet learn sub-optimum policies such as gathering as a group in a corner to avoid being attacked, since such behaviors generate relatively high reward. However, since the distribution of reward is uneven, i.e., agents at the exterior of the group are easily attacked, learning from the “low reward experiences” produced by the sub-optimum policy, DQN and CommNet converge to more passive policies, which lead to much lower reward. We evaluate DGN and the baselines by running test games, each game unrolled with timesteps. Table 1 shows the mean reward, kills, deaths, and kill-death ratio.

mean reward
# kills
# deaths
kill-death ratio
Table 1: Battle

DGN agents learn a series of tactical maneuvers, such as encircling and envelopment of a single flank. For single enemy, DGN agents learn to encircle and attack it together. For a group of enemies, DGN agents learn to move against and attack one of the enemy’s open flanks, as depicted in Figure 4(a). CommNet agents adopt an active defense strategy. They seldom launch attacks but rather run away or gather together to avoid being attacked. DQN agents driven by self-interest fail to learn a rational policy. They are usually forced into a corner and passively react to the enemy’s attack, as shown in Figure 4(b). MFQ agents do not effectively cooperate with each other because the mean action incurs the loss of important information that could help cooperation. In DGN, relation kernels can extract high order relations between agents through graph convolution, which can be easily exploited to yield cooperation. Therefore, DGN outperforms other baselines.

(a) DGN in battle
(b) DQN in battle
(c) DGN in jungle
(d) DQN in jungle
Figure 5: Illustration of representative behaviors of DGN and DQN agents in battle and jungle.

Ablations.  We first remove temporal relation regularation from DGN, denoted as DGN-R. As shown in Figure 4 and Table 1, the performance drops slightly. In the experiment, it is observed that DGN agents indeed behave more consistently and synchronously with each other, while DGN-R agents are more likely to be distracted by the new appearance of enemy or friend nearby and abandon its original intended trajectory. This results in fewer appearances of successful formation of encircling of a moving enemy, which might need consistent cooperation of agents to move across the field. DGN agents often overcome such distraction and show more long-term strategy and aim by moving more synchronously to chase the enemy until encircle and destroy it. From this experiment, we can see that temporal relation regularization indeed helps agents to form more consistent cooperation. We further replace relation kernels of graph convolution in DGN-R with mean kernels, denoted as DGN-M. Comparing the performance of DGN-R and DGN-M, we confirm that relation kernels that abstract the relation representation between agents indeed helps to learn cooperation. Although DGN-M and CommNet both use mean operation, DGN-M substantially outperforms CommNet. This is attributed to graph convolution can effectively extract latent features from gradually increased receptive field. The performance of DGN with different receptive fields is available in Appendix.

4.2 Jungle

Figure 6: Learning curves in jungle.

This scenario is a moral dilemma. There are agents and foods in the field, where foods are stationary. An agent gets positive reward by eating food, but gets higher reward by attacking other agent. At each timestep, each agent can move to or attack one of four neighboring grids. Attacking a blank grid gets a small negative reward (inhibiting excessive attacks). This experiment is to examine whether agents can learn collaboratively sharing resources rather than attacking each other. We trained all the models in the setting of and for episodes. Figure 6 shows their learning curves. Table 2 shows the mean reward and number of attacks between agents over test runs, each game unrolled with timesteps.

mean reward
# attacks
Table 2: Jungle

DGN outperforms all the baselines during training and test in terms of mean reward and number of attacks between agents. It is observed that DGN agents can properly select the close food and seldom hurt each other, and the food can be allocated rationally by the surrounding agents, as shown in Figure 4(c). Moreover, attacks between DGN agents are much less than others, e.g., less than MFQ. Sneak attack, fierce conflict, and hesitation are the characteristics of CommNet and DQN agents, as illustrated in Figure 4(d), verifying their failure of learning cooperation.

4.3 Routing

The network consists of routers. Each router is randomly connected to a constant number of routers (three in the experiment), and the network topology is stationary. There are data packets with a random size, and each packet is randomly assigned a source and destination router. If there are multiple packets with the sum size larger than the bandwidth of a link, they cannot go through the link simultaneously. In the experiment, data packets are agents, and they aim to quickly reach the destination while avoiding congestion. At each timestep, the observation of a packet is its own attributes (i.e., current location, destination, and data size), the attributes of cables connected to its current location (i.e., load, length), and neighboring data packets (on the connected cable or routers). It takes some timesteps for a data packet to go through a cable, a linear function of the cable length. The action space of a packet is the choices of next hop. Once the data packet arrives at the destination, it leaves the system and another data packet enters the system with random initialization.

We trained all the models with the setting of and for episodes. Figure 7 shows their learning curves. DGN converges to much higher mean reward and more quickly than the baselines. We evaluate all the models by running 10 test games, each game unrolled with timesteps. Table  3 shows the mean reward, mean delay of data packets, and throughput, where the delay of a packet is measured by the timesteps taken from source to destination and the throughput is the number of delivered packets per timestep.

Figure 7: Learning curves in routing.

To better interpret the performance of the models, we calculate the shortest path for every pair of nodes in the network using Floyd algorithm. Then, during test, we directly calculate the delay and throughout based on the shortest path of each packet, which is Floyd in Table 3. Note that this delay is without considering the bandwidth limitation (i.e., data packets can go through any link simultaneously). Thus, this is the ideal case for the routing problem. When considering the bandwidth limit, we let each packet follow its shortest path, and if a link is congested, the packet will wait at the router until the link is unblocked. This is Floyd with BL in Table 3, which can be considered as the practical solution. As shown in Table 3, the performance of DGN is much better than other models and Floyd with BL.

In the experiment, it is observed that DGN agents tend to select the shortest path to the destination, and more interestingly, learn to select different paths when congestion is about to occur. DQN agents cannot learn the shortest path due to myopia and easily cause congestion at some links without considering the influence of other agents. Communication indeed helps as MFQ and CommNet outperform DQN. However, they are unable to develop the sophisticated strategies as DGN does and eventually converge to much lower performance.

Floyd Floyd w/ BL DGN MFQ CommNet DQN
mean reward
mean reward
mean reward
Table 3: Routing

To investigate how network traffic affects the performance of the models, we performed the experiments with heavier data traffic, i.e., and , where all the models are directly applied to the setting without retraining. From Table 3, we can see that DGN is much better than Floyd with BL, and MFQ is also better than Floyd with BL. The reason is that Floyd with BL (i.e., simply following the shortest path) is favorable when traffic is light and congestion is rare, while it does not work well when traffic is heavy and congestion easily occurs. We further apply all the models learned in and to the setting of and . DGN still outperforms Floyd with BL, while MFQ become worse than Floyd with BL. It is observed in the experiments that DGN without retraining outperforms Floyd with BL up to and , available in Appendix. From the experiments, we can see that our model trained with fewer agents can well generalize to the setting with much more agents, which demonstrates that the policy that takes as input the integrated features from neighboring agents based on their relations scales well with the number of agents.

5 Conclusions

We have proposed graph convolutional reinforcement learning. DGN adapts to the dynamics of the underlying graph of the multi-agent environment and exploits convolution with relation kernels to extract latent features from gradually increased receptive fields for learning cooperative strategies. Moreover, the relation representation between agents are temporally regularized to make the cooperation more consistent. Empirically, DGN significantly outperforms existing methods in a variety of cooperative multi-agent scenarios.


This work was supported in part by Peng Cheng Lab, Huawei, and NSFC under grant 61872009.


  • C. L. Apicella, F. W. Marlowe, J. H. Fowler, and N. A. Christakis (2012) Social networks and cooperation in hunter-gatherers. Nature 481 (7382), pp. 497. Cited by: §1, §1.
  • P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In NeurIPS, Cited by: §2.
  • R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §2.
  • A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2019) TarMAC: targeted multi-agent communication. ICML. Cited by: §2.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, Cited by: §2.
  • J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In AAAI, Cited by: §2.
  • J. K. Gupta, M. Egorov, and M. Kochenderfer (2017) Cooperative multi-agent control using deep reinforcement learning. In AAMAS, Cited by: §2.
  • M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
  • Y. Hoshen (2017) Vain: attentional multi-agent predictive modeling. In NeurIPS, Cited by: §2.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.1.
  • N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. A. Ortega, D. Strouse, J. Z. Leibo, and N. de Freitas (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. ICML. Cited by: §1.
  • J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. NeurIPS. Cited by: §1, §2, §3.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, Cited by: §2.
  • L. Matignon, L. Jeanpierre, A. Mouaddib, et al. (2012) Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes.. In AAAI, Cited by: §1.
  • A. P. Melis and D. Semmann (2010) How is human cooperation different?. Philosophical Transactions of the Royal Society of London B: Biological Sciences 365 (1553), pp. 2663–2674. Cited by: §1.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.
  • H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak (2006) A simple rule for the evolution of cooperation on graphs and social networks. Nature 441 (7092), pp. 502. Cited by: §1, §1.
  • P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, and J. Wang (2017) Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069. Cited by: §1, §2.
  • C. Qu, S. Mannor, H. Xu, Y. Qi, L. Song, and J. Xiong (2019) Value propagation for decentralized networked deep multi-agent reinforcement learning. In NeurIPS, Cited by: §2.
  • S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §1.
  • S. Sukhbaatar, R. Fergus, et al. (2016) Learning multiagent communication with backpropagation. In NeurIPS, Cited by: §1, §2, §4.
  • A. Tacchetti, H. F. Song, P. A. Mediano, V. Zambaldi, N. C. Rabinowitz, T. Graepel, M. Botvinick, and P. W. Battaglia (2019) Relational forward models for multi-agent learning. In ICLR, Cited by: §2.
  • M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In ICML, Cited by: §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1, §2, §3.2.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.
  • N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran (2017) Visual interaction networks. arXiv preprint arXiv:1706.01433. Cited by: §2.
  • M. Wiering (2000) Multi-agent reinforcement learning for traffic light control. In ICML, Cited by: §1.
  • Y. Yang, J. Hao, M. Sun, Z. Wang, C. Fan, and G. Strbac (2018a) Recurrent deep multiagent q-learning for autonomous brokers in smart grid. In IJCAI, Cited by: §1.
  • Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018b) Mean field multi-agent reinforcement learning. In ICML, Cited by: §1, §4.
  • V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, et al. (2018) Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830. Cited by: §2.
  • K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar (2018) Fully decentralized multi-agent reinforcement learning with networked agents. In ICML, Cited by: §2.
  • L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu (2017) MAgent: a many-agent reinforcement learning platform for artificial collective intelligence. arXiv preprint arXiv:1712.00600. Cited by: §4.

Appendix A Hyperparameters

Table 4 summarizes the hyperparameters used by DGN and the baselines in the experiments.

Hyperparameter DGN CommNet MFQ DQN
discount ()
batch size
buffer capacity
and decay
optimizer Adam
learning rate
# neighbors
# convolutional layers
# attention heads
# encoder MLP layers
# encoder MLP units
network affine transformation affine transformation
MLP activation ReLU
initializer random normal
Table 4: Hyperparameters

Appendix B Experimental Settings

In jungle, the reward is for moving, for attacking (eating) the food, for attacking other agent, for being attacked, and for attacking a blank grid. In battle, the reward is for attacking the enemy, for being killed, and for attacking a blank grid. In routing, the bandwidth of each link is the same and set to . Each data packet is with a random size between and . If the link to the next hop selected by a data packet is overloaded, the data packet will stay at the current router and be punished with a reward . Once the data packet arrives at the destination, it leaves the system and gets a reward . In the experiments, we fix the size of to , because DGN is currently implemented based on TensorFlow which does not support dynamic computing graph (varying size of ). We also show how different sizes of affect DGN’s performance in the following. Indeed, DGN adapts to dynamic environments, no matter how the number of agents changes, how the graph of agents changes, and how many neighbors each agent has.

Figure 8: DGN with different number of convolutional layers in battle
Figure 9: DGN with different number of neighbors for each agent in jungle.
Figure 10: DGN versus Floyd with BL under increasingly heavier traffic in routing.

Appendix C Additional Experiments

As aforementioned, larger receptive field yields a higher degree of centralization that mitigates non-stationarity. We also investigate this in the experiments. First we examine how DGN performs with different number of convolution layers. As illustrated in Figure 10, two convolutional layers indeed yield more stable learning curve than one layer as expected. As the agent’s receptive field is also determined by the size of , we also investigate how it affects the performance of DGN. We set of each agent to , and in jungle. As illustrated in Figure 10, its performance drops as the number of neighbors reduces, as expected.

We also conducted additional experiments in routing to compare DGN (learned in the setting of and ) and Floyd with BL under increasingly heavier traffic, in terms of mean delay. As shown in Figure 10, DGN continuously outperforms Floyd with BL up to . After that, Floyd with BL outperforms DGN. The reason is that when the traffic becomes so heavy, the network is fully congested and there is no way to improve the performance. DGN learned in much lighter traffic may still try to find better routes, but this incurs extra delay.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description