Scaling Up Multiagent Reinforcement Learning for Robotic Systems: Learn an Adaptive Sparse Communication Graph
Abstract
The complexity of multiagent reinforcement learning (MARL) in multiagent systems increases exponentially with respect to the agent number. This scalability issue prevents MARL from being applied in largescale multiagent systems. However, one critical feature in MARL that is often neglected is that the interactions between agents are quite sparse. Without exploiting this sparsity structure, existing works aggregate information from all of the agents and thus have a high sample complexity. To address this issue, we propose an adaptive sparse attention mechanism by generalizing a sparsityinducing activation function. Then a sparse communication graph in MARL is learned by graph neural networks based on this new attention mechanism. Through this sparsity structure, the agents can communicate in an effective as well as efficient way via only selectively attending to agents that matter the most and thus the scale of the MARL problem is reduced with little optimality compromised. Comparative results show that our algorithm can learn an interpretable sparse structure and outperforms previous works by a significant margin on applications involving a largescale multiagent system.
I Introduction
Reinforcement Learning (RL) has achieved enormous successes in robotics [16] and gaming [21] in both single and multiagent settings. For example, deep reinforcement learning (DRL) achieved superhuman performance in the twoplayer game Go, which has a very highdimensional stateaction space [27, 28]. However, in multiagent scenarios, the sizes of the state space, joint action space, and joint observation space grow exponentially with the number of agents. As a result of this high dimensionality, existing multiagent reinforcement learning (MARL) algorithms require significant computational resources to learn an optimal policy, which impedes the application of MARL to systems such as swarm robotics [12]. Thus, improving the scalability of MARL is a necessary step towards building largescale multiagent learning systems for realworld applications.
In MARL, the increase of complexity of finding an optimal joint policy, with respect to the number of agents, is a result of coupled interactions between agents [2]. However, in many multiagent scenarios, the interactions between agents are quite sparse. For example, in a soccer game, an agent typically only needs to pay attention to other nearby agents when dribbling because agents far away are not able to intercept. The existence of such sparsity structures of the state transition dynamics (or the stateactionreward relationships) suggests that an agent may only need to attend to information from a small subset of the agents for nearoptimal decisionmaking. Note that the other players that require attention might not be nearby, such as the receiver of a long pass in soccer. In such cases, the agent only needs to selectively attend to agents that “matter the most”. As a result, the agent can spatially and temporally reduce the scale of the planning problem.
In largescale MARL, sample complexity is a bottleneck of scalability [4]. To reduce the sample complexity, another feature we can exploit is the interchangeability of homogeneous agents: switching two agents’ state/action will not make any difference to the environment. This interchangeability implies permutationinvariance of the multiagent stateaction value function (a.k.a. the centralized function) as well as interchangeability of agent policies. However, many MARL algorithms such as MADDPG [19], VDN [30], QMIX [23] do not exploit this symmetry and thus have to learn this interchangeability from experience, which increases the sample complexity unnecessarily.
Graph neural network (GNN) is a specific neural network architecture in which permutationinvariance features can be embedded via graph pooling operations, so this approach has been applied in MARL [1, 15, 14] to exploit the interchangeability. As MARL is a nonstructural scenario where the links/connections between the nodes/agents are ambiguous to decide, a graph has to be created in advance to apply GNN for MARL. Refs. [1, 15, 14], apply adhoc methods, such as nearest neighbors, hard threshold, and random dropout to obtain a graph structure. However, these methods require handcrafted metrics to measure the closeness between agents, which are scenariospecific and thus not general/principled. Inappropriately selecting neighbors based on a poorly designed closeness metric could lead to the failure of learning a useful policy.
While attention mechanisms [31] could be applied to learn the strength of the connections between a pair of agents (i.e., closeness metric) in a general and principled way, such strengths are often dense, leading to a nearlycomplete computation graph that does not benefit scalability. The dense attention mechanism results from that the softmax activation function operated on the raw attention logits generates a probability distribution with full support. One solution to enforce a sparse graph is top thresholding [5], which keeps the largest attention scores and truncates the rest to zero. However, this truncation is a nondifferentiable operation that may cause problems for gradientbased optimization algorithms, such as those used in endtoend training. Therefore, a sparse attention mechanism that preserves the gradient flow necessary for gradientbased training is required.
To address the nondifferentiability issue in sparse attention mechanisms, we generalize sparsemax [20] and obtain a sparsity mechanism whose pattern is adaptive to the environment states. This sparsity mechanism can reduce the complexity of both the forward pass and the backpropagation of the policy and value networks, as well as preserving the endtoend trainability in contrast to hard thresholding. With the introduction of GNN and generalized sparsemax, which can preserve permutation invariance and promote sparsity respectively, the scalability of MARL is improved.
The discussion so far was restricted to homogeneous agents and thus permutationinvariance is desirable. However, in heterogeneous multiagent systems or competitive environments, permutation invariance and interchangeability are no longer valid. For example, in soccer, switching positions of two players from different sides can make a difference to the game. To address this heterogeneity, GNNbased MARL must distinguish the different semantic meanings of the connections between different agent pairs (e.g. friend/friend relationship versus friend/foe relationship). We address this requirement by multirelational graph convolution network [25] to pass messages using different graph convolution layers on graph edge connections with different semantic meanings.
To summarize, we propose to learn an adaptive sparse communication graph within the GNNbased framework to improve the scalability of MARL, which applies to both homogeneous and heterogeneous multiagent systems in mixed cooperativecompetitive scenarios.
Ia Related Work
One of the existing works exploiting the structure in MARL is the meanfield reinforcement learning (MFRL) [32] algorithm, which takes as input the observation and the mean action of neighboring agents to make the decision, and neglects the actions of all the other agents. This simplification leads to good scalability. However, the mean action cannot distinguish the difference among neighboring agents and the locality approximations fail to capture information from a far but important agent for optimal decisionmaking, which leads to suboptimal policies. MultiActorAttentionCritic (MAAC) is proposed in [13] to aggregate information using attention mechanism from all the other agents. Similarly, [1, 14, 7] also employ the attention mechanism to learn a representation for the actionvalue function. However, the communication graphs used there are either dense or adhoc ( nearest neighbors), which makes the learning difficult.
Sparse attention mechanisms were first studied by the natural language processing community in [20], where sparsemax was proposed as a sparse alternative to the activation function softmax. The basic idea is to project the attention logits onto the probability simplex, which can generate zero entries once the projection hits the boundary of the simplex. While generalized sparse attention mechanisms were further studied in [22, 3, 18], they are not adaptive to the state in the context of MARL, in terms of the sparsity pattern.
Given this state of the art, the contributions of this paper are twofold. First, we propose a new adaptive sparse attention mechanism in MARL to learn a sparse communication graph, which improves the scalability of MARL by lowering the sample complexity. Second, we extend our GNNbased MARL to heterogeneous systems in mixed cooperativecompetitive settings using multirelational GNN. The evaluations show that our algorithm significantly outperforms previous approaches on applications involving a large number of agents. This technique can be applied to empower largescale autonomous systems such as swarm robotics.
Ii Preliminaries
Iia Multiagent Reinforcement Learning
As a multiagent extension of Markov decision processes (MDPs), a Markov game is defined as a tuple , where is a set of agent indices, is the set of state, and are the joint observation and joint action sets, respectively. The th agent chooses actions via a stochastic policy , which leads to the next state according to the state transition function . The th agent also obtains a reward as a function of the state and agentâs action , and receives a private observation correlated with the state . The initial states are determined by a distribution . The th agent aims to maximize its own total expected return , with discount factor and time horizon .
IiB Multihead attention
The scaled dotproduct attention mechanism was first proposed in [31] for natural language processing. An attention function maps the query and a set of keyvalue pairs to the output, which is the weighted sum of the values. The weight assigned to the each value calculated via a compatibility function of the query and the corresponding key. In the context of MARL, let be the representation of the agents. Key, query and value of agent is defined as , and , respectively with and are parameter matrices. The output for agent is then
(1) 
where , the th row of the weight matrix , is defined as
(2) 
with being the softmax function in previous works of GNNbased MARL. The weight is dense as for any vector and .
To increase the expressiveness, multihead attention is applied here via simply concatenating the outputs from a single attention function [31].
IiC Relational GNN
In heterogeneous multiagent systems, different agent pair can have different relations, such as friend or foe in a twoparty zerosum game. As a result, information aggregation from agents with different relations should have different parameters. Work in [25] proposed relational graph convolutional network to model multirelational data. The forwardpass update of agent in a multirelational graph is as follows
(3) 
where denotes the set of neighbor indices of agent under relation and is a normalization constant. To distinguish the heterogeneity in MARL, similar to this convolutionbased multirelational GNN, we apply different attention heads on agent pairs with different relations.
Iii Approach
In this section, we present our approach to exploit the sparsity in MARL by generalizing the dense softmax attention to adaptive sparse attention. Moreover, our approach to apply multirelational attention mechanism for heterogeneous games involving competitive agents is also introduced.
Iiia Learning a communication graph via adaptive sparse attention
The scaled dotproduct attention is applied to learn the communication graph in MARL. If an attention weight between a pair of agents is zero, then there is no communication/message passing between them. Thus, the normalization function in (2) is critical to learn a communication graph. As usually used in the attention mechanism [31] or classifications, is usually set to be softmax, which cannot induce sparsity. We propose an adaptive sparse activation function as an alternative to softmax.
Let be the raw attention logits and be normalized attention strength in the ()dimensional probability simplex defined as . We are interested in the mapping from to . In other words, such a mapping can transform real weights to a probability distribution, i.e., the normalized attention strength between a pair of agents. The classical softmax, used in most attention mechanisms, is defined componentwisely as
(4) 
A limitation of the softmax transformation is that the resulting probability distribution always has full support, which makes the communication graph dense, resulting in high complexity. In order to reduce the complexity, our idea is to replace the softmax activation function with a generalized activation function, which could adaptively be dense or sparse based on the state. To investigate alternative activation functions to softmax, consider the max operator defined as
(5) 
where . The second equality comes from that the supremum of the linear form over a simplex is always achieved at a vertex, i.e., one of the standard basis vector . As a result, the max operator puts all the probability mass onto a single element, or in other words, only one entry of is nonzero corresponding to the largest entry of . For example, with , the probability distribution w.r.t. the logit , i.e., , is a step function, as equals 1 if and otherwise. This discontinuity at of the step function is not amenable to gradientbased optimization algorithms for training deep neural networks. One solution to the discontinuity issue encountered in (6) is to add a regularized in the max operator as
(6) 
Different regularizers produce different mappings with distinct properties (see summary in Table I). Note that with as the Shannon entropy, recovers softmax. With the states/observations evolving, the ideal profile of should be able to adapt the sparsity extent (controlled via ) and the pattern (controlled via the selection of ) accordingly.
Entropy  Ref.  
Shannon  [3]  
norm  [20]  
Tsallis  No closedform  [6]  
Generalized  No closedform  [17] 
softmax  softmax  sparsemax  
Property 




Example 
Note that the Tsallis entropy and the generalized entropy in Table I do not have closedform solutions [3], which will increase the computational burden since iterative numerical algorithms will have to be employed. Sparsemax has a closedform solution and can induce sparsity, but sparsemax is not adaptive and lacks flexibility as it is unable to switch from one sparsity pattern to another when necessary. We aim to combine the advantages and avoid the disadvantages using this new formulation
(7) 
with and being a learnable neural network and a scalar, respectively. By choosing different , can exhibit different sparsity patterns including softmax and sparsemax. With fixed, the parameter can control how sparse the output could be, similar to the temperature parameter in softmax. The summary in Table II shows that (7) will lead to a general mapping and can combine properties such as translation and scaling invariance adaptively. Work in [18] proposed sparsehourglass that can adjust the tradeoff between translation and scaling invariance via tunable parameters. However, it is unclear under which circumstances one property is more desirable than the other, so there is little to no prior knowledge on how to tune such parameters. In contrast, our formulation in (7) can balance such tradeoff via learning and while work in [18] is based on a fixed form of with tunable parameters.
While we can let the neural network learn without any restrictions, there is indeed prior knowledge that we can apply, e.g., monotonicity. It is desired to keep the monotonicity of , i.e., , as larger attention logit should be mapped into larger attention strength. As sparsemax is monotonic, this requires that , or in other words, the order of the input of coincides with that of the output. To keep this property, is designed componentwisely as , with are neural networks with hidden layers. Note that should be coupled with all of the entries of instead of be a univariate function only depending on , as demonstrated in Table II. As the second argument of (i.e., ) is invariant to , the order preserving of is equivalent to the monotonicity of and . In order to keep this monotonicity, we enforce all the weights of the networks and to be positive [8], by applying an absolute value function on the weights. This architecture can accelerate the learning process with extra prior knowledge, as it is monotonic by design.
IiiB Message passing in MARL via GNN
We will present how the information is aggregated to learn a representation for peragent value/policy network using a graph neural network. The scaled dotproduct attention mechanism (Section IIB) with our generalized sparsemax as the activation function, denoted as sparseAtt, is applied to learn a communication graph and pass messages through the connections in the graph.
We start with homogeneous multiagent system, where the relation between any agent pair is identical. A graph is defined as , where represent an agent and the cardinality of is . Moreover, is if agent and can communicate directly (or agent is observable to agent ), and otherwise. This is a restriction on the communication graph and is the set of all possible edges. Then sparseAtt aims to learn a subset of via induced sparsity without compromising much optimality. For agent , let and be its observation and entity encoding respectively, where is the local state and is a learnable agent encoder network. Then the initial observation embedding of agent , denoted as , is
(8) 
where is another learnable network and the operator denotes concatenation. Then at hop (th round of message passing), agent aggregates information from its possible neighbors belonging to the set as follows
(9) 
With , the multihop message passing can enable the agent to obtain information from beyond its immediate neighbors. In the message aggregation from all of the agents , identical parameters are used in , which enforces the permutationinvariance. This property is desirable because homogeneous agents are interchangeable.
However, interchangeability is no longer applicable to heterogeneous systems or mixed cooperativecompetitive environment. For example, with being a twoteam partition of , agents cooperate with other agents from the same team but compete against agents from the other team. For agent , its teammate neighborhood and enemy neighborhood are and , respectively. The edges connecting teammates and enemies are called positive and negative edges. Then based on multirelational GNN, agent aggregates information at hop in the following way
where and are different attention heads. Additionally, balance theory [11] suggests that “the teammate of my teammate is my teammate” and “the enemy of my enemy is my teammate.” In a twoteam competitive game, any walk (a sequence of nodes and edges of a graph) between an agent pair in the communication graph, comprising of both positive and negative edges, will lead to the same relation between the agent pair [9]. This property eliminates the ambiguity that the information aggregated from the same agent (but different walk) might have a different teammate/enemy property.
The proposed algorithmic framework is illustrated in Fig. 1. After rounds of message passing, each agent has an updated encoding . This encoding is then fed into the value network and the policy network, which estimate the state value and a probability distribution over all possible actions, respectively. As homogeneous agents are interchangeable, they share all of the parameters, including entity encoding, policy, value and message passing. Proximal policy gradient (PPO, [26]) is employed to train the model in an endtoend manner. As only local information is required, the proposed approach is decentralized. Moreover, our approach maintains the transferability of GNNbased approaches as all the network dimensions are invariant to agent/entity number in the system.
Iv Experiments
Iva Task description
The proposed algorithm is evaluated in three swarm robotics tasks: Coverage, Formation, and ParticleSoccer [24], first two of which are cooperative and the third is competitive. The tasks are simulated in the Multiagent Particle Environment
Coverage: There are agents (light purple) and landmarks (black) in the environment (see illustration in Fig. 1(a)). The objective for the agents is to cover the landmarks with the smallest possible number of timesteps. Agents are not assigned to reach a certain landmark, but instead, have to figure out the assignment via communication such that the task can be finished optimally.
Formation: There are agents (blue) and landmarks (black) in the environment (see illustration in Fig. 1(b)), with being an even natural number. The agents need to split into two subteams of equal size, with each of them building a formation of a regular pentagon. The two regular pentagons with different sizes are both centered at the landmark.
ParticleSoccer: There are agents and 3 landmarks in the environment (see illustration in Fig. 1(c)), with the bigger landmark as a movable ball and the two smaller ones as a fixed landmark. A team wins the game via pushing the black ball to the opponent team’s goal. The goal color of the light blue (red, resp.) team is blue (red, resp.).
IvB Implementation specifications
The agent encoder and the entity encoder take input the dimensional agent states and dimensional entity states, respectively. The queries, keys, and values in all of the sparse attention mechanism are dimensional. The communication hop is . All neural networks are fully connected with the ReLU activation function. In the sparsitypromoting function (7), and all have one hidden layer with dimensions being , and , respectively. The absolute value function is used to keep the weights of the monotonicitypreserving neural network positive.
Evaluation is performed every episodes and PPO update is executed for epochs after collecting experience of timesteps.
IvC Results
In the cooperative scenarios i.e., Coverage and Formation, two metrics are used to evaluate the algorithms. The first is the average reward per step and the second is the task success rate. Higher means better performance for both metrics.
We compare our algorithms with two baselines: GNNbased MARL with dense attention mechanism [1] and MAAC [13]. These two algorithms are considered to be strong baselines as they reported advantageous results against algorithms including MADDPG [19], COMA [10], VDN [29] and QMIX [23]. Public repositories
In simulation, we set and for Coverage and Formation, respectively. Fig. 4 and Fig. 4 demonstrated that our algorithm can achieve higher rewards than the two baselines with fewer episodes. This validates that sparseAtt can accelerate the learning process via aggregating information from agents that matter the most. Moreover, in terms of the second metric, i.e., success rate, our algorithm consistently outperforms the two baselines by a significant margin (with a much smaller variance), as shown in Fig. 5. The evaluations of both metrics for two scenarios provide strong support for the advantages of our algorithm.
For the competitive ParticleSoccer task, we set with both red team and blue team of size . As this task is competitive, the above two metrics are no longer applicable. Instead, we let the red (blue, resp.) play against a blue (red, resp.) team from another algorithm. Table III presents the results of the interalgorithm competition. The overall score of each algorithm equals the sum of the winning evaluation episodes of its red team and blue team playing against blue and red team respectively from other algorithms. The overall scores in Table III show that our algorithm can learn strong policies.
\backslashboxRedBlue 

denseAtt  MAAC  

\cellcolorblue!25  
denseAtt  \cellcolorblue!25  
MAAC  (7,0,43)  (2,0,48)  \cellcolorblue!25  
N/A  

denseAtt  MAAC  

IvD Interpretability of the sparse communication graph
Let us proceed by considering the inherent sparity in Formation and ParticleSoccer. As mentioned in the description of the Formation scenario, the formation of each pentagon is related to half of the agents, while the subteam assignments need to be learned. In the implementation, the reward is set to require that the first agents closest to the landmark build the formations of the inner pentagon and the remaining agents to build the formations of the outer pentagon. With the convergence of the learning algorithm, once a subteam partition is learned to complete the two subtasks, the learned agent indexing of each team should not vary due to the distance sorting and the two pentagons are relatively far away. As a result, the reward to complete each subtask is only related to the corresponding subteam and hence the two subteams are decoupled from each other. The adjacency matrix of the learned communication graph shown in Fig. 5(a) validates that the interteam communication is very sparse. This adjacency matrix is up to row/column permutation as indexing of each subteam is learned without being known as a prior. Moreover, in a subteam, the algorithm learns a communication graph similar to a stargraph. It can be understood that each subteam selects a leader. As a stargraph is a connected graph with possibly minimum edges, this communication protocol is both effective and efficient. Also, the length of the path between any agent pair in a star graph is no greater than , which echos the twohop communication () we used in the simulation. That is because due to the twohop messagepassing, the agents can eventually communicate with agents as far as two edges away, which includes all of the agents in a star graph. Note that the sparsity on the diagonal entries of the communication graph does not mean that the agent’s own information is neglected, as it is separately concatenated; see (9).
Also, in the ParticleSoccer scenario, from each team’s perspective, agents need to coordinate tightly within the team to greedily push the ball to the other team’s goal while only attending to a small number of agents from the other team. This leads to dense intrateam communication but relatively sparse interteam communication. This is validated by the approximately blockdiagonal adjacency matrix of the learned communication graph in Fig. 5(b).
V CONCLUSIONS and FUTURE WORK
This paper exploits sparsity to scale up MultiAgent Reinforcement Learning (MARL), which is motivated by the fact that interactions are often sparse in multiagent systems. We propose a new general and adaptive sparsityinducing activation function to empower an attention mechanism, which can learn a sparse communication graph among agents. The sparse communication graph can make the messagepassing both effective and efficient such that the scalability of MARL is improved without compromising optimality. Our algorithm outperforms two baselines by a significant margin on three tasks. Moreover, for scenarios with inherent sparsity, it is shown that the sparsity of the learned communication graph is interpretable.
Future work will focus on combining evolutionary population curriculum learning and graph neural network to further improve the scalability. In addition, robust learning against evolving/learned adversarial attacks is also of great interest.
Acknowledgments
Research is supported by Scientific Systems Company, Inc. under research agreement SC166104. Authors would like to thank DongKi Kim, Samir Wadhwania and Michael Everett for their many useful discussions and Amazon Web Services for computation support.
Footnotes
 https://github.com/openai/multiagentparticleenvs
 https://github.com/sumitsk/matrl.git
 https://github.com/shariqiqbal2810/MAAC
References
 (2019) Learning transferable cooperative behavior in multiagent teams. arXiv preprint arXiv:1906.01202. Cited by: §IA, §I, §IVC.
 (2002) The complexity of decentralized control of markov decision processes. Mathematics of operations research 27 (4), pp. 819–840. Cited by: §I.
 (2018) Learning classifiers with fenchelyoung losses: generalized entropies, margins, and algorithms. arXiv preprint arXiv:1805.09717. Cited by: §IA, §IIIA, TABLE I.
 (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §I.
 (2015) Attentionbased models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §I.
 (2019) Adaptively sparse transformers. arXiv preprint arXiv:1909.00015. Cited by: TABLE I.
 (2018) Tarmac: targeted multiagent communication. arXiv preprint arXiv:1810.11187. Cited by: §IA.
 (2009) Incorporating functional knowledge in neural networks. Journal of Machine Learning Research 10 (Jun), pp. 1239–1262. Cited by: §IIIA.
 (2012) Networks, crowds, and markets: reasoning about a highly connected world. Significance 9, pp. 43–44. Cited by: §IIIB.
 (2018) Counterfactual multiagent policy gradients. In Thirtysecond AAAI conference on artificial intelligence, Cited by: §IVC.
 (1946) Attitudes and cognitive organization. The Journal of psychology 21 (1), pp. 107–112. Cited by: §IIIB.
 (2017) Guided deep reinforcement learning for swarm systems. arXiv preprint arXiv:1709.06011. Cited by: §I.
 (2018) Actorattentioncritic for multiagent reinforcement learning. arXiv preprint arXiv:1810.02912. Cited by: §IA, §IVC.
 (2018) Graph convolutional reinforcement learning for multiagent cooperation. arXiv preprint arXiv:1810.09202 2 (3). Cited by: §IA, §I.
 (2019) Graph policy gradients for large scale robot control. arXiv preprint arXiv:1907.03822. Cited by: §I.
 (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §I.
 (2013) Concepts and recent advances in generalized information measures and statistics. Bentham Science Publishers. Cited by: TABLE I.
 (2018) On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems, pp. 6422–6432. Cited by: §IA, §IIIA.
 (2017) Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §I, §IVA, §IVC.
 (2016) From softmax to sparsemax: a sparse model of attention and multilabel classification. In International Conference on Machine Learning, pp. 1614–1623. Cited by: §IA, §I, TABLE I.
 (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §I.
 (2017) A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems, pp. 3338–3348. Cited by: §IA.
 (2018) QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §I, §IVC.
 (2004) Swarm robotics: from sources of inspiration to domains of application. In International workshop on swarm robotics, pp. 10–20. Cited by: §IVA.
 (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §I, §IIC.
 (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IIIB.
 (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I.
 (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §I.
 (2017) Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296. Cited by: §IVC.
 (2018) Valuedecomposition networks for cooperative multiagent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085–2087. Cited by: §I.
 (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I, §IIB, §IIB, §IIIA.
 (2018) Mean field multiagent reinforcement learning. arXiv preprint arXiv:1802.05438. Cited by: §IA.