Decentralized Multi-Agent Actor-Critic with Generative Inference
Recent multi-agent actor-critic methods have utilized centralized training with decentralized execution to address the non-stationarity of co-adapting agents. This training paradigm constrains learning to the centralized phase such that only pre-learned policies may be used during the decentralized phase, which performs poorly when agent communications are delayed, noisy, or disrupted. In this work, we propose a new system that can gracefully handle partially-observable information due to communication disruptions during decentralized execution. Our approach augments the multi-agent actor-critic method’s centralized training phase with generative modeling so that agents may infer other agents’ observations when provided with locally available context. Our method is evaluated on three tasks that require agents to combine local and remote observations communicated by other agents. We evaluate our approach by introducing both partial observability during decentralized execution, and show that decentralized training on inferred observations performs as well or better than existing actor-critic methods.
Reinforcement learning (RL) with function approximation has been used to solve difficult sequential decision making problems in high dimensional state and action spaces, such as game playing Mnih2013 and robotics Haarnoja2018. Many decision making problems are best modeled as a multi-agent system in which agents learn concurrently with other agents. Two naive approaches that use single-agent RL methods in multi-agent problems are independent learning (IL) and joint action learning (JAL), but these approaches perform poorly. IL agents simply treat other agents as part of the stochastic environment. JAL agents condition on the full joint action and observation spaces for all agents, but these joint spaces grow exponentially with the number of agents and are therefore not scalable.
Most multi-agent reinforcement learning (MARL) methods use decentralized policies where each agent’s policy depends on local observations and actions. Decentralized scenarios usually have partial observations and limited communication. In some problems, the learning must also be decentralized and rely on agent communication Zhang2018a. Often, a simulator may be available so that agents may learn with extra state information while assuming free, instantaneous communication, e.g., for robotics or autonomous vehicles. After centralized learning, the agents execute decentralized policies using only local information. Many recent works have adopted this Centralized Training, Decentralized Execution (CTDE) framework; we review several such methods in the following section.
Agent communication is the main approach to learning in decentralized systems when agents have partial observations. In CTDE methods, the communication during centralized training is implicit: all observations are shared freely and instantly. However, communication networks may be lossy, delayed, or lacking coverage.
We propose to address the limited communication learning gap of CTDE methods in decentralized execution with generative modeling. Specifically, we use a modified context conditional generative adversarial network (CC-GAN) Denton2016 to infer missing joint observations given partial observations. The task of filling in partial observations by generative inference is similar to the image inpainting problem for a missing patch of pixels: with an arbitrary number of missing observations, we would like to infer the most likely observation of the other agents.
We extend the popular MADDPG method Lowe2017 as it appears most amenable to full decentralization. MADDPG agents require both (1) the policies of other agents and (2) other agents’ observations as input to the policies and critics. As other agents’ approximate policies can be learned, the agents only need to learn a model for the other agents’ observations. The generative model will learn this joint distribution of agent observations by training on random combinations of missing agent communications during centralized training. During the decentralized portion, the agents may sample from this model to continue learning under partial observability.
Our contributions are as follows. We review the recent trend of CTDE MARL literature and identify that they are ill-suited to learn in the decentralized execution phase without explicit communication. We show how a context conditional generative model can address this problem for a popular CTDE method and provide a modified GAN that can learn the joint observation distribution. We experimentally evaluate our approach on three continuous multi-agent tasks. To the best of our knowledge, this is the first work to use generative models to overcome multi-agent partial observability or address decentralized learning in CTDE methods.
2 Related work
2.1 Centralized Training with Decentralized Execution
Many recent multi-agent actor-critic methods utilize centralized training with decentralized execution. This training procedure lessens the non-stationarity of co-adapting agents by providing additional information in centralized training. The majority of these methods: (1) solve only cooperative tasks by using a shared reward for all agents, (2) use a centralized critic function that conditions on all agents’ observations and actions, and (3) use policy or critic networks that include recurrent components, such as an LSTM Schmidhuber1997. Recurrent networks have been shown to be effective at learning policies in partially observable environments Hausknecht2015.
Gupta et al. solve cooperative, partially observable tasks with recurrent policies and curriculum learning Gupta2017. They compare several versions of the CTDE methods, including using Q-learning vs. actor-critic, centralized vs. decentralized policies with parameter sharing, and feed-forward vs. recurrent policies. Foerster et al. use a centralized critic for all agents with decentralized recurrent policies for cooperative tasks Foerster2018a. In addition, they use counterfactual baselines: difference rewards comparing agents’ actions to a default action.
Instead of learning a single centralized critic for all agents, Lowe et al. introduced multi-agent DDPG (MADDPG) which has a centralized critic for each agent and may be used in cooperative or competitive tasks Lowe2017. While recurrent networks may be used with MADDPG, only feed-forward networks were tested. Rashid et al. also uses centralized critics for each agent, but includes a centralized mixing network to combine each agent’s critic function Rashid2018. They also use recurrent polices and may be used in cooperative tasks. Foerster et al. learns communication protocols over a limited-bandwidth communication channel Foerster2016. They propose two approaches that use recurrent policies in cooperative settings via parameter sharing or sending gradients over the communication channel.
We chose to extend MADDPG because it appears the most amenable to decentralization: each agent has its own critic function (with no mixing network), policy , approximate policies of other agents , and reward function to allow for both cooperative and competitive tasks. In addition, the policies are deterministic which allows for continuous state and action spaces.
2.2 Decentralized Learning
Traditional decentralized MARL approaches rely on persistent reliable communication so that agents may share local observations and jointly choose actions in an uncertain environment. When states are represented in a factored form, agents may solve a distributed constraint optimization problem over the network to choose a good joint action for all agents Zhang2013. In pure MARL approaches, agents choose actions while sharing information over the communication channel. In some systems agents share local rewards but the state is fully observable Zhang2018a, and others use a communication-based consensus protocol to agree on a global state from local observations before choosing joint actions Zhang2018. In contrast, our approach aims to allow learning despite disruptions in communication. When communication is unavailable, we infer the missing data given the available local observations of neighboring agents.
3.1 Reinforcement Learning
Formally, each task in decentralized MARL is represented by a discrete-time partially observable Markov Game Littman1994, a multi-agent extension of the Markov decision process sutton2018reinforcementbook. A Markov Game is a tuple where a set of agents choose actions based on local observations to maximize their own expected cumulative reward. At each time step , the environment has a true state and each agent simultaneously chooses an action from their individual set of available actions . The environment stochastically transitions to a new state given by the state transition function , and each agent then receives a reward according to its own reward function . The discount factor is used for calculating expected return for time horizon . Each agent receives a private observation correlated with the state . Agents choose actions using a stochastic policy , where has parameters .
The three main approaches to RL are action-value methods, policy gradient methods, and the actor-critic hybrid approach sutton2018reinforcementbook. Q-learning estimates the action-value function : the future discounted reward when taking action from state while following policy . Deep Q-Networks (DQN) used Q-learning with neural network function approximation to play Atari games from pixels Mnih2013. DQN also introduced two stability improvements: target functions that are updated less frequently, and an experience replay buffer that stores environment transitions for decorrelated batch updates.
Instead of learning a value function, policy gradient methods learn a parameterized policy directly sutton2018reinforcementbook. This approach is often more efficient but tends to have high variance. To reduce variance, actor-critic algorithms combine an action-value function along with the parameterized policy to guide policy updates.
Deterministic policy gradient (DPG) methods learn a policy that returns a single action Silver2014. Deep DPG (DDPG) is an off-policy model-free actor-critic algorithm that combines the DQN value function with a deterministic policy Lillicrap2016a. Like DQN, DDPG uses experience replay and target networks for both value and policy networks. Random noise is added to the policy’s output for better exploration. From here on, all policies are assumed deterministic.
3.2 Generative Adversarial Networks
Generative models learn a data distribution and can generate new samples similar to the learned distribution. The most popular class of models is the generative adversarial network (GAN) Goodfellow2014. A GAN is composed of two neural networks with opposing goals: a generator network that receives noise as input and produces samples similar to the data distribution, and a discriminator network that tries to determine real data points from those sampled from . While GANs have largely been applied to image generation, they should be able to learn any joint data distribution.
Wasserstein GANs (WGANs) are a variant of GANs that have been shown to have more reliable convergence and less mode collapse Arjovsky2017. WGAN uses a critic rather than a discriminator (outputs are not probabilities), trains using a simple loss metric that approximates the Wasserstein distance when the network enforces a 1-Lipshitz constraint, and allows pre-training the critic to optimality. To avoid confusing the WGAN critic with the critic , we will continue to refer to it as the discriminator as this distinction makes no difference in our work.
Our work uses the context-conditional generative model, where the model takes a partial input and must generate a complete data sample. The closest computer vision task to our problem is image inpainting, where a patch of pixels from an image is removed and the model must fill the missing patch based on its learned model of the pixels’ joint distribution. The CC-GAN objective function is given by where denotes a binary mask used to drop a patch from image , and is the combined inpainted image where is element-wise multiplication.
Other generative models for image inpainting include autoregressive models and context encoders, but they are not suitable for our approach. Autoregressive models, such as PixelRNN VanOord2016, require a pre-specified ordering over the pixels and thus will not work for arbitrary missing data. Context encoders use a variational autoencoder coupled with adversarial loss Pathak2016, but results tend to be less accurate compared to CC-GANs.
4 Decentralized Fine-Tuning
4.1 Inferring Observations with CC-WGAN
We approach the problem of inferring missing information from partial observations as a generative sampling problem, similar to the task of image inpainting. We use a modified CC-GAN as our generative model Denton2016. Specifically, we train a WGAN with gradient penalty constraints Arjovsky2017; Gulrajani2017 with the CC-GAN random data masking training procedure. We refer to our modified model as the context-conditional WGAN (CC-WGAN). In our experiments, the CC-WGAN was more reliable than regular CC-GAN for low-dimensional data. Unlike the standard image generation task for GANs, we have no training data. We store joint observations in a replay buffer just as MADDPG to stabilize learning when batch training.
Fig. 1 shows the update procedure for training the CC-WGAN. When the model updates, it randomly samples joint observations from the joint observation replay buffer to randomly mask agent observations. During centralized training, all joint observations are available. If the CC-WGAN is updated in the decentralized phase, inferred observations are mixed into the updates. This is a form of semi-supervised learning because the model updates on its own predictions Goodfellow2016book.
For each joint observations , we randomly mask combinations of missing agents from with a binary mask . Masked elements in are replaced with random normal noise to get the partial observation . The masking procedure requires some knowledge of the conditions when observations will become partial, e.g., inter-agent distance greater than communications allow. Fig. 2 shows the masking procedure for distance-based partial observability based on coordinates. In the diagram, takes joint partial observation and binary mask vector , and produces the generated output . We then replace the masked portion in with that portion of the generated output to get a combined observation .
Where CC-GAN passes only the inpainted patch to the discriminator, we instead pass the combined observation to the discriminator because any number of agents may be missing from . is trained to distinguish batches of size of real joint observations and inferred observations by minimizing the empirical Wasserstein distance:
where . Similarly, is updated by maximizing:
4.2 MADDPG with Inferred Observations
As stated before, we augment MADDPG Lowe2017 with the CC-WGAN because it appears the most flexible CTDE method for decentralization. Each MADDPG agent learns a deterministic policy , a centralized critic , and a set of approximate policies for each other agent .
Fig. 1 shows the MADDPG method updates along with the CC-WGAN for both centralized and decentralized phases. During centralized training, the critics and policies are updated exactly as MADDPG. In addition, the CC-WGAN is collecting joint observations in its replay buffer and updating as described above.
In the decentralized phase, local observations may be missing information about other agents. At each time step each agent receives a partial observation which consists of the agent’s local information and possibly information about other agents. When updating the centralized critics, if an agent has information about agent in its local partial observation , then it can also see agent ’s partial observation . This is because we assume agents within range are “communicating” all local information. Just as in training, the joint partial observation is passed to the generator to get and combined via binary mask with to get the inferred observation .
Following the derivation in Lowe2017, the deterministic policy loss with inferred observations is:
where are taken from approximate policies such that . The approximate policies are updated with:
where is the entropy of the policy distribution and is a small weight (0.001 in experiments). The centralized critics are updated with:
where is a target policy and is an inferred next observation following observation .
5.1 Environments and Setup
We evaluate our method under three continuous scenarios of the Multi-agent Particles Environment (MPE)111MPE code: http://github.com/openai/multiagent-particle-envs introduced in the original MADDPG paper Lowe2017. In order to directly compare to their results, we use the physical deception and predatory-prey competitive scenarios also used by Lowe2017. We additionally test on a cooperative navigation scenario where agents share a reward function to show both competitive and cooperative settings. We did not include the communication-based scenarios tested by Lowe et al. because they have no clear way to make partially observable. In contrast, we evaluate on physical scenarios that are easy to make partially observable: when agents’ are farther from each other than partially observable distance , they cannot observe each others’ positions, velocity, etc. (see Fig. 3). We use in all experiments, where the width of the 2D square environment is 2.
When using agent-distance partial observability, learning the coordination of predator-prey appears hardest, followed by physical deception, and lastly cooperative navigation. In predator-prey, three agents must coordinate to catch one faster agent, so there is no stable strategy. In physical deception, two agents should learn to deceive an adversary agent by covering two landmarks to hide which is the correct goal. If the adversary is out of range, this strategy should not change. In cooperative navigation, each agent must move to and remain near different landmarks. With myopic view , agents can still determine if another agent is covering the same landmark and move to another.
In addition to making the decentralized phase partially observable, we approximate real-world deployment by modifying simulation dynamics with scaled random normal noise and translation to both actions and observations. Combined with partial observability, the added noise makes learning more difficult for both the policies and the CC-WGAN and requires fine-tuning to the new distribution.
Each episode has 200 steps with no early termination. All models are updated every 100 steps, and are represented with a three-layer, feed-forward neural network with 64 hidden units. The models use the Adam optimizer and each non-output layer uses a ReLU activation function. Each plot shows the mean and standard deviation shading over 30 independent trials for each algorithm. Each algorithm within a single plot receives the same set of 30 random seeds for accurate comparisons with random exploration. In all plots, a vertical dashed line marks the episode in which the environment becomes decentralized.
The following plots compare our approach of augmenting MADDPG with CC-WGAN inference against regular MADDPG and DDPG. We chose DDPG because it performed the best among all IL methods in Lowe2017 and we use the same environment. Agents learn approximate policies for all other agents in MADDPG with and without generative inference. MADDPG and our version are identical during the centralized training because agents only use the CC-WGAN inference in the decentralized phase. After centralized training, we let MADDPG continue learning while treating the partial observability as random noise, whereas our approach infers the missing data. Except for Fig. 6, the policies and CC-WGAN continue updating in the decentralized phase.
We did not test against the other CTDE methods discussed in Related Work because these methods use recurrent critics or policies. As such, they condition on past trajectories and would likely overcome the partial observability implicitly.
In Fig. 4, we show the cooperating agents’ reward for all agents using MADDPG, MADDPG with CC-WGAN inference, and DDPG. In this plot we only introduce partial observability when agents are farther than the partially observable distance . Fig. 5 shows the reward for cooperating agents with both partial observability and altered environment dynamics in the decentralized phase to evaluate the capability of fine-tuning to another environment. The overall reward is much lower here than Fig. 4 which suggests the CC-WGAN is not well-suited to switching its modeled observation distribution (i.e., for sim-to-real transfer Peng2018).
In Fig. 6, we compare the total reward for our method with four options of decentralized updates, where the decentralized phase has both partial observability and altered environment dynamics. All agents use the CC-WGAN inferred observations when choosing actions. The curves show the difference in whether the policy and CC-WGAN update on inferred observations in the decentralized phase.
Lastly, in Fig. 7, we show the CC-WGAN’s reconstruction mean squared error during training over several partial observability distances . When , the model is effectively using IL in the decentralized phase; when , the model is usually using complete observations. We show this reconstruction plot because the agents’ observations are low-dimensional (i.e., not images as GANs are usually used for). MSE may not be a good metric for this error, however it was more informative than cosine similarity. We initially expected the error to be lower for larger , but it is clear that the CC-WGAN reconstruction error has little to do with the partial observability distance. We would like to investigate the conditions under which the CC-WGAN gives better predictions.
The results shown here reveal properties about context-based modeling of observations in MARL and the scenarios for which our approach is appropriate. As CC-WGAN learns a joint observation distribution by sampling joint observations from its replay buffer , it has no temporal coherence: each inference step is independent from the previous. Without a model of observation trajectories, it is ill-suited for dynamic tasks with no clear stable behaviors under the partial observability.
Fig. 4 illustrates this problem depending on the scenario’s need for non-local information and the stability of optimal behavior. Inferring other agents’ observations is useful when the task requires non-local coordination like the physical deception and predator-prey scenarios. Cooperative navigation agents can move to a different landmark if another agent is covering the same landmark, but may take slightly longer. Without temporal coherence, the CC-WGAN has trouble modeling non-stationary observation distributions like predator-prey. In contrast, physical deception and cooperative navigation have a stable optimal policy. In summation, our approach works best with a stable observation distribution (physical deception and cooperative navigation) and is useful in tasks requiring non-local coordination (physical deception and predator-prey). As such, our reward is significantly higher in physical deception, slightly higher in cooperative navigation, and slightly lower in predator-prey.
As seen in Fig. 6, when the CC-WGAN and policy update on inferred observations the decentralized reward dropoff is more drastic than without updating on the inference. This is due to the co-adaptation between the agents’ policies and CC-WGAN inference: the CC-WGAN must learn based on new observations being generated by agents choosing actions based on the inferred observations from the CC-WGAN. Also it appears that having either policy updates or GAN updates on inferred observations gives roughly the same benefit.
6 Conclusions and Future Work
In this paper we reviewed the recent trend of MARL methods utilizing centralized training with decentralized execution (CTDE) and identified that none of the methods may continue learning in the decentralize phase without adding explicit communication. We proposed to learn a context conditional generative model during centralized training phase that allows for a popular CTDE actor-critic method to continue learning in the decentralized phase, and showed that this addition allows for increased reward and coordination in three continuous multi-agent tasks.
Our approach is useful for completing partial observations in Markovian environments where decentralized environment dynamics closely match the centralized training dynamics. In environments where agents should condition on history, recurrent policies or critics would help solve the problem. Our experiments show that context is useful in settings where there is a stable optimal behavior for agents, but training on trajectories may be able to learn more difficult observation distributions. We would also like to address the domain adaptation problem with possibly re-training the generative model on decentralized dynamics, or using techniques such as domain randomization.