MultiAgent Trust Region Policy Optimization
Abstract
We extend trust region policy optimization (TRPO) [26] to multiagent reinforcement learning (MARL) problems. We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multiagent cases. By making a series of approximations to the consensus optimization model, we propose a decentralized MARL algorithm, which we call multiagent TRPO (MATRPO). This algorithm can optimize distributed policies based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/actionvalue functions of other agents. The agents only share a likelihood ratio with their neighbors during the training process. The algorithm is fully decentralized and privacypreserving. Our experiments on two cooperative games demonstrate its robust performance on complicated MARL tasks.
[appendix]appendix
I Introduction
Recent developments in multiagent reinforcement learning (MARL) have enabled individual agents to learn cooperative policies to jointly maximize a collective reward for a common goal. This has significantly extended the traditional RL paradigm from single agent domains to multiagent systems (MAS), which have been found to be useful in a wide range of applications, such as multiplayer games [24], coordinating selfdriving vehicles [12], multirobot systems [19], traffic signal control [33] and smart grids [25].
Most MARL algorithms can be classified into three broad categories: (1) fully centralized methods, such as joint action learning [5], which learn a centralized policy by reducing the problem to a singleagent RL over the global observation and action space, (2) centralized training with decentralized execution methods, which learn distributed local policies (that select actions based on local observations) in a centralized manner by using global information in the training process, such as aggregated observations, joint rewards, policy parameters or gradients of other agents, etc., and (3) fully decentralized methods, such as independent learning [30], which learn local policies in a decentralized manner, and do not require global information in both the training and execution stages.
Of the three categories, centralized training with decentralized execution methods are preferred in many studies, because they only need local information to execute the learned policies while being able to exploit global information to ease the nonstationarity issue [20]. Nonstationarity issue usually happens in independent learning. Since the policy of each agent changes during the training process, the state transition and reward functions perceived by other agents change as well. The mutual interference can influence the training process and make it difficult to converge towards an equilibrium point. However, in centralized training, local observations, rewards, or policies are shared among all the agents. In consequence, the environment dynamics perceived by the agents becomes stationary. Based on this idea, the multiagent deep deterministic policy gradient (MADDPG) algorithm is proposed in [16], which trains a centralized actionvalue function for each agent based on aggregated observations and inferred actions. The local policy of each agent is updated according to the centralized actionvalue function and estimated policies of other agents. In [28], a value decomposition network is devised to implicitly learn the local actionvalue function for each individual agent using the aggregated observations and joint rewards. In [8], an actorattentioncritic algorithm is proposed to learn decentralized policies using centrally trained critics. An attention mechanism is incorporated to help select relevant information for each agent to improve the training of the critic.
Nevertheless, the centralized training with decentralized execution framework exhibits many limitations. Firstly, the existence of an unavoidable central controller makes the framework vulnerable to single point failure and cyberattack. Secondly, training policies of all agents in one single controller may consume massive computation and communication resources, which poses a major challenge to the scalability and flexibility of centralized training methods. More importantly, privacy is a major concern in many multiagent system applications, such as recommendation systems [4], agentmediated ecommerce [21], semantic web services [27], etc. The requirement of access to other agent’s observations, rewards or policies may raise concerns about privacy leakage, which makes centralized training methods questionable for many realworld applications.
TRPO has been successful at solving complex singleagent RL problems [26], and famous for its monotonic improvement guarantee and effectiveness for optimizing large neural network policies [7]. In this work, we extend the TRPO algorithm to MARL problems. We show that the policy update of TRPO can be equivalently transformed into a distributed consensus optimization problem. We approximately solve the consensus optimization, yielding a decentralized MARL algorithm, which we call multiagent TRPO (MATRPO). In this algorithm, the policy updates are based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/actionvalue functions of other agents. During the training process, the agents only communicate with their neighbors to share a local likelihood ratio. The algorithm is fully decentralized and friendly for privacy. In our experiments, we show that this algorithm can learn highquality neural network policies for complex MARL problems.
Ii Related Work
There exists a series of insightful works in the literature that address the decentralized MARL problem, and a comprehensive overview of decentralized MARL methods can be found in [35, 23]. Like singleagent RL, most decentralized MARL methods can be classified into two categories: (1) policy iteration methods [2], which alternate between policy evaluation and policy improvement; (2) policy gradient methods [29], which update the policy according to an estimated gradient of expected return.
Most policy iteration methods focus on the distributed policy evaluation task, which estimates the global value function under a fixed joint policy. Generally, the task is converted to the mean square projected bellman error (MSPBE) minimization problem, and further reformulated as a distributed saddlepoint problem by using linear approximation functions. For example, in [31], a diffusionbased distributed gradient temporaldifference (GTD) algorithm was developed to solve the saddlepoint problem, and convergence under sufficiently small stepsize updates was established. In [11], a primaldual distributed GTD algorithm was proposed based on consensus, and asymptotic convergence was established by using ordinary differential equation. In [18], an gossipbased averaging scheme was investigated to incorporate neighbors information in distributed TD updates. In [32], a double averaging scheme combining the dynamic consensus and stochastic average gradient algorithm was proposed to solve saddlepoint problem with a convergence guarantee at a global geometric rate. In [6], the finitetime convergence of a consensusbased distributed TD(0) algorithm under constant and timevarying stepsizes was investigated. Different from the above methods attempting to estimate the value function, the learning algorithm in [9] was to learn the actionvalue function by distributed Qfactors, which were updated by using a algorithm. In [15], an multiagent deep Qnetwork algorithm was proposed to learn distributed optimal actionvalue functions by aggregating the feature information from neighboring agents group via an attentive relational encoder.
For policy gradient methods, some works learn a local actor and a consensus critic and some others do the other way around. For example, following the former scheme, two decentralized actorcritic algorithms were proposed in [36] to estimate local policy gradients by using a consensus actionvalue function and a consensus value function, respectively. Instead of learning a consensus critic, the method in [37] derived a distributed offpolicy actor critic algorithm with policy consensus. Through decomposition of the global value and actionvalue functions, the algorithm avoids learning a centralized critic. Converge guarantee was provided in both methods when using linear approximate functions. However, whether learning a consensus critic or learning a consensus actor, the agents mush know the global system state, which is not the case in a partially observable environment. In addition, privacy issue may arise, especially in the policy consensus [37], wherein the policy parameters must be shared between agents. To address this issue, a flexible parameter sharing mechanism was introduced in [13] introduced by dividing the policy parameters into shared and nonshared parts. This also allows the agents to train distributed policies that only depend on local observations. To overcome the nonstationarity problem, this method assumes that all agents can infer policies of other agents.
Different from the aforementioned two categories of methods, the proposed algorithm is a local policy search method, which iteratively updates the policy by maximizing the expected return over a local neighborhood of the most recent iterate [26]. Compared to prior works, the proposed algorithm features the following major differences: (1) it is effective for optimizing distributed neural network policies; (2) it is privacypreserving, which means that the agents do not need to share private information, such as local observations, rewards, policy or critic parameters; (3) it is suitable for partially observable MARL problems, wherein the agents have different observation spaces, reward functions and act according to different policies.
Iii Preliminaries
Iiia Markov Game
We consider a partially observable Markov game (POMG) [14], in which agents communicate with their immediate neighbors through a sparsely connected network to cooperate with each other for a common purpose. The POMG is defined by the tuple , where is a set of environment states, is a set of observations of the agent , is a set of actions that agent can choose, is the transition probability distribution, is the private reward received by agent , is the distribution of the initial state, is the discount factor, and is an undirected graph representing the communication network, where is the set of agents and is the set of communication links.
Let denote a stochastic policy of agent , and denote the joint policy of all agents. For each agent , the goal is to learn an optimal local policy so that the joint policy maximizes the expected sum of the discounted rewards of all agents, . Here denotes a trajectory (), and indicates that the distribution over the trajectory depends on .
Letting denote the discounted return of a trajectory perceived by agent , we define the local value function as , local actionvalue function as , and local advantage function as respectively, where is the joint action of all agents. With these definitions, we express the global value function by , the global actionvalue function by , and the global advantage function by .
IiiB Trust Region Policy Optimization
TRPO is a policy search method with monotonic improvement guarantee [26]. In TRPO, the policy is iteratively updated by maximizing a surrogate objective over a trustregion around the most recent iterate :
(1)  
where the objective is an importance sampling estimator of the advantage function , and is the discounted visitation frequencies according to the policy :
The trustregion constraint restricts the searching area in the neighborhood of , defined by the average KLDivergence and the parameter .
IiiC Alternating Direction Method of Multipliers
ADMM solves a separable equalityconstrained optimization problem in the form [3]
(2)  
where the decision variables decomposes into two subvectors, and , and the objective is a sum of convex functions, and , with respect to these subvectors. The decision vectors and are coupled through a linear constraint, where , , and .
The ADMM algorithm solves the problem (2) by forming its augmented Lagrange function,
(3)  
and approximately minimizing the augmented Lagrange function through sequentially updating the primal variables and , and the dual variable :
(4)  
Iv MultiAgent Trust Region Policy Optimization
Iva Trust Region Optimization for Multiple Agents
Consider the policy optimization problem (1) for the multiagent case. We decompose the advantage function in the objective and rewrite the policy optimization model as follows:
(5) 
Our purpose is to split the objective into independent subobjectives so that the optimization problem can be solved distributedly by agents. However, the subobjectives are coupled through the joint policy and .
In our solution, instead of directly optimizing the joint policy , we train a local policy for every agent , where is a probability distribution of the joint action conditioned on . When choosing actions, each agent independently generate its action according to the local marginal distribution . The joint policy can be expressed as .
Our principal theoretical result is that if the local policy has the form, i.e. , the policy optimization (5) can be equivalently transformed into the following consensus optimization problem (see Appendix A for proof):
(6) 
Now, the objective is separable with respect to local policies . However, the local policy is conditioned on , which means in order to learn or act according to , every agent should know the global system state . Generally, this is not the case in a partially observable environment. Since the agents act according to their local observations in a partially observable environment, the joint policy becomes , where is the aggregated observation of all agents. In this case, we introduce the following approximation for partially observable environments:
(7)  
where we substitute the global state with each agent’s local observation . In order to meet the trustregion constraint on the KLDivergence of the joint policy , we introduce new constraints on the KLDivergence of the local policy . According to the additive and nonnegative properties of KLDivergence, the following inequality holds:
(8) 
Consequently, when the new KLDivergence constraints in (7) are satisfied, i.e. , the joint policy will be constrained within the trust region . It is noted that if the agents have enough observations (e.g. sufficient long histories) to almost surely infer the global state, the approximation (7) could be sufficiently close to the problem (6) for the fullyobservable case.
For the policy optimization (7), the joint action is required to calculate the likelihood ratio and the advantage function . However, the joint action follows the policy rather than , which means that the agents only need to know the actions performed by the other agents in the last iterate. Knowing the last round actions of other agents is not a restrictive assumption and has been widely adopted by existing decentralized algorithms in the literature. It is also worth mentioning that knowing the actions of other agents does not mean knowing their policies because the local policy is conditioned on local observation , which is privately owned be the agent .
IvB Distributed Consensus with ADMM
To solve the distributed consensus optimization problem (7), we employ the asynchronous ADMM algorithm proposed in [34]. Specifically, let denote the agents at the endpoints of the communication link . For each agent , we introduce an estimator to estimate the likelihood ratio of its neighbor . Then, the consensus constraint for the agent and can be reformulated as
(9a)  
(9b)  
(9c) 
For brevity, we use to represent , to represent , and to represent . By using Equations (9), the problem (7) can be transformed into
(10) 
where , , and is the weight on the information transmitted between agents at the communication link . The value of must satisfy , and is set to either or in our study.
We form the augmented Lagrange function of problem (10):
(11) 
where is the set of the dual variables, is a penalty parameter; the local policy is defined in the feasible set , and the estimator is defined in the feasible set .
To solve problem (10), the asynchronous ADMM minimizes the augmented Lagrange function by sequentially updating the local policies, estimators, and the dual variables. Specifically, let , , be the values of at iteration . At iteration , a communication link is selected according to a certain probability . Then, the two agents at the endpoints of the link are activated to update their policy , estimator , and dual variable according to:
(12) 
Wei and Ozdaglary proved in [34] that if the objective is convex, and and are nonempty closed convex sets, the algorithm (12) converges almost surely to the optimal solution at the rate of as .
IvC Sequential Convexification for Convergence
In the problem (10), we implicitly assume that the policy can be evaluated at every observation and action, and optimize directly over . However, for parameterized policies , like neural networks, we would like to optimize over the parameters . This can lead to convergence issue of the ADMM algorithm due to nonconvexity of with respect to [3, 34].
Since we are considering parameterized policy , we will use the parameter vector to represent the policy , and overload all previous notations, e.g. , and .
To address the convergence issue, we borrow an idea from sequential convex programming [17] and approximate the problem (10) with a convex model. Note that, for a small trust region , the likelihood ratio and can be well approximated by first order Taylor expansion round :
(13) 
and the KLdivergence can be well approximate by second order Taylor expansion round :
(14) 
where . The idea is also similar to TRPO [26] and constrained policy optimization (CPO) [1].
IvD Privacy Protection
During the training process, the agents need to share with neighbors the information on the local likelihood ratio to reach a consensus. The likelihood ratio does not contain private information of the agents, such as the observation, reward, or policy parameters. Therefore, MATRPO is friendly for agents’ privacy.
V Practical Implementation
Va SampleBased Approximation of the Consensus Constraint
To meet the consensus constraint, the agents need to reach agreement on the ratio
for every observation and action. This task is challenging when the observation and action spaces are very large. To overcome this challenge, we implement a samplebased approximation to the consensus constraint.
First, we generate a batch of initial states according to the distribution , and each agent has a batch of local observations of the state, i.e. . Then, we simulate the joint policy for timesteps to generate a batch of trajectories of the states and actions, and each agent has a batch of locally observed trajectories, i.e. . After that, the agents calculate the local likelihood ratios based on the samples from these trajectories. Finally, we constrain the agents to reach consensus on the samples of the likelihood ratio .
We also use this sampling procedure to estimate the local advantage function , which is calculated by following the method in [1]:
(15) 
where the local value function is approximated by using a neural network based on the local observations and private rewards .
VB Agreement on Logarithmic Ratio
We find that it is better to use the logarithmic likelihood ratio for consensus since it leads to a closedform solution to the asynchronous ADMM updates (12). As the logarithmic function is monotonic, consensus on is equivalent to consensus on . Therefore, the following equations are used as the consensus constraints:
(16) 
In addition, we approximate the logarithmic likelihood ratio by its first order Taylor expansion around :
(17) 
Then, using the samplebased approximation procedure in Section VA, we derive the following Lagrange function:
(18) 
where is a vector of the sampling values of the advantage function , is the Jacobian of the sampled likelihood ratios with respect to . The policy parameters is defined in the feasible set , where is the Hessian of the sample average KLDivergence .
Vi Experiments
The experiments are designed to investigate the following questions:

Does MATRPO succeed in learning cooperative policies in a fully decentralized manner when agents have different observation spaces and reward functions?

How does MATRPO compare with other baseline methods, such as a fully centralized learning method? Does MATRPO perform better than independent learning?

Can MATRPO be used to solve largescale MARL problems? Does the performance degrade when the number of agents increases?
To answer these questions, we test MATRPO on two cooperative tasks, in which agents only have partial observations of the environment and receive private rewards. We compare the final performance of MATRPO with two baseline methods: (1) a fully centralized method, which uses TRPO to learn a global policy by reducing the problem to a singleagent RL, (2) an independent learning method, which uses TRPO to learn a local policy by maximizing the local expected rewards. We also examine the scalability of our algorithm by increasing the number of agents in the tasks.
Via Environments
Our experiments are carried out on the multiagent particle environment proposed by [22], and extended by [16] and [8]. The Cooperative Navigation and Cooperative Treasure Collection tasks are used to test the proposed algorithm. To make the tasks more challenging in the decentralized training scheme, they are modified, and details on the modified tasks are illustrated in Fig. 1 and explained as follows.
Cooperative Navigation. In this task, agents must reach a set of landmarks through coordinated movements. In the original version of this task, the agents are collectively rewarded based on the minimum agent distance to each landmark, and individually punished when colliding with each other. To make it more challenging, we modified the task such that only one of the agents is rewarded. The other agents do not receive reward, bet they are penalized when collision. The agents observe the position and velocity of itself as well as the relative positions of other agents and landmarks.
Cooperative Treasure Collection. In this task, treasure hunters and treasure banks search around the environment to gather treasures. The treasures are generated with different colors and respawn randomly upon being collected. The hunters are responsible for collecting treasures and deposit them into correctly colored banks. The banks simply gather as much treasure as possible from the hunters. The agents observe the position and velocity of itself as well as the relative positions and velocities of other agents. The agents also observe the colors and respawn positions of the treasures. In our modified version of this task, the banks are rewarded for successful collection and depositing of treasure. They are also negatively rewarded based on the minimum distance of each hunter to the treasures. The hunters do not receive any reward for collecting treasures, but are penalized for colliding with each other. This modification makes the task more challenging than the original version where both the hunters and the banks are rewarded.
For all experiments, we use neural network policies with two hidden layers of 128 SeLU units [10]. The action space is {left, right, up, down, stay}. For decentralized training, we assume that the agents communicate with neighbors via a ring topology network. We run MATRPO and the baselines algorithms 5 times with different random seeds and compare their average performances on both tasks. Details on the experimental setup and parameters are provided in Table I.
Task  Cooper. Navigation  Cooper. Treas. Collection  

Number of Agents  
Sim. steps per iter.  10k  10k  10k  10k 
Stepsize ().  0.003  0.003  0.001  0.001 
Discount ().  0.995  0.995  0.995  0.995 
Policy iter.  500  500  500  500 
Num. path  100  100  100  100 
Path len.  100  100  100  100 
ADMM iter.  100  150  200  500 
Penalty para. ().  1.0  1.0  5.0  5.0 
ViB Results
Learning curves of MATRPO and the baseline methods for the two cooperative tasks are shown in Fig. 2. MATRPO is successful at learning collaborative policies and achieves consistently high performance for all of the problems. Apart from the small cooperative navigation task () where Centralized TRPO almost finds the global optimum, MATRPO performs practically as well as the Centralized TRPO does. Independent TRPO performs poorly on these tasks. Especially, for the cooperative treasure collection task, the performance becomes worse as the training proceeds. Since the hunters are punished for colliding with each other, agents trained by Independent TRPO learn to move far away from each other to avoid collision. As a result, some hunters move outside the window and end up with being distant from the treasures.
In addition, from the experimental results we find that the performance of MARTPO does not deteriorate when the number of agents increase. These results provide empirical evidence that MATRPO can be used to solve largescale MARL problems.
Note that MATRPO learned all of the collaborative policies in a decentralized way using local observation and privatelyowned rewards. The agents does not need to know observations, rewards, or policies of other agents. This is in contrast with most prior methods, which typically rely on aggregated observations and rewards, or policy and action value/value network parameters of other agents.
Vii Conclusion
In the paper, we showed that through a particular decomposition and transformation, the TRPO algorithm can be extended to optimization of cooperative policies in multiagent systems. This enabled the development of our MATRPO algorithm, which updates the local policy through distributed consensus optimization.
In our experiments, we demonstrated that MATRPO can train distributed nonlinear policies based on only local observations and rewards. Furthermore, the privacy of the agents can be well protected during the training process. We justified the superiority and scalability of MATRPO on two cooperative tasks by comparing with centralized TRPO and independent learning. Our work represents a step towards applying MARL in realworld applications, where privacy and scalability are important.
Appendix A
Aa Equivalent Transformation of Policy Optimization in TRPO
Consider a multiagent system with agents. Let be the local policy trained by agent . When choosing actions, each agent acts independently according to the local marginal distribution . The joint policy is expressed as .
Theorem 1.
Assume the local policy has the form