Multi-Agent Trust Region Policy Optimization

Multi-Agent Trust Region Policy Optimization

Abstract

We extend trust region policy optimization (TRPO) [26] to multi-agent reinforcement learning (MARL) problems. We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases. By making a series of approximations to the consensus optimization model, we propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO). This algorithm can optimize distributed policies based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/action-value functions of other agents. The agents only share a likelihood ratio with their neighbors during the training process. The algorithm is fully decentralized and privacy-preserving. Our experiments on two cooperative games demonstrate its robust performance on complicated MARL tasks.

Multi-agent reinforcement learning, trust region policy optimization, alternating direction method of multipliers, privacy protection, decentralized learning.
\externaldocument

[appendix-]appendix

I Introduction

Recent developments in multi-agent reinforcement learning (MARL) have enabled individual agents to learn cooperative policies to jointly maximize a collective reward for a common goal. This has significantly extended the traditional RL paradigm from single agent domains to multi-agent systems (MAS), which have been found to be useful in a wide range of applications, such as multiplayer games [24], coordinating self-driving vehicles [12], multi-robot systems [19], traffic signal control [33] and smart grids [25].

Most MARL algorithms can be classified into three broad categories: (1) fully centralized methods, such as joint action learning [5], which learn a centralized policy by reducing the problem to a single-agent RL over the global observation and action space, (2) centralized training with decentralized execution methods, which learn distributed local policies (that select actions based on local observations) in a centralized manner by using global information in the training process, such as aggregated observations, joint rewards, policy parameters or gradients of other agents, etc., and (3) fully decentralized methods, such as independent learning [30], which learn local policies in a decentralized manner, and do not require global information in both the training and execution stages.

Of the three categories, centralized training with decentralized execution methods are preferred in many studies, because they only need local information to execute the learned policies while being able to exploit global information to ease the nonstationarity issue [20]. Nonstationarity issue usually happens in independent learning. Since the policy of each agent changes during the training process, the state transition and reward functions perceived by other agents change as well. The mutual interference can influence the training process and make it difficult to converge towards an equilibrium point. However, in centralized training, local observations, rewards, or policies are shared among all the agents. In consequence, the environment dynamics perceived by the agents becomes stationary. Based on this idea, the multi-agent deep deterministic policy gradient (MADDPG) algorithm is proposed in [16], which trains a centralized action-value function for each agent based on aggregated observations and inferred actions. The local policy of each agent is updated according to the centralized action-value function and estimated policies of other agents. In [28], a value decomposition network is devised to implicitly learn the local action-value function for each individual agent using the aggregated observations and joint rewards. In [8], an actor-attention-critic algorithm is proposed to learn decentralized policies using centrally trained critics. An attention mechanism is incorporated to help select relevant information for each agent to improve the training of the critic.

Nevertheless, the centralized training with decentralized execution framework exhibits many limitations. Firstly, the existence of an unavoidable central controller makes the framework vulnerable to single point failure and cyber-attack. Secondly, training policies of all agents in one single controller may consume massive computation and communication resources, which poses a major challenge to the scalability and flexibility of centralized training methods. More importantly, privacy is a major concern in many multi-agent system applications, such as recommendation systems [4], agent-mediated e-commerce [21], semantic web services [27], etc. The requirement of access to other agent’s observations, rewards or policies may raise concerns about privacy leakage, which makes centralized training methods questionable for many real-world applications.

TRPO has been successful at solving complex single-agent RL problems [26], and famous for its monotonic improvement guarantee and effectiveness for optimizing large neural network policies [7]. In this work, we extend the TRPO algorithm to MARL problems. We show that the policy update of TRPO can be equivalently transformed into a distributed consensus optimization problem. We approximately solve the consensus optimization, yielding a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO). In this algorithm, the policy updates are based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/action-value functions of other agents. During the training process, the agents only communicate with their neighbors to share a local likelihood ratio. The algorithm is fully decentralized and friendly for privacy. In our experiments, we show that this algorithm can learn high-quality neural network policies for complex MARL problems.

Ii Related Work

There exists a series of insightful works in the literature that address the decentralized MARL problem, and a comprehensive overview of decentralized MARL methods can be found in [35, 23]. Like single-agent RL, most decentralized MARL methods can be classified into two categories: (1) policy iteration met-hods [2], which alternate between policy evaluation and policy improvement; (2) policy gradient methods [29], which update the policy according to an estimated gradient of expected return.

Most policy iteration methods focus on the distributed policy evaluation task, which estimates the global value function under a fixed joint policy. Generally, the task is converted to the mean square projected bellman error (MSPBE) minimization problem, and further reformulated as a distributed saddle-point problem by using linear approximation functions. For example, in [31], a diffusion-based distributed gradient temporal-difference (GTD) algorithm was developed to solve the saddle-point problem, and convergence under sufficiently small step-size updates was established. In [11], a primal-dual distributed GTD algorithm was proposed based on consensus, and asymptotic convergence was established by using ordinary differential equation. In [18], an gossip-based averaging scheme was investigated to incorporate neighbors information in distributed TD updates. In [32], a double averaging scheme combining the dynamic consensus and stochastic average gradient algorithm was proposed to solve saddle-point problem with a convergence guarantee at a global geometric rate. In [6], the finite-time convergence of a consensus-based distributed TD(0) algorithm under constant and time-varying step-sizes was investigated. Different from the above methods attempting to estimate the value function, the -learning algorithm in [9] was to learn the action-value function by distributed Q-factors, which were updated by using a algorithm. In [15], an multi-agent deep Q-network algorithm was proposed to learn distributed optimal action-value functions by aggregating the feature information from neighboring agents group via an attentive relational encoder.

For policy gradient methods, some works learn a local actor and a consensus critic and some others do the other way around. For example, following the former scheme, two decentralized actor-critic algorithms were proposed in [36] to estimate local policy gradients by using a consensus action-value function and a consensus value function, respectively. Instead of learning a consensus critic, the method in [37] derived a distributed off-policy actor critic algorithm with policy consensus. Through decomposition of the global value and action-value functions, the algorithm avoids learning a centralized critic. Converge guarantee was provided in both methods when using linear approximate functions. However, whether learning a consensus critic or learning a consensus actor, the agents mush know the global system state, which is not the case in a partially observable environment. In addition, privacy issue may arise, especially in the policy consensus [37], wherein the policy parameters must be shared between agents. To address this issue, a flexible parameter sharing mechanism was introduced in [13] introduced by dividing the policy parameters into shared and non-shared parts. This also allows the agents to train distributed policies that only depend on local observations. To overcome the nonstationarity problem, this method assumes that all agents can infer policies of other agents.

Different from the aforementioned two categories of methods, the proposed algorithm is a local policy search method, which iteratively updates the policy by maximizing the expected return over a local neighborhood of the most recent iterate [26]. Compared to prior works, the proposed algorithm features the following major differences: (1) it is effective for optimizing distributed neural network policies; (2) it is privacy-preserving, which means that the agents do not need to share private information, such as local observations, rewards, policy or critic parameters; (3) it is suitable for partially observable MARL problems, wherein the agents have different observation spaces, reward functions and act according to different policies.

Iii Preliminaries

Iii-a Markov Game

We consider a partially observable Markov game (POMG) [14], in which agents communicate with their immediate neighbors through a sparsely connected network to cooperate with each other for a common purpose. The POMG is defined by the tuple , where is a set of environment states, is a set of observations of the agent , is a set of actions that agent can choose, is the transition probability distribution, is the private reward received by agent , is the distribution of the initial state, is the discount factor, and is an undirected graph representing the communication network, where is the set of agents and is the set of communication links.

Let denote a stochastic policy of agent , and denote the joint policy of all agents. For each agent , the goal is to learn an optimal local policy so that the joint policy maximizes the expected sum of the discounted rewards of all agents, . Here denotes a trajectory (), and indicates that the distribution over the trajectory depends on .

Letting denote the discounted return of a trajectory perceived by agent , we define the local value function as , local action-value function as , and local advantage function as respectively, where is the joint action of all agents. With these definitions, we express the global value function by , the global action-value function by , and the global advantage function by .

Iii-B Trust Region Policy Optimization

TRPO is a policy search method with monotonic improvement guarantee [26]. In TRPO, the policy is iteratively updated by maximizing a surrogate objective over a trust-region around the most recent iterate :

(1)

where the objective is an importance sampling estimator of the advantage function , and is the discounted visitation frequencies according to the policy :

The trust-region constraint restricts the searching area in the neighborhood of , defined by the average KL-Divergence and the parameter .

Iii-C Alternating Direction Method of Multipliers

ADMM solves a separable equality-constrained optimization problem in the form [3]

(2)

where the decision variables decomposes into two sub-vectors, and , and the objective is a sum of convex functions, and , with respect to these sub-vectors. The decision vectors and are coupled through a linear constraint, where , , and .

The ADMM algorithm solves the problem (2) by forming its augmented Lagrange function,

(3)

and approximately minimizing the augmented Lagrange function through sequentially updating the primal variables and , and the dual variable :

(4)

Iv Multi-Agent Trust Region Policy Optimization

Iv-a Trust Region Optimization for Multiple Agents

Consider the policy optimization problem (1) for the multi-agent case. We decompose the advantage function in the objective and rewrite the policy optimization model as follows:

(5)

Our purpose is to split the objective into independent sub-objectives so that the optimization problem can be solved distributedly by agents. However, the sub-objectives are coupled through the joint policy and .

In our solution, instead of directly optimizing the joint policy , we train a local policy for every agent , where is a probability distribution of the joint action conditioned on . When choosing actions, each agent independently generate its action according to the local marginal distribution . The joint policy can be expressed as .

Our principal theoretical result is that if the local policy has the form, i.e. , the policy optimization (5) can be equivalently transformed into the following consensus optimization problem (see Appendix A for proof):

(6)

Now, the objective is separable with respect to local policies . However, the local policy is conditioned on , which means in order to learn or act according to , every agent should know the global system state . Generally, this is not the case in a partially observable environment. Since the agents act according to their local observations in a partially observable environment, the joint policy becomes , where is the aggregated observation of all agents. In this case, we introduce the following approximation for partially observable environments:

(7)

where we substitute the global state with each agent’s local observation . In order to meet the trust-region constraint on the KL-Divergence of the joint policy , we introduce new constraints on the KL-Divergence of the local policy . According to the additive and non-negative properties of KL-Divergence, the following inequality holds:

(8)

Consequently, when the new KL-Divergence constraints in (7) are satisfied, i.e. , the joint policy will be constrained within the trust region . It is noted that if the agents have enough observations (e.g. sufficient long histories) to almost surely infer the global state, the approximation (7) could be sufficiently close to the problem (6) for the fully-observable case.

For the policy optimization (7), the joint action is required to calculate the likelihood ratio and the advantage function . However, the joint action follows the policy rather than , which means that the agents only need to know the actions performed by the other agents in the last iterate. Knowing the last round actions of other agents is not a restrictive assumption and has been widely adopted by existing decentralized algorithms in the literature. It is also worth mentioning that knowing the actions of other agents does not mean knowing their policies because the local policy is conditioned on local observation , which is privately owned be the agent .

Iv-B Distributed Consensus with ADMM

To solve the distributed consensus optimization problem (7), we employ the asynchronous ADMM algorithm proposed in [34]. Specifically, let denote the agents at the endpoints of the communication link . For each agent , we introduce an estimator to estimate the likelihood ratio of its neighbor . Then, the consensus constraint for the agent and can be reformulated as

(9a)
(9b)
(9c)

For brevity, we use to represent , to represent , and to represent . By using Equations (9), the problem (7) can be transformed into

(10)

where , , and is the weight on the information transmitted between agents at the communication link . The value of must satisfy , and is set to either or in our study.

We form the augmented Lagrange function of problem (10):

(11)

where is the set of the dual variables, is a penalty parameter; the local policy is defined in the feasible set , and the estimator is defined in the feasible set .

To solve problem (10), the asynchronous ADMM minimizes the augmented Lagrange function by sequentially updating the local policies, estimators, and the dual variables. Specifically, let , , be the values of at iteration . At iteration , a communication link is selected according to a certain probability . Then, the two agents at the endpoints of the link are activated to update their policy , estimator , and dual variable according to:

(12)

Wei and Ozdaglary proved in [34] that if the objective is convex, and and are nonempty closed convex sets, the algorithm (12) converges almost surely to the optimal solution at the rate of as .

Iv-C Sequential Convexification for Convergence

In the problem (10), we implicitly assume that the policy can be evaluated at every observation and action, and optimize directly over . However, for parameterized policies , like neural networks, we would like to optimize over the parameters . This can lead to convergence issue of the ADMM algorithm due to non-convexity of with respect to [3, 34].

Since we are considering parameterized policy , we will use the parameter vector to represent the policy , and overload all previous notations, e.g. , and .

To address the convergence issue, we borrow an idea from sequential convex programming [17] and approximate the problem (10) with a convex model. Note that, for a small trust region , the likelihood ratio and can be well approximated by first order Taylor expansion round :

(13)

and the KL-divergence can be well approximate by second order Taylor expansion round :

(14)

where . The idea is also similar to TRPO [26] and constrained policy optimization (CPO) [1].

Substituting (13) - (14) into (10) and (11), we can derive a convex approximation to the primal problem and the augmented Lagrange function, respectively. With the convex approximation, we can use the asynchronous ADMM algorithm (12) to solve for an improved policy for each agent .

Iv-D Privacy Protection

During the training process, the agents need to share with neighbors the information on the local likelihood ratio to reach a consensus. The likelihood ratio does not contain private information of the agents, such as the observation, reward, or policy parameters. Therefore, MATRPO is friendly for agents’ privacy.

V Practical Implementation

V-a Sample-Based Approximation of the Consensus Constraint

To meet the consensus constraint, the agents need to reach agreement on the ratio

for every observation and action. This task is challenging when the observation and action spaces are very large. To overcome this challenge, we implement a sample-based approximation to the consensus constraint.

First, we generate a batch of initial states according to the distribution , and each agent has a batch of local observations of the state, i.e. . Then, we simulate the joint policy for timesteps to generate a batch of trajectories of the states and actions, and each agent has a batch of locally observed trajectories, i.e. . After that, the agents calculate the local likelihood ratios based on the samples from these trajectories. Finally, we constrain the agents to reach consensus on the samples of the likelihood ratio .

We also use this sampling procedure to estimate the local advantage function , which is calculated by following the method in [1]:

(15)

where the local value function is approximated by using a neural network based on the local observations and private rewards .

V-B Agreement on Logarithmic Ratio

We find that it is better to use the logarithmic likelihood ratio for consensus since it leads to a closed-form solution to the asynchronous ADMM updates (12). As the logarithmic function is monotonic, consensus on is equivalent to consensus on . Therefore, the following equations are used as the consensus constraints:

(16)

In addition, we approximate the logarithmic likelihood ratio by its first order Taylor expansion around :

(17)

Then, using the sample-based approximation procedure in Section V-A, we derive the following Lagrange function:

(18)

where is a vector of the sampling values of the advantage function , is the Jacobian of the sampled likelihood ratios with respect to . The policy parameters is defined in the feasible set , where is the Hessian of the sample average KL-Divergence .

The asynchronous ADMM updates for the Lagrange function (18) have the following closed-forms (see Appendix A-B for proof):

(19)

where

and is the set of all communication links that directly connect to agent . The pseudocode for our algorithm is given as Algorithm 1.

  Input: Initial local policies ; tolerance
  for  do
     Set the joint policy
     for  do
        Sample a set of trajectories for agent
        Form sample estimates
     end for
     Initial estimators and dual variables
     for  do
        Randomly select a communication link
        For agents at the endpoints of link , update , and according to (26)
        Update
     end for
  end for
Algorithm 1 Multi-Agent Trust Region Policy Optimization

Vi Experiments

The experiments are designed to investigate the following questions:

  • Does MATRPO succeed in learning cooperative policies in a fully decentralized manner when agents have different observation spaces and reward functions?

  • How does MATRPO compare with other baseline methods, such as a fully centralized learning method? Does MATRPO perform better than independent learning?

  • Can MATRPO be used to solve large-scale MARL problems? Does the performance degrade when the number of agents increases?

To answer these questions, we test MATRPO on two cooperative tasks, in which agents only have partial observations of the environment and receive private rewards. We compare the final performance of MATRPO with two baseline methods: (1) a fully centralized method, which uses TRPO to learn a global policy by reducing the problem to a single-agent RL, (2) an independent learning method, which uses TRPO to learn a local policy by maximizing the local expected rewards. We also examine the scalability of our algorithm by increasing the number of agents in the tasks.

Vi-a Environments

Our experiments are carried out on the multi-agent particle environment proposed by [22], and extended by [16] and [8]. The Cooperative Navigation and Cooperative Treasure Collection tasks are used to test the proposed algorithm. To make the tasks more challenging in the decentralized training scheme, they are modified, and details on the modified tasks are illustrated in Fig. 1 and explained as follows.

(a) Cooperative Navigation
(b) Cooperative Treasure Collection
Fig. 1: (a) Cooperative Navigation (). The agents must cover all the landmarks through coordinated movements. While all agents are punished for colliding with each other, only agent 1 is rewarded based on the proximity of any agent to each landmark. The agents communicate through a ring topology network, i.e. . (b) Cooperative Treasure Collection (). The hunters need to collect treasures, which are represented by small colored circles, and deposit them into correctly colored banks. The banks are rewarded for the successful collection and depositing of treasure. The hunters do not receive any reward, but are penalized for colliding with each other. The agents communicate through a ring topology network, i.e. Bank 1 - Bank 2 - Hunter 1 - - Hunter 6 - Bank 1 - .

Cooperative Navigation. In this task, agents must reach a set of landmarks through coordinated movements. In the original version of this task, the agents are collectively rewarded based on the minimum agent distance to each landmark, and individually punished when colliding with each other. To make it more challenging, we modified the task such that only one of the agents is rewarded. The other agents do not receive reward, bet they are penalized when collision. The agents observe the position and velocity of itself as well as the relative positions of other agents and landmarks.

Cooperative Treasure Collection. In this task, treasure hunters and treasure banks search around the environment to gather treasures. The treasures are generated with different colors and re-spawn randomly upon being collected. The hunters are responsible for collecting treasures and deposit them into correctly colored banks. The banks simply gather as much treasure as possible from the hunters. The agents observe the position and velocity of itself as well as the relative positions and velocities of other agents. The agents also observe the colors and re-spawn positions of the treasures. In our modified version of this task, the banks are rewarded for successful collection and depositing of treasure. They are also negatively rewarded based on the minimum distance of each hunter to the treasures. The hunters do not receive any reward for collecting treasures, but are penalized for colliding with each other. This modification makes the task more challenging than the original version where both the hunters and the banks are rewarded.

For all experiments, we use neural network policies with two hidden layers of 128 SeLU units [10]. The action space is {left, right, up, down, stay}. For decentralized training, we assume that the agents communicate with neighbors via a ring topology network. We run MATRPO and the baselines algorithms 5 times with different random seeds and compare their average performances on both tasks. Details on the experimental setup and parameters are provided in Table I.

Fig. 2: Learning curves for the Cooperative Navigation and Cooperative Treasure Collection tasks. Error bars are a 95% confidence interval across 5 runs.

Task Cooper. Navigation Cooper. Treas. Collection
Number of Agents
Sim. steps per iter. 10k 10k 10k 10k
Stepsize (). 0.003 0.003 0.001 0.001
Discount (). 0.995 0.995 0.995 0.995
Policy iter. 500 500 500 500
Num. path 100 100 100 100
Path len. 100 100 100 100
ADMM iter. 100 150 200 500
Penalty para. (). 1.0 1.0 5.0 5.0
TABLE I: Parameters of MATRPO for the Cooperative Navigation and Cooperative Treasure Collection tasks.

Vi-B Results

Learning curves of MATRPO and the baseline methods for the two cooperative tasks are shown in Fig. 2. MATRPO is successful at learning collaborative policies and achieves consistently high performance for all of the problems. Apart from the small cooperative navigation task () where Centralized TRPO almost finds the global optimum, MATRPO performs practically as well as the Centralized TRPO does. Independent TRPO performs poorly on these tasks. Especially, for the cooperative treasure collection task, the performance becomes worse as the training proceeds. Since the hunters are punished for colliding with each other, agents trained by Independent TRPO learn to move far away from each other to avoid collision. As a result, some hunters move outside the window and end up with being distant from the treasures.

In addition, from the experimental results we find that the performance of MARTPO does not deteriorate when the number of agents increase. These results provide empirical evidence that MATRPO can be used to solve large-scale MARL problems.

Note that MATRPO learned all of the collaborative policies in a decentralized way using local observation and privately-owned rewards. The agents does not need to know observations, rewards, or policies of other agents. This is in contrast with most prior methods, which typically rely on aggregated observations and rewards, or policy and action value/value network parameters of other agents.

Vii Conclusion

In the paper, we showed that through a particular decomposition and transformation, the TRPO algorithm can be extended to optimization of cooperative policies in multi-agent systems. This enabled the development of our MATRPO algorithm, which updates the local policy through distributed consensus optimization.

In our experiments, we demonstrated that MATRPO can train distributed nonlinear policies based on only local observations and rewards. Furthermore, the privacy of the agents can be well protected during the training process. We justified the superiority and scalability of MATRPO on two cooperative tasks by comparing with centralized TRPO and independent learning. Our work represents a step towards applying MARL in real-world applications, where privacy and scalability are important.

Appendix A

A-a Equivalent Transformation of Policy Optimization in TRPO

Consider a multi-agent system with agents. Let be the local policy trained by agent . When choosing actions, each agent acts independently according to the local marginal distribution . The joint policy is expressed as .

Theorem 1.

Assume the local policy has the form