Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus

# Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus

Yan Zhang and Michael M. Zavlanos Yan Zhang and Michael M. Zavlanos are with the Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC 27708, USA. {yan.zhang2,michael.zavlanos}@duke.edu
###### Abstract

In this paper, we propose a distributed off-policy actor critic method to solve multi-agent reinforcement learning problems. Specifically, we assume that all agents keep local estimates of the global optimal policy parameter and update their local value function estimates independently. Then, we introduce an additional consensus step to let all the agents asymptotically achieve agreement on the global optimal policy function. The convergence analysis of the proposed algorithm is provided and the effectiveness of the proposed algorithm is validated using a distributed resource allocation example. Compared to relevant distributed actor critic methods, here the agents do not share information about their local tasks, but instead they coordinate to estimate the global policy function.

## I Introduction

Reinforcement learning (RL) algorithms have been widely used to solve decision making and control problems in unknown and stochastic environments, [1, 2]. Existing RL algorithms fall in two main categories, tabular-based methods and methods that use function approximation. Tabular-based methods are generally easier to analyze [3], however, they require the state and action spaces to be discrete and finite. On the other hand, using function approximation, such as Neural Networks [2], allows to solve RL problems in continuous state and action spaces. The goal of these methods is to estimate the value function or policy function over the whole state-action space with a finite number of function parameters. Then, the learning problem can be reduced to finding the optimal function parameters in finite dimensions. However, methods that rely on function approximation can be sensitive to approximation errors and can diverge in some cases [4]. Understanding convergence of RL algorithms with function approximation is an active area of research that is also the focus of this work.

Existing RL algorithms can also be classified as value-based methods [5, 6] or policy gradient methods [7, 8]. Value-based methods parameterize the state-value function or state-action value function and learn these function parameters during the learning process. In these methods, in order to obtain the control signal, a maximization problem needs to be solved at each time step of the execution phase, which is impractical. Instead, policy gradient methods parameterize and directly learn the policy function using stochastic gradient descent, [7, 8]. It is well-known that in these methods the estimate of the policy gradient typically has large variance and one popular way to reduce this variance is the Actor-Critic (AC) method, [9]. Essentially, the learning agent keeps a policy function estimator called Actor and a value function estimator called Critic. The Critic estimates the value function under the current policy and the Actor uses the feedback from the Critic to improve the policy function parameters.

In this paper, we are interested in distributed Actor-Critic methods. Specifically, we consider networks of agents that have their own tasks, states and actions, and assume that their states depend not only on their own actions but also on the states and actions of the other agents in the network. The goal is to let the agents in the network collaborate to learn a global optimal policy that maximizes the aggregate accumulated rewards over the network. The challenge in applying the AC method in this scenario is that the Critic update needs the local reward information from all the agents. This usually requires a master node to serve as a central critic, [10, 11]. When the network size is large, having a master node introduces significant communication overhead and may also cause privacy issues. The works in [12, 13, 14, 15, 16] employ distributed optimization methods to evaluate a given fixed policy in multi-agents systems. However, these methods do not improve the policy parameters. A different formulation is proposed in [17] that develops a distributed Actor Critic method for teams of homogeneous agents that learn the same task independently and share their policy parameters with their neighbors. Instead, here we assume that the agents can have different tasks, different state and action spaces, and their behavior can affect each other. The works in [18] study the multi-agent actor critic method in mixed competitive-cooperative environment but with no convergence analysis.

Perhaps the most relevant work to the method proposed here is [19, 20]. The key idea in these works is to let each agent keep a local estimate of the global value function and introduce an additional consensus step on these local estimates to make the local agents asymptotically aware of the global value function. Compared to [19, 20], here every agent keeps its own local value function estimate associated with their own task and never shares this with its neighbors. Therefore, information about the local tasks is not revealed to other agents, as in [19, 20]. Instead, the agents keep local estimates of the global policy function and a consensus step is introduced so that all agents agree on the optimal policy.

The rest of the paper is organized as follows. In Section II, we introduce the distributed reinforcment learning problem under consideration as well as some preliminary results. In Section III, we formulate the decentralized off-policy reinforcement learning problem and formally present our proposed distributed Actor-Critic algorithm. In Section IV, we analyze the convergence of the proposed algorithm. In Section V, we present a numerical example to validate our analysis. In Section VI, we conclude the paper.

## Ii Preliminaries

### Ii-a The Reinforcement Learning Problem

Consider network of agents. We define the state of the system , where denotes the state of agent . Moreover, we define the action of the system , where denotes the action of agent . The state space and action space are continuous. We denote the state transition function by , where represents the noise in the state dynamics at time . We also assume that the transition process is stationary, that is, the function and the noise generate a time-invariant distribution . We assume that the global state and action can be observed by all the agents. This is a common assumption in current reinforcement learning problems, [18, 10, 21, 19].

Let denote the local reward received by agent at time as a result of taking action at state . Define also the the global deterministic policy function, . Moreover, denote by the state value function and by the state-action value function. The goal is to maximize the infinite-time discounted-reward value function

 J(π)=Eρπ[Vπ(s(0))], (1)

over all policies where is the policy function space, is the discounted factor. Assuming the transition probability and the reward are known, the existence of the optimal stationary policy function that maximized the value function (1) is shown in [3] and this policy can be found using planning methods, e.g., policy iteration. However, if the probability and reward functions are unknown, reinforcement learning methods need to be applied to find the optimal policy function . In this paper, we are interested in actor critic methods to find the optimal policy .

### Ii-B Actor Critic Method

Parameterizing the policy function as , where is the policy parameter and is a vector of policy feature functions, the problem of finding the optimal policy that maximizes the value function in (1) can be reduced to the following optimization problem in the parameterized function space ,

 maxθJ(θ). (2)

Problem (2) can be solved using stochastic gradient descent methods. Specifically, the gradient is given in [8],

 ∇θJ(π)=Eρπ[∇θπ(s)∇aQπ(s,a)|a=πθ(s)]. (3)

Since function under policy is unknown, policy evaluation algorithms are needed to approximate and furthermore the gradient . To do so, the function can be parameterized as , , where is a vector of feature functions. Then, to solve problem (2) we can use the following actor-critic algorithm consisting of the two time scale updates

 w(t+1)=w(t+1)+αw(t)(h(w(t),θ(t))+M(t+1)),θ(t+1)=θ(t)+αθ(t)(f(w(t),θ(t))+N(t+1)), (4)

where represents the update formulas of a policy evaluation algorithm, e.g.,[5, 6], represents the policy gradient update in (3), and and are noises coming from sampling during the learning process. The first update in (4) estimates the value function given the current policy, and is named Critic update. The second update in (4) improves the current policy, and is named Actor update. Convergence of this Actor Critic scheme typically depends on analyzing the two time scale updates [22], which we explain in detail in Section IV.

## Iii Problem formulation

Since the estimation of the global state-action value function in (4) requires the global reward function , the actor-critic method in (4) is centralized. That is, a master node is required to collect the local rewards from all the agents and execute the update (4). To design a decentralized algorithm, we first decompose the value and action value functions using linearity of the expectation as

 Vπ(s)=N∑i=1Vπi(s) and Qπ(s,a)=N∑i=1Qπi(s,a),

where is the local value function and is the local action value function under policy . Then, problem (1) can be written in the following separable form

 minθN∑i=1Ji(θ), (5)

where . A common approach to solve problem (5) in a distributed way is by introducing local estimates and using consensus to estimate the global optimal policy parameter :

 θi(t+1)=∑j∈NiWij(θj(t)+αθ(t)∇θJj(θ)|θ=θj(t)). (6)

In the RL problem under consideration, the objective function is usually nonlinear and the gradient is usually evaluated in a stochastic way. The convergence of (6) in this case is studied in [23]. The key idea in [23] is to evaluate the local gradient in an on-policy fashion, as in [8], for which all agents need to behave under policy . This suggests that to execute the consensus update (6) at time , every agent needs to send its policy parameter to all other agents that need to execute the policy for multiple time steps so that agent can collect local rewards to estimate . This on-policy scheme is impractical even though its convergence analysis is simpler, as seen in [23].

This motivates us to consider off-policy actor-critic methods as in [8, 24]. The idea is to let all agents behave under a fixed policy, named behavorial policy , and optimize an approximate objective function,

 Jβ(θ)=Eρβ[Vπ(s)], (7)

instead of the true cost in (1), where the expection in (7) is taken over the stationary state distribution instead of as in (1). Same as with , the approximate cost can also be decomposed into local costs as

 minθN∑i=1Ji,β(θ). (8)

To achieve consensus on the local policy parameters , we can apply a similar update as in (6),

 θi(t+1)=∑j∈NiWij(θj(t)+αθ(t)∇θJj,β(θ)|θ=θj(t)). (9)

However, as discussed in [8], the gradient cannot be exactly estimated in this off-policy setting, therefore, it is replaced with an approximate gradient

 ^∇Jj,β(θ)=Eρβ[∇θπ(s)∇aQπj(s,a)|a=πθ(s)]. (10)

Convergence of the Actor Critic method in (9) using the off-policy gradient in (10) is studied in [24] for the centralized problem when no information is received from the neighbors. However, convergence of (9) with off-policy gradient in 10 for decentralized problems is unknown, which is the focus of this paper.

In practice, we compute the gradient in (10) using the parameterization . To ensure that this parameterization preserves the update (10) that uses the true state-action value function , as discussed in [8], we make the following assumption:

###### Assumption III.1.

(Function Compatibility) All the local value functions are parameterized as . That is, the feature function for satisfies that .

Given the Assumption III.1 and the expression for the off-policy gradient (10), the consensus update (9) of the local policy parameters becomes

 θi(t)=∑j∈NiWij(θi(t−1)+αθ∇π(s(t))∇θπ(s(t))Twi(t). (11)

To conduct the policy parameter update in (11), we have to compute , which is the value function parameter. To compute this parameter, generally speaking, any off-policy policy evaluation algorithm can be used at the local agents independently. Then, combining these policy evaluation algorithms with the approximate local gradient update in (10), a decentralized Actor Critic algorithm can be developed. In this paper, we employ the gradient temporal difference (GTD) learning algorithm studied in [5, 6, 8] to estimate the local value function parameters in (11), because this method is known to be stable in the off-policy setting. The proposed distributed Actor Critic method is presented in Algorithm 1.

## Iv Convergence Analysis

In this section, we analyze the convergence of Algorithm 1 using the two-time scale technique in [22]. The key idea is that the Critic updates at a faster rate than the Actor so that to analyze the convergence of the Critic update in line 3 in Algorithm 1, we can assume that each local policy parameter is fixed. Then, each local Critic can independently estimate its own local value function and the convergence analysis of this local policy evaluation is the same as GTD in [6]. To analyze the convergence of the Actor, we can assume that every local Critic has already converged to the correct value function estimate. Compared to the analysis for the centralized off-policy Actor Critic method in [24], the challenge here is to analyze the decentralized off-policy Actor-Critic algorithm with a consensus step in line 4 of Algorithm 1. To do so, we first introduce several assumptions that are common in the reinforcement learning literature.

###### Assumption IV.1.

We assume that the behavioral policy is stationary, and the Markov chain that governs the state under policy is irreducible and aperiodic.

The above assumption ensures that when the agents behave under policy , the system states will reach the stationary state distribution .

###### Assumption IV.2.

We assume that for all , the reward is uniformly bounded for all and .

This assumption ensures that the objective function in (8) is upper bounded when the discount factor satisfies , so that the problem (8) is well defined.

###### Assumption IV.3.

We assume that the stepsizes and are deterministic and satisfy that , , and . Moreover, .

This assumption is standard in the literature employing two-time scale analysis, [22, 19, 20]. Furthermore, let be a random weight matrix of the communication graph at time . Define the filtration to be a algebra . Then, we have the following assumptions on .

###### Assumption IV.4.

We assume that satisfies the following conditions: (a) is row stochastic and is column stochastic for all . That is, and ; (b) The spectral norm satisfies ; (c) and are conditionally independent given the filtration .

The above assumptions are standard in the stochastic consensus optimization literature [23]. They will be used to establish consensus on the policy parameters.

###### Assumption IV.5.

We assume the vector of feature functions is uniformly bounded for all .

This assumption is common and essential to show stability of reinforcment learning algorithms, see [5, 6, 20, 19].

###### Assumption IV.6.

We assume that through the whole history of the algorithm, belongs to a compact set for all and . We also assume that this compact set contains at least one local maximum of the problem (8).

This assumption is necessary to show stability of the policy parameter updates. Moreover, boundedness of the policy parameters is commonly observed in practice when implementing RL algorithms as mentioned in [24]. It is possible to remove this assumption by projecting the policy parameters onto an appropriately chosen compact set after each update (11). However, this approach raises the question of how to select this projection set so that it contains at least one local maximum of the problem 8, as per Assumption IV.6, and also complicates the analysis of off-policy methods, as we discuss after we present Theorem IV.12.

Let denote a function that maps the policy parameter to the optimal value function parameter by means of the policy evaluation algorithm [6]. Then, we can show the following result for the function .

###### Lemma IV.7.

Let policy parameter at agent be fixed, and let agent run the gradient TD (GTD) learning algorithm in [6] to evaluate this policy. Then, the local value function parameter converges to almost surely (a.s.). Moreover, the function is Lipschitz continous.

###### Proof.

By the two-time scale nature of Algorithm 1 and the fact that each policy evaluation is performed independently by each agent , the GTD algorithm run at agent behaves as a centralized algorithm. Therefore, the results in Lemma IV.7 can be directly shown using Lemma 4 and 5 in [24]. ∎

Next, we have the following result on the stability of the value function parameter .

###### Lemma IV.8.

Given Assumption IV.2 and IV.6, the value function parameter is a.s. uniformly bounded over time.

###### Proof.

According to Lemma IV.7, we have that converges to a.s. Then, the boundness of is given by combining Assumption IV.6 and the fact that the function is Lipschitz continuous. ∎

In what follows, we stack all policy function parameters in a vector . The update (11) can be compactly written as

 θ(t+1)=(Wt⊗I)(θ(t)+αθ(t)^∇J(θ(t))), (12)

where and is an identity matrix of the same dimension as . The expression of is due to Assumption III.1. Moreover, define the disagreement between local policy parameters as , where is a vector of dimension and its entries are all and . We have the following lemma.

###### Lemma IV.9.

Given Assumptions IV.3, IV.4, IV.5 and IV.6, we have that . Therefore, a.s.

###### Proof.

First, we establish the dynamcis of . To achieve this, we introduce an operator . Then, multiplying both sides of (12) with , and replacing with , we have that

 θ⊥(t)=J⊥(Wt⊗I)(1⊗¯θ(t−1)+θ⊥(t−1)+αθ(t)^∇J(θ(t−1))).

Using Assumption IV.4(a), we have that for all . Therefore, we obtain

 θ⊥(t)=J⊥(Wt⊗I)(θ⊥(t−1)+αθ(t)^∇J(θ(t−1))).

Taking the square of the Euclidean norm on both sides of the above equation, we have that

 ∥θ⊥(t)∥2=∥θ⊥(t−1)+αθ(t)^∇J(θ(t−1))∥2(WTt(I−1N11T)Wt),

where for any vector and matrix . According to Assumption IV.4(b,c), taking expectation over the random matrix , given the filtration and the random sample , we have that

 E[∥θ⊥(t)∥2|Ft−1,s(t),r(t)]≤ρW∥θ⊥(t−1)+αθ(t)^∇J(θ(t−1))∥2≤ρW(∥θ⊥(t−1)∥2+2αθ(t)∥θ⊥(t−1)∥∥^∇J(θ(t−1)∥+αθ(t)2∥^∇J(θ(t−1)∥2),

where the second inequality above is by expanding the two norm and using the Cauchy-Swartz inequality. Taking the expectation of both sides of above inequality and using Jensen’s inequality, we have that

 E[∥θ⊥(t)∥2]≤ρWE[∥θ⊥(t−1)∥2]+2ρWαθ(t)√E[∥θ⊥(t−1)∥2∥^∇J(θ(t−1)∥2]+ρWαθ(t)2E[^∇J(θ(t−1)∥2]. (13)

Recalling the expression for under (12), and using Assumption IV.5 and Lemma IV.8, we have that can be bounded by a constant for all . Denote . Then, recalling that due to Assumption IV.4(b), (13) can be written as

 v(t)≤ρWv(t−1)+2αθ(t)K1v(t−1)+αθ(t)2K21.

The above inequality is the same as (17) in [23]. Since Assumptions IV.4 and IV.3imply that Assumptions 1 and 2 in [23] are also satisfied, we can use the same proof as in Lemma 1 in [23] to show that satisfies that and a.s.. ∎

Since a.s., it is sufficient to study the dynamics of . In what followss, we show that with the policy update in (12), asymptotically approaches the following ODE dynamics

 ˙¯θ=F(¯θ), (14)

where

 F(¯θ)=E[1Nϕπ(s)ϕπ(s)T∑iχi(¯θ)]. (15)

Note that it is standard to study the discrete-time dynamics (11) by relating them to the behavior of the ODE in (14); see, e.g., relevant literature on RL [24, 21, 19] and stochastic optimization [23], as well as Chapter 2 in [22] that establishes conditions on the step sizes and noise terms in the discrete-time dynamics so that they asymptotically approach their continuous-time ODE counterpart.

To show that the discrete-time dynamics of asymptotically approach the ODE (14), we first multiply both sides of (12) with on the left to obtain

 ¯θ(t+1)=1N(1T⊗I)(θ(t)+αθ(t+1)^∇J(θ(t))) =¯θ(t)+αθ(t+1)1N(ϕπ(st+1)ϕπ(st+1)T)∑iχi(θi(t))

Since , the above update of can be written in the following form

 ¯θ(t+1)=¯θ(t)+αθ(t+1)F(¯θ(t))+αθ(t+1)ξ(t)+αθ(t+1)r(t), (16)

where we have

 F(¯θt)=E[1Nϕπ(st+1)ϕπ(st+1)T∑iχi(¯θ(t))], (17a) ξ(t)=1Nϕπ(st+1)ϕπ(st+1)T∑iχi(¯θ(t))−E[1Nϕπ(st+1)ϕπ(st+1)T∑iχi(¯θ(t))]. (17b) r(t)=1Nϕπ(st+1)ϕπ(st+1)T∑i(χi(θi(t))−χi(¯θ(t))). (17c)

Then, to show that the discrete-time trajectory in (16) approaches the continuous trajectory of (14), we need to define the following functions generated by these trajectories; cf. Chapter 2.1 in [22]. First, let denote a continuous piecewise linear function that passes through the discrete-time updates in (16), so that for and for , where , and denotes the continuous time index. Moreover, define the function that is the unique solution of the dynamical equation (14) for with initial condition , and the function that is the unique solution of (14) for with the ending condition . Then, we can show the following result.

###### Lemma IV.10.

Given Assumption IV.3, IV.4, IV.5 and IV.6, we have that for any ,

 lims→∞supn∈[s,s+T]∥¯x(n)−xs(n)∥=0,a.s.lims→∞supn∈[s,s−T]∥¯x(n)−xs(n)∥=0,a.s..
###### Proof.

According to Lemma 1 in Chapter 2 [22], it is sufficient to show that the following conditions are satisfied: (i) the function in (17a) is Lipschtiz continuous, (ii) in (17b) satisfies that and a.s. for some constant , and (iii) that a.s.. Note that Lemma 1 in [22] requires conditions (i-ii), but only considers the dynamical equation (16) without the noise term . The condition for the dynamical equation with noise to approach its ODE counterpart is given by the third extension of Lemma 1 in Chapter 2.2 [22]. And this extension requires condition (iii).

Combining Assumption IV.5 and Lemma IV.7, and recalling the definition in (17a), condition (i) is satisfied. In addition, from the construction of , it is simple to see that . Particularly, since is bounded for all and is also bounded according to Assumption IV.5 and Lemma IV.7, we have that is uniformly bounded for all . Therefore, the constant in condition (ii) must exist and this condition is satisfied. Finally, by the Lipschitz property of the function shown in Lemma IV.7, we have that

 ∥r(t)∥≤LχN∥ϕπ(st+1)ϕπ(st+1)T∥∑i∥θi(t)−¯θ(t)∥≤Lχ√N∥ϕπ(st+1)ϕπ(st+1)T∥∥θ⊥(t)∥.

Due to boundness of and Lemma IV.9, we have that a.s.. Therefore, condition (iii) is also satisfied.By conditions (i-iii), Assumption IV.3 and applying Lemma 1 in [22], the proof is complete. ∎

Before we state our main result, define the set and make the following assumption.

###### Assumption IV.11.

We assume that set is compact. Meanwhile, the set has an empty interior.

This assumption is satisfied when the objective function is smooth, according to Sard’s theorem. It is a common assumption in the stochastic approximation and optimization literature, e.g., [25, 23].

###### Theorem IV.12.

Given Assumptions III.1 and from IV.1 to IV.11, converges to the set a.s. for all .

###### Proof.

Using Lemma IV.9, we need to show that given by (16) converges to the set . Moreover, using Lemma IV.10, we need to show that the dynamics (14) converge to the set . To achieve this, define function , where is the objective function of the central problem in (7). We shall prove that the function can serve as a Lyapunov function to show stability of the set under dynamics (14). For this, we need to show that for any solution of the ODE in (14), where is the continuous time index, and that the inequality is strict for any .

If the function is the gradient of the function , then we can directly use Proposition 4 in [23] to get the desired result. However, due to the proposed off-policy framework as we discussed in Section III, is only an approximation of the exact gradient. In what follows, we show that behaves in a similar way as the exact gradient. Recalling the definition of , we have that

 ^∇Jβ(¯θ)=E[∇θπ(s)|θ=¯θ∑i∇aQπi(s,a)|a=πθ(s)].

Since the local value functions are parameterized by , we have that for all . Plugging this into the above equation, we get

 ^∇Jβ(¯θ)=E[(∇θπ(s)∇θπ(s)T)|θ=¯θ∑iχi(¯θ)]. (18)

Furthermore, using the parameterization , we can obtain that the function in (15) is the off-policy gradient . According to the policy improvement theorem (Theorem 1) in [24], we have that there exists , such that for all positive and , we have that . And for all , the above inequality is strict. Considering the first order Taylor expansion of the value function , when goes to , the term dominates . Since , combining with the policy improvement theorem, we have that for all . And if , we have that .

Since , we have that . Since , we have that . We conclude that is a valid Lyapunov function that can be used to show stability of the set under dynamics (14). Finally, given Assumption IV.11, combining Lemma IV.10 and Theorem 2 in [23], we have that converges to the set a.s.. Recalling that a.s., we obtain the desired result. The proof is complete. ∎

To conclude, we make the following remark on the two-time scale analysis we have employed in this paper. Specifically, we have considered the two-time scale update

 wi(t+1)=wi(t)+αwh(wi(t),θi(t))+Mi(t+1),¯θ(t+1)=¯θ(t)+αθ(t+1)F(¯θ(t))+αθ(t+1)ξ(t)+αθ(t+1)r(t).

To analyze the above concurrent update in the two-time scale framework, according to [22], the noise in the update of need to be a Martingale difference sequence with bounded momentum. However, as we have shown in Lemma IV.10, the sequence is related to the disagreement error and is not necessarily a Martingale difference sequence. Therefore, the analysis of the concurrent update scheme in [22] cannot be directly applied here. However, as mentioned in the end of Chapter 6.1 [22], the two-time scale effect can also be achieved by the subsampling scheme. Specifically, let the local critic update in the fast time scale and let actor keep its local policy estimate fixed and only update when the local critic updates steps. Then, according to Lemma IV.7, when is chosen large enough, the local critic converges. Therefore, the convergence of the local actors can be analyzed in the two-time scale fashion as we have done in this section.

## V Numerical Simulation

In this section, we illustrate our proposed algorithm and theoretical analysis using a distributed resource allocation example. Specifically, we consider resource dispatch centers in an area of interest. These centers make decisions as to how to allocate available resources amongst each other. For example, the resources can be vehicles or taxis that service passengers in a big city and the dispatch centers can control the number of vehicles in different neighborhoods in the city. Due to varying passenger demand, the dispatch centers may need to transfer vehicles from one another. We define the state of each center at time by that captures the number of available resources and the local demands . We also define the local action as , which denotes the amount of resources sent from center to its neighbor center at time . In this simulation, we assume the 6 centers are located in a grid. Each center communicates with its hop neighboring centers located at its up/down/left/right direction. We assume the demand at each agent is of sinusoidal shape with random noise, i.e.,

 di(t)=Aisin(ωit+ϕi)+wi(t),

where are randomly generated for all the agents. We denote by the periodicity of demand at agent . The noise is subject to a zero-mean Gaussian distribution , and we set to be of . Given this demand model, we let , where denotes the phase of the local demand. Moreover, we define the state transition function as

 mi(t+1)=ΠM[mi(t)+∑j∈Niaji−∑j∈Niaij−di(t)]
 ¯ti(t+1)={¯ti(t)+δ, if ¯ti(t)+δtTi2.

where is a compact set, is the projection operator onto the set , and is the sampling interval. The local reward function is designed as

 ri(si(t))={0 if mi(t)>0,−(−mi(t))3 if mi(t)<0. (19)

This reward function penalizes agents for having negative resources but also does not reward them for accumulating too many resources. Since the agents can only observe and but do not know the models for the demands , the transition function or the reward functions , this problem becomes a distributed RL problem.

We apply Algorithm 1 to solve this problem. Specifically, we parmeterize the global policy function using radial Gaussian basis functions (GBF) as , where

 ϕk(s)=1√2πσ2kexp(−∥s−ck∥22σ2k).

The feature parameters are randomly generated and are adjusted by trial and error. All the agents in the network share the same basis functions for their policies. According to III.1, the basis functions for the value functions are . As discussed at the end of Section IV, we utilize the subsampling to achieve a two-time scale effect. Specifically, we let the local actors update once every updates of the critics. The step sizes are chosen as and .

We run Algorithm 1 for times with the same initialization. Specifically, all agents start with the same initial policy parameters and value function parameters at the beginning of each trial. The randomness in these trials is due to the noise in the demand model and the random exploration policy . In Figure 1, we demonstrate that the accumulated return using each agent’s policy estimate increases. Each point in the curve in Figure 1 is achieved in the following way: fix the policy parameter at agent at the current iteration, let all the agents execute this policy for steps and compute the aggregate accumulated reward over the whole network, run the aforementioned process for times and take the mean of the accumulated rewards. This mean value is used as the performance indicator for the current policy parameter at agent at the current iteration. Finally, observe in Figure 1 that by applying Algorithm 1, the local agent’s policies achieve consensus. The policy estimate of agents 1 and 3 improves, while the local of agent 2 degrades. The reason is that the distributed RL problem (8) is nonlinear. Algorithm 1 is only guaranteed to converge to a local optimizer. We observe that the perforamance of the policy estimate at each local agent degrades slightly as the iteration increase, but eventually converges due to the decreasing step size rule. This behavior is caused by the variance in the policy gradient estimate in the Actor Critic method, and can also be observed in other literature on Actor Critic methods, e.g., [21].

## Vi Conclusions

In this paper, we proposed a distributed actor critic algorithm to solve multi-agent reinforcement learning problems. Specifically, we assumed that every agent has a local estimate of the global optimal policy function, updates its local estimate with the local value function estimate, and we introduced an additional consenesus step on these local estimates so that the agents asymptotically achieve agreement on the global optimal policy. We anlayzed the convergence of the proposed algorithm and demonstrated its effectiveness on a distributed resource allocation example. Compared to existing distributed actor critic methods for RL, using policy consensus does not require the agents to share their local tasks with each other.

## References

• [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
• [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
• [3] D. P. Bertsekas, Dynamic Programming and Optimal Control.   Athena Scientific, 1995.
• [4] L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995.   Elsevier, 1995, pp. 30–37.
• [5] R. S. Sutton, C. Szepesvári, and H. R. Maei, “A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation,” Advances in neural information processing systems, vol. 21, no. 21, pp. 1609–1616, 2008.
• [6] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation.” in ICML, 2010, pp. 719–726.
• [7] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
• [8] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
• [9] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
• [10]