1 INTRODUCTION

## Abstract

Stochastic approximation, a data-driven approach for finding the fixed point of an unknown operator, provides a unified framework for treating many problems in stochastic optimization and reinforcement learning. Motivated by a growing interest in multi-agent and multi-task learning, we consider in this paper a decentralized variant of stochastic approximation. A network of agents, each with their own unknown operator and data observations, cooperatively find the fixed point of the aggregate operator. The agents work by running a local stochastic approximation algorithm using noisy samples from their operators while averaging their iterates with their neighbors’ on a decentralized communication graph. Our main contribution provides a finite-time analysis of this decentralized stochastic approximation algorithm and characterizes the impacts of the underlying communication topology between agents. Our model for the data observed at each agent is that it is sampled from a Markov processes; this lack of independence makes the iterates biased and (potentially) unbounded. Under mild assumptions on the Markov processes, we show that the convergence rate of the proposed methods is essentially the same as if the samples were independent, differing only by a log factor that represents the mixing time of the Markov process. We also present applications of the proposed method on a number of interesting learning problems in multi-agent systems, including a decentralized variant of Q-learning for solving multi-task reinforcement learning.

## 1 Introduction

With the fast growth of cloud computing and distributed data centers, decentralized algorithms start to become an increasingly important tool that enables simultaneous learning from multiple separate resources. In this paper, we consider the decentralized (nonlinear) stochastic approximation (DCSA) framework, where a network of agents, under noisy measurements, seek to cooperatively find the fixed point of a global operator that is the sum of local operators distributed among the agents. Mathematically, each agent has access to a local operator , where is a statistical sample space with distribution . Defining

 ¯Fi(θ)≜Eμi[Fi(Xi,θ)]=∑Xi∈Xiμi(Xi)Fi(Xi,θ),

the objective of the agents is to collectively find a solution such that

 ¯F(θ∗)≜N∑i=1¯Fi(θ∗)=0. (1)

We note that the distributed version of stochastic gradient descent (SGD) and many algorithms in reinforcement learning such as Q learning and TD() learning can be abstracted as (1), which we discuss in more details in Secion 3.

We aim to solve (1) in the decentralized scenario where every agent only communicates with a subset of other agents (a.k.a. neighbors). Specifically, communication occurs according to an undirected and connected graph , where agents communicate if and only if . We use to denote the neighbors of agent , i.e. . Our motivation to study the decentralized problem stems from the wide range of applications where centralized communication is expensive or impossible. For example, when unmanned robots are used to collaboratively explore an unknown terrain or boundary, every robot may only communicate with some others within a certain radius (Ovchinnikov et al., 2014).

We propose solving (1) with Algorithm 1. In every iteration , agent maintains , a local estimate of the optimal . is updated according to , as the mixture of a consensus step and a local improvement step. In the consensus step, agent collects its neighbors parameters and performs a weighted average according to the matrix , with the aim of reducing the consensus error . In the local improvement step, every agent moves in the direction of , which is computed only using agent ’s local information. The purpose of this step is to push the local parameter towards .

Our model for the data observed at each agent is that it is sampled from a Markov processes; this lack of independence makes the iterates biased and the convergence properties less known in the literature. Our motivation to study Markovian randomness comes from the wide use of variants of Algorithm 1 in reinforcement learning, where the i.i.d. data assumption is usually unrealistic. We provide more details on the applications that operate under Markovian data in Section 3.

### 1.1 Main Contribution

In this work, we propose a general framework the decentralized stochastic approximation problem, and present its finite-time convergence analysis under Markovian data without any assumptions on the boundedness of quantities related to . Specifically, we show under a constant step size , Algorithm 1 converges linearly to a ball centered at a fixed point of (1) with radius . This convergence rate and the asymptotic precision are both only different from the rate with i.i.d. noise by , which naturally relates to the mixing time of the Markov chain. In addition, our rate matches the one in centralized stochastic approximation under Markovian data, up to a factor that depends on the structure of the communication graph .

We demonstrate the generality of the DCSA framework by showing how two actively studied topics, namely, the decentralized Markov chain gradient descent (MCGD) and the decentralized TD() learning, fall under the framework as special cases. As a result, we can apply the convergence theorem of DCSA and obtain the convergence rates of decentralized MCGD and decentralized TD(), which match the best-known rates derived specifically for the respective problems.

Further, we apply the DCSA framework to solve multi-task reinforcement learning problems. We define and motivate this problem, present the convergence properties, and support the effectiveness of the algorithm with experimental results.

### 1.2 Related Works

Distributed stochastic approximation (SA) and the related distributed SGD have been studied extensively in the literature. Here we provide a brief review of works most relevant to this paper. Yuan et al. (2016); Lian et al. (2017); Zeng and Yin (2018); Koloskova et al. (2020) study the distributed SGD for convex and non-convex problems under i.i.d. data. A subset of these works allow the data at different agents to follow unique distributions (i.e. and can be distributed differently if ), but they all require the data at every agent to be distributed i.i.d. according to the local distribution. Most of the works assume every gradient update has bounded energy.

In this work, a key contribution is that we remove this bounded updates/gradients assumption usually made in the literature. For some specific examples, Sun et al. (2019) assumes that the norm of is uniformly upper bounded by a constant; Wai (2020) relaxes this assumption, but has to assume a bound on the variance of . Such assumptions are difficult to verify in practical problems, and may not even hold in simple least squares regressions. Our work eliminates any assumptions on the boundedness of the operator or quantities related to its variance.

A recent work (Doan, 2020) also studied distributed SA without the boundedness assumption, but it considers a centralized communication topology (star graph). When centralized coordination is available and the agents communicate every iteration, the algorithm can be designed such that the local parameter maintained by all agents are always equal, i.e. . This makes the consensus error zero, which greatly simplifies the analysis. In contrast, our work considers a decentralized communication graph, under which the consensus error is non-zero. The consensus error is very closely related to . With non-zero consensus errors, analyzing the convergence without assuming is bounded becomes much more difficult.

Our work finds applications in TD learning and Q learning, and is thus related to the vast volume of literature in multi-agent and multi-task reinforcement learning. A few most related works include Zhang et al. (2018); Assran et al. (2019); Zeng et al. (2020), which blend consensus SGD with variants of policy gradient methods to solve multi-agent or multi-task reinforcement learning. IMPALA Espeholt et al. (2018) and its variants also tackle multi-task learning using a network of agents, but require centralized coordination.

## 2 Finite-Time Analysis

In this section, we present our convergence analysis of Algorithm 1 for solving the objective (1). Our main theorem relies some structural properties of the Markov chains , the mappings and the network weights , which we detail in the following assumptions.

###### Assumption 1.

The matrix of consensus weights in (2) is doubly stochastic. In addition, and if and only if there is an edge between and are connected in .

This is a standard assumption in consensus algorithms. It means that the maximum singular value of is , and the key structural property affecting the rate of convergence of consensus algorithms converge is the spectral gap , where is the second-largest singular value.

###### Assumption 2.

The Markov chains for each agent are irreducible and aperiodic. In addition, and are independent for .

Assumption 2 is again standard in studying the convergence properties of Markov chains. Its main consequence is that the chains “mix” at a geometric rate. In the context of this paper, if we define the -mixing time of as

 τi(ϵ)=min{t≥1:∥E[Fi(Xki,θ)∣X0i=x]−¯Fi(θ)∥ ≤ϵ(∥θ∥+1),∀θ,x,k≥t},

then we can have for some constant . This is the content of Lemma 1, which we present in the supplementary material. We will denote .

###### Assumption 3.

The mappings are Lipschitz; there exists a constant such that for all

 ∥Fi(Xi,θ)−Fi(Xi,~θ)∥≤L∥θ−~θ∥,∀θ,~θ∈Rd.

As a result of Assumption 3, the mapping is also Lipschitz continuous with constant .

 ∥¯Fi(θ)−¯Fi(~θ)∥≤L∥θ−~θ∥,∀θ,~θ∈Rd.

Let since are finite.

###### Assumption 4.

The mapping is strongly monotone for all ; there exists such that

 ⟨¯Fi(θ)−¯Fi(~θ),θ−~θ⟩≤−α∥θ−~θ∥2,∀θ,~θ∈Rd.

A direct consequence of this assumption is that there exists a unique that solves (1).

In our theorem statement below, we concatenate the estimated parameters for each node into and denote their average as . The total variance of the from their average (which we call the consensus error below) is .

###### Theorem 1.

Suppose that Assumptions 1-4 hold. Let the step size be a constant , small enough that

 ϵτ(ϵ) ≤min{1Cϵ,1, α(1−σ22)Cϵ,2}, (3)

where and are constants that depend on and 1. Then the sequence generated by Algorithm 1 satisfies

 E[∥θki−θ∗∥2] ≤(1−αϵ4)k−τ(ϵ)O(∥¯θ0∥2+S0++B∥θ∗∥21−σ22) +O(NB2α(1−σ22)ϵlog(1ϵ)). (4)

We defer the complete proof of the theorem to the supplementary material, where we give the precise form for (4) and the constants. Under a constant step size , (4) indicates that Algorithm 1 achieves linear convergence to a ball with radius on the order of around the solution . As the step size is selected closer to 0, this error eventually decays to 0. Both the rate of convergence and the radius of the error are different from works assuming the are i.i.d. sequences only by a factor of (for example, see (Khaled et al., 2019)). In addition, this rate matches those derived in prior works with a centralized communication topology (Doan, 2020).

Our bound depends inversely on , which reflects the impact of the graph and the matrix on the convergence. The bound also scales proportional to , which shows the increasing difficulty of the problem as the number of agents goes up.

### 2.1 Technical Challenges

In this section, we sketch the main technical challenges in our analysis. Our assumptions are more general than those in the existing works in two important ways: we do not assume that the are bounded, and we take the sequences to be Markovian instead of i.i.d. Characterizing the convergence with these weaker assumptions requires a new approach to the analysis.

We separate the error between the parameter estimate at node at iteration into two parts:

 θki−θ∗=θki−¯θkconsensus error"+¯θk−θ∗θkioptimality error"

If the operators were bounded, i.e., there existed a constant such that uniformly over all and , then previous work (e.g. Yuan et al. (2016)) has shown that . This effectively allows the consensus error to be made arbitrarily small by choosing a small , and so can be analyzed independently of the optimality error. Without the boundedness assumption, the errors must be treated jointly.

Another complication arises from the Markovian samples. A standard argument yields the following bound:

 E[∥¯θk+1−θ∗∥2] ≤E[∥¯θk−θ∗∥2]+O(ϵ2) +2ϵN∑i=1E[⟨¯θk−θ∗,Fi(Xki,θki)−¯Fi(θki)⟩] +2ϵN∑i=1E[⟨¯θk−θ∗,¯Fi(θki)−¯Fi(¯θk)⟩] +2ϵN∑i=1E[⟨¯θk−θ∗,¯Fi(¯θk)⟩] ≤(1−2αϵ)E[∥¯θk−θ∗∥2]+O(ϵ2) +2ϵN∑i=1E[⟨¯θk−θ∗,Fi(Xki,θki)−¯Fi(θki)⟩] (5) +2ϵN∑i=1E[⟨¯θk−θ∗,¯Fi(θki)−¯Fi(¯θk)⟩], (6)

where the second inequality is due to Assumption 4 and . To get a geometric convergence rate, we need the last two terms above to be smaller than . When the are i.i.d., we have , and the term (5) vanishes. Under Markovian randomness, the samples are biased, i.e. , and the term (5) is potentially unbounded. Furthermore, by combining the independent bound on the consensus error with the Lipschitz condition, (6) is also easily treated, as

 (???)≤O(ϵ2∥¯θk−θ∗∥).

Our approach is to bound the combination of the consensus and optimality errors directly. The key to this approach is to take advantage of the fast mixing time that Assumption 2 affords; that there exists a constant such that the can be bounded as

 τi(ϵ) ≤ Clog(1/ϵ).

(This is carefully stated and proven as Lemma 1 in the appendix.) This essentially means that samples of the Markov chain are not much affected by those at least steps away in the past, and since , we can choose a step size that allows us to analyze the convergence on a time scale that has been mildly dilated.

For our analysis, we use a Lyapunov function defined on and ,

 V(θk)=E[∥¯θk−θ∗∥2]+E[Sk]+E[Sk−τ(ϵ)].

The three terms above are each bounded with recursive inequalities, all coupled with one another, and through careful analysis of their dynamics using the fast mixing time property of the Markov chain, we show that converges geometrically to a ball around the origin with radius proportional to .

## 3 Applications

In this section, we present three problems in multi-task supervised learning and reinforcement learning. We demonstrate the generality of the DCSA framework by discussing how the three problems fall under the framework as special cases. The first two problems, decentralized Markov chain gradient descent and decentralized TD() with linear function approximation, have been well studied in the existing literature, so we give a brief overview of the problems and highlight their connection to our framework. Our third problem is novel: multi-task Q learning with a network of agents, for which we detail the problem formulation, the algorithm, and the convergence properties obtained through the application of Theorem 1.

### 3.1 Decentralized Markov Chain Gradient Descent

Let be a statistical sample space with probability distribution , and let be a function that is convex in its second argument for all . We consider the optimization problem

 minx∈RnEξ[f(ξ;x)]=∑ξ∈Ξπ(ξ)f(ξ;x). (7)

To solve this problem, the method of SGD then iteratively update the iterate as

 xk+1=xk−ϵk∇f(ξk,xk), (8)

where and is the step sizes. When are i.i.d., the convergence of (8) are well-understood in the literature of stochastic optimization. However, in many cases drawing i.i.d. samples from is computationally expensive or even infeasible, for example, in high-dimensional probability distributions. Markov Chain Monte Carlo (MCMC) sampling is a standard technique to circumvent this issue (Jerrum and Sinclair, 1996). In addition, it has also been observed that the SGD performs better, i.e., using less data and computation, when the gradients are sampled from Markov processes as compared to i.i.d samples in both convex and nonconvex problems (Sun et al., 2018). In this case, it produces a sequence of gradients in (8) with Markovian “noise” whose stationary distribution is (Sun et al., 2018; Doan et al., 2020).

A natural extension of MCGD is the distributed setting, where there are a large amount of data stored at different machines (agents) Sun et al. (2019). In this case, we aim to solve

 minx∈RnN∑i=1Eξi[fi(ξi;x)], (9)

where is the local data at machine . Thus, one can apply Algorithm 1 to solve problem (9), where , where Assumption 4 is equivalent to the strong convexity of for all . Our theoretical result in Theorem 1 recovers the known results of decentralized MCGD for solving strongly convex optimization problems (Sun et al., 2019) under constant step sizes. However, unlike the work in Sun et al. (2019) we do not need to assume the boundedness of . Therefore, the analysis techniques used in this paper are different from the one there. Note that although Sun et al. (2019) do not consider the case of strongly convex objectives, one can generalize their techniques to cover this case.

### 3.2 Decentralized TD(λ) Learning with Linear Function Approximation

Decentralized TD learning considers the policy evaluation problem in multi-agent reinforcement learning, where there are a group agents connected through a graph . At time , agent observes the global state and applies an action . Their joint actions decides the next state of the system and agent receives an instantaneous reward . The goal of the agents is to collectively evaluate the global value function under a provided policy.

Mathkar and Borkar (2016) and Doan et al. (2019a, b) analyze this problem in the scenario where the estimated value function is restricted to a linear subspace parameterized by . In this case, the policy evaluation problem is equivalent to find satisfying

 Eπ[A]θ∗=N∑i=1Eπ[bi], (10)

where is a (not necessarily symmetric) positive definite matrix and are vectors, which depend on the geometry of the subspace and on Markovian samples with being the stationary distribution.

In this case, TD() learning is a linear SA problem and an obvious special case of problem (1), with . Assumptions 3 and 4 are guaranteed to hold in this case (Doan et al., 2019b). By Theorem 1, the error linearly decays to a ball with radius , which recovers the best known rate in Doan et al. (2019a, b). Finally, since is not symmetric, (10) cannot be cast as an optimization problem. Therefore, it does not fit into the SGD framework.

We consider a multi-task reinforcement learning (MTRL) problem with tasks. We can view each task as an environment characterized by a different Markov decision process (MDP). We employ a network of agents and assign one agent to each task. Each agent is restricted to learn in its own environment, but the goal of the agents is to cooperatively find a unified policy, a mapping from the union of state spaces to the common action space, that is jointly optimal across all environments.

Mathematically, the MDP at agent is given by a -tuple where is the set of states; is the set of possible actions, which is the same across agents; is the transition probabilities that specify the distribution on the next state given the current state and an action; is the reward function; is the discount factor. We define , where can share common states.

Given a policy , let be the value function associated with the -th environment,

 Vπi(s)=E[∞∑k=0γkiRi(ski,aki)|s0i=s],aki∼π(⋅|ski).

We define the average value function of policy that measures the performance of the policy across all environments,

 Vπ(s) ≜∑i:s∈SiVπi(s).

Similarly, we denote by and the local and average Q functions, respectively.

 Qπi(s,a)=E[∞∑k=0γkiR(ski,aki)|s0i=s,a0i=a], Qπ(s,a)≜∑i:s∈SiQπi(s,a).

An optimal policy maximizes the average value function for all states at the same time, i.e.

 Vπ∗(s)≥Vπ(s),∀s∈S,π. (11)

In single task reinforcement learning, there exists a deterministic optimal policy under mild assumptions (Puterman, 2014). Further, such a policy can be obtained using Q learning under assumptions on the behavior policy (Watkins and Dayan, 1992). However, in the multi-task regime, a deterministic optimal policy may not exist, due to the interference of the dynamics of different environments. This issue prevents us from using Q learning to find a globally optimal policy, since Q learning by design can only produce deterministic policies. However, it can be shown that a deterministic optimal policy exists in two special cases of MTRL: 1) every environment has the same state space and transition probability; 2) the environments all have disjoint state spaces.

#### Multi-Task Q Learning with Linear Function Approximation

In practical problems, the dimensionality of the state space usually prohibits storing a Q table. To approximate the Q function, we assume that it lies in a -dimensional linear subspace . Given the features for , the Q value is approximated as

 ~Qθ(si,ai)=ϕ(si,ai)⊤θ,∀si∈Si,ai∈A.

In the rest of this section, we focus on formulating the multi-task Q learning problem with linear function approximation in the second special case, where the state spaces of the environments are disjoint. We show that this problem admits a re-formulation that falls under the DCSA framework (Equation (1)). We point out that multi-task Q learning in the first special case can be formulated to fit in the DCSA framework by an almost repetitive argument.

Using to denote the state in environment , we define the feature matrix

 Φi =⎡⎢ ⎢ ⎢⎣ϕ(si,1,a1)⊤⋮ϕ(si,|Si|,a|A|)⊤⎤⎥ ⎥ ⎥⎦=⎡⎢⎣∣∣ϕ1,i⋯ϕd,i∣∣⎤⎥⎦.

Without loss of generality, we can normalize the feature vectors such that for all . Since the state spaces of the environments are disjoint, the feature matrix for the global problem is given by .

Now the problem becomes finding the best Q function that lies in the linear subspace .

With function approximates, there is no guarantee that the optimal Q function truly lies in . Therefore, we solve a projected version of the Bellman equation

 Φθ=ΠQT[Φθ], (12)

where is the projection mapping onto the linear subspace with respect to some suitable metric.

We now provide a re-formulation of the objective (12). Suppose that every agent uses a fixed behavior policy to generate trajectories, that is, agent chooses action in state and observes the next state . Under mild assumptions on , we can show (12) is equivalent to

 +γmaxa′i∈Aϕ(s′i,a′i)⊤θ−ϕ(si,ai)⊤θ)]=0, (13)

where is the stationary distribution induced by the behavior policy in environment 2.

The problem now resembles (1) by choosing

 Fi(Xi,θ)=ϕ(si,ai)(Ri(si,ai)+γmaxa′i∈Aϕ(s′i,a′i)⊤θ−ϕ(si,ai)⊤θ +γmaxa′i∈Aϕ(s′i,a′i)⊤θ−ϕ(si,ai)⊤θ), (14)

where is Markovian with stationary distribution . This suggests that to solve a MTRL problem, we can perform a decentralized Q learning (DCQL) algorithm, which amounts to using Algorithm 1 with defined in (14).

Q learning with linear function approximation is known to diverge in general (Baird, 1995), and we have to make the following regularity assumption to guarantee convergence. This assumption is also made in single task Q learning with linear function approximation (Chen et al., 2019) and can be interpreted as a requirement on the quality of the behavior policy for good exploration.

###### Assumption 5.

The behavior policy satisfies

 γ2Eμi[maxa′∈A(ϕ(s,a′)⊤θ)2]−Eμi,π[(ϕ(s,a)⊤θ)2] ≤−2α∥θ∥2,∀θ∈Rd,i∈[N].

Assumption 5 can be re-expressed to guarantee to be strongly-monotone, which makes DCQL a special case within the DCSA framework. This implies a linear convergence rate when we apply Algorithm 1 to solve multi-task Q learning.

## 4 Experiments

In this section, we verify the effectiveness of the DCSA framework through two multi-task Q learning experiments. The first experiment is a small-scale GridWorld problem that provides insight on the behavior of the policies learned by the algorithm. The second experiment runs on a large-scale drone navigation platform and shows the scalability of our approach.

In practical MTRL problems, it may be difficult to pick a behavior policy that meets Assumption 5. This topic has been discussed in-depth by Chen et al. (2019). In the following experiments we use as behavior policies the -greedy policy with respect to the current Q function, which we cannot verify to satisfy Assumption 5. The aim of the experiments is to show that Algorithm 1 often performs well empirically even under imperfect behavior policies that may violate Assumption 5.

### 4.1 GridWorld Experiment

We consider a GridWorld problem consisting of multiple mazes (environments). In each environment, the agent is placed in a grid of cells, where each cell has one of three labels: goal, obstacle, or empty. The agent selects an action from a set of 4 actions up, down, left, right to move to the next cell. It then receives a reward of +1 if it reaches the goal, -1 if it encounters an obstacle, and 0 otherwise. The state is the current position of the agent. Each agent maximizes its personal cumulative award when the goal is reached from the initial position in the smallest number of steps.

In the multi-task settings, there are 4 (Figure 1(a)) or 6 (Figure 1(b)) different maze environments of size that differ in obstacle and goal positions. We assign one agent to each environment. In this setting, the environments have the same state space and dynamics, which guarantees the existence of a deterministic optimal solution as discussed in Section 3.3. As the state and action space are both small, we perform tabular Q learning. We note that tabular learning is a special case of learning with linear function approximation by choosing the feature vector of state-action pair to be , where

 es′,a′(s,a)={1, if s=s′,a=a′,0, otherwise.

After being trained with the DCQL algorithm for 2000 episodes, the agents agree on a unified Q value table, whose performance is tested in all environments. The results are presented in Figure 1, where we combine all the results into one grid. Here the dark green cells represent the starting position of the agents. Yellow cells represent the goals, with the white numbers indicating the indices of environments in which the cell is a goal. Red cells represent the obstacle positions, with the black numbers indicating the indices of environments in which the cell is an obstacle. The mixed color cell (Figure 1(b) cell 8f) indicates the cell is a goal position in environment 0, but an obstacle in environment 2 and 3. The light green path is the route dictated by the policy greedy with respect to the learned Q table.

Figure 1(a) shows a 4-environment problem where the algorithm returns an optimal policy which finds all the goals at the environments sequentially. Then, in Figure 1(b), we consider a harder problem, where we have 6 environments and introduce conflicts, i.e. the goal position of environment 0 is an obstacle in environment 2 and 3. As the agent is agnostic to which environment it is placed in, the optimal policy should visit the goal positions of all other environments before visiting the goal position of environment 0. The learned DCQL policy indeed takes this strategy and can be verified to be optimal. In contrast, an agent trained independently in one maze without collaborating with other agents would fail to find the goal in any other mazes.

In this experiment, we create 4 indoor environments. Each environment is explored by one drone agent, forming a network of 4 agents. The agents aim to collaboratively learn a unified policy for navigating all environments, while each agent is restricted to explore only its own environment.

This experiment is conducted on PEDRA, a 3D realistic drone simulation platform (11). In this platform, each drone agent uses the picture captured by its front-facing camera as the state, which makes the state space of each environment the set of all possible pictures the drone may see. There are a total number of 25 actions, corresponding to the drone controlling its yaw and pitch by various angles. The reward received by the agent is designed to encourage the drone to stay away from obstacles.

Each agent uses a 5-layer neural network (4 convolutional layers followed by a fully connected layer) to approximate the Q function, with the weights of the conv layers pre-trained so that the output of the fourth layer approximates the depth map extracted from the state. The agents freeze the conv layers and only train the fully connected layer. This essentially means the agents use a linear function approximation with the feature vector being the output of the conv layers.

We compare the DCQL policy with policies trained independently by a single agent in the -th environment, which we denote as SA-. We stress that the single agents and DCQL are trained identically, with the only difference being whether agents communicate and average their parameters3. In the testing phase, the performance of the policies is measured using the mean safe flight (MSF), the average distance travelled by the agent before it collides with any obstacle. The result is presented in the left half of Table 1.

Intuitively, with the feature vector being the depth map of the current state, the drone navigation problem becomes much simplified, as the agent just has to learn to steer away from the direction with the smallest depth. Indeed, it can be seen from the table that every single agent not only performs well in its own environment, but also learns to generalize to other environments it has not seen before thanks to the suitable choice of feature vectors. However, even in this setup where generalization almost comes for free, the DCQL policy still performs the best in the sense that it has the highest average MSF across all environments.

Then, we consider a more challenging problem (right half of Table 1), where we do not have a pre-trained feature extractor. Instead, each agent uses as the function approximate a neural network (same architecture as in the previous experiment) that it trains from scratch. In this case the function approximate is non-linear, which our theoretical results do not cover.

In the absence of a good feature extractor, we see that the single agent trained independently in one environment loses the ability to generalize to unseen environments, while the DCQL policy performs well across all environments. This demonstrates the value of Algorithm 1 in multi-task reinforcement learning: by communicating with others, the agent immersed in one environment can adapt to all other environments without explicitly learning in those environments. Hence, such algorithm significantly improves the learning efficiency in applications where every agent needs to perform well in a number of environments (or a large, heterogeneous environment) that have been collectively explored by a network of agents.

## 5 Conclusion

In this work, we study the finite-time convergence of DCSA over a network of agents, where the data at each agent is generated from a Markov process. We remove the restrictive assumptions on bounded updates that are usually made in the existing literature. We discuss three problems in multi-task and distributed learning that can be viewed as special cases of the DCSA framework. Finally, we verify the effectiveness of the DCSA framework by solving two multi-task Q learning problems experimentally. Future directions from this work include studying the DCSA framework under time-varying communication graph, communication constraints, and/or asynchronous communication.

Finite-Time Analysis of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning
Supplementary Materials

## Appendix A Proofs

We use the following notations throughout the proofs.

 θ≜[θT1,θT2,...,θTN]T∈RNd,X≜[X1,X2,...,XN],andF(X;θ)≜⎛⎜ ⎜ ⎜ ⎜ ⎜⎝F1(X1,θ1)F2(X2,θ2)⋮FN(XN,θN)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠∈RNd.

We use and to denote the identity matrix and the all-one vector of appropriate dimensions as implied in the context.

### a.1 Proof of Theorem 1

###### Theorem 1.

Suppose that Assumptions 1-4 hold. Let the step size be a constant , small enough that

 ϵτ(ϵ)≤ min{16NB,α(1−σ22)45N+324NB2}. (15)

Then the sequence generated by Algorithm 1 satisfies

 E[∥θki−θ∗∥2] ≤(1−αϵ4)k−τ(ϵ)((64Bϵ9(1−σ22)+23N2)∥¯θ0∥2+(4+16Bϵ9N(1−σ22)+23N2)∥θ0−(1⊗I)¯θ0∥2 +23N2+13C2ϵ3NB)+8CC1+40CC2αϵlog(1ϵ),

where the constants

 C1 =(60NB2+45N2+90NBL+6B2)∥θ∗∥2+(60NB2+45N2+45NBL+6B2), andC2 =8NB21−σ22(∥θ∗∥2+1),

and defined in Lemma 1 below.

Proof: We first introduce four technical lemmas.

###### Lemma 1.

Under Assumption 2 and 3, there exists a constant such that given any , we have ,

 ∥∥E[Fi(Xki,θ)∣X0i]−¯Fi(θ)∥∥≤ϵ(∥θ∥+1),

### 1

where .

###### Lemma 2.

Let . Then under Assumption 3, we have for all and

 ∥Fi(Xi,θ)∥≤B(∥θ∥+1), and ∥¯Fi(θ)∥≤B(∥θ∥+1),∀θ∈Rd.
###### Lemma 3.

Under Assumptions 1-3, we have for all

 ∥¯θk−¯θk−τ(ϵ)∥ ≤3ϵBτ(ϵ)(∥¯θk∥+1N∥θk−τ(ϵ)−(1⊗I)¯θk−τ(ϵ)∥+1) ≤12N∥¯θk∥+12N∥θk−τ(ϵ)−(1⊗I)¯θk−τ(ϵ)∥+12N,

and

 ∥¯θk−¯θk−τ(ϵ)∥ ≤2ϵBτ(ϵ)(∥¯θk−τ(ϵ)∥+1N∥θk−τ(ϵ)−(1⊗I)¯θk−τ(ϵ)∥+1) ≤13N∥¯θk−τ(ϵ)∥+13N∥θk−τ(ϵ)−(1⊗I)¯θk−τ(ϵ)∥+13N.

For all , we have

 ∥¯θk−¯θ0∥≤13N∥¯θ0∥+13N∥θ0−(1⊗I)¯θ0∥+13N.
###### Lemma 4.

Under Assumptions 1-3, we have for all

 E[⟨¯θk−θ∗,N∑i=1(Fi(Xki,θki)−¯Fi(¯θk))⟩] ≤α2E[∥¯θk−θ∗∥2]+(45N4+30NB2+48NBL)ϵτ(ϵ)E[∥¯θk−θ∗∥2] +ϵτ(ϵ)((30NB2+45N4+45NBL)∥θ∗∥2+(30NB2+45N4+45NBL2)) +(5B+512NB+4NL2α+5L)E[∥θk−τ(ϵ)−(1⊗I)¯θk−τ(ϵ)∥2].

We defer the proofs of lemmas to Section A.2-5 and focus on proving the main theorem now.

The more precise form of the step size we need in the following analysis is

 ϵτ(ϵ)≤ min⎧⎪⎨⎪⎩1−σ222α3+20B+53NB+8NL2α+10L+BN2,α45N+120NB2+192NBL+12B2,16NB,α(1−σ22)192NB2⎫⎪⎬⎪⎭,

which is guaranteed if the step size satisfies the more strict condition (15).

 ∥¯θk+1−θ∗∥2 ≤∥¯θk+ϵNN∑i=1Fi(Xki,θki)−θ∗∥2 ≤∥¯θk−θ∗∥2+∥ϵNN∑i=1Fi(Xki,θki)∥2+2ϵ⟨¯θk−θ∗,N∑i=1¯Fi(¯θk)⟩ +2ϵ⟨¯θk−θ∗,N∑i=1(Fi(Xki,θki)−¯Fi(¯θk))⟩ ≤(1−2αϵ)∥¯θk−θ∗∥2+∥ϵNN∑i=1Fi(Xki,θki)∥2+2ϵ⟨¯θk−θ∗,N∑i=1(Fi(Xki,θki)−¯Fi(¯θk))⟩, (16)

where in the last inequality we used Assumption 4. We can bound the second term

 ∥ϵNN∑i=1Fi(Xki,θki)∥2 ≤ϵ2NN∑i=1B2(∥θki∥+1)2 ≤ϵ2B2NN∑i=1(3∥¯θk∥2+3∥θki−¯θk∥2+3) ≤3