Decentralized Consensus Algorithm with Delayed and Stochastic Gradients

# Decentralized Consensus Algorithm with Delayed and Stochastic Gradients

## Abstract.

We analyze the convergence of decentralized consensus algorithm with delayed gradient information across the network. The nodes in the network privately hold parts of the objective function and collaboratively solve for the consensus optimal solution of the total objective while they can only communicate with their immediate neighbors. In real-world networks, it is often difficult and sometimes impossible to synchronize the nodes, and therefore they have to use stale gradient information during computations. We show that, as long as the random delays are bounded in expectation and a proper diminishing step size policy is employed, the iterates generated by decentralized gradient descent method converge to a consensual optimal solution. Convergence rates of both objective and consensus are derived. Numerical results on a number of synthetic problems and real-world seismic tomography datasets in decentralized sensor networks are presented to show the performance of the method.

###### 2000 Mathematics Subject Classification:
65K05, 90C25, 65Y05.
B. Sirb and X. Ye are with the Department of Mathematics and Statistics, Georgia State University, Atlanta, Georgia 30303, USA. E-mail: bsirb1@student.gsu.edu, xye@gsu.edu.

## 1. Introduction

In this paper, we consider a decentralized consensus optimization problem arising from emerging technologies such as distributed machine learning [3, 10, 16, 19], sensor network [13, 29, 35], and smart grid [11, 21]. Given a network , is the node (also called agent, processor, or sensor) set and is the edge set. Two nodes and are called neighbors if . The communications between neighbor nodes are bidirectional, meaning that and can communicate with each other as long as .

In a decentralized sensor network , individual nodes can acquire, store, and process data about large-sized objects. Each node collects data and holds objective function privately where is random with fixed but unknown probability distribution in domain to model environmental fluctuations such as noise in data acquisition and/or inaccurate estimation of objective function or its gradient. Here is the unknown (e.g., the seismic image) to be solved, where the domain is compact and convex. Furthermore, we assume that is convex for all and , and we define which is convex with respect to . The goal of decentralized consensus optimization is to solve the minimization problem

 (1) minimizex∈Xf(x),where f(x):=m∑i=1fi(x)

with the restrictions that , and hence , are accessible by node only, and that nodes and can communicate only if during the entire computation.

There are a number of practical issues that need to be taken into consideration in solving the real-world decentralized consensus optimization problem (1):

• The partial objective (and ) is held privately by node , and transferring to a data fusion center is either infeasible or cost-ineffective due to data privacy, the large size of , and/or limited bandwidth and communication power overhead of sensors. Therefore, the nodes can only communicate their own estimates of with their neighbors in each iteration of a decentralized consensus algorithm.

• Since it is often difficult and sometimes impossible for the nodes to be fully synchronized, they may not have access to the most up-to-date (stochastic) gradient information during computations. In this case, the node has to use out-of-date (stochastic) gradient where is the estimate of obtained by node at iteration , and is the level of (possibly random) delay of the gradient information at .

• The estimates by the nodes should tend to be consensual as increases, and the consensual value is a solution of problem (1). In this case, there is a guarantee of retrieving a good estimate of from any surviving node in the network even if some nodes are sabotaged, lost, or run out of power during the computation process.

In this paper, we analyze a decentralized consensus algorithm which takes all the factors above into consideration in solving (1). We provide comprehensive convergence analysis of the algorithm, including the decay rates of objective function and disagreements between nodes, in terms of iteration number, level of delays, and network structure etc.

### 1.1. Related work

Distributed computing on networks is an emerging technology with extensive applications in modern machine learning [10, 16, 19], sensor networks [13, 29, 45, 46], and big data analysis [4, 30]. There are two types of scenarios in distributed computing: centralized and decentralized. In the centralized scenario, computations are carried out locally by worker (slave) nodes while computations of certain global variables must eventually be processed by designated master node or at a center of shared memory during each (outer) iteration. A major effort in this scenario has been devoted to update the global variable more effectively using an asynchronous setting in, for example, distributed centralized alternating direction method of multipliers (ADMM) [5, 7, 20, 40, 43]. In the decentralized scenario considered in this paper, the nodes privately hold parts of objective functions and can only communicate with neighbor nodes during computations. In many real-world applications, decentralized computing is particularly useful when a master-worker network setting is either infeasible or not economical, or the data acquisition and computation have to be carried out by individual nodes which then need to collaboratively solve the optimization problem. Decentralized networks are also more robust to node failure and can better address privacy concerns. For more discussions about motivations and advantages of decentralized computing, see, e.g., [15, 26, 28, 33, 37, 38] and references therein.

Decentralized consensus algorithms take the data distribution and communication restriction into consideration, so that they can be implemented at individual nodes in the network. In the ideal synchronous case of decentralized consensus where all the nodes are coordinated to finish computation and then start to exchange information with neighbors in each iteration, a number of developments have been made. A class of methods is to rewrite the consensus constraints for minimization problem (1) by introducing auxiliary variables between neighbor nodes (i.e., edges), and apply ADMM (possibly with linearization or preconditioning techniques) to derive an implementable decentralized consensus algorithm [6, 12, 14, 22, 34]. Most of these methods require each node to solve a local optimization problem every iteration before communication, and reach a convergence rate of in terms of outer iteration (communication) number for general convex objective functions . First-order methods based on decentralized gradient descent require less computational cost at individual nodes such that between two communications they only perform one step of a gradient descent-type update at the weighted average of previous iterates obtained from neighbors. In particular, Nesterov’s optimal gradient scheme is employed in decentralized gradient descent with diminishing step sizes to achieve rate of in [15], where an alternative gradient method that requires excessive communications in each inner iteration is also developed and can reach a theoretical convergence rate of , despite that it seems to work less efficiently in terms of communications than the former in practice. A correction technique is developed for decentralized gradient descent with convergence rate as with constant step size in [33], which results in a saddle-point algorithm as pointed out in [23]. In [46], the authors combine Nesterov’s gradient scheme and a multiplier-type auxiliary variable to obtain a fast optimality convergence rate of . Other first-order decentralized methods have also been developed recently, such dual averaging [8]. Additional constraints for primal variables in decentralized consensus optimization (1) are considered in [42].

In real-world decentralized computing, it is often difficult and sometimes impossible to coordinate all the nodes in the network such that their computation and communication are perfectly synchronized. One practical approach for such asynchronous consensus is using a broadcast scenario where in each (outer) iteration, one node in the network is assumed to wake up at random and broadcasts its value to neighbors (but does not hear them back). A number of algorithms for broadcast consensus are developed, for instance, in [2, 13, 24, 25]. Another important issue in the asynchronous setting is that nodes may have to use out-of-date (stale) gradient information during updates [27]. This delayed scenario in gradient descent is considered in a distributed but not decentralized setting in [1, 18, 36, 44]. In addition, analysis of stochastic gradient in distributed computing is also carried out in [1, 32]. In [9], linear convergence rate of optimality is derived for strongly convex objective functions with delays. Extending [1], a fixed delay at all nodes is considered in dual averaging [17] and gradient descent [39] in a decentralized setting, but they did not consider more practical and useful random delays, and there are no convergence rates on node consensus provided in these papers.

### 1.2. Contributions

The contribution of this paper is in three phases.

First, we consider a general decentralized consensus algorithm with randomly delayed and stochastic gradient (Section 2). In this case, the nodes do not need to be synchronized and they may only have access to stale gradient information. This renders stochastic gradients with random delays at different nodes in their gradient updates, which is suitable for many real-world decentralized computing applications.

Second, we provide a comprehensive convergence analysis of the proposed algorithm (Section 3). More precisely, we derive convergence rates for both the objective function (optimality) and disagreement (feasibility constraint of consensus), and show their dependency on the characteristics of the problem, such as Lipschitz constants of (stochastic) gradients and spectral gaps of the underlying network.

Third, we conduct a number of numerical experiments on synthetic and real datasets to validate the performance of the proposed algorithm (Section 4). In particular, we examine the convergence on synthetic decentralized least squares, robust least squares, and logistic regression problems. We also present the numerical results on the reconstruction of several seismic images in decentralized wireless sensor networks.

### 1.3. Notations and assumptions

In this paper, all vectors are column vectors unless otherwise noted. We denote by the estimate of node at iteration , and . We denote if is a vector and if is a matrix, which should be clear by the context. For any two vectors of same dimension, denotes their inner product, and for symmetric nonnegative definite matrix . For notation simplicity, we use where and are the -th row of the matrices and respectively. Such matrix inner product is also generalized to for matrices and . In this paper, we set the domain for some , which can be thought of as the maximum pixel intensity in reconstructed images for instance. We further denote .

For each node , we define as the expectation of objective function, and (here the gradient is taken with respect to ) is the stochastic gradient at at node . We let be the delay of gradient at node in iteration , and . We write in short for , for , and for . We assume is continuously differentiable, has Lipschitz constant , and denote . Let be a solution of (1). Since is consensual, we denote simply by in this paper which is clear by the context, for instance . Furthermore, we let be the running average of , and be the consensus average of . We denote , then . Note that for all , is always consensual but may not be.

Suppose is a continuously differentiable convex function, then for any we denote the Bregman distance (divergence) between and (order matters) by

 (2) Dh(x,y):=h(x)−h(y)−⟨∇h(y),x−y⟩.

If in addition is -Lipschitz continuous, then we can verify that, for any , there is

 ⟨∇h(z)−∇h(w),x−y⟩= Dh(y,z)−Dh(x,z)−Dh(y,w)+Dh(x,w) (3) ≤ Dh(y,z)−Dh(x,z)+Lh2∥x−w∥2

where we used the facts that and .

An important ingredient in decentralized gradient descent is the mixing matrix in (4). For the algorithm to be implementable in practice, if and only if . In this paper, we assume that is symmetric and for all , hence is doubly stochastic, namely and where . With the assumption that the network is simple and connected, we know and eigenvalue of has multiplicity by the Perron-Frobenius theorem. As a consequence, if and only if is consensual, i.e., for some . We further assume (otherwise use since stochastic matrix has spectral radius 1). Given a network , there are different ways to design the mixing matrix . For some optimal choices of , see, e.g., [31, 41].

Now we make several mild assumptions that are necessary in our convergence analysis.

1. The network is undirected, simple, and connected.

2. The stochastic gradient satisfies for all and . Moreover, for all , , and , and for some , and for some .

3. The delays may follow different distributions at different nodes, but their second moments are assumed to be uniformly bounded, i.e., there exists such that for all and iteration . For each node , we assume each update happens once, i.e., is strictly increasing as increases.

It is worth pointing out that these assumptions are rather standard and easy to satisfy in practice. For instance, the boundedness of is a consequence of the compactness of domain and the Lipschitz continuity of . The assumption on random delays in a distributed system is also used in [1]. We further assume that the the stochastic error and the random delay are independent.

## 2. Algorithm

Taking the delayed stochastic gradient and the constraint that nodes can only communicate with immediate neighbors, we propose the following decentralized delayed stochastic gradient descent method for solving (1). Starting from an initial guess , each node performs the following updates iteratively:

 (4) xi(t+1)=ΠX[m∑j=1wijxj(t)−α(t)gi(t−τi(t))].

Namely, in each iteration , the nodes exchange their most recent with their neighbors. Then each node takes weighted average of the received local copies using weights and performs a gradient descent type update using a stochastic gradient with delay and step size , and projects the result onto .

Following the matrix notation in Section 1.3, the iteration (4) can be written as

 (5) x(t+1)=ΠX[Wx(t)−α(t)g(t−τ(t))].

Here the projection is accomplished by each node projecting to due to the definition of in Section 1.3, which does not require any coordination between nodes. Note that the update (5) is also equivalent to

 (6) x(t+1)=argminx∈X{⟨g(t−τ(t)),x⟩+12α(t)∥x−Wx(t)∥2}.

In this paper, we may refer to the proposed decentralized delayed stochastic gradient descent algorithm by any of (4), (5), and (6) since they are equivalent.

## 3. Convergence Analysis

In this section, we provide a comprehensive convergence analysis of the proposed algorithm (6) by employing a proper step size policy. In particular, we derive convergence rates for the objective function and disagreement in that order.

###### Lemma 1.

Let be the iterates generated by Algorithm (5), then the following inequality holds for all :

 (7) T∑t=1⟨∇f(x(t))−∇f(x(t−τ(t))),x(t+1)−x∗⟩ ≤ 2mnLR2(1+2B2)+L2(B+1)2T∑t=1∥x(t+1)−x(t)∥2.
###### Proof.

We first observe that

 ⟨∇f(x(t))−∇f(x(t−τ(t))),x(t+1)−x∗⟩ (8) = m∑i=1⟨∇fi(xi(t))−∇fi(xi(t−τi(t))),xi(t+1)−x∗⟩ ≤ m∑i=1[Dfi(x∗,xi(t))−Dfi(x∗,xi(t−τi(t)))+L2∥xi(t+1)−xi(t−τi(t))∥2]

where we applied (1.3) to get the inequality. We further note that the convexity of implies

 (9) ∥xi(t+1)−xi(t−τi(t))∥2≤(τi(t)+1)τi(t)∑s=0∥xi(t−s+1)−xi(t−s)∥2.

Combining (3) and (9), and taking the sum of from to , we obtain

 T∑t=1⟨∇f(x(t))−∇f(x(t−τ(t))),x(t+1)−x∗⟩ (10) ≤ m∑i=1[T∑t=1(Dfi(x∗,xi(t))−Dfi(x∗,xi(t−τi(t)))) +L2T∑t=1(τi(t)+1)τi(t)∑s=0∥xi(t−s+1)−xi(t−s)∥2⎤⎦

For each , the sum of terms for from to above leaves only those not received by the gradient procedure within iterations, namely

 (11) T∑t=1(Dfi(x∗,xi(t))−Dfi(x∗,xi(t−τi(t))))=∑t∈Si(T)Dfi(x∗,xi(t))

where . Then by Chebyshev’s inequality, we can bound the expected cardinality of by

 (12) E[|Si(T)|]=T∑t=1P(τi(T)>T−t)≤1+T−1∑t=1B2(T−t)2≤1+2B2.

where we used the fact that . Combining (11) and (12), and using the fact that for all and , we obtain,

 (13) m∑i=1T∑t=1(Dfi(x∗,xi(t))−Dfi(x∗,xi(t−τi(t))))≤2mnLR2(1+2B2).

For each , the second sum for from to on the right side of (3) yields

 (14) T∑t=1(τi(t)+1)τi(t)∑s=0∥xi(t−s+1)−xi(t−s)∥2≤T∑t=1Ni(t,T)∥xi(t+1)−xi(t)∥2

where the coefficient is defined by

 (15) Ni(t,T):=∑{t≤s≤T:0≤s−τi(s)≤t}(τi(s)+1)

Therefore, we have for each that

 (16) E[Ni(t,T)]= E⎡⎣∑{t≤s≤T:0≤s−τi(s)≤t}(τi(s)+1)⎤⎦=T∑s=ts∑k=s−t(k+1)P(τi(s)=k) ≤ T∑k=0(k+1)2P(τi(s)=k)≤E[|τi(s)+1|2]≤(B+1)2

where the first inequality is obtained by listing each possible value of in the double sum, and upper bounding its occurrence by , and the last inequality is due to . Therefore, (14) can be bounded by

 (17) m∑i=1T∑t=1(τi(t)+1)τi(t)∑s=0∥xi(t−s+1)−xi(t−s)∥2≤(B+1)2T∑t=1∥x(t+1)−x(t)∥2

Applying (13) and (17) to (3) completes the proof. ∎

###### Theorem 2.

Let be the iterates generated by Algorithm (5) with where is a nondecreasing function of , then

 (18) E[f(y(T))−f(x∗)] ≤2mnR2[4L+2η(1)+2η(T)+L(1+2B2)]T+2mσ2TT∑t=11η(t) +L(B+1)22TT∑t=1E[∥x(t+1)−x(t)∥2]

where is the running average of .

###### Proof.

We first note that there is

 f(x(t+1))−f(x∗)=m∑i=1(fi(xi(t+1))−fi(x∗)) = m∑i=1[fi(xi(t+1))−fi(xi(t))+fi(xi(t))−fi(x∗)] (19) ≤ m∑i=1[⟨∇fi(xi(t)),xi(t+1)−xi(t)⟩+Li2∥xi(t+1)−xi(t)∥2 +⟨∇fi(xi(t)),xi(t)−x∗⟩] ≤ m∑i=1[⟨∇fi(xi(t)),xi(t+1)−x∗⟩+Li2∥xi(t+1)−xi(t)∥2] ≤ ⟨∇f(x(t)),x(t+1)−x∗⟩+L2∥x(t+1)−x(t)∥2 +L2∥x(t+1)−x(t)∥2

where we used the -Lipschitz continuity of and convexity of to obtain the first inequality. Note that is obtained by (6) as

 (20) x(t+1)= argminx∈X{⟨g(t−τ(t)),x⟩+12α(t)∥x−Wx(t)∥2} = argminx∈X{⟨g(t−τ(t))+1α(t)(I−W)x(t),x⟩+12α(t)∥x−x(t)∥2.}

Therefore, the optimality of in (6) implies that

 ⟨g(t−τ(t)),x(t+1)−x∗⟩ (21) ≤ −1α(t)⟨(I−W)x(t),x(t+1)−x∗⟩ +12α(t)[∥x∗−x(t)∥2−∥x(t+1)−x(t)∥2−∥x∗−x(t+1)∥2].

Furthermore, we note that and is consensual, hence we have

 −1α(t)⟨(I−W)x(t),x(t+1)−x∗⟩ = −1α(t)⟨(I−W)(x(t)−x∗),x(t+1)−x∗⟩ (22) = 12α(t)(∥x(t)−x(t+1)∥2I−W−∥x(t)−x∗∥2I−W−∥x(t+1)−x∗∥2I−W) ≤ 14α(t)∥x(t)−x(t+1)∥2I−W

where we have used the fact that

 ∥x(t)−x(t+1)∥2I−W≤2(∥x(t)−x∗∥2I−W+∥x(t+1)−x∗∥2I−W)

to obtain the inequality above. We also have that

 ∥x(t)−x(t+1)∥2I−W≤∥x(t)−x(t+1)∥2

with which we can further bound (3) as

 −1α(t)⟨(I−W)x(t),x(t+1)−x∗⟩≤14α(t)∥x(t)−x(t+1)∥2.

Now applying the inequality above and (3) to (3), and taking sum of from to , we get

 T∑t=1f(x(t+1))−Tf(x∗) (23) ≤ T∑t=1[12α(t)(∥x(t)−x∗∥2−∥x(t+1)−x∗∥2)+(L2−14α(t))∥x(t)−x(t+1)∥2] +T∑t=1⟨∇f(x(t))−g(t−τ(t)),x(t+1)−x∗⟩.

For the last term on the right hand side of (3), we have

 T∑t=1⟨∇f(x(t))−g(t−τ(t)),x(t+1)−x∗⟩ = T∑t=1⟨∇f(x(t))−∇f(x(t−τ(t))),x(t+1)−x∗⟩ (24) +T∑t=1⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x∗⟩ ≤ 2mnLR2(1+2B2)+T∑t=1L2(B+1)2∥x(t+1)−x(t)∥2 +T∑t=1⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x∗⟩

where we applied the Lemma 1 to obtain the inequality.

Note that the running average satisfies due to the convexity of all . Therefore, together with (3) and (3) and the definition of , we have

 T(f(y(T))−f(x∗)) (25) ≤ T∑t=1[12α(t)(∥x(t)−x∗∥2−∥x(t+1)−x∗∥2)+L(B+1)2−η(t)2∥x(t)−x(t+1)∥2] +2mnLR2(1+2B2)+T∑t=1⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x∗⟩.

Now, by taking expectation on both sides of (3), we obtain

 (26) TE[f(y(T))−f(x∗)]≤ +2mn

where we denoted for notation simplicity.

Now we work on the last sum of inner products on the right side of (26). First we observe that

 E⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x∗⟩ (27) = E⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t)−x∗⟩ +E⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x(t)⟩.

Note that is used to calculate , and hence its stochastic error is independent of . Therefore, we have

 (28) E⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t)−x∗⟩=0.

Furthermore, by Young’s inequality, we have

 E⟨∇f(x(t−τ(t)))−g(t−τ(t)),x(t+1)−x(t)⟩ (29) ≤ 2η(t)E[∥∇f(x(t−τ(t)))−g(t−τ(t))∥2]+η(t)2E[∥x(t+1)−x(t)∥2] ≤ 2mσ2η(t)+η(t)2E[∥x(t+1)−x(t)∥2]

where we used the fact that for all . Now applying (3), (28) and (3) in (26), we have

 TE[f(y(T))−f(x∗)] ≤T∑t=112α(t)(e(t)−e(t+1))+2mnLR2(1+2B2)+T∑t=12mσ2η(t) (30) +L(B+1)22T∑t=1E[∥x(t+1)−x(t)∥2] ≤e(1)2α(1)+T∑t=2e(t)2(1α(t)−1α(t−1))+2mnLR2(1+2B2)+T∑t=12mσ2η(t) +L(B+1)22T∑t=1E[∥x(t+1)−x(t)∥2].

Note that is nonincreasing, therefore and hence

 (31) T∑t=2e(t)2(1α(t)−1α(t−1))≤2mnR2T∑t=2(1α(t)−1α(t−1))≤2mnR2α(T)

where we used the fact that for all . Applying (31) to (3) yields (18). ∎

We have shown that the running average makes the objective function decay as in (18). However, an important feature in decentralized computing is that tend to be consensual. Now we prove that the consensus can be achieved by the proposed algorithm (5), and we derive the convergence rate for the employed step size policy.

###### Lemma 3.

For any , its projection onto yields nonincreasing disagreement. That is

 (32) ∥(I−J)ΠX(x)∥2≤∥(I−J)x∥2
###### Proof.

See Appendix A. ∎

###### Lemma 4.

Let and , and define . Then for any there is

 (33) t−1∑s=0α(s)λt−s−1≤√πλ−2c2√tlog(λ−1)=O(1√t)

for all

###### Proof.

See Appendix B. ∎

###### Theorem 5.

Let be the iterates generated by Algorithm (6) with