A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient DescentThis work was partially supported by the NSF under grants DMS-1664644 and CNS-1645681, by the ONR under MURI grant N00014-16-1-2832, by the NIH under grant 1UL1TR001430 to the Clinical & Translational Science Institute at Boston University, and by the Boston University Digital Health Initiative..

# A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent††thanks: This work was partially supported by the NSF under grants DMS-1664644 and CNS-1645681, by the ONR under MURI grant N00014-16-1-2832, by the NIH under grant 1UL1TR001430 to the Clinical & Translational Science Institute at Boston University, and by the Boston University Digital Health Initiative..

Alex Olshevsky Department of Electrical and Computer Engineering and Division of Systems Engineering, Boston University, Boston, MA (alexols@bu.edu, yannisp@bu.edu).    Ioannis Ch. Paschalidis22footnotemark: 2    Shi Pu Division of Systems Engineering, Boston University, Boston, MA (sp3dw@virginia.edu).
###### Abstract

This paper is concerned with minimizing the average of cost functions over a network, in which agents may communicate and exchange information with their peers in the network. Specifically, we consider the setting where only noisy gradient information is available. To solve the problem, we study the standard distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, we not only show that DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD), but also explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. Furthermore, we derive the time needed for DSGD to approach the asymptotic convergence rate, which behaves as , where denotes the spectral gap of the mixing matrix of communicating agents.

Key words. distributed optimization, convex optimization, stochastic programming, stochastic gradient descent

AMS subject classifications. 90C15, 90C25, 68Q25

## 1 Introduction

In this paper, we consider the distributed optimization problem where a group of agents collaboratively look for that minimizes the average of cost functions:

 minx∈Rpf(x)(=1nn∑i=1fi(x)). (1.1)

Each local cost function is known by agent only, and all the agents communicate and exchange information over a network. Problems in the form of (LABEL:opt_Problem_def) find applications in multi-agent target seeking [31, 8], distributed machine learning [13, 24, 10, 2, 45, 1, 4], and wireless networks [9, 20, 2], among other scenarios.

In order to solve (LABEL:opt_Problem_def), we assume each agent is able to obtain noisy gradient samples satisfying the following assumption:

###### Assumption 1.1.

For all and all , each random vector is independent, and

 Eξi[gi(x,ξi)∣x]=∇fi(x),Eξi[∥gi(x,ξi)−∇fi(x)∥2∣x]≤σ2\ for some σ>0. (1.2)

This condition is satisfied for many distributed learning problems. For example, suppose represents the expected loss function for agent , where are independent data samples gathered over time. Then for any and , is an unbiased estimator of satisfying Assumption LABEL:asp:_gradient_samples. For another example, suppose the overall goal is to minimize an expected risk function , and each agent has a single data sample . Then, the expected risk function can be approximated by , where . In this setting, the gradient estimation of can incur noises from various sources such as approximation error and modeling and discretization errors.

Problem (LABEL:opt_Problem_def) has been studied extensively in the literature under various distributed algorithms [42, 25, 26, 19, 15, 16, 38, 11, 34, 23, 44, 33], among which the distributed gradient descent (DGD) method proposed in [25] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest [36, 40, 12, 3, 5, 41, 6, 7, 22, 17, 18, 29, 30, 37, 39, 14, 32, 28, 43, 1]. Several recent works [18, 29, 21, 30, 32, 28] have shown that distributed methods may compare with their centralized counterparts under certain conditions. For instance, a recent paper [28] discussed a distributed stochastic gradient method that asymptotically performs as well as the best bounds on centralized stochastic gradient descent (SGD).

In this work, we perform a non-asymptotic analysis for the standard distributed stochastic gradient descent (DSGD) method adapted from DGD. In addition to showing that the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap () of the mixing matrix). Furthermore, we characterize the time needed for DSGD to achieve the optimal rate of convergence, demonstrated in the following corollary.

###### Corollary (Corollary 4.7).

It takes time for DSGD to reach the asymptotic rate of convergence, i.e., when , we have .

Note that is the asymptotic convergence rate for SGD (see Theorem LABEL:Thm:_centralized). Here denotes the spectral norm of with being the mixing matrix for all the agents, is the average solution at time and is the optimal solution. Stepsizes are set to be for some . These results are new to the best of our knowledge.

The rest of this paper is organized as follows. After introducing necessary notation in Section LABEL:subsec:pre, we present the DSGD algorithm and some preliminary results in Section LABEL:sec:_DSGD. In Section LABEL:sec:_analysis we prove the sublinear convergence of the algorithm. Main convergence results and a comparison with centralized stochastic gradient method are demonstrated in Section LABEL:sec:_main_results. We conclude the paper in Section LABEL:sec:_conclusions.

### 1.1 Notation

Vectors are column vectors unless otherwise specified. Each agent holds a local copy of the decision vector denoted by , and its value at iteration/time is written as . Let

 x:=[x1,x2,…,xn]⊺∈Rn×p,¯¯¯x:=1n1⊺x∈R1×p,

where is the all one vector. Define an aggregate objective function

 F(x):=n∑i=1fi(xi),

and let

 ∇F(x):=[∇f1(x1),∇f2(x2),…,∇fn(xn)]⊺∈Rn×p,
 ¯∇F(x):=1n1⊺∇F(x).

 ξ:=[ξ1,ξ2,…,ξn]⊺∈Rn×p,
 g(x,ξ):=[g1(x1,ξ1),g2(x2,ξ2),…,gn(xn,ξn)]⊺∈Rn×p.

In what follows we write and for short.

The inner product of two vectors is written as . For two matrices , let , where (respectively, ) is the -th row of (respectively, ). We use to denote the -norm of vectors and the Frobenius norm of matrices.

A graph has a set of vertices (nodes) and a set of edges connecting vertices . Consider agents interact in an undirected graph, i.e., if and only if .

Denote the mixing matrix of agents by . Two agents and are connected if and only if ( otherwise). Formally, we assume the following condition on the communication among agents:

###### Assumption 1.2.

The graph is undirected and connected (there exists a path between any two agents). The mixing matrix is nonnegative and doubly stochastic, i.e., and .

From Assumption LABEL:asp:_network, we have the following contraction property of (see [34]):

###### Lemma 1.3.

Let Assumption LABEL:asp:_network hold, and let denote the spectral norm of the matrix . Then, and

 ∥Wω−1¯¯¯ω∥≤ρw∥ω−1¯¯¯ω∥

for all , where .

## 2 Distributed Stochastic Gradient Descent

We consider the following standard DSGD method: at each step , every agent independently performs the update:

 xi(k+1)=n∑j=1wij(xj(k)−αkgj(k)), (2.1)

where is a sequence of non-increasing stepsizes. The initial vectors are arbitrary for all . We can rewrite (LABEL:eq:_x_i,k) in the following compact form:

 xk+1=W(x(k)−αkg(k)). (2.2)

Throughout the paper, we make the following standing assumption regarding the objective functions 111The assumption can be generalized to the case where the agents have different and ..

###### Assumption 2.1.

Each is -strongly convex with -Lipschitz continuous gradients, i.e., for any ,

 ⟨∇fi(x)−∇fi(x′),x−x′⟩≥μ∥x−x′∥2,∥∇fi(x)−∇fi(x′)∥≤L∥x−x′∥. (2.3)

Under Assumption LABEL:asp:_mu-L_convexity, Problem (LABEL:opt_Problem_def) has a unique optimal solution , and the following result holds (See [34] Lemma 10).

###### Lemma 2.2.

For any and , we have

 ∥x−α∇f(x)−x∗∥≤λ∥x−x∗∥,

where .

Denote . The following two lemma will be useful for our analysis later.

###### Lemma 2.3.

Under Assumption LABEL:asp:_gradient_samples, for all ,

 E[∥∥¯¯¯g(k)−¯∇F(x(k))∥∥2]≤σ2n. (2.4)

###### Proof.

By definitions of , and Assumption LABEL:asp:_gradient_samples, we have

 E[∥∥¯¯¯g(k)−¯∇F(x(k))∥∥2]=E[∥∥∥1n1⊺g(k)−1n1⊺∇F(x(k))∥∥∥2]=1n2n∑i=1E[∥gi(k)−∇fi(xi(k))∥2]≤σ2n.

###### Lemma 2.4.

Under Assumption LABEL:asp:_mu-L_convexity, for all ,

 ∥∥∇f(¯¯¯x(k))−¯∇F(x(k))∥∥≤L√n∥x(k)−1¯¯¯x(k)∥. (2.5)

###### Proof.

By definition,

 ∥∥∇f(¯¯¯x(k))−¯∇F(x(k))∥∥= ∥∥∥∇f(¯¯¯x(k))−1n1⊺∇F(x(k))∥∥∥ = ∥∥ ∥∥1nn∑i=1∇fi(¯¯¯x(k))−1nn∑i=1∇fi(xi(k))∥∥ ∥∥ (Assumption ???)≤ Lnn∑i=1∥¯¯¯x(k)−xi(k)∥≤L√n∥x(k)−1¯¯¯x(k)∥,

where the last relation follows from the Cauchy-Schwarz inequality.

### 2.1 Preliminary Results

In this section, we present some preliminary results concerning (expected optimization error) and (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration. Throughout the analysis we assume Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network and LABEL:asp:_mu-L_convexity hold.

###### Lemma 2.5.

Under Algorithm (LABEL:eq:_x_k), for all , we have

 E[∥¯¯¯x(k+1)−x∗∥2∣x(k)]≤∥¯¯¯x(k)−αk∇f(¯¯¯x(k))−x∗∥2+2αkL√n∥¯¯¯x(k)−αk∇f(¯¯¯x(k))−x∗∥∥x(k)−1¯¯¯x(k)∥+α2kL2n∥x(k)−1¯¯¯x(k)∥2+α2kσ2n. (2.6)

###### Proof.

See Appendix LABEL:proof_lem:_optimization_error_contraction_pre.

The next result is a corollary of Lemma LABEL:lem:_optimization_error_contraction_pre.

###### Lemma 2.6.

Under Algorithm (LABEL:eq:_x_k), supposing , then

 (2.7)

###### Proof.

See Appendix LABEL:proof_lem:_optimization_error_contraction.

Concerning the expected consensus error , we have the following lemma.

###### Lemma 2.7.

Under Algorithm (LABEL:eq:_x_k), for all ,

 E[∥x(k+1)−1¯¯¯x(k+1)∥2]≤(1+ρ2w2+2αkρ2wL+2α2kρ2wL2)E[∥∥x(k)−1¯¯¯x(k)∥∥2]+ρ2w[α2k4nL2(1−ρ2w)E[∥¯¯¯x(k)−x∗∥2]+α2k4∥∇F(1x∗)∥2(1−ρ2w)+α2knσ2].

###### Proof.

See Appendix LABEL:proof_lem:_consensus_error_contraction.

## 3 Analysis

We are now ready to derive some preliminary convergence results for Algorithm (LABEL:eq:_x_k). First, we provide a uniform bound on the iterates generated by Algorithm (LABEL:eq:_x_k) (in expectation) for all . Then based on the lemma established in Section LABEL:subsec:_pre_results, we prove the sublinear convergence rates and .

From now on we consider the following stepsize policy:

 αk:=θμ(k+K),∀k, (3.1)

where and

 K:=⌈2θL2μ2⌉. (3.2)

### 3.1 Uniform Bound

We derive a uniform bound on the iterates generated by Algorithm (LABEL:eq:_x_k) (in expectation) for all .

###### Lemma 3.1.

For all , we have

 E[∥x(k)∥2]≤max{∥x(0)∥2,n∑i=1Ri}, (3.3)

where

 Ri:=maxq∈Xi{(1−μ22L2)q+μL2∥∇fi(0)∥√q+μ24L4(2∥∇fi(0)∥2+σ2)}, (3.4)

and sets are defined in (LABEL:definition:_X).

###### Proof.

See Appendix LABEL:proof_lem:_bounded_iterates_general_stepsize.

We can further bound as follows. From the definition of ,

 maxq∈Xiq≤8∥∇fi(0)∥2μ2+3σ24L2.

Hence

 Ri= maxq∈Xi{q−μ2L2[μq−2∥∇fi(0)∥√q−μ2L2(2∥∇fi(0)∥2+σ2)]} (3.5) ≤ maxq∈Xiq−μ2L2minq∈Xi{μq−2∥∇fi(0)∥√q−μ2L2(2∥∇fi(0)∥2+σ2)} ≤ 8∥∇fi(0)∥2μ2+3σ24L2+μ2L2[∥∇fi(0)∥2μ+μ2L2(2∥∇fi(0)∥2+σ2)] ≤ 9∥∇fi(0)∥2μ2+σ2L2.

In light of Lemma LABEL:lem:_bounded_iterates_general_stepsize and inequality (LABEL:bound:_R_i), further noticing that the choice of is arbitrary in the proof of Lemma LABEL:lem:_bounded_iterates_general_stepsize, we obtain the following uniform bound for .

###### Lemma 3.2.

Under Algorithm (LABEL:eq:_x_k), for all , we have

 E[∥x(k)−1x∗∥2]≤~X:=max{∥x(0)−1x∗∥2,9∑ni=1∥∇fi(x∗)∥2μ2+nσ2L2}, (3.6)

### 3.2 Sublinear Rate

Denote

 (3.7)

Using Lemma LABEL:lem:_optimization_error_contraction and Lemma LABEL:lem:_consensus_error_contraction from Section LABEL:subsec:_pre_results, we show below that Algorithm (LABEL:eq:_x_k) enjoys the sublinear convergence rate, i.e., and .

Define a Lyapunov function:

 W(k):=U(k)+ω(k)V(k),∀k, (3.8)

where is to be determined later.

###### Lemma 3.3.

Let

 K1:=⌈24L2θ(1−ρ2w)μ2⌉, (3.9)

and

 ω(k):=12αkL2μn(1−ρ2w). (3.10)

Under Algorithm (LABEL:eq:_x_k), for all , we have

 U(k)≤~Wk, (3.11)

where

 ~W:=K1~Xn+3(4θ−3)(σ2θ2nμ2+σ2ρ2wθ22μ2)+6∥∇F(1x∗)∥2ρ2wθ2(4θ−3)nμ2(1−ρ2w). (3.12)

 V(k)≤pk−K10V(K1)+V1k2+V2k3,

where

 p0:=3+ρ2w4, (3.13)
 V1:=8θ2ρ2wμ2(1−ρ2w)[4∥∇F(1x∗)∥2(1−ρ2w)+nσ2],V2:=32θ2nL2ρ2wμ2(1−ρ2w)2~W. (3.14)

###### Proof.

See Appendix LABEL:proof_lemma:_prelim_rates_general_stepsize.

## 4 Main Results

Notice that the sublinear rate obtained in Lemma LABEL:lemma:_prelim_rates_general_stepsize is network dependent, i.e., depends on the spectral gap , a function of the mixing matrix . In this section, we perform a non-asymptotic analysis of network independence for Algorithm (LABEL:eq:_x_k). Specifically, in Theorem LABEL:Theorem_general_stepsize and Corollary LABEL:cor:_U(k)_general_step, we show that , where the first term is network independent and the second (higher-order) term depends on . In Section LABEL:subsec:_finer_result, we further improve the result and compare it with centralized stochastic gradient descent. We show that asymptotically, the two methods have the same convergence rate . In addition, it takes time for Algorithm (LABEL:eq:_x_k) to reach this asymptotic rate of convergence.

Our analysis starts with a useful lemma.

###### Lemma 4.1.
 k−1∏t=a(1−2θt)≤a2θk2θ. (4.1)

###### Proof.

See Appendix LABEL:proof_lem:_product.

The following theorem characterizes the non-asymptotic convergence property for Algorithm (LABEL:eq:_x_k).

###### Theorem 4.2.

Under Algorithm (LABEL:eq:_x_k), suppose 222The condition can be easily generalized to the case where .. We have for all ,

 U(k)≤θ2σ2(2θ−1)nμ2k+4θL√~WV1(2θ−1.5)√nμ1k1.5+3θ2(2θ−1)σ22(θ−1)nμ21k2+1θ−1⎛⎝θ2~W+2θL√~WV2√nμ⎞⎠1k2+2θ2L2V1(2θ−3)nμ21k3+θ2L2V2(θ−2)nμ21k4+(K2θ1~Xn+4θL√~W~X√nμK2θ−1.511−p0+2θ2L2~Xnμ2K2θ−211−p0)1k2θ. (4.2)

###### Proof.

For , in light of Lemma LABEL:lem:_optimization_error_contraction_pre and Lemma LABEL:lem:_contraction_mu-L_convexity,

 U(k+1) ≤ (1−αkμ)2U(k)+2αkL√nE[∥¯¯¯x(k)−x∗∥∥x(k)−1¯¯¯x(k)∥]+α2kL2nV(k)+α2kσ2n ≤ (1−αkμ)2U(k)+2αkL√n√U(k)V(k)+α2kL2nV(k)+α2kσ2n = (1−2θk)U(k)+θ2U(k)k2+2θL√nμ√U(k)V(k)k+θ2L2nμ2V(k)k2+θ2σ2nμ21k2.

where the second inequality follows from the Cauchy-Schwarz inequality. Then,

 U(k)≤⎛⎝k−1∏t=K1(1−2θt)⎞⎠U(K1)+k−1∑t=K1(k−1∏i=t+1(1−2θi))(θ2σ2nμ2t2+θ2U(t)t2+2θL√nμ√U(t)V(t)t+θ2L2nμ2V(t)t2).

From Lemma LABEL:lem:_product,

 U(k)≤ K2θ1k2θU(K1) (4.3) +k−1∑t=K1(t+1)2θk2θ(θ2σ2nμ2t2+θ2U(t)t2+2θL√nμ√U(t)V(t)t+θ2L2nμ2V(t)t2) = 1k2θθ2σ2nμ2k−1∑t=K1(t+1)2θt2+K2θ1k2θU(K1) +k−1∑t=K1(t+1)2θk2θ(θ2U(t)t2+2θL√nμ√U(t)V(t)t+θ2L2nμ2V(t)t2).

In light of Lemma LABEL:lemma:_prelim_rates_general_stepsize, when ï¼

 U(k)≤~Wk,

and

 V(k)≤pk−K10V(K1)+V1k2+V2k3.

Hence

 U(k)−1k2θθ2σ2nμ2k−1∑t=K1(t+1)2θt2−K2θ1k2θU(K1) ≤ k−1∑t=K1(t+1)2θk2θ[θ2~Wt3+2θL√nμ√~Wt1.5√pt−K10V(K1)+V1t2+V2t3 +θ2L2nμ21t2(pt−K10V(K1)+V1t2+V2t3)] ≤ θ2~Wk2θk−1∑t=K1(t+1)2θt3+k−1∑t=K1(t+1)2θk2θ⎡⎢ ⎢⎣2θL√~W√nμ⎛⎜ ⎜⎝√V(K1)√pt−K10t1.5+√V1t2.5+√V2t3⎞⎟ ⎟⎠ +θ2L2nμ2⎛⎝pt−K10V(K1)t2+V1t4+V2t5⎞⎠⎤⎦ = 1k2θ2θL√~WV1√nμk−1∑t=K1(t+1)2θt2.5+1k2θ(θ2~W+2θL√~WV2√nμ)k−1∑t=K1(t+1)2θt3 +1k2θθ2L2V1nμ2k−1∑t=K1(t+1)2θt4+1k2θθ2L2V2nμ2k−1∑t=K1(t+1)2θt5 +1k2θ2θL√~WV(K1)√nμk−1∑t=K1(t+1)2θ√pt−K10t1.5+1k2θθ2L2V(K1)nμ2k−1∑t=K1(t+1)2θpt−K10t2.

However, we have for any ,

 b∑a(t+1)2θt2≤b−2∑a[(t+1)2θ(t+1)2+3(t+1)2θ(t+1)3]+b2θ(b−1)2+(b+1)2θb2≤∫ba(t2θ−2+3t2θ−3)dt+2(b+1)2θb2≤b2θ−12θ−1+3b2θ−22θ−2+3b2θ−2,
 b∑a(t+1)2θt2.5≤2b2θ−1.52θ−1.5,b∑a(t+1)2θt3≤2b2θ−22θ−2, b∑a(t+1)2θt4≤2b2θ−32θ−3,b∑a(t+1)2θt5≤2b2θ−42θ−4,

and

 k−1∑t=K1(t+1)2θ√pt−K10t1.5≤2∫∞t=K1t2θ−1.5√pt−K10≤2lnp0∫