A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient DescentThis work was partially supported by the NSF under grants DMS-1664644 and CNS-1645681, by the ONR under MURI grant N00014-16-1-2832, by the NIH under grant 1UL1TR001430 to the Clinical & Translational Science Institute at Boston University, and by the Boston University Digital Health Initiative..

A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descentthanks: This work was partially supported by the NSF under grants DMS-1664644 and CNS-1645681, by the ONR under MURI grant N00014-16-1-2832, by the NIH under grant 1UL1TR001430 to the Clinical & Translational Science Institute at Boston University, and by the Boston University Digital Health Initiative..

Alex Olshevsky Department of Electrical and Computer Engineering and Division of Systems Engineering, Boston University, Boston, MA (alexols@bu.edu, yannisp@bu.edu).    Ioannis Ch. Paschalidis22footnotemark: 2    Shi Pu Division of Systems Engineering, Boston University, Boston, MA (sp3dw@virginia.edu).
Abstract

This paper is concerned with minimizing the average of cost functions over a network, in which agents may communicate and exchange information with their peers in the network. Specifically, we consider the setting where only noisy gradient information is available. To solve the problem, we study the standard distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, we not only show that DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD), but also explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. Furthermore, we derive the time needed for DSGD to approach the asymptotic convergence rate, which behaves as , where denotes the spectral gap of the mixing matrix of communicating agents.

Key words. distributed optimization, convex optimization, stochastic programming, stochastic gradient descent

AMS subject classifications. 90C15, 90C25, 68Q25

1 Introduction

In this paper, we consider the distributed optimization problem where a group of agents collaboratively look for that minimizes the average of cost functions:

(1.1)

Each local cost function is known by agent only, and all the agents communicate and exchange information over a network. Problems in the form of (LABEL:opt_Problem_def) find applications in multi-agent target seeking [31, 8], distributed machine learning [13, 24, 10, 2, 45, 1, 4], and wireless networks [9, 20, 2], among other scenarios.

In order to solve (LABEL:opt_Problem_def), we assume each agent is able to obtain noisy gradient samples satisfying the following assumption:

Assumption 1.1.

For all and all , each random vector is independent, and

(1.2)

This condition is satisfied for many distributed learning problems. For example, suppose represents the expected loss function for agent , where are independent data samples gathered over time. Then for any and , is an unbiased estimator of satisfying Assumption LABEL:asp:_gradient_samples. For another example, suppose the overall goal is to minimize an expected risk function , and each agent has a single data sample . Then, the expected risk function can be approximated by , where . In this setting, the gradient estimation of can incur noises from various sources such as approximation error and modeling and discretization errors.

Problem (LABEL:opt_Problem_def) has been studied extensively in the literature under various distributed algorithms [42, 25, 26, 19, 15, 16, 38, 11, 34, 23, 44, 33], among which the distributed gradient descent (DGD) method proposed in [25] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest [36, 40, 12, 3, 5, 41, 6, 7, 22, 17, 18, 29, 30, 37, 39, 14, 32, 28, 43, 1]. Several recent works [18, 29, 21, 30, 32, 28] have shown that distributed methods may compare with their centralized counterparts under certain conditions. For instance, a recent paper [28] discussed a distributed stochastic gradient method that asymptotically performs as well as the best bounds on centralized stochastic gradient descent (SGD).

In this work, we perform a non-asymptotic analysis for the standard distributed stochastic gradient descent (DSGD) method adapted from DGD. In addition to showing that the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap () of the mixing matrix). Furthermore, we characterize the time needed for DSGD to achieve the optimal rate of convergence, demonstrated in the following corollary.

Corollary (Corollary 4.7).

It takes time for DSGD to reach the asymptotic rate of convergence, i.e., when , we have .

Note that is the asymptotic convergence rate for SGD (see Theorem LABEL:Thm:_centralized). Here denotes the spectral norm of with being the mixing matrix for all the agents, is the average solution at time and is the optimal solution. Stepsizes are set to be for some . These results are new to the best of our knowledge.

The rest of this paper is organized as follows. After introducing necessary notation in Section LABEL:subsec:pre, we present the DSGD algorithm and some preliminary results in Section LABEL:sec:_DSGD. In Section LABEL:sec:_analysis we prove the sublinear convergence of the algorithm. Main convergence results and a comparison with centralized stochastic gradient method are demonstrated in Section LABEL:sec:_main_results. We conclude the paper in Section LABEL:sec:_conclusions.

1.1 Notation

Vectors are column vectors unless otherwise specified. Each agent holds a local copy of the decision vector denoted by , and its value at iteration/time is written as . Let

where is the all one vector. Define an aggregate objective function

and let

In addition, we denote

In what follows we write and for short.

The inner product of two vectors is written as . For two matrices , let , where (respectively, ) is the -th row of (respectively, ). We use to denote the -norm of vectors and the Frobenius norm of matrices.

A graph has a set of vertices (nodes) and a set of edges connecting vertices . Consider agents interact in an undirected graph, i.e., if and only if .

Denote the mixing matrix of agents by . Two agents and are connected if and only if ( otherwise). Formally, we assume the following condition on the communication among agents:

Assumption 1.2.

The graph is undirected and connected (there exists a path between any two agents). The mixing matrix is nonnegative and doubly stochastic, i.e., and .

From Assumption LABEL:asp:_network, we have the following contraction property of (see [34]):

Lemma 1.3.

Let Assumption LABEL:asp:_network hold, and let denote the spectral norm of the matrix . Then, and

for all , where .

2 Distributed Stochastic Gradient Descent

We consider the following standard DSGD method: at each step , every agent independently performs the update:

(2.1)

where is a sequence of non-increasing stepsizes. The initial vectors are arbitrary for all . We can rewrite (LABEL:eq:_x_i,k) in the following compact form:

(2.2)

Throughout the paper, we make the following standing assumption regarding the objective functions 111The assumption can be generalized to the case where the agents have different and ..

Assumption 2.1.

Each is -strongly convex with -Lipschitz continuous gradients, i.e., for any ,

(2.3)

Under Assumption LABEL:asp:_mu-L_convexity, Problem (LABEL:opt_Problem_def) has a unique optimal solution , and the following result holds (See [34] Lemma 10).

Lemma 2.2.

For any and , we have

where .

Denote . The following two lemma will be useful for our analysis later.

Lemma 2.3.

Under Assumption LABEL:asp:_gradient_samples, for all ,

(2.4)

Proof.

By definitions of , and Assumption LABEL:asp:_gradient_samples, we have

    

Lemma 2.4.

Under Assumption LABEL:asp:_mu-L_convexity, for all ,

(2.5)

Proof.

By definition,

where the last relation follows from the Cauchy-Schwarz inequality.     

2.1 Preliminary Results

In this section, we present some preliminary results concerning (expected optimization error) and (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration. Throughout the analysis we assume Assumptions LABEL:asp:_gradient_samples, LABEL:asp:_network and LABEL:asp:_mu-L_convexity hold.

Lemma 2.5.

Under Algorithm (LABEL:eq:_x_k), for all , we have

(2.6)

Proof.

See Appendix LABEL:proof_lem:_optimization_error_contraction_pre.     

The next result is a corollary of Lemma LABEL:lem:_optimization_error_contraction_pre.

Lemma 2.6.

Under Algorithm (LABEL:eq:_x_k), supposing , then

(2.7)

Proof.

See Appendix LABEL:proof_lem:_optimization_error_contraction.     

Concerning the expected consensus error , we have the following lemma.

Lemma 2.7.

Under Algorithm (LABEL:eq:_x_k), for all ,

Proof.

See Appendix LABEL:proof_lem:_consensus_error_contraction.     

3 Analysis

We are now ready to derive some preliminary convergence results for Algorithm (LABEL:eq:_x_k). First, we provide a uniform bound on the iterates generated by Algorithm (LABEL:eq:_x_k) (in expectation) for all . Then based on the lemma established in Section LABEL:subsec:_pre_results, we prove the sublinear convergence rates and .

From now on we consider the following stepsize policy:

(3.1)

where and

(3.2)

3.1 Uniform Bound

We derive a uniform bound on the iterates generated by Algorithm (LABEL:eq:_x_k) (in expectation) for all .

Lemma 3.1.

For all , we have

(3.3)

where

(3.4)

and sets are defined in (LABEL:definition:_X).

Proof.

See Appendix LABEL:proof_lem:_bounded_iterates_general_stepsize.     

We can further bound as follows. From the definition of ,

Hence

(3.5)

In light of Lemma LABEL:lem:_bounded_iterates_general_stepsize and inequality (LABEL:bound:_R_i), further noticing that the choice of is arbitrary in the proof of Lemma LABEL:lem:_bounded_iterates_general_stepsize, we obtain the following uniform bound for .

Lemma 3.2.

Under Algorithm (LABEL:eq:_x_k), for all , we have

(3.6)

3.2 Sublinear Rate

Denote

(3.7)

Using Lemma LABEL:lem:_optimization_error_contraction and Lemma LABEL:lem:_consensus_error_contraction from Section LABEL:subsec:_pre_results, we show below that Algorithm (LABEL:eq:_x_k) enjoys the sublinear convergence rate, i.e., and .

Define a Lyapunov function:

(3.8)

where is to be determined later.

Lemma 3.3.

Let

(3.9)

and

(3.10)

Under Algorithm (LABEL:eq:_x_k), for all , we have

(3.11)

where

(3.12)

In addition,

where

(3.13)
(3.14)

Proof.

See Appendix LABEL:proof_lemma:_prelim_rates_general_stepsize.     

4 Main Results

Notice that the sublinear rate obtained in Lemma LABEL:lemma:_prelim_rates_general_stepsize is network dependent, i.e., depends on the spectral gap , a function of the mixing matrix . In this section, we perform a non-asymptotic analysis of network independence for Algorithm (LABEL:eq:_x_k). Specifically, in Theorem LABEL:Theorem_general_stepsize and Corollary LABEL:cor:_U(k)_general_step, we show that , where the first term is network independent and the second (higher-order) term depends on . In Section LABEL:subsec:_finer_result, we further improve the result and compare it with centralized stochastic gradient descent. We show that asymptotically, the two methods have the same convergence rate . In addition, it takes time for Algorithm (LABEL:eq:_x_k) to reach this asymptotic rate of convergence.

Our analysis starts with a useful lemma.

Lemma 4.1.
(4.1)

Proof.

See Appendix LABEL:proof_lem:_product.     

The following theorem characterizes the non-asymptotic convergence property for Algorithm (LABEL:eq:_x_k).

Theorem 4.2.

Under Algorithm (LABEL:eq:_x_k), suppose 222The condition can be easily generalized to the case where .. We have for all ,

(4.2)

Proof.

For , in light of Lemma LABEL:lem:_optimization_error_contraction_pre and Lemma LABEL:lem:_contraction_mu-L_convexity,

where the second inequality follows from the Cauchy-Schwarz inequality. Then,

From Lemma LABEL:lem:_product,

(4.3)

In light of Lemma LABEL:lemma:_prelim_rates_general_stepsize, when ,

and

Hence

However, we have for any ,

and