Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

\nameAlireza Fallah* \emailafallah@mit.edu
\addrDepartment of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139, United States of America \AND\nameMert Gürbüzbalaban* \emailmg1366@rutgers.edu
\addrDepartment of Management Science and Information Systems
Rutgers Business School
Piscataway, NJ 08854, United States of America \AND\nameAsuman Ozdaglar* \emailasuman@mit.edu
\addrDepartment of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139, United States of America \AND\nameUmut Şimşekli* \emailumut.simsekli@telecom-paris.fr
\addrLTCI, Télécom Paris, Institut Polytechnique de Paris
Paris 75013, France, and
Department of Statistics, University of Oxford
24-29 St Giles’, Oxford OX1 3LB, United Kingdom \AND\nameLingjiong Zhu* \emailzhu@math.fsu.edu
\addrDepartment of Mathematics
Florida State University
Tallahassee, FL 32306, United States of America
 
\name* \addrThe authors are in alphabetical order.
\name \addrCorresponding author.
Abstract

We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance and dependence to network effects. When gradients do not contain noise, we also prove that D-ASG can achieve acceleration, in the sense that it requires gradient evaluations and communications to converge to the same fixed point with the non-accelerated variant where is the condition number and is the target accuracy. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact convergence to the optimal solution. It achieves optimal and accelerated linear decay in the bias term as well as optimal in the variance term. We illustrate through numerical experiments that our approach results in accelerated practical algorithms that are robust to gradient noise and that can outperform existing methods.

\ShortHeadings

Robust Distributed Accelerated Stochastic Gradient MethodsFallah, Gürbüzbalaban, Ozdaglar, Şimşekli and Zhu \firstpageno1

\editor{keywords}

Distributed Optimization, Accelerated Methods, Stochastic Optimization, Robustness, Multi-Agent Networks

1 Introduction

Advances in sensing and processing technologies, communication capabilities and smart devices have enabled deployment of systems where a massive amount of data is collected by many distributed autonomous units to make decisions. There are numerous such examples including a set of sensors collecting and processing information about a time-varying spatial field (e.g., to monitor temperature levels or chemical concentrations), a collection of mobile robots performing dynamic tasks spread over a region, community-based traffic and navigation systems (such as Waze, a GPS navigation software application owned by Google, which is free to download and use) and autonomous cars providing real-time traffic information and guidance for drivers. In such systems, most of the information is often collected in a decentralized, distributed manner, and processing of information has to go hand-in-hand with its communication and sharing across these units over an undirected network defined by the set of (computational units) agents connected by the edges . In such a setting, we consider the group of agents (i.e., the nodes) collaboratively solving the following optimization problem:

(1)

where each is known by agent only and therefore referred to as its local objective function. We assume each is -strongly convex with -Lipschitz gradients (hence is also -strongly convex with -Lipschitz gradient and we refer to as its condition number). We also use to denote the unique optimal solution of (1). In addition, we denote the local model of node at iteration by .

We consider the setting where each agent has access to noisy estimates of the actual gradients satisfying the following assumption: {assumption} Recall that denotes the decision variable of node at iteration . We assume at iteration , node has access to which is an estimate of where is a random variable independent of and . Moreover, we assume

To simplify the notation, we suppress the dependence, and denote by . This arises naturally in distributed learning problems where represents the expected loss where are independent data points collected at node (see e.g. Pu and Nedić (2018); Lan et al. (2017); Olshevsky et al. (2019)). For this setting, is an unbiased estimator of which we assume satisfies the bounded variance assumption of Assumption 1. Note that in our setting, a master node that can coordinate the computations is not available unlike the master/slave architecture studied in the literature (see e.g. Mishchenko et al. (2018); Agarwal and Duchi (2011); Hakimi et al. (2019); Lee et al. (2018); Meng et al. (2016); Jaggi et al. (2014); Xin and Khan (2018)). Furthermore, our setting covers an arbitrary network topology that is more general than particular network topologies such as the complete graph or ring graph.

Deterministic variants of problem (1) have been studied extensively in the literature. Much of the work builds on the Distributed Gradient (DG) method proposed in Nedic and Ozdaglar (2009) where each agent keeps local estimates of the optimal solution of (1) and updates by a combination of weighted average of neighbors’ estimates and a gradient step (normalized by the stepsize ) of the local objective function. Nedic and Ozdaglar (2009) analyzed the case with convex and possibly nonsmooth local objective functions, constant stepsize , and agents linked over an undirected connected graph and showed that the ergodic average of local estimates of the agents converge at rate to an neighborhood of the optimal solution of problem (1) (where denotes the number of iterations). Yuan et al. (2016) considered this algorithm for the case that local functions are smooth, i.e., are Lipschitz continuous, and when are either convex, restricted strongly convex or strongly convex. For the convex case, they show the network-wide mean estimate converges at rate to an neighborhood of the optimal solution, and for the strongly convex case, all local estimates converge at a linear rate to an neighborhood of .111For two real-valued functions and , we say if there exist positive constants and such that for every in the domain of and with being sufficiently large.

There have been many recent works on developing new distributed deterministic algorithms with faster convergence rate and exact convergence to the optimal solution . We start by summarizing the literature in this area that are most relevant to this work. First, Shi et al. (2015) provides a novel algorithm which can be viewed as a primal-dual algorithm for the constrained reformulation of problem (1) (see Mokhtari and Ribeiro (2016) for this interpretation) that achieves exact convergence with linear rate to the optimal solution. Second, Qu and Li (2018) proposes to update the DG method such that agents also maintain, exchange, and combine estimates of gradients of the global objective function of (1). This update is based on a technique called “gradient tracking” (see e.g. Di Lorenzo and Scutari (2015, 2016)) which enables better control on the global gradient direction and yields a linear rate of convergence to the optimal solution (see Jakovetić (2019) for a unified analysis of these two methods). In a follow up paper, Qu and Li (2017) also considered an acceleration of their algorithm and achieved a linear convergence rate to the optimal solution. To our best knowledge, whether an accelerated primal variant of the DG algorithm can achieve the non-distributed linear rate to a neighborhood of the optimum solution with dependence has been an open problem. Alternative distributed first-order methods besides DG have also been studied. In particular, if additional assumptions are made such as the explicit characterization of Fenchel dual of the local objective functions, referred to as the dualable setting as in Scaman et al. (2018); Uribe et al. (2018)), then it is known that the multi-step dual accelerated (MSDA) method of Scaman et al. (2018) achieves the linear rate to the optimum with dependence. For deterministic distributed optimization problems under smooth and strongly convex objectives, Dvinskikh and Gasnikov (2019) proposed the PSTM algorithm and provided accelerated convergence guarantees. Recently, Scaman et al. (2019) provided lower bounds which matches the upper bounds of Dvinskikh and Gasnikov (2019) up to logarithmic factors (see also Scaman et al. (2019) for a discussion of deterministic optimal algorithms under different assumptions (Lipschitz continuity, strong convexity, smoothness, and a combination of strong convexity and smoothness)).

This paper focuses on the Distributed Stochastic Gradient (D-SG) method (which is a stochastic version of the DG method) and its momentum enhanced variant, Distributed Accelerated Stochastic Gradient (D-ASG) method. These methods are relevant for solving distributed learning problems and are natural decentralized versions of the stochastic gradient and its variant based on Nesterov’s momentum averaging (Nesterov (2004); Can et al. (2019)). In this paper, we focus on strongly convex and smooth objectives. Several works studied D-SG under these assumptions although D-ASG remains relatively understudied except the deterministic case (see e.g. Jakovetić et al. (2014); Xi et al. (2017); Li et al. (2018); Qu and Li (2016)). We summarize the existing convergence rate results for D-SG in Table 1.222See also Shamir and Srebro (2014) for a different noise model than ours in the mini-batch setting, where each objective can be expressed as a finite sum. Among these, Rabbat (2015) studied composite stochastic optimization problems and showed a convergence rate for D-SG and its mirror descent variant. Koloskova et al. (2019) studied decentralized stochastic gradient algorithms when the nodes compress (e.g. quantize or sparsify) their updates. Olshevsky et al. (2019) provided an asymptotic network independent sublinear rate. In our approach, we use a dynamical system representation of these iterative algorithms (presented in Lessard et al. (2016) and further used in Hu and Lessard (2017); Aybat et al. (2018, 2019)) to provide rate estimates for convergence of the local agent iterates to a neighborhood of the optimal solution of problem (1). Our bounds are presented in terms of three components: (i) a bias term that shows the decay rate of the initialization error (i.e., distance of the initial estimates to the optimal solution) independent of gradient noise, (ii) a variance term that depends on the error level of local objective functions’ gradients, measuring the “robustness” of the algorithm to noise (in a sense that we will define precise later), (iii) a network effect that highlights the dependence on the structure of the network. In this paper, in addition to the convergence analysis for D-SG and D-ASG, our purpose is to study the trade-offs and interplays between these three terms that affect the performance.

Algorithm
Extra
Assumption
Convergence Rate
D-SG
Tsianos and Rabbat (2012)
Yes
D-SG
Rabbat (2015)
No
D-SG
Olshevsky et al. (2019)
No
D-SG
Koloskova et al. (2019)
Yes
D-SG
Corollary 2.3.2
in this paper
No
D-ASG
Theorem 2.3.3
in this paper
No
D-MASG
Corollary 3
in this paper
No
Table 1: Summary for D-SG and D-ASG. denotes the average of nodes’ estimates at time , i.e., , and, is a weighted average defined in Koloskova et al. (2019). Also, is given by , is defined in (8), is defined in (28) and is defined in (20).
: The authors analyze a D-SG method with a slightly different update then ours.
: The authors make the extra assumption .

Contributions. We have three sets of contributions.

First, we study the convergence rate of DSG with constant stepsize which is used in many practical applications (Alghunaim and Sayed (2019, 2018); Dieuleveut et al. (2017)). Our bounds provide tighter guarantees on the bias term as well as novel guarantees on the variance term for this algorithm. For quadratic functions, we provide sharper estimates for the bias, variance, and network effect terms that are tight, as there exist simple quadratic functions that achieve these bounds.

Second, we consider D-ASG with constant stepsize. We show that the bias term decays linearly with rate to a neighborhood of the optimal solution, and thus, it achieves an accelerated rate. We also provide an explicit characterization for this neighborhood, in terms of noise and network structure parameters, with the variance term dominating for small enough stepsize. When the objectives are all quadratic, we obtain non-asymptotic guarantees that are explicit in terms of their linear convergence rate and dependence to noise, generalizing available known guarantees for ASG to the distributed setting (Can et al. (2019)).

For both algorithms, following earlier work on non-distributed versions of these algorithms (Aybat et al. (2018)), we use our explicit characterization of bias, variance, and network effect terms to provide a computational framework that can choose algorithm parameters to trade-off these difference effects in a systematic manner. In the centralized setting, it has been observed and argued that accelerated algorithms are often more sensitive to noise than non-accelerated algorithms (see e.g. Flammarion and Bach (2015); d’Aspremont (2008); Aybat et al. (2019); Hardt (2014)), however to our knowledge this behavior has not been systematically studied in the context of decentralized algorithms. We study the asymptotic variance of the D-SG and D-ASG iterates as a measure of robustness to random gradient noise and provide explicit expressions for this quantity for quadratic objectives as well as upper bounds for strongly convex objectives. This allows us to compare D-SG and D-ASG in terms of their robustness to random noise properties. Our results (see the discussion after Theorem 2.3.3) show that indeed D-ASG can be less robust compared to D-SG depending on the choice of the momentum and stepsize parameters, shedding further light into the tuning of hyperparameters (stepsize and momentum) in the distributed setting.

Finally, we study a multistage version of D-ASG, building on the non-distributed method in Aybat et al. (2019), whereby a distributed accelerated stochastic gradient method with constant stepsize and momentum parameter is used at every stage, with parameters carefully varied over stages to ensure exact convergence to the optimal solution . Similar to Aybat et al. (2019), a momentum restart is used to enable stitching the improvement obtained over consecutive stages. We show that our proposed method achieves optimal and accelerated linear decay in the bias term as well as optimal in the variance term and in terms of network effect, where is the spectral gap of the network, see (8) for a formal definition. Such an optimal dependency on and was obtained previously for the PBSTM algorithm of Dvinskikh and Gasnikov (2019) which is optimal up to logarithmic terms. However, to the best of our knowledge, this is the first result where such an optimal dependency on and can be given for the D-ASG algorithm. A summary of all our convergence results is provided in Table 1.

Notation. Let denote the set of functions from to that are -strongly convex and -smooth, that is, for every ,

where we have the condition number . Let denote the zero matrix with rows and columns. Given a collection of square matrices , the matrix denotes the block diagonal square matrix with -th diagonal block equal to . For two matrices and , we denote their Kronecker product by . For two functions defined over positive integers, we say if there exists a constant and a positive integer such that for every positive integer .

2 Distributed Stochastic Gradient and Its Accelerated Variant

We will first study the distributed stochastic gradient (D-SG) method which is the stochastic version of the distributed gradient (DG) method introduced in Nedic and Ozdaglar (2009), and then focus on its accelerated variant.

Consider a network that is connected by edges , where denotes the set of vertices. We associate this network with an symmetric, doubly stochastic weight matrix . We have if and , and if and , and finally for every333We adopt the convention that the node is a neighbor of itself, i.e. . . The eigenvalues of ordered in a descending manner satisfy:

with . Such a matrix always exists (see e.g. Boyd et al. (2006)) if the graph is not bi-partite and there can be different choices of (Shi et al. (2015)). For bi-partite graphs, one can also construct such a matrix by considering the transition matrix of a lazy random walk on the graph (see e.g. Chung (1997)).

Next, we make a few definitions for the sake of subsequent analysis. First define the average iterates

(2)

Next we define the column vector

(3)

which concatenates the local decision variables into a single vector. We also define as

(4)

which is the column vector of length that concatenates copies of the optimizer to the problem (1).

In addition, we define as

where

which obeys

(5)

due to Assumption 1. Furthermore, is -strongly convex and -smooth.

2.1 Distributed stochastic gradient (D-SG)

Recall that denotes the decision variable of node at iteration . The D-SG iterations update this variable by performing a stochastic gradient descent update with respect to the local cost function together with a weighted averaging with the decision variables of node ’s immediate neighbors :

(6)

where is the stepsize. Note that we can express the D-SG iterations as

(7)

where .

Without noise, i.e., when , D-SG reduces to the DG algorithm. In this case, Yuan et al. (2016) show that DG algorithm is inexact in the sense that the iterates of the DG algorithm do not converge to the optimum in general with constant stepsize, but instead converge linearly to a fixed point that is in a neighborhood of the solution satisfying

(8)

for some constant with the explicit expression

provided that the stepsize satisfies some conditions (Yuan et al., 2016) (see Lemma A in the Appendix for details).

Similar to (4), we define the column vector

(9)

which is a concatenation of the fixed point of node over all the nodes. It can be checked that the unique fixed point to (7) in the noiseless setting is the solution to

(10)

This means that the sequence converges to zero with an appropriate choice of the stepsize. The performance of the algorithm can then be measured by the distance of to given by (4).

2.2 Distributed accelerated stochastic gradient (D-ASG)

Consider the following variant of D-SG:

(11)

where is the stepsize and is called the momentum parameter. This algorithm has also been considered in the literature by Jakovetić et al. (2014) in the noiseless setting.

We define the average iterates and the column vector as in (2) and (3), respectively. Also, similar to (3), we define the column vector

Then, we can re-write the D-ASG iterates (11) as:

(12)

for starting from the initial values and for each node . Here, is the stepsize and is the momentum parameter. Note that for , D-ASG reduces to the D-SG algorithm. When there is a single node, i.e. , D-ASG also reduces to the Nesterov’s (non-distributed) accelerated stochastic gradient algorithm (ASG) (Nesterov (2004)). Note that this algorithm is also inexact in the sense that both and will also converge to the same point in the noiseless setting where is the fixed point of the distributed gradient (DG) algorithm defined by (10).

2.3 Convergence Rates and Robustness to Gradient Noise

Consider both D-SG and D-ASG algorithms, subject to gradient noise satisfying Assumption 1. For this scenario, the noise is persistent, i.e., it does not decay over time, and it is possible that the limit of as may not exist (even in the non-distributed setting), see Can et al. (2019)); therefore, one natural way444There are other possible ways to define a robustness measure, see e.g. Aybat et al. (2018). of defining robustness of an algorithm to gradient noise is to consider the worst-case limiting variance along all possible subsequences, i.e.

(13)

This measure has been considered in control theory under the name norm of a dynamical system defined by (7) and recently applied to optimization to develop noise-robust non-distributed algorithms (Aybat et al., 2018). It is equal to the ratio of the output variance and the input noise variance (which is the variance of noise at the worst case), therefore it can be interpreted as a signal-to-noise ratio (SNR) measure, quantifying how robust the algorithm is to white noise. In the next sections, we will provide bounds on the robustness level and the expected distance to both the fixed point and the optimum for the D-SG and D-ASG algorithms. In particular, in the non-distributed setting, it is known that ASG can be less robust to noise compared to gradient descent (Hardt (2014); Aybat et al. (2018)); we will later obtain bounds in Section 2.3.3 for the robustness of D-ASG and D-SG which suggests a similar behavior in the distributed setting when the stepsize is small enough.

For analysis purposes, we consider the penalized objective function defined as

(14)

Similar penalized objectives have also been considered in the past to analyze deterministic algorithms (see e.g. (Yuan et al., 2016, Section 2), Mansoori and Wei (2017)). It can be seen that its gradient (with respect to ) is . Since , we have also

(15)

Furthermore, the unique minimum of satisfies the first-order conditions

Then, it follows from (10) that , i.e. the minimum of coincides with the limit point . In fact, we can re-write the D-SG iterations (7) as

(16)

which is equivalent to running a non-distributed stochastic gradient algorithm for minimizing an alternative objective in dimension . We can also re-write the D-ASG iterations (12) as

(17)

These iterations are identical to the iterations of the (non-distributed) ASG. In other words, D-ASG applied to solve the problem (1) in dimension is equivalent to running a non-distributed ASG algorithm for minimizing an alternative objective in dimension .

This connection allows us to analyze both D-SG and D-ASG with existing techniques developed for non-distributed algorithms in Aybat et al. (2018, 2019) that builds on dynamical system representation of optimization algorithms.

2.3.1 Dynamical system representation

We first reformulate D-SG (16) and D-ASG update rules (17) as a discrete-time dynamical system:

(18)

where is the state, and are system matrices that are appropriately chosen. For example, we can represent the D-SG iterates with the choice of

(19)

Similarly, we can represent the D-ASG iterations as the dynamical system (18) with

(20)

and where

(21)

(see also Lessard et al. (2016) for such a dynamical system representation in the deterministic case). For studying the dynamical system (21), we introduce the following Lyapunov function

(22)

where is a scalar, is a positive semi-definite matrix and is a fixed matrix that will be specified later. Since is the minimum of , we observe that has non-negative values. In particular, . In the special case when , we obtain

In the next section, we obtain convergence results for D-SG and D-ASG for constant stepsize and momentum which also implies guarantees on the robustness measure . The analysis is based on studying the Lyapunov function (22) for different choices of the matrix and the scalar . In particular, for D-SG we can choose to be the identity matrix and , however for D-ASG, the choice of is less trivial and depends on the choice of the stepsize and in general. Here, our choice of the Lyapunov function (22) is motivated by Fazlyab et al. (2018) which studied this Lyapunov function to analyze accelerated gradient methods in the centralized deterministic setting.

2.3.2 Analysis of Distributed Stochastic Gradient

We next provide a performance bound for D-SG in Theorem 2.3.2. It shows that the expected distance square to the fixed point can be bounded as a sum of two terms: A bias term that depends on the initialization and decays with a linear rate where A variance term that scales linearly with the noise level providing a bound on the asymptotic variance and hence the robustness level . When there is no noise (when ), the variance term is zero, and we obtain a linear convergence rate for the (deterministic) DG algorithm with rate . This improves the previously best known convergence rate for DG obtained in (Yuan et al., 2016), where , which can get arbitrarily close to , see Theorem 7 in (Yuan et al., 2016). We also note that the convergence rate and robustness we provide in Theorem 2.3.2 is tight for D-SG in the sense that they are attained for some quadratic choices of the objective (see Remark C.1 in the Appendix C).

For proving Theorem 2.3.2, we exploit the above-mentioned fact that running D-SG on the objective is equivalent to running (non-distributed SG) on the modified objective and we build on the existing results for non-distributed stochastic gradient (Aybat et al., 2018, Prop. 4.3); the proof is given in the appendix.

{theorem}

Consider running D-SG method with stepsize . Then, for every ,

(23)

where . As a result, the robustness of the D-SG method satisfies

We recall that the penalized objective depends on the network and the stepsize. The fixed point is the minimum of the penalized objective . In general, the difference is not zero and it depends on the network structure and the stepsize . We call this term the “network effect”; it can be controlled by the the inequality (8). The following corollary is obtained by a direct application of the inequality (8) to Theorem 2.3.2.

{corollary}

Consider running D-SG method with stepsize . Then, for every ,

(24)

which implies that the robustness of the D-SG method satisfies

In addition, if , we have

(25)

where are given in (8).

2.3.3 Analysis of Distributed Accelerated Stochastic Gradient

Throughout this section, we state the results under the following assumption. {assumption} We assume all eigenvalues of are positive, i.e., we assume that .

We note that Assumption 2.3.3 is not restrictive in the sense that even if the weight matrix does not satisfy this assumption, we can still apply the results in our paper by considering the modified weight matrix for instead of . Because, we have for and therefore satisfies Assumption 2.3.3.

The following result extends Aybat et al. (2018) from non-distributed ASG to D-ASG. {theorem} Assume there exist and a positive semi-definite matrix such that

(26)

where , and are defined in (21) and

Let . Then, for every ,

(27)

Therefore, the robustness of D-ASG iterations defined in (13) satisfies

In addition, if , we have

where are given in (8).

The results in Theorem 2.3.3 are stated in terms of a matrix which solves the matrix inequality (26). For any fixed , and ; this is a linear matrix inequality (LMI). Therefore, we can compute numerically by varying , and on a grid and then solving the resulting LMIs with a software such as CVX (Grant et al. (2008)) (see also Lessard et al. (2016) for a similar approach). However, in the next result, we obtain some explicit performance bounds in the special case when ; this choice of is motivated by the fact that it is a common choice in the non-distributed and noiseless setting.555Furthermore it can be shown that it gives the fastest rate for quadratic objectives in the non-distributed case when there is no noise (Aybat et al. (2019)). The proof is deferred to the appendix; it is based on the fact that when , and ; is an explicit solution to the matrix inequality (26) where

Then, plugging in in Theorem 2.3.3 and in the bound (2.3.3), we obtain performance guarantees in terms of the Lyapunov function . To simplify the notation in this case, with slight abuse of notation, we let

(28)

We have the following explicit performance bounds on the convergence and the robustness of D-ASG. {theorem} Consider running D-ASG method with and