Robust Distributed Accelerated Stochastic Gradient Methods for MultiAgent Networks
Abstract
We study distributed stochastic gradient (DSG) method and its accelerated variant (DASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance and dependence to network effects. When gradients do not contain noise, we also prove that DASG can achieve acceleration, in the sense that it requires gradient evaluations and communications to converge to the same fixed point with the nonaccelerated variant where is the condition number and is the target accuracy. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of DASG with parameters carefully varied over stages to ensure exact convergence to the optimal solution. It achieves optimal and accelerated linear decay in the bias term as well as optimal in the variance term. We illustrate through numerical experiments that our approach results in accelerated practical algorithms that are robust to gradient noise and that can outperform existing methods.
Robust Distributed Accelerated Stochastic Gradient MethodsFallah, Gürbüzbalaban, Ozdaglar, Şimşekli and Zhu \firstpageno1
Distributed Optimization, Accelerated Methods, Stochastic Optimization, Robustness, MultiAgent Networks
1 Introduction
Advances in sensing and processing technologies, communication capabilities and smart devices have enabled deployment of systems where a massive amount of data is collected by many distributed autonomous units to make decisions. There are numerous such examples including a set of sensors collecting and processing information about a timevarying spatial field (e.g., to monitor temperature levels or chemical concentrations), a collection of mobile robots performing dynamic tasks spread over a region, communitybased traffic and navigation systems (such as Waze, a GPS navigation software application owned by Google, which is free to download and use) and autonomous cars providing realtime traffic information and guidance for drivers. In such systems, most of the information is often collected in a decentralized, distributed manner, and processing of information has to go handinhand with its communication and sharing across these units over an undirected network defined by the set of (computational units) agents connected by the edges . In such a setting, we consider the group of agents (i.e., the nodes) collaboratively solving the following optimization problem:
(1) 
where each is known by agent only and therefore referred to as its local objective function. We assume each is strongly convex with Lipschitz gradients (hence is also strongly convex with Lipschitz gradient and we refer to as its condition number). We also use to denote the unique optimal solution of (1). In addition, we denote the local model of node at iteration by .
We consider the setting where each agent has access to noisy estimates of the actual gradients satisfying the following assumption: {assumption} Recall that denotes the decision variable of node at iteration . We assume at iteration , node has access to which is an estimate of where is a random variable independent of and . Moreover, we assume
To simplify the notation, we suppress the dependence, and denote by . This arises naturally in distributed learning problems where represents the expected loss where are independent data points collected at node (see e.g. Pu and Nedić (2018); Lan et al. (2017); Olshevsky et al. (2019)). For this setting, is an unbiased estimator of which we assume satisfies the bounded variance assumption of Assumption 1. Note that in our setting, a master node that can coordinate the computations is not available unlike the master/slave architecture studied in the literature (see e.g. Mishchenko et al. (2018); Agarwal and Duchi (2011); Hakimi et al. (2019); Lee et al. (2018); Meng et al. (2016); Jaggi et al. (2014); Xin and Khan (2018)). Furthermore, our setting covers an arbitrary network topology that is more general than particular network topologies such as the complete graph or ring graph.
Deterministic variants of problem (1) have been studied extensively in the literature. Much of the work builds on the Distributed Gradient (DG) method proposed in Nedic and Ozdaglar (2009) where each agent keeps local estimates of the optimal solution of (1) and updates by a combination of weighted average of neighbors’ estimates and a gradient step (normalized by the stepsize ) of the local objective function. Nedic and Ozdaglar (2009) analyzed the case with convex and possibly nonsmooth local objective functions, constant stepsize , and agents linked over an undirected connected graph and showed that the ergodic average of local estimates of the agents converge at rate to an neighborhood of the optimal solution of problem (1) (where denotes the number of iterations). Yuan et al. (2016) considered this algorithm for the case that local functions are smooth, i.e., are Lipschitz continuous, and when are either convex, restricted strongly convex or strongly convex. For the convex case, they show the networkwide mean estimate converges at rate to an neighborhood of the optimal solution, and for the strongly convex case, all local estimates converge at a linear rate to an neighborhood of .^{1}^{1}1For two realvalued functions and , we say if there exist positive constants and such that for every in the domain of and with being sufficiently large.
There have been many recent works on developing new distributed deterministic algorithms with faster convergence rate and exact convergence to the optimal solution . We start by summarizing the literature in this area that are most relevant to this work. First, Shi et al. (2015) provides a novel algorithm which can be viewed as a primaldual algorithm for the constrained reformulation of problem (1) (see Mokhtari and Ribeiro (2016) for this interpretation) that achieves exact convergence with linear rate to the optimal solution. Second, Qu and Li (2018) proposes to update the DG method such that agents also maintain, exchange, and combine estimates of gradients of the global objective function of (1). This update is based on a technique called “gradient tracking” (see e.g. Di Lorenzo and Scutari (2015, 2016)) which enables better control on the global gradient direction and yields a linear rate of convergence to the optimal solution (see Jakovetić (2019) for a unified analysis of these two methods). In a follow up paper, Qu and Li (2017) also considered an acceleration of their algorithm and achieved a linear convergence rate to the optimal solution. To our best knowledge, whether an accelerated primal variant of the DG algorithm can achieve the nondistributed linear rate to a neighborhood of the optimum solution with dependence has been an open problem. Alternative distributed firstorder methods besides DG have also been studied. In particular, if additional assumptions are made such as the explicit characterization of Fenchel dual of the local objective functions, referred to as the dualable setting as in Scaman et al. (2018); Uribe et al. (2018)), then it is known that the multistep dual accelerated (MSDA) method of Scaman et al. (2018) achieves the linear rate to the optimum with dependence. For deterministic distributed optimization problems under smooth and strongly convex objectives, Dvinskikh and Gasnikov (2019) proposed the PSTM algorithm and provided accelerated convergence guarantees. Recently, Scaman et al. (2019) provided lower bounds which matches the upper bounds of Dvinskikh and Gasnikov (2019) up to logarithmic factors (see also Scaman et al. (2019) for a discussion of deterministic optimal algorithms under different assumptions (Lipschitz continuity, strong convexity, smoothness, and a combination of strong convexity and smoothness)).
This paper focuses on the Distributed Stochastic Gradient (DSG) method (which is a stochastic version of the DG method) and its momentum enhanced variant, Distributed Accelerated Stochastic Gradient (DASG) method. These methods are relevant for solving distributed learning problems and are natural decentralized versions of the stochastic gradient and its variant based on Nesterov’s momentum averaging (Nesterov (2004); Can et al. (2019)). In this paper, we focus on strongly convex and smooth objectives. Several works studied DSG under these assumptions although DASG remains relatively understudied except the deterministic case (see e.g. Jakovetić et al. (2014); Xi et al. (2017); Li et al. (2018); Qu and Li (2016)). We summarize the existing convergence rate results for DSG in Table 1.^{2}^{2}2See also Shamir and Srebro (2014) for a different noise model than ours in the minibatch setting, where each objective can be expressed as a finite sum. Among these, Rabbat (2015) studied composite stochastic optimization problems and showed a convergence rate for DSG and its mirror descent variant. Koloskova et al. (2019) studied decentralized stochastic gradient algorithms when the nodes compress (e.g. quantize or sparsify) their updates. Olshevsky et al. (2019) provided an asymptotic network independent sublinear rate. In our approach, we use a dynamical system representation of these iterative algorithms (presented in Lessard et al. (2016) and further used in Hu and Lessard (2017); Aybat et al. (2018, 2019)) to provide rate estimates for convergence of the local agent iterates to a neighborhood of the optimal solution of problem (1). Our bounds are presented in terms of three components: (i) a bias term that shows the decay rate of the initialization error (i.e., distance of the initial estimates to the optimal solution) independent of gradient noise, (ii) a variance term that depends on the error level of local objective functions’ gradients, measuring the “robustness” of the algorithm to noise (in a sense that we will define precise later), (iii) a network effect that highlights the dependence on the structure of the network. In this paper, in addition to the convergence analysis for DSG and DASG, our purpose is to study the tradeoffs and interplays between these three terms that affect the performance.
Algorithm 

Convergence Rate  


Yes  

No  

No  

Yes  

No 



No 



No 

: The authors analyze a DSG method with a slightly different update then ours.
: The authors make the extra assumption .
Contributions. We have three sets of contributions.
First, we study the convergence rate of DSG with constant stepsize which is used in many practical applications (Alghunaim and Sayed (2019, 2018); Dieuleveut et al. (2017)). Our bounds provide tighter guarantees on the bias term as well as novel guarantees on the variance term for this algorithm. For quadratic functions, we provide sharper estimates for the bias, variance, and network effect terms that are tight, as there exist simple quadratic functions that achieve these bounds.
Second, we consider DASG with constant stepsize. We show that the bias term decays linearly with rate to a neighborhood of the optimal solution, and thus, it achieves an accelerated rate. We also provide an explicit characterization for this neighborhood, in terms of noise and network structure parameters, with the variance term dominating for small enough stepsize. When the objectives are all quadratic, we obtain nonasymptotic guarantees that are explicit in terms of their linear convergence rate and dependence to noise, generalizing available known guarantees for ASG to the distributed setting (Can et al. (2019)).
For both algorithms, following earlier work on nondistributed versions of these algorithms (Aybat et al. (2018)), we use our explicit characterization of bias, variance, and network effect terms to provide a computational framework that can choose algorithm parameters to tradeoff these difference effects in a systematic manner. In the centralized setting, it has been observed and argued that accelerated algorithms are often more sensitive to noise than nonaccelerated algorithms (see e.g. Flammarion and Bach (2015); d’Aspremont (2008); Aybat et al. (2019); Hardt (2014)), however to our knowledge this behavior has not been systematically studied in the context of decentralized algorithms. We study the asymptotic variance of the DSG and DASG iterates as a measure of robustness to random gradient noise and provide explicit expressions for this quantity for quadratic objectives as well as upper bounds for strongly convex objectives. This allows us to compare DSG and DASG in terms of their robustness to random noise properties. Our results (see the discussion after Theorem 2.3.3) show that indeed DASG can be less robust compared to DSG depending on the choice of the momentum and stepsize parameters, shedding further light into the tuning of hyperparameters (stepsize and momentum) in the distributed setting.
Finally, we study a multistage version of DASG, building on the nondistributed method in Aybat et al. (2019), whereby a distributed accelerated stochastic gradient method with constant stepsize and momentum parameter is used at every stage, with parameters carefully varied over stages to ensure exact convergence to the optimal solution . Similar to Aybat et al. (2019), a momentum restart is used to enable stitching the improvement obtained over consecutive stages. We show that our proposed method achieves optimal and accelerated linear decay in the bias term as well as optimal in the variance term and in terms of network effect, where is the spectral gap of the network, see (8) for a formal definition. Such an optimal dependency on and was obtained previously for the PBSTM algorithm of Dvinskikh and Gasnikov (2019) which is optimal up to logarithmic terms. However, to the best of our knowledge, this is the first result where such an optimal dependency on and can be given for the DASG algorithm. A summary of all our convergence results is provided in Table 1.
Notation. Let denote the set of functions from to that are strongly convex and smooth, that is, for every ,
where we have the condition number . Let denote the zero matrix with rows and columns. Given a collection of square matrices , the matrix denotes the block diagonal square matrix with th diagonal block equal to . For two matrices and , we denote their Kronecker product by . For two functions defined over positive integers, we say if there exists a constant and a positive integer such that for every positive integer .
2 Distributed Stochastic Gradient and Its Accelerated Variant
We will first study the distributed stochastic gradient (DSG) method which is the stochastic version of the distributed gradient (DG) method introduced in Nedic and Ozdaglar (2009), and then focus on its accelerated variant.
Consider a network that is connected by edges , where denotes the set of vertices. We associate this network with an symmetric, doubly stochastic weight matrix . We have if and , and if and , and finally for every^{3}^{3}3We adopt the convention that the node is a neighbor of itself, i.e. . . The eigenvalues of ordered in a descending manner satisfy:
with . Such a matrix always exists (see e.g. Boyd et al. (2006)) if the graph is not bipartite and there can be different choices of (Shi et al. (2015)). For bipartite graphs, one can also construct such a matrix by considering the transition matrix of a lazy random walk on the graph (see e.g. Chung (1997)).
Next, we make a few definitions for the sake of subsequent analysis. First define the average iterates
(2) 
Next we define the column vector
(3) 
which concatenates the local decision variables into a single vector. We also define as
(4) 
which is the column vector of length that concatenates copies of the optimizer to the problem (1).
In addition, we define as
where
which obeys
(5) 
due to Assumption 1. Furthermore, is strongly convex and smooth.
2.1 Distributed stochastic gradient (DSG)
Recall that denotes the decision variable of node at iteration . The DSG iterations update this variable by performing a stochastic gradient descent update with respect to the local cost function together with a weighted averaging with the decision variables of node ’s immediate neighbors :
(6) 
where is the stepsize. Note that we can express the DSG iterations as
(7) 
where .
Without noise, i.e., when , DSG reduces to the DG algorithm. In this case, Yuan et al. (2016) show that DG algorithm is inexact in the sense that the iterates of the DG algorithm do not converge to the optimum in general with constant stepsize, but instead converge linearly to a fixed point that is in a neighborhood of the solution satisfying
(8) 
for some constant with the explicit expression
provided that the stepsize satisfies some conditions (Yuan et al., 2016) (see Lemma A in the Appendix for details).
Similar to (4), we define the column vector
(9) 
which is a concatenation of the fixed point of node over all the nodes. It can be checked that the unique fixed point to (7) in the noiseless setting is the solution to
(10) 
This means that the sequence converges to zero with an appropriate choice of the stepsize. The performance of the algorithm can then be measured by the distance of to given by (4).
2.2 Distributed accelerated stochastic gradient (DASG)
Consider the following variant of DSG:
(11)  
where is the stepsize and is called the momentum parameter. This algorithm has also been considered in the literature by Jakovetić et al. (2014) in the noiseless setting.
We define the average iterates and the column vector as in (2) and (3), respectively. Also, similar to (3), we define the column vector
Then, we can rewrite the DASG iterates (11) as:
(12)  
for starting from the initial values and for each node . Here, is the stepsize and is the momentum parameter. Note that for , DASG reduces to the DSG algorithm. When there is a single node, i.e. , DASG also reduces to the Nesterov’s (nondistributed) accelerated stochastic gradient algorithm (ASG) (Nesterov (2004)). Note that this algorithm is also inexact in the sense that both and will also converge to the same point in the noiseless setting where is the fixed point of the distributed gradient (DG) algorithm defined by (10).
2.3 Convergence Rates and Robustness to Gradient Noise
Consider both DSG and DASG algorithms, subject to gradient noise satisfying Assumption 1. For this scenario, the noise is persistent, i.e., it does not decay over time, and it is possible that the limit of as may not exist (even in the nondistributed setting), see Can et al. (2019)); therefore, one natural way^{4}^{4}4There are other possible ways to define a robustness measure, see e.g. Aybat et al. (2018). of defining robustness of an algorithm to gradient noise is to consider the worstcase limiting variance along all possible subsequences, i.e.
(13) 
This measure has been considered in control theory under the name norm of a dynamical system defined by (7) and recently applied to optimization to develop noiserobust nondistributed algorithms (Aybat et al., 2018). It is equal to the ratio of the output variance and the input noise variance (which is the variance of noise at the worst case), therefore it can be interpreted as a signaltonoise ratio (SNR) measure, quantifying how robust the algorithm is to white noise. In the next sections, we will provide bounds on the robustness level and the expected distance to both the fixed point and the optimum for the DSG and DASG algorithms. In particular, in the nondistributed setting, it is known that ASG can be less robust to noise compared to gradient descent (Hardt (2014); Aybat et al. (2018)); we will later obtain bounds in Section 2.3.3 for the robustness of DASG and DSG which suggests a similar behavior in the distributed setting when the stepsize is small enough.
For analysis purposes, we consider the penalized objective function defined as
(14) 
Similar penalized objectives have also been considered in the past to analyze deterministic algorithms (see e.g. (Yuan et al., 2016, Section 2), Mansoori and Wei (2017)). It can be seen that its gradient (with respect to ) is . Since , we have also
(15) 
Furthermore, the unique minimum of satisfies the firstorder conditions
Then, it follows from (10) that , i.e. the minimum of coincides with the limit point . In fact, we can rewrite the DSG iterations (7) as
(16) 
which is equivalent to running a nondistributed stochastic gradient algorithm for minimizing an alternative objective in dimension . We can also rewrite the DASG iterations (12) as
(17)  
These iterations are identical to the iterations of the (nondistributed) ASG. In other words, DASG applied to solve the problem (1) in dimension is equivalent to running a nondistributed ASG algorithm for minimizing an alternative objective in dimension .
This connection allows us to analyze both DSG and DASG with existing techniques developed for nondistributed algorithms in Aybat et al. (2018, 2019) that builds on dynamical system representation of optimization algorithms.
2.3.1 Dynamical system representation
(18) 
where is the state, and are system matrices that are appropriately chosen. For example, we can represent the DSG iterates with the choice of
(19) 
Similarly, we can represent the DASG iterations as the dynamical system (18) with
(20) 
and where
(21) 
(see also Lessard et al. (2016) for such a dynamical system representation in the deterministic case). For studying the dynamical system (21), we introduce the following Lyapunov function
(22) 
where is a scalar, is a positive semidefinite matrix and is a fixed matrix that will be specified later. Since is the minimum of , we observe that has nonnegative values. In particular, . In the special case when , we obtain
In the next section, we obtain convergence results for DSG and DASG for constant stepsize and momentum which also implies guarantees on the robustness measure . The analysis is based on studying the Lyapunov function (22) for different choices of the matrix and the scalar . In particular, for DSG we can choose to be the identity matrix and , however for DASG, the choice of is less trivial and depends on the choice of the stepsize and in general. Here, our choice of the Lyapunov function (22) is motivated by Fazlyab et al. (2018) which studied this Lyapunov function to analyze accelerated gradient methods in the centralized deterministic setting.
2.3.2 Analysis of Distributed Stochastic Gradient
We next provide a performance bound for DSG in Theorem 2.3.2. It shows that the expected distance square to the fixed point can be bounded as a sum of two terms: A bias term that depends on the initialization and decays with a linear rate where A variance term that scales linearly with the noise level providing a bound on the asymptotic variance and hence the robustness level . When there is no noise (when ), the variance term is zero, and we obtain a linear convergence rate for the (deterministic) DG algorithm with rate . This improves the previously best known convergence rate for DG obtained in (Yuan et al., 2016), where , which can get arbitrarily close to , see Theorem 7 in (Yuan et al., 2016). We also note that the convergence rate and robustness we provide in Theorem 2.3.2 is tight for DSG in the sense that they are attained for some quadratic choices of the objective (see Remark C.1 in the Appendix C).
For proving Theorem 2.3.2, we exploit the abovementioned fact that running DSG on the objective is equivalent to running (nondistributed SG) on the modified objective and we build on the existing results for nondistributed stochastic gradient (Aybat et al., 2018, Prop. 4.3); the proof is given in the appendix.
Consider running DSG method with stepsize . Then, for every ,
(23) 
where . As a result, the robustness of the DSG method satisfies
We recall that the penalized objective depends on the network and the stepsize. The fixed point is the minimum of the penalized objective . In general, the difference is not zero and it depends on the network structure and the stepsize . We call this term the “network effect”; it can be controlled by the the inequality (8). The following corollary is obtained by a direct application of the inequality (8) to Theorem 2.3.2.
Consider running DSG method with stepsize . Then, for every ,
(24) 
which implies that the robustness of the DSG method satisfies
In addition, if , we have
(25)  
where are given in (8).
2.3.3 Analysis of Distributed Accelerated Stochastic Gradient
Throughout this section, we state the results under the following assumption. {assumption} We assume all eigenvalues of are positive, i.e., we assume that .
We note that Assumption 2.3.3 is not restrictive in the sense that even if the weight matrix does not satisfy this assumption, we can still apply the results in our paper by considering the modified weight matrix for instead of . Because, we have for and therefore satisfies Assumption 2.3.3.
The following result extends Aybat et al. (2018) from nondistributed ASG to DASG. {theorem} Assume there exist and a positive semidefinite matrix such that
(26) 
where , and are defined in (21) and
Let . Then, for every ,
(27) 
Therefore, the robustness of DASG iterations defined in (13) satisfies
In addition, if , we have
where are given in (8).
The results in Theorem 2.3.3 are stated in terms of a matrix which solves the matrix inequality (26). For any fixed , and ; this is a linear matrix inequality (LMI). Therefore, we can compute numerically by varying , and on a grid and then solving the resulting LMIs with a software such as CVX (Grant et al. (2008)) (see also Lessard et al. (2016) for a similar approach). However, in the next result, we obtain some explicit performance bounds in the special case when ; this choice of is motivated by the fact that it is a common choice in the nondistributed and noiseless setting.^{5}^{5}5Furthermore it can be shown that it gives the fastest rate for quadratic objectives in the nondistributed case when there is no noise (Aybat et al. (2019)). The proof is deferred to the appendix; it is based on the fact that when , and ; is an explicit solution to the matrix inequality (26) where
Then, plugging in in Theorem 2.3.3 and in the bound (2.3.3), we obtain performance guarantees in terms of the Lyapunov function . To simplify the notation in this case, with slight abuse of notation, we let
(28) 
We have the following explicit performance bounds on the convergence and the robustness of DASG. {theorem} Consider running DASG method with and