Abstract
Consensusbased distributed optimization methods have recently been advocated as alternatives to parameter server and ring allreduce paradigms for large scale training of machine learning models. In this case, each worker maintains a local estimate of the optimal parameter vector and iteratively updates it by averaging the estimates obtained from its neighbors, and applying a correction on the basis of its local dataset. While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays.
1 Introduction
In 2014, Google’s Sybil machine learning (ML) platform was processing hundreds of terabytes through thousands of cores to train models with hundreds of billions of parameters (Canini et al., 2014). At this scale, no single machine can solve these problems in a timely manner, and, as time goes on, the need for efficient distributed solutions becomes even more urgent. For example, experiments in (Young et al., 2017) rely on more than computing nodes to iteratively improve the (hyper)parameters of a deep neural network.
The example in (Young et al., 2017) is typical of a large class of iterative ML distributed algorithms. Such algorithms begin with a guess of an optimal vector of parameters and proceed through multiple iterations over the input data to improve the solution. The process evolves in a dataparallel manner: input data is divided among worker threads. Currently, two communication paradigms are commonly used to coordinate the different workers (Google I/O, 2018): parameter server and ring allreduce. Both paradigms are natively supported by TensorFlow (Abadi et al., 2016).
In the first case, a stateful parameter server (PS) (Smola and Narayanamurthy, 2010) maintains the current version of the model parameters. Workers use locally available versions of the model to compute “delta” updates of the parameters (e.g., through a gradient descent step). Updates are then aggregated by the parameter server and combined with its current state to produce a new estimate of the optimal parameter vector.
As an alternative, it is possible to remove the PS, by letting each worker aggregate the inputs of all other workers through the ring allreduce algorithm (Gibiansky, 2017). With workers, each aggregation phase requires communication steps with data transmitted per worker. There are many efficient low level implementations of ring allreduce, e.g., in NVIDIA’s library NCCL.
We observe that both the PS and the ring allreduce paradigms 1) maintain a unique candidate parameter vector at any given time and 2) rely logically on an alltoall communication scheme.

in terms of number of epochs, the convergence speed is almost the same when the communication topology is a ring or a clique, contradicting theoretical findings that predict convergence to be faster on a clique;

in terms of wallclock time, convergence is faster for sparser topologies, an effect attributed in (Lian et al., 2017) to smaller communication load.
In particular, Luo et al. (2019) summarize their findings as follows “in theory, the bigger the spectral gap, [i.e., the more connected the topology] the fewer iterations it takes to converge. However, our experiments do not show a significant difference in the convergence rate w.r.t. iterations, even when spectral gaps are very dissimilar.”
In this paper we contribute to a better understanding of the potential advantages of consensusbased gradient methods. In particular,

we present a refined convergence analysis that helps to explain the apparent contradiction among theoretical results and empirical observations,

we show that sparse topologies can speedup wallclock time convergence even when communication costs are negligible, because they intrinsically mitigate the straggler problem.
2 Notation and Background
The goal of supervised learning is to learn a function that maps an input to an output using examples from a training dataset . Each example is a pair consisting of an input object and an associated target value . In order to find the best statistical model, ML techniques often find the set of parameters that solves the following optimization problem:
(1) 
where function represents the error the model commits on the th element of the dataset when parameter vector is used. The objective function may also include a regularization term that enforces some “simplicity” (e.g., sparseness) on ; such a term is easily taken into account in our analysis.
Due to increases in available data and statistical model complexity, distributed solutions are often required to determine the parameter vector in a reasonable time. The dataset in this case is divided among workers (), possibly with some overlap. For simplicity, we consider that all local datasets have the same size. Problem (1) can be restated in an equivalent form as minimization of the sum of functions local to each node:
(2) 
where .
The distributed system can be represented by a directed dataflow graph , where is the set of nodes (the workers) and an edge indicates that, at each iteration, node waits for updates from node for the previous iteration. We assume the graph is strongly connected. Let denote the inneighborhood of node , i.e., the set of predecessors of node in . Each node maintains a local estimate of the parameter vector and broadcasts it to its successors. The local estimate is updated as follows:
(3) 
The node computes a weighted average (consensus/gossip component) of the estimates of its neighbors and itself, and then corrects it taking into account a stochastic subgradient
where denotes a subgradient of with respect to , and is a random minibatch of size drawn from .
Parameter is the (potentially timevarying) learning rate. is an matrix of nonnegative weights. We call the consensus matrix.
The operation of a synchronous PS or ring allreduce is captured by (3) when the underlying graph is a clique, , where is the vector consisting of all ones, and .
Under some standard technical conditions, (Nedić et al., 2018, Thm. 8) and (Duchi et al., 2012, Thm. 2) conclude, respectively for Distributed Subgradient Method (DSM) and for the Dual Averaging Distributed method, that the number of iterations needed to approximate the minimum objective function value by the desired error is
(4) 
where is the spectral gap of the matrix , i.e., the difference between the moduli of the two largest eigenvalues of . The spectral gap quantifies how information flows in the network. In particular, the spectral gap is maximal for a clique with weights . Motivated by these convergence results, existing theoreticallyoriented literature has concluded that a more connected network topology leads to faster convergence (Nedić et al., 2018; Duchi et al., 2012). But some recent experimental results report that consensusbased gradient methods achieve similar performance after the same number of iterations/epochs on topologies as different as rings and cliques. For example (Lian et al., 2017, Fig. 3) shows almost overlapping training losses for different ResNet architectures trained on CIFAR10 with up to one hundred workers. (Luo et al., 2019, Fig. 20), (Koloskova et al., 2019, Fig. 11 in supplementary material), and our experimental results in Sect. 4 confirm these findings.
Lian et al. (2017) provide a partial explanation for this insensitivity in their Corollary 2, showing that the convergence rate is topologyindependent 1) after a large number of iterations (),
2) for a vanishing learning rate, and 3) when the functions are differentiable with Lipschitzian gradients. Under the additional hypothesis of strong convexity, Pu et al. (2019) prove that topology insensitivity should manifest after iterations.
3 Analysis
A less connected topology requires more iterations to achieve a given precision as indicated by (4). Our detailed analysis below shows that, when consensusbased optimization methods are used for ML training, the increase in the number of iterations is much less pronounced than previous studies predict. This is due to two different effects. First, consensus is affected only by variability in initial estimates and subgradients across nodes, and not by their absolute values. Second, certain configurations of initial estimates and subgradients are more difficult to achieve a consensus over, and would make the training highly dependent on the topology, but they are unlikely to be obtained by randomly partitioning the dataset.
Let be the number of parameters of the model, and and be matrices, whose columns are, respectively, node estimates and subgradients at the completion of iteration . Equation (3) can be rewritten in the form , from which we obtain iteratively
(5) 
We make the following assumptions:
 A1

all functions are convex,
 A2

the set of (global) minimizers is nonempty,
 A3

graph is strongly connected,
 A4

matrix is normal (i.e., ) and doubly stochastic,
 A5

the squared Frobenius norm of subgradient matrix is bounded in expectation over the vector of minibatches randomly drawn at nodes, i.e., there exists , such that .
Assumptions A1A4 are standard ones in the related literature, see for example (Nedić and Ozdaglar, 2009; Duchi et al., 2012; Nedić et al., 2018). Assumption A5 imposes a bound on the (expected) energy of the subgradients, because . In the literature it is often replaced by the stronger requirement that the norm2 of the subgradients is bounded. Let denote the matrix , whose column is the difference between subgradient and the average of subgradients . captures the variability in the subgradients. Assumption A5 also implies that there exist two constants and such that
Similarly, let denote the energy of the initial parameter vectors (or an upper bound for it), i.e., . We also denote by the energy for the difference matrix , i.e., . captures the variability in initial estimates. It holds .
Because of Assumption A4, the consensus matrix has a spectral decomposition with orthogonal projectors where are the distinct eigenvalues of , is the orthogonal projector onto the nullspace of along the range of . We assume that the eigenvalues are ordered so that . Assumptions A3 and A4 imply that , and (Appendix B). Finally, we define
(6) 
where is an upperbound for the normalized fraction of energy in the subspace defined by projector (Appendix D). The quantity can be interpreted as an effective bound for the fraction of the energy that falls in the subspace relative to the second largest eigenvalue .
We are now ready to introduce our main convergence result. We state it for the average model over nodes and time, i.e., for .
We have also derived a similar bound for the local timeaverage model at each node, i.e., for (Appendix D.3).
Proposition 3.1.
Under assumptions A1A5 and that a constant learning rate is used, an upper bound for the objective value at the end of the ()th iteration is given by:
(7)  
Here, denotes the Euclidean distance between vector and set of global minimizers . The proof is in Appendix D.1. The first two terms on the right hand side of (7) also appear when studying the convergence of centralized subgradient methods. The last two terms appear because of the distributed consensus component of the algorithm and depend on . We observe that is the spectral gap of . It measures how well connected the graph is. In particular, the larger the spectral gap (the smaller ), the better the connectivity and the smaller the bound in (7).
From Proposition 3.1, we can derive a looser bound analogous to the bound for DSM in (Nedić and Ozdaglar, 2009). In fact, observing that , , , and , we can prove (Appendix D.2):
Corollary 3.2.
Under assumptions A1A5 and that constant learning rate is used, an upper bound for the objective value at the end of the ()th iteration is given by:
(8) 
In particular, if workers compute fullbatch subgradients and the 2norm of subgradients of functions is bounded by a constant , we obtain:
(9) 
When is large enough, the fourth term in (8) and (9) is dominant, so that the error is essentially proportional to . Note that the constant multiplying in (8) is larger than the corresponding one in (7) by a factor
(10) 
The value roughly indicates how much looser bound (8) is in comparison to bound (7).
Existing theoretical works like (Nedić and Ozdaglar, 2009; Duchi et al., 2012; Nedić et al., 2018) derived bounds similar to (9) and concluded then that one should select the learning rate proportional to to reduce the effect of topology. In particular, one obtains (4) when . Our bound (7) improves bound (8) by replacing in the third terms of (8) by the smaller value , and in the third and fourth terms by the smaller values and , and introducing the new coefficient . We qualitatively describe the effect of these constants.
For , considerations similar to those applying to hold. What matters is the variability of the subgradients. Assume that the dataset is replicated at each node and each node computes the subgradient over the full batch (). In this case all subgradients would be equal, and , , and the fourth term would also vanish. This corresponds to the fact that, when initial parameter vectors, as well as local functions, are the same, the parameter vectors are equal at any iteration and the system evolves exactly as it would under a centralized subgradient method. In general, local subgradients can be expected to be close (and ), if 1) local datasets are representative of the entire dataset (the dataset has been randomly split and ), and 2) large batch sizes are used. On the other hand, when batch sizes are very small, one expects stochastic subgradients to be very noisy, and as a consequence the energy of the matrix to be much larger than the energy of , so that . In both cases, is large (in the first case because , and in the second because ). We quantify these effects below.
From (5) we see that the effect of the subgradients is modulated by , that equals . The energy of the subgradients is spread across the different subspaces defined by the eigenvectors of . The classic bound (8) implicitly assumes that all energy falls in the subspace corresponding to (this occurs if the row of the matrices are aligned with the second eigenvector). In reality, on average each subspace will only get th of the total energy (), and the energy in other subspaces will be dissipated faster than what happens for the subspace corresponding to . quantifies this effect.
A toy example in Appendix F illustrates qualitatively these effects. Here we provide estimates for the expected values of , , and over all possible ways to distribute the dataset randomly across the nodes. We reason as follows. For a given parameter vector , consider the set of subgradients at all dataset points, i.e., . The average subgradient over all datapoints is . Let denote the trace of the covariance matrix of all subgradients. then equals the sum of the variances of all the components of the subgradients. We denote by the expanded dataset where each datapoint is replicated times with . The dataset is split across the nodes. Each node selects a random minibatch from its local dataset and we denote by the corresponding subgradient matrix.
Proposition 3.3.
Consider a uniform random permutation of with the constraint that copies of the same point are placed at different nodes. The following holds
(11)  
We can use (11) to study how , , and vary with dataset size, batch size, and number of replicas, using the following approximations:
(12) 
Figure 1 illustrates the ratio for a particular setting. It also highlights the two regimes discussed above: for large batch sizes and for small ones. As indicates how much looser bound (8) is in comparison to bound (7), and , the figure shows that (8) may indeed overestimate the effect of the topology by many orders of magnitudes. The comparison of Fig. 0(b) and Fig. 0(a) shows that (and then ) does not depend much on the replication factor , but for the fact that the batch size can scale up to .
4 Experiments
Dataset  Model  M  B  


16  128  0.0003  7.92  1.01  1.53  12.23  12.31  1  1  
3250  38.45  1.00  1.64  62.86  60.97  1  1  
100  128  7.75  1.01  1.54  12.05  11.56  1  10  1  
520  15.58  1.00  1.51  23.60  22.96  1  17  1  


16  128  0.1  1.45  1.42  1.49  3.07  2.92  1  16  1  72  
500  2.15  1.14  1.53  3.75  3.71  1  22  40  1  260  
64  128  1.41  1.42  1.51  3.02  3.03  1  10  1  24  
split by digit  10  500  0.01  1.01  1.00  1.42  1.42  3.62  1  3  60  1  7  100  


16  128  0.05  1.07  3.35  1.49  5.34  5.62  1  10  30  1  20  
500  1.18  1.91  1.50  3.40  3.52  1  21  1 
With our experiments we want to 1) evaluate the effect of topology on the number of epochs to converge, and in particular quantify , , , and in practical ML problems, 2) evaluate the effect of topology on the convergence time. We considered three different optimization problems:

Minimization of mean squared error (MSE) for linear regression on the dataset “Relative location of CT slices on axial axis” from (34; F. Graf, H. Kriegel, M. Schubert, S. Pölsterl and A. Cavallaro (2011)). Convexity holds, but gradients are potentially unbounded.

Minimization of crossentropy loss through a neural network with two convolutional layers on MNIST dataset (Lecun et al., 1998). Neither convexity, nor subgradient boundness hold.
We have developed an adhoc Python simulator that allows us to test clusters with a large number of nodes, as well as a distributed application using PyTorch MPI backend to run experiments on a real GPU cluster platform.
Table 1 shows values of , , , and their product for different problems and different settings.
From (7) and (8), we can also compute at which iteration the two bounds predict that the effect of the topology becomes significant, by identifying when the training loss difference between the clique and the ring accounts for a given percentage of the loss decrease over the entire training period. Figure 3 qualitatively illustrates the procedure.
We note that forecasts are very different, despite the fact that, in some settings, our bound is only 3 times tighter than the classic one. Bound (8) predicts that the training loss curves should differ by more than since the first iteration (). The new bound (7) correctly identifies that the topology’s effect becomes evident later, sometimes beyond the total number of iterations performed in the experiment (in this case we indicate ).
Figure 2 shows the training loss evolution for specific settings (one for each ML problem) and two very different topologies (undirected ring and clique), when the dataset is split randomly across the nodes. The behaviour is qualitatively similar to what observed in previous works (Lian et al., 2017, 2018; Luo et al., 2019); despite the remarkable difference in the level of connectivity (quantified also by the spectral gap), the curves are very close, sometimes indistinguishable.
Figure 4 shows the same plot for the case when MNIST images for the same digit have been assigned to the same node. In this case the local datasets are very different and ; the topology has a remarkable effect! This plot warns against extending the empirical finding in (Lian et al., 2017, 2018; Luo et al., 2019) to settings where local datasets can be highly different as it can be for example in the case of federated learning (Konecný et al., 2015).
The experiments above confirm that the communication topology has little influence on the number of epochs needed to converge (when local datasets are statistically similar). Our analysis reconciles (at least in part) theory and experiments by pushing farther the training epoch at which the effect of the topology should be evident.
The conclusion about the role of the topology is radically different if one considers the time to converge. For example, Karakus et al. (2017) and Luo et al. (2019) observe experimentally that sparse topologies can effectively reduce the convergence wallclock time. A possible explanation is that each iteration is faster because less time is spent in the communication phase: the less connected the graph, the smaller the communication load at each node. Lian et al. (2017, 2018) advance this explanation to justify why DSM on ringlike topologies can converge faster than the centralized PS.
Here, we show that sparse topologies can speedup wallclock time convergence even when communication costs are negligible, because they intrinsically mitigate the effect of stragglers, i.e., tasks whose completion time can be occasionally much longer than its typical value. Transient slowdowns are common in computing systems (especially in shared ones) and have many causes, such as resource contention, background OS activities, garbage collection, and (for ML tasks) stopping criteria calculations. Stragglers can significantly reduce computation speed in a multimachine setting (Ananthanarayanan et al., 2013; Karakus et al., 2017; Li et al., 2018). For consensusbased method, one can hope that, when the topology is sparse, a temporary straggler only slows down a limited number of nodes (its outneighbors in ), so that the system can still maintain a high throughput.
Neglia et al. (2019) have proposed approximate formulas to evaluate the throughput of distributed ML systems for some specific random distribution of the computation time (uniform, exponential, and Pareto). Here, we take a more practical approach. Our PyTorchbased distributed application allow us to simulate systems with arbitrary distributions of the computation times and communication delays. We have carried out experiments with zero communication delays (an ideal network) and two different empirical distributions for the computation time. One was obtained by running stochastic gradient descent on a production Spark cluster with sixteen servers using Zoe Analytics (Pace et al., 2017), each with two 8core Intel E52630 CPUs running at 2.40GHz. The other was extracted from ASCIQ supercomputer traces (Petrini et al., 2003, Fig. 4). Figure 5 shows the effect of topology connectivity on the convergence time for a MNIST experiment with Spark computation distribution. We consider in this case undirected regular random graphs. The number of iterations completed per node grows faster the less connected the topology (Fig. 5 (a)). As the training loss is almost independent of the topology (Fig. 5 (b)), the ring achieves the shortest convergence time (Fig. 5 (c)), even if there is no communication delay. Qualitatively similar results for other ML problems and time distributions are in Appendix G.
5 Conclusions
We have explained, both through analysis and experiments, when and why the communication topology does not affect the number of epochs consensusbased optimization methods need to converge, an effect recently observed in many papers, but not thoroughly investigated. We have also shown that, as a consequence of this invariance, a less connected topology achieves a shorter convergence time, not necessarily because it incurs a smaller communication load, but because it mitigates the stragglers’ problem. The distributed operation of consensusbased approaches appears particularly suited for federated and multiagent learning. Our study points out that further research is required for these applications, because the benefits observed until now are dependent on the statistical similarity of the local datasets, an assumption that is not satisfied in federated learning.
6 Acknowledgements
We warmly thank Alain JeanMarie, Pietro Michiardi, and Bruno Ribeiro for their feedback on the manuscript and their help preparing the rebuttal letter. We are also grateful to the OPAL infrastructure from UniversitÃ© CÃ´te d’Azur and Inria Sophia Antipolis  MÃ©diterranÃ©e “NEF” computation platform for providing resources and support. Finally, We would like to thank the anonymous reviewers for their insightful comments and suggestions, which helped us improve this work.
This work was supported in part by ARL under Cooperative Agreement W911NF1720196.
Appendix A Notation
We use an overline to denote an average over all the nodes and a “hat” to denote the timeaverage. For example
(13) 
For a matrix, e.g., , denotes the matrix whose column is , i.e.,
where is the orthogonal projector on the subspace generated by . is used to denote the difference .
Given a matrix , and denote the th row and the th column, respectively.
We use different standard matrix norms, whose definitions are reported here for completeness. Let be a matrix:
(14)  
(15) 
where are the singular values of the matrix and is the largest one.
We will also consider the Frobenius inner product between matrices defined as follows
(16) 
All the results in Appendix D assume that the matrix is irreducible, primitive, doubly stochastic, nonnegative, and normal.
Appendix B Linear algebra reminders
b.1 Irreducible primitive doubly stochastic nonnegative matrices
We remind some results from PerronFrobenius theory (Meyer, 2000, Ch. 8). As our communication graph is strongly connected, and whenever is an edge of , the consensus matrix is irreducible. Moreover, the consensus matrix has nonnull diagonal elements and then it is also primitive. The spectral radius is then itself a simple eigenvalue. Because is also stochastic, its eigenvalue coincides with the spectral radius ().
Let be the orthogonal projector on the subspace generated by the unit vector .
The nonzero singular values of are the positive square roots of the nonzero eigenvalues of . We observe that . The spectrum of is equal to the spectrum of but for one eigenvalue that is replaced by an eigenvalue . It follows that .
b.2 Normal matrices
An matrix is normal if . A matrix is normal if and only if it is unitarily diagonalizable (Meyer, 2000, p. 547), i.e., it exists a complete orthonormal set of eigenvectors such that , where is the diagonal matrix containing the eigenvalues and has the eigenvectors as columns.
Normal matrices have a spectral decomposition with orthogonal projectors (Meyer, 2000, p. 517), i.e.,
where are the eigenvalues of , is the orthogonal projector onto the nullspace of along the range of , for , and . Because the projectors are orthogonal and non null it holds and (Meyer, 2000, p. 433). Moreover, for any vector and , it holds:
(17) 
Symmetric matrices as well as circulant matrices are normal. In fact, a circulant matrix is always diagonalizable by the Fourier matrix and then it is normal.
The nonzero singular values of a normal matrix are the modules of its eigenvalues, i.e., . If the matrix is also imprimitive, irreducible, doubly stochastic and nonnegative, it holds
(18) 
Moreover, observe that in this case .
Appendix C Insensitivity quantification in previous work
Lian et al. (2017) and Pu et al. (2019) have studied the convergence of decentralized stochastic gradient method predicting topology independence after a certain number of iterations. Here, we evaluate quantitatively their predictions on some ML problems and compare them with our experimental observations. The results show that these predictions are very loose and then they do not fully explain why insensitivity is often observed since the beginning of the training phase.
These two papers have slightly different assumptions than ours.
 A1

the consensus matrix is symmetric and doubly stochastic,
 A2

every has LLipschitz continuous gradient, i.e.,
 A3

the expected variance of stochastic gradient is uniformly bounded, i.e.,
 A4

every is strongly convex, i.e.,
We observe that A1 implies A4, A2 implies A5, and A4 implies A1.
Next, we will present the bounds obtained in (Lian et al., 2017; Pu et al., 2019) and how we evaluate them quantitatively.
Corollary C.1 (Corollary 2 in Lian et al. (2017)).
Under assumptions A2A3, A1A3, and that a constant learning rate is used, the convergence rate is independent of the topology if the total number of iterations is sufficiently large, in particular if
(19) 
We estimate the constants , and as follows:
 Estimate of :

, where is a random set of pairs of parameter vectors.
 Estimate of :

, as the dataset is randomly split.
 Estimate of :

.
We observe that in general underestimates the Lipschitz constant , and is likely to overestimate . Both effects lead to underestimate , that makes our conclusions stronger.
We evaluate these quantities for three machine learning problems (SUSY,
Dataset  

SUSY  5.03  2.82  1.0e10 
CT  37.56  1953.27  9.2e11 
MNIST  86.05  12.83  2.3e6 
Theorem C.2 (Theorem (4.2) in (Pu et al., 2019)).
Under assumptions A1A3, A2A4, suppose , learning rate . For all it holds
(20) 
where is the minimizer of , , , ,
, ,
,
and .
Notice that in (20), only the third term is related to the topology as depends on the spectral gap of the consensus matrix. The bound in Corollary C.1 is obtained imposing that the term depending on the spectral gap is smaller than the other terms (Lian et al., 2017, Thm. 1). Here we apply the same idea, requiring the third term of (20) to be bounded by the first and the second term of (20), i.e.,
and then
Then, we have
As and , assuming that , we conclude that the convergence rate is independent of the topology if the total number of iterations is sufficiently large, and more precisely if