Consensus-based distributed optimization methods have recently been advocated as alternatives to parameter server and ring all-reduce paradigms for large scale training of machine learning models. In this case, each worker maintains a local estimate of the optimal parameter vector and iteratively updates it by averaging the estimates obtained from its neighbors, and applying a correction on the basis of its local dataset. While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays.
In 2014, Google’s Sybil machine learning (ML) platform was processing hundreds of terabytes through thousands of cores to train models with hundreds of billions of parameters (Canini et al., 2014). At this scale, no single machine can solve these problems in a timely manner, and, as time goes on, the need for efficient distributed solutions becomes even more urgent. For example, experiments in (Young et al., 2017) rely on more than computing nodes to iteratively improve the (hyper)parameters of a deep neural network.
The example in (Young et al., 2017) is typical of a large class of iterative ML distributed algorithms. Such algorithms begin with a guess of an optimal vector of parameters and proceed through multiple iterations over the input data to improve the solution. The process evolves in a data-parallel manner: input data is divided among worker threads. Currently, two communication paradigms are commonly used to coordinate the different workers (Google I/O, 2018): parameter server and ring all-reduce. Both paradigms are natively supported by TensorFlow (Abadi et al., 2016).
In the first case, a stateful parameter server (PS) (Smola and Narayanamurthy, 2010) maintains the current version of the model parameters. Workers use locally available versions of the model to compute “delta” updates of the parameters (e.g., through a gradient descent step). Updates are then aggregated by the parameter server and combined with its current state to produce a new estimate of the optimal parameter vector.
As an alternative, it is possible to remove the PS, by letting each worker aggregate the inputs of all other workers through the ring all-reduce algorithm (Gibiansky, 2017). With workers, each aggregation phase requires communication steps with data transmitted per worker. There are many efficient low level implementations of ring all-reduce, e.g., in NVIDIA’s library NCCL.
We observe that both the PS and the ring all-reduce paradigms 1) maintain a unique candidate parameter vector at any given time and 2) rely logically on an all-to-all communication scheme.
in terms of number of epochs, the convergence speed is almost the same when the communication topology is a ring or a clique, contradicting theoretical findings that predict convergence to be faster on a clique;
in terms of wall-clock time, convergence is faster for sparser topologies, an effect attributed in (Lian et al., 2017) to smaller communication load.
In particular, Luo et al. (2019) summarize their findings as follows “in theory, the bigger the spectral gap, [i.e., the more connected the topology] the fewer iterations it takes to converge. However, our experiments do not show a significant difference in the convergence rate w.r.t. iterations, even when spectral gaps are very dissimilar.”
In this paper we contribute to a better understanding of the potential advantages of consensus-based gradient methods. In particular,
we present a refined convergence analysis that helps to explain the apparent contradiction among theoretical results and empirical observations,
we show that sparse topologies can speed-up wall-clock time convergence even when communication costs are negligible, because they intrinsically mitigate the straggler problem.
2 Notation and Background
The goal of supervised learning is to learn a function that maps an input to an output using examples from a training dataset . Each example is a pair consisting of an input object and an associated target value . In order to find the best statistical model, ML techniques often find the set of parameters that solves the following optimization problem:
where function represents the error the model commits on the -th element of the dataset when parameter vector is used. The objective function may also include a regularization term that enforces some “simplicity” (e.g., sparseness) on ; such a term is easily taken into account in our analysis.
Due to increases in available data and statistical model complexity, distributed solutions are often required to determine the parameter vector in a reasonable time. The dataset in this case is divided among workers (), possibly with some overlap. For simplicity, we consider that all local datasets have the same size. Problem (1) can be restated in an equivalent form as minimization of the sum of functions local to each node:
The distributed system can be represented by a directed dataflow graph , where is the set of nodes (the workers) and an edge indicates that, at each iteration, node waits for updates from node for the previous iteration. We assume the graph is strongly connected. Let denote the in-neighborhood of node , i.e., the set of predecessors of node in . Each node maintains a local estimate of the parameter vector and broadcasts it to its successors. The local estimate is updated as follows:
The node computes a weighted average (consensus/gossip component) of the estimates of its neighbors and itself, and then corrects it taking into account a stochastic subgradient
where denotes a subgradient of with respect to , and is a random minibatch of size drawn from .
Parameter is the (potentially time-varying) learning rate. is an matrix of non-negative weights. We call the consensus matrix.
The operation of a synchronous PS or ring all-reduce is captured by (3) when the underlying graph is a clique, , where is the vector consisting of all ones, and .
Under some standard technical conditions, (Nedić et al., 2018, Thm. 8) and (Duchi et al., 2012, Thm. 2) conclude, respectively for Distributed Subgradient Method (DSM) and for the Dual Averaging Distributed method, that the number of iterations needed to approximate the minimum objective function value by the desired error is
where is the spectral gap of the matrix , i.e., the difference between the moduli of the two largest eigenvalues of . The spectral gap quantifies how information flows in the network. In particular, the spectral gap is maximal for a clique with weights . Motivated by these convergence results, existing theoretically-oriented literature has concluded that a more connected network topology leads to faster convergence (Nedić et al., 2018; Duchi et al., 2012). But some recent experimental results report that consensus-based gradient methods achieve similar performance after the same number of iterations/epochs on topologies as different as rings and cliques. For example (Lian et al., 2017, Fig. 3) shows almost overlapping training losses for different ResNet architectures trained on CIFAR-10 with up to one hundred workers. (Luo et al., 2019, Fig. 20), (Koloskova et al., 2019, Fig. 11 in supplementary material), and our experimental results in Sect. 4 confirm these findings.
Lian et al. (2017) provide a partial explanation for this insensitivity in their Corollary 2, showing that the convergence rate is topology-independent 1) after a large number of iterations (),
2) for a vanishing learning rate, and 3) when the functions are differentiable with Lipschitzian gradients. Under the additional hypothesis of strong convexity, Pu et al. (2019) prove that topology insensitivity should manifest after iterations.
A less connected topology requires more iterations to achieve a given precision as indicated by (4). Our detailed analysis below shows that, when consensus-based optimization methods are used for ML training, the increase in the number of iterations is much less pronounced than previous studies predict. This is due to two different effects. First, consensus is affected only by variability in initial estimates and subgradients across nodes, and not by their absolute values. Second, certain configurations of initial estimates and subgradients are more difficult to achieve a consensus over, and would make the training highly dependent on the topology, but they are unlikely to be obtained by randomly partitioning the dataset.
Let be the number of parameters of the model, and and be matrices, whose columns are, respectively, node estimates and subgradients at the completion of iteration . Equation (3) can be rewritten in the form , from which we obtain iteratively
We make the following assumptions:
all functions are convex,
the set of (global) minimizers is non-empty,
graph is strongly connected,
matrix is normal (i.e., ) and doubly stochastic,
the squared Frobenius norm of subgradient matrix is bounded in expectation over the vector of minibatches randomly drawn at nodes, i.e., there exists , such that .
Assumptions A1-A4 are standard ones in the related literature, see for example (Nedić and Ozdaglar, 2009; Duchi et al., 2012; Nedić et al., 2018). Assumption A5 imposes a bound on the (expected) energy of the subgradients, because . In the literature it is often replaced by the stronger requirement that the norm-2 of the subgradients is bounded. Let denote the matrix , whose column is the difference between subgradient and the average of subgradients . captures the variability in the subgradients. Assumption A5 also implies that there exist two constants and such that
Similarly, let denote the energy of the initial parameter vectors (or an upper bound for it), i.e., . We also denote by the energy for the difference matrix , i.e., . captures the variability in initial estimates. It holds .
Because of Assumption A4, the consensus matrix has a spectral decomposition with orthogonal projectors where are the distinct eigenvalues of , is the orthogonal projector onto the nullspace of along the range of . We assume that the eigenvalues are ordered so that . Assumptions A3 and A4 imply that , and (Appendix B). Finally, we define
where is an upper-bound for the normalized fraction of energy in the subspace defined by projector (Appendix D). The quantity can be interpreted as an effective bound for the fraction of the energy that falls in the subspace relative to the second largest eigenvalue .
We are now ready to introduce our main convergence result. We state it for the average model over nodes and time, i.e., for .
We have also derived a similar bound for the local time-average model at each node, i.e., for (Appendix D.3).
Under assumptions A1-A5 and that a constant learning rate is used, an upper bound for the objective value at the end of the ()th iteration is given by:
Here, denotes the Euclidean distance between vector and set of global minimizers . The proof is in Appendix D.1. The first two terms on the right hand side of (7) also appear when studying the convergence of centralized subgradient methods. The last two terms appear because of the distributed consensus component of the algorithm and depend on . We observe that is the spectral gap of . It measures how well connected the graph is. In particular, the larger the spectral gap (the smaller ), the better the connectivity and the smaller the bound in (7).
Under assumptions A1-A5 and that constant learning rate is used, an upper bound for the objective value at the end of the ()th iteration is given by:
In particular, if workers compute full-batch subgradients and the 2-norm of subgradients of functions is bounded by a constant , we obtain:
When is large enough, the fourth term in (8) and (9) is dominant, so that the error is essentially proportional to . Note that the constant multiplying in (8) is larger than the corresponding one in (7) by a factor
Existing theoretical works like (Nedić and Ozdaglar, 2009; Duchi et al., 2012; Nedić et al., 2018) derived bounds similar to (9) and concluded then that one should select the learning rate proportional to to reduce the effect of topology. In particular, one obtains (4) when . Our bound (7) improves bound (8) by replacing in the third terms of (8) by the smaller value , and in the third and fourth terms by the smaller values and , and introducing the new coefficient . We qualitatively describe the effect of these constants.
For , considerations similar to those applying to hold. What matters is the variability of the subgradients. Assume that the dataset is replicated at each node and each node computes the subgradient over the full batch (). In this case all subgradients would be equal, and , , and the fourth term would also vanish. This corresponds to the fact that, when initial parameter vectors, as well as local functions, are the same, the parameter vectors are equal at any iteration and the system evolves exactly as it would under a centralized subgradient method. In general, local subgradients can be expected to be close (and ), if 1) local datasets are representative of the entire dataset (the dataset has been randomly split and ), and 2) large batch sizes are used. On the other hand, when batch sizes are very small, one expects stochastic subgradients to be very noisy, and as a consequence the energy of the matrix to be much larger than the energy of , so that . In both cases, is large (in the first case because , and in the second because ). We quantify these effects below.
From (5) we see that the effect of the subgradients is modulated by , that equals . The energy of the subgradients is spread across the different subspaces defined by the eigenvectors of . The classic bound (8) implicitly assumes that all energy falls in the subspace corresponding to (this occurs if the row of the matrices are aligned with the second eigenvector). In reality, on average each subspace will only get -th of the total energy (), and the energy in other subspaces will be dissipated faster than what happens for the subspace corresponding to . quantifies this effect.
A toy example in Appendix F illustrates qualitatively these effects. Here we provide estimates for the expected values of , , and over all possible ways to distribute the dataset randomly across the nodes. We reason as follows. For a given parameter vector , consider the set of subgradients at all dataset points, i.e., . The average subgradient over all datapoints is . Let denote the trace of the covariance matrix of all subgradients. then equals the sum of the variances of all the components of the subgradients. We denote by the expanded dataset where each datapoint is replicated times with . The dataset is split across the nodes. Each node selects a random minibatch from its local dataset and we denote by the corresponding subgradient matrix.
Consider a uniform random permutation of with the constraint that copies of the same point are placed at different nodes. The following holds
We can use (11) to study how , , and vary with dataset size, batch size, and number of replicas, using the following approximations:
Figure 1 illustrates the ratio for a particular setting. It also highlights the two regimes discussed above: for large batch sizes and for small ones. As indicates how much looser bound (8) is in comparison to bound (7), and , the figure shows that (8) may indeed overestimate the effect of the topology by many orders of magnitudes. The comparison of Fig. 0(b) and Fig. 0(a) shows that (and then ) does not depend much on the replication factor , but for the fact that the batch size can scale up to .
|split by digit||10||500||0.01||1.01||1.00||1.42||1.42||3.62||1||3||60||1||7||100|
With our experiments we want to 1) evaluate the effect of topology on the number of epochs to converge, and in particular quantify , , , and in practical ML problems, 2) evaluate the effect of topology on the convergence time. We considered three different optimization problems:
Minimization of mean squared error (MSE) for linear regression on the dataset “Relative location of CT slices on axial axis” from (34; F. Graf, H. Kriegel, M. Schubert, S. Pölsterl and A. Cavallaro (2011)). Convexity holds, but gradients are potentially unbounded.
Minimization of cross-entropy loss through a neural network with two convolutional layers on MNIST dataset (Lecun et al., 1998). Neither convexity, nor subgradient boundness hold.
We have developed an ad-hoc Python simulator that allows us to test clusters with a large number of nodes, as well as a distributed application using PyTorch MPI backend to run experiments on a real GPU cluster platform.
Table 1 shows values of , , , and their product for different problems and different settings.
From (7) and (8), we can also compute at which iteration the two bounds predict that the effect of the topology becomes significant, by identifying when the training loss difference between the clique and the ring accounts for a given percentage of the loss decrease over the entire training period. Figure 3 qualitatively illustrates the procedure.
We note that forecasts are very different, despite the fact that, in some settings, our bound is only 3 times tighter than the classic one. Bound (8) predicts that the training loss curves should differ by more than since the first iteration (). The new bound (7) correctly identifies that the topology’s effect becomes evident later, sometimes beyond the total number of iterations performed in the experiment (in this case we indicate ).
Figure 2 shows the training loss evolution for specific settings (one for each ML problem) and two very different topologies (undirected ring and clique), when the dataset is split randomly across the nodes. The behaviour is qualitatively similar to what observed in previous works (Lian et al., 2017, 2018; Luo et al., 2019); despite the remarkable difference in the level of connectivity (quantified also by the spectral gap), the curves are very close, sometimes indistinguishable.
Figure 4 shows the same plot for the case when MNIST images for the same digit have been assigned to the same node. In this case the local datasets are very different and ; the topology has a remarkable effect! This plot warns against extending the empirical finding in (Lian et al., 2017, 2018; Luo et al., 2019) to settings where local datasets can be highly different as it can be for example in the case of federated learning (Konecný et al., 2015).
The experiments above confirm that the communication topology has little influence on the number of epochs needed to converge (when local datasets are statistically similar). Our analysis reconciles (at least in part) theory and experiments by pushing farther the training epoch at which the effect of the topology should be evident.
The conclusion about the role of the topology is radically different if one considers the time to converge. For example, Karakus et al. (2017) and Luo et al. (2019) observe experimentally that sparse topologies can effectively reduce the convergence wall-clock time. A possible explanation is that each iteration is faster because less time is spent in the communication phase: the less connected the graph, the smaller the communication load at each node. Lian et al. (2017, 2018) advance this explanation to justify why DSM on ring-like topologies can converge faster than the centralized PS.
Here, we show that sparse topologies can speed-up wall-clock time convergence even when communication costs are negligible, because they intrinsically mitigate the effect of stragglers, i.e., tasks whose completion time can be occasionally much longer than its typical value. Transient slowdowns are common in computing systems (especially in shared ones) and have many causes, such as resource contention, background OS activities, garbage collection, and (for ML tasks) stopping criteria calculations. Stragglers can significantly reduce computation speed in a multi-machine setting (Ananthanarayanan et al., 2013; Karakus et al., 2017; Li et al., 2018). For consensus-based method, one can hope that, when the topology is sparse, a temporary straggler only slows down a limited number of nodes (its out-neighbors in ), so that the system can still maintain a high throughput.
Neglia et al. (2019) have proposed approximate formulas to evaluate the throughput of distributed ML systems for some specific random distribution of the computation time (uniform, exponential, and Pareto). Here, we take a more practical approach. Our PyTorch-based distributed application allow us to simulate systems with arbitrary distributions of the computation times and communication delays. We have carried out experiments with zero communication delays (an ideal network) and two different empirical distributions for the computation time. One was obtained by running stochastic gradient descent on a production Spark cluster with sixteen servers using Zoe Analytics (Pace et al., 2017), each with two 8-core Intel E5-2630 CPUs running at 2.40GHz. The other was extracted from ASCI-Q super-computer traces (Petrini et al., 2003, Fig. 4). Figure 5 shows the effect of topology connectivity on the convergence time for a MNIST experiment with Spark computation distribution. We consider in this case undirected -regular random graphs. The number of iterations completed per node grows faster the less connected the topology (Fig. 5 (a)). As the training loss is almost independent of the topology (Fig. 5 (b)), the ring achieves the shortest convergence time (Fig. 5 (c)), even if there is no communication delay. Qualitatively similar results for other ML problems and time distributions are in Appendix G.
We have explained, both through analysis and experiments, when and why the communication topology does not affect the number of epochs consensus-based optimization methods need to converge, an effect recently observed in many papers, but not thoroughly investigated. We have also shown that, as a consequence of this invariance, a less connected topology achieves a shorter convergence time, not necessarily because it incurs a smaller communication load, but because it mitigates the stragglers’ problem. The distributed operation of consensus-based approaches appears particularly suited for federated and multi-agent learning. Our study points out that further research is required for these applications, because the benefits observed until now are dependent on the statistical similarity of the local datasets, an assumption that is not satisfied in federated learning.
We warmly thank Alain Jean-Marie, Pietro Michiardi, and Bruno Ribeiro for their feedback on the manuscript and their help preparing the rebuttal letter. We are also grateful to the OPAL infrastructure from UniversitÃ© CÃ´te d’Azur and Inria Sophia Antipolis - MÃ©diterranÃ©e “NEF” computation platform for providing resources and support. Finally, We would like to thank the anonymous reviewers for their insightful comments and suggestions, which helped us improve this work.
This work was supported in part by ARL under Cooperative Agreement W911NF-17-2-0196.
Appendix A Notation
We use an overline to denote an average over all the nodes and a “hat” to denote the time-average. For example
For a matrix, e.g., , denotes the matrix whose column is , i.e.,
where is the orthogonal projector on the subspace generated by . is used to denote the difference .
Given a matrix , and denote the -th row and the -th column, respectively.
We use different standard matrix norms, whose definitions are reported here for completeness. Let be a matrix:
where are the singular values of the matrix and is the largest one.
We will also consider the Frobenius inner product between matrices defined as follows
All the results in Appendix D assume that the matrix is irreducible, primitive, doubly stochastic, non-negative, and normal.
Appendix B Linear algebra reminders
b.1 Irreducible primitive doubly stochastic non-negative matrices
We remind some results from Perron-Frobenius theory (Meyer, 2000, Ch. 8). As our communication graph is strongly connected, and whenever is an edge of , the consensus matrix is irreducible. Moreover, the consensus matrix has non-null diagonal elements and then it is also primitive. The spectral radius is then itself a simple eigenvalue. Because is also stochastic, its eigenvalue coincides with the spectral radius ().
Let be the orthogonal projector on the subspace generated by the unit vector .
The non-zero singular values of are the positive square roots of the non-zero eigenvalues of . We observe that . The spectrum of is equal to the spectrum of but for one eigenvalue that is replaced by an eigenvalue . It follows that .
b.2 Normal matrices
An matrix is normal if . A matrix is normal if and only if it is unitarily diagonalizable (Meyer, 2000, p. 547), i.e., it exists a complete orthonormal set of eigenvectors such that , where is the diagonal matrix containing the eigenvalues and has the eigenvectors as columns.
Normal matrices have a spectral decomposition with orthogonal projectors (Meyer, 2000, p. 517), i.e.,
where are the eigenvalues of , is the orthogonal projector onto the nullspace of along the range of , for , and . Because the projectors are orthogonal and non null it holds and (Meyer, 2000, p. 433). Moreover, for any vector and , it holds:
Symmetric matrices as well as circulant matrices are normal. In fact, a circulant matrix is always diagonalizable by the Fourier matrix and then it is normal.
The non-zero singular values of a normal matrix are the modules of its eigenvalues, i.e., . If the matrix is also imprimitive, irreducible, doubly stochastic and non-negative, it holds
Moreover, observe that in this case .
Appendix C Insensitivity quantification in previous work
Lian et al. (2017) and Pu et al. (2019) have studied the convergence of decentralized stochastic gradient method predicting topology independence after a certain number of iterations. Here, we evaluate quantitatively their predictions on some ML problems and compare them with our experimental observations. The results show that these predictions are very loose and then they do not fully explain why insensitivity is often observed since the beginning of the training phase.
These two papers have slightly different assumptions than ours.
the consensus matrix is symmetric and doubly stochastic,
every has L-Lipschitz continuous gradient, i.e.,
the expected variance of stochastic gradient is uniformly bounded, i.e.,
every is -strongly convex, i.e.,
We observe that A1 implies A4, A2 implies A5, and A4 implies A1.
Corollary C.1 (Corollary 2 in Lian et al. (2017)).
Under assumptions A2-A3, A1-A3, and that a constant learning rate is used, the convergence rate is independent of the topology if the total number of iterations is sufficiently large, in particular if
We estimate the constants , and as follows:
- Estimate of :
, where is a random set of pairs of parameter vectors.
- Estimate of :
, as the dataset is randomly split.
- Estimate of :
We observe that in general underestimates the Lipschitz constant , and is likely to overestimate . Both effects lead to underestimate , that makes our conclusions stronger.
We evaluate these quantities for three machine learning problems (SUSY,
Theorem C.2 (Theorem (4.2) in (Pu et al., 2019)).
Under assumptions A1-A3, A2-A4, suppose , learning rate . For all it holds
where is the minimizer of , , , ,
, and .
Notice that in (20), only the third term is related to the topology as depends on the spectral gap of the consensus matrix. The bound in Corollary C.1 is obtained imposing that the term depending on the spectral gap is smaller than the other terms (Lian et al., 2017, Thm. 1). Here we apply the same idea, requiring the third term of (20) to be bounded by the first and the second term of (20), i.e.,
Then, we have
As and , assuming that , we conclude that the convergence rate is independent of the topology if the total number of iterations is sufficiently large, and more precisely if