Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization
We study the scalability of consensus-based distributed optimization algorithms by considering two questions: How many processors should we use for a given problem, and how often should they communicate when communication is not free? Central to our analysis is a problem-specific value which quantifies the communication/computation tradeoff. We show that organizing the communication among nodes as a -regular expander graph  yields speedups, while when all pairs of nodes communicate (as in a complete graph), there is an optimal number of processors that depends on . Surprisingly, a speedup can be obtained, in terms of the time to reach a fixed level of accuracy, by communicating less and less frequently as the computation progresses. Experiments on a real cluster solving metric learning and non-smooth convex minimization tasks demonstrate strong agreement between theory and practice.
How many processors should we use and how often should they communicate for large-scale distributed optimization? We address these questions by studying the performance and limitations of a class of distributed algorithms that solve the general optimization problem
where each function is convex over a convex set . This formulation applies widely in machine learning scenarios, where measures the loss of model with respect to data point , and is the cumulative loss over all data points.
Although efficient serial algorithms exist , the increasing size of available data and problem dimensionality are pushing computers to their limits and the need for parallelization arises . Among many proposed distributed approaches for solving (1), we focus on consensus-based distributed optimization [4, 5, 6, 7] where each component function in (1) is assigned to a different node in a network (i.e., the data is partitioned among the nodes), and the nodes interleave local gradient-based optimization updates with communication using a consensus protocol to collectively converge to a minimizer of .
Consensus-based algorithms are attractive because they make distributed optimization possible without requiring centralized coordination or significant network infrastructure (as opposed to, e.g., hierarchical schemes ). In addition, they combine simplicity of implementation with robustness to node failures and are resilient to communication delays . These qualities are important in clusters, which are typically shared among many users, and algorithms need to be immune to slow nodes that use part of their computation and communication resources for unrelated tasks. The main drawback of consensus-based optimization algorithms comes from the potentially high communication cost associated with distributed consensus. At the same time, existing convergence bounds in terms of iterations (e.g., (7) below) suggest that increasing the number of processors slows down convergence, which contradicts the intuition that more computing resources are better.
This paper focuses on understanding the limitations and potential for scalability of consensus-based optimization. We build on the distributed dual averaging framework . The key to our analysis is to attach to each iteration a cost that involves two competing terms: a computation cost per iteration which decreases as we add more processors, and a communication cost which depends on the network. Our cost expression quantifies the communication/computation tradeoff by a parameter that is easy to estimate for a given problem and platform. The role of is essential; for example, when nodes communicate at every iteration, we show that in complete graph topologies, there exists an optimal number of processors , while for -regular expander graphs , increasing the network size yields a diminishing speedup. Similar results are obtained when nodes communicate every iterations and even when increases with time. We validate our analysis with experiments on a cluster. Our results show a remarkable agreement between theory and practice.
In Section II we formalize the distributed optimization problem and summarize the distributed dual averaging algorithm. Section III introduces the communication/computation tradeoff and contains the basic analysis where nodes communicate at every iteration. The general case of sparsifying communication is treated in Section IV. Section V tests our theorical results on a real cluster implementation and Section VI discusses some future extensions.
Ii Distributed Convex Optimization
Assume we have at our disposal a cluster with processors to solve (1), and suppose without loss of generality that is divisible by . In the absence of any other information, we partition the data evenly among the processors and our objective becomes to solve the optimization problem,
where we use the notation to denote loss associated with the th local data point at processor (i.e., ). The local objective functions at each node are assumed to be -Lipschitz and convex. The recent distributed optimization literature contains multiple consensus-based algorithms with similar rates of convergence for solving this type of problem. We adopt the distributed dual averaging (DDA) framework  because its analysis admits a clear separation between the standard (centralized) optimization error and the error due to distributing computation over a network, facilitating our investigation of the communication/computation tradeoff.
Ii-a Distributed Dual Averaging (DDA)
In DDA, nodes iteratively communicate and update optimization variables to solve (2). Nodes only communicate if they are neighbors in a communication graph , with the vertices being the processors. The communication graph is user-defined (application layer) and does not necessarily correspond to the physical interconnections between processors. DDA requires three additional quantities: a -strongly convex proximal function satisfying and (e.g., ); a positive step size sequence ; and a doubly stochastic consensus matrix with entries only if either or and otherwise. The algorithm repeats for each node in discrete steps , the following updates:
where is a subgradient of evaluated at . In (3), the variable maintains an accumulated subgradient up to time and represents node ’s belief of the direction of the optimum. To update in (3), each node must communicate to exchange the variables with its neighbors in . If , for the local running averages defined in (5), the error from a minimizer of after iterations is bounded by (Theorem , )
where is the Lipschitz constant, indicates the dual norm, , and quantifies the network error as a disagreement between the direction to the optimum at node and the consensus direction at time . Furthermore, from Theorem in , with , after optimizing for we have a bound on the error,
where is the second largest eigenvalue of . The dependence on the communication topology is reflected through , since the sparsity structure of is determined by . According to (7), increasing slows down the rate of convergence even if does not depend on .
Iii Communication/Computation Tradeoff
In consensus-based distributed optimization algorithms such as DDA, the communication graph and the cost of transmitting a message have an important influence on convergence speed, especially when communicating one message requires a non-trivial amount of time (e.g., if the dimension of the problem is very high).
We are interested in the shortest time to obtain an -accurate solution (i.e., ). From (7), convergence is faster for topologies with good expansion properties; i.e., when the spectral gap does not shrink too quickly as grows. In addition, it is preferable to have a balanced network, where each node has the same number of neighbors so that all nodes spend roughly the same amount of time communicating per iteration. Below we focus on two particular cases and take to be either a complete graph (i.e., all pairs of nodes communicate) or a -regular expander .
By using more processors, the total amount of communication inevitably increases. At the same time, more data can be processed in parallel in the same amount of time. We focus on the scenario where the size of the dataset is fixed but possibly very large. To understand whether there is room for speedup, we move away from measuring iterations and employ a time model that explicitly accounts for communication cost. This will allow us to study the communication/computation tradeoff and draw conclusions based on the total amount of time to reach an accuracy solution.
Iii-a Time model
At each iteration, in step (3), processor computes a local subgradient on its subset of the data:
The cost of this computation increases linearly with the subset size. Let us normalize time so that one processor compute a subgradient on the full dataset of size in time unit. Then, using cpus, each local gradient will take time units to compute. We ignore the time required to compute the projection in step (4); often this can be done very efficiently and requires negligible time when is large compared to and .
We account for the cost of communication as follows. In the consensus update (3), each pair of neighbors in transmits and receives one variable . Since the message size depends only on the problem dimension and does not change with or , we denote by the time required to transmit and receive one message, relative to the time unit required to compute the full gradient on all the data. If every node has neighbors, the cost of one iteration in a network of nodes is
Using this time model, we study the convergence rate bound (7) after attaching an appropriate time unit cost per iteration. To obtain a speedup by increasing the number of processors for a given problem, we must ensure that -accuracy is achieved in fewer time units.
Iii-B Simple Case: Communicate at every Iteration
In the original DDA description (3)-(5), nodes communicate at every iteration. According to our time model, iterations will cost time units. From (7), the time to reach error is found by substituting for and solving for . Ignoring the log factor in (7), we get
This simple manipulation reveals some important facts. If communication is free, then . If in addition the network is a -regular expander, then is fixed , is independent of and . Thus, in the ideal situation, we obtain a linear speedup by increasing the number of processors, as one would expect. In reality, of course, communication is not free.
Complete graph. Suppose that is the complete graph, where and . In this scenario we cannot keep increasing the network size without eventually harming performance due to the excessive communication cost. For a problem with a communication/computation tradeoff , the optimal number of processors is calculated by minimizing for :
Again, in accordance with intuition, if the communication cost is too high (i.e., ) and it takes more time to transmit and receive a gradient than it takes to compute it, using a complete graph cannot speedup the optimization. We reiterate that is a quantity that can be easily measured for a given hardware and a given optimization problem. As we report in Section V, the optimal value predicted by our theory agrees very well with experimental performance on a real cluster.
Iv General Case: Sparse Communication
The previous section analyzes the case where processors communicate at every iteration. Next we investigate the more general situation where we adjust the frequency of communication.
Iv-a Bounded Intercommunication Intervals
Suppose that a consensus step takes place once every iterations. That is, the algorithm repeats cheap iterations (no communication) of cost time units followed by an expensive iteration (with communication) with cost . This strategy clearly reduces the overall average cost per iteration. The caveat is that the network error is higher because of having executed fewer consensus steps.
where counts the number of communication steps in iterations, and if and otherwise. Using the fact that , we obtain
Taking norms, recalling that the are convex and Lipschitz, and since , we arrive at
Using a technique similar to that in  to bound the distance of row of to its stationary distribution as grows, we can show that
for all . Comparing (16) to equation (29) in , the network error within iterations is no more than times larger when a consensus step is only performed once every iterations. Finally, we substitute the network error in (II-A). For , we have , and
We minimize the leading term over to obtain
Of the iterations, only involve communication. So, iterations will take
To achieve -accuracy, ignoring again the logarithmic factor, we need iterations, or
From the last expression, for a fixed number of processors , there exists an optimal value for that depends on the network size and communication graph :
If the network is a complete graph, using yields ; i.e., using more processors hurts performance when not communicating every iteration. On the other hand, if the network is a -regular expander then for constants , and we obtain a diminishing speedup.
Iv-B Increasingly Sparse Communication
Next, we consider progressively increasing the intercommunication intervals. This captures the intuition that as the optimization moves closer to the solution, progress slows down and a processor should have “something significantly new to say” before it communicates. Let denote the number of cheap iterations performed between the st and th expensive iteration; i.e., the first communication is at iteration , the second at iteration , and so on. We consider schemes where for . The number of iterations that nodes communicate out of the first total iterations is given by . We have
which means that as . Similar to (15), the network error is bounded as
We split the sum into two terms based on whether or not the powers of have converged. Using the split point , the term is bounded by when is large and by when is small:
since . Substituting this bound into (II-A) and taking the step size sequence to be with and to be determined, we get
The first four summands converge to zero when . Since ,
which converges to zero if . To bound the last term, note that , so the term goes to zero as if . In conclusion, converges no slower than since . If we choose to balance the first three summands, for small , the rate of convergence is arbitrarily close to , while nodes communicate increasingly infrequently as .
Out of total iterations, DDA executes expensive iterations involving communication and cheap iterations without communication, so
In this case, the communication cost becomes a less and less significant proportion of as increases. So for any , if is fixed, we approach a linear speedup behaviour . To get , ignoring the logarithmic factor, we need
From this last equation we see that for we have , so using increasingly sparse communication should, in fact, be faster than communicating at every iteration.
V Experimental Evaluation
To verify our theoretical findings, we implement DDA on a cluster of nodes with 3.2 GHz Pentium HT processors and GB of memory each, connected via ethernet that allows for roughly MB/sec throughput per node. Our implementation is in C++ using the send and receive functions of OpenMPI v1.4.4 for communication. The Armadillo v2.3.91 library, linked to LAPACK and BLAS, is used for efficient numerical computations.
V-a Application to Metric Learning
Metric learning [11, 12, 13] is a computationally intensive problem where the goal is to find a distance metric such that points that are related have a very small distance under while for unrelated points is large. Following the formulation in , we have a data set with and signifying whether or not is similar to (e.g., similar if they are from the same class). Our goal is to find a symmetric positive semi-definite matrix to define a pseudo-metric of the form . To that end, we use a hinge-type loss function where is a threshold that determines whether two points are dissimilar according to . In the batch setting, we formulate the convex optimization problem
The subgradient of at is zero if . Otherwise
Since DDA uses vectors and , we represent each pair as a dimensional vector. The communication cost is thus quadratic in the dimension. In step (3) of DDA, we use the proximal function , in which case (4) simplifies to taking , followed by projecting to the constraint set by setting and projecting to the set of positive semi-definite matrices by first taking its eigenvalue decomposition and reconstructing after forcing any negative eigenvalues to zero.
We use the MNIST digits dataset which consists of pixel images of handwritten digits through . Representing images as vectors, we have and a problem with dimensions trying to learn a matrix . With double precision arithmetic, each DDA message has a size approximately MB. We construct a dataset by randomly selecting pairs from the full MNIST data. One node needs seconds to compute a gradient on this dataset, and sending and receiving MB takes seconds. The communication/computation tradeoff value is estimated as . According to (11), when is a complete graph, we expect to have optimal performance when using nodes. Figure 1(left) shows the evolution of the average function value for to processors connected as a complete graph, where is as defined in (5). There is a very good match between theory and practice since the fastest convergence is achieved with nodes.
In the second experiment, to make closer to , we apply PCA to the original data and keep the top principal components, containing of the energy. The dimension of the problem is reduced dramatically to and the message size to KB. Using random pairs of MNIST data, the time to compute one gradient on the entire dataset with one node is seconds, while the time to transmit and receive KB is only seconds. Again, for a complete graph, Figure 1(right) illustrates the evolution of for to nodes. As we see, increasing speeds up the computation. The speedup we get is close to linear at first, but diminishes since communication is not entirely free. In this case and .
V-B Nonsmooth Convex Minimization
Next we create an artificial problem where the minima of the components at each node are very different, so that communication is essential in order to obtain an accurate optimizer of . We define as a sum of high dimensional quadratics,
where , and are the centers of the quadratics. Figure 2 illustrates again the average function value for nodes in a complete graph topology. The baseline performance is when nodes communicate at every iteration (). For this problem and, from (21), . Naturally communicating every iterations () slows down convergence. Over the duration of the experiment, with , each node communicates with its peers times. We selected for increasingly sparse communication, and got communications per node. As we see, even though nodes communicate as much as the case, convergence is even faster than communicating at every iteration. This verifies our intuition that communication is more important in the beginning. Finally, the case where is shown. This value is out of the permissible range, and as expected DDA does not converge to the right solution.
Vi Conclusions and Future Work
The analysis and experimental evaluation in this paper focus on distributed dual averaging and reveal the capability of distributed dual averaging to scale with the network size. We expect that similar results hold for other consensus-based algorithms such as  as well as various distributed averaging-type algorithms (e.g., [15, 16, 17]). In the future we will extend the analysis to the case of stochastic optimization, where could correspond to using increasingly larger mini-batches.
-  O. Reingold, S. Vadhan, and A. Wigderson, “Entropy waves, the zig-zag graph product, and new constant-degree expanders,” Annals of Mathematics, vol. 155, no. 2, pp. 157–187, 2002.
-  Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical Programming Series B, vol. 120, pp. 221–259, 2009.
-  R. Bekkerman, M. Bilenko, and J. Langford, Scaling up Machine Learning, Parallel and Distributed Approaches. Cambridge University Press, 2011.
-  J. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2011.
-  A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, January 2009.
-  B. Johansson, M. Rabi, and M. Johansson, “A randomized incremental subgradient method for distributed optimization in networked systems,” SIAM Journal on Control and Optimization, vol. 20, no. 3, 2009.
-  S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” Journal of Optimization Theory and Applications, vol. 147, no. 3, pp. 516–545, 2011.
-  A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Neural Information Processing Systems, 2011.
-  K. I. Tsianos and M. G. Rabbat, “Distributed dual averaging for convex optimization under communication delays,” in American Control Conference (ACC), 2012.
-  F. Chung, Spectral Graph Theory. AMS, 1998.
-  E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Neural Information Processing Systems, 2003.
-  K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Optimization Theory and Applications, vol. 10, pp. 207–244, 2009.
-  K. Q. Weinberger, F. Sha, and L. K. Saul, “Convex optimizations for distance metric learning and pattern classification,” IEEE Signal Processing Magazine, 2010.
-  S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in ICML, 2004, pp. 743–750.
-  M. A. Zinkevich, M. Weimer, A. Smola, and L. Li, “Parallelized stochastic gradient descent,” in Neural Information Processing Systems, 2010.
-  R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2012, pp. 456–464.
-  G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. D. Walker, “Efficient large-scale distributed training of conditional maximum entropy models,” in Neural Information Processing Systems, 2009, pp. 1231–1239.
Appendix A Appendix
A-a Proof of equation (12)
Let us stack the local node variables in a vector and . From (3) in matrix form we have after back-substituting in the recursion
and after some algebra
or in general
where counts the number of communication steps in iterations and if and otherwise. From this last expression we take the -th row to get the result.
A-B Proof of equation (16)
If the consensus matrix is doubly stochastic it is straightforward to show that as . Moreover, from standard Perron-Frobenius is it easy to show (see e.g., [StookDiaconis])
so in our case . Next, demand that the right hand side bound is less than with to be determined:
So with the choice ,
if . When is large and we simply take . The desired bound of (A-B) is not obtained as follows
Since we know that . Moreover, . Using there two fact we arrive at the result.