Impact of Communication Delays on the Convergence Rate of Distributed Optimization Algorithms
Motivated by applications in machine learning and statistics, we study distributed optimization problems over a network of processors, where the goal is to optimize a global objective composed of a sum of local functions. In these problems, due to the large scale of the data sets, the data and computation must be distributed over processors resulting in the need for distributed algorithms. In this paper, we consider a popular distributed gradient-based consensus algorithm, which only requires local computation and communication. An important problem in this area is to analyze the convergence rate of such algorithms in the presence of communication delays that are inevitable in distributed systems. We prove the convergence of the gradient-based consensus algorithm in the presence of uniform, but possibly arbitrarily large, communication delays between the processors. Moreover, we obtain an upper bound on the rate of convergence of the algorithm as a function of the network size, topology, and the inter-processor communication delays.
There has been much recent interest in large-scale optimization problems, especially in machine learning and statistics. Due to the explosion in the size of data sets, it is important to be able to solve such problems efficiently. In addition, very often large data sets, on the order of terabytes, cannot be stored or processed on one single processor. As a result, both the data and computation must be distributed over a network of processors, necessitating the development of distributed algorithms. Moreover, the computation and communication in these algorithms should be efficient enough so that network latencies do not offset the computational gains.
In this paper, we study distributed algorithms for optimization problems that are defined over a network of nodes111The terms nodes and processors will be used interchangeably., while explicitly accounting for network delays, one of the most critical issues in distributed systems. The objective function is defined by a sum of local functions where each function is known by only one node. Problems of this nature arise in a variety of application domains within the information sciences and engineering. A standard example from statistical machine learning (et. al., 2014) is the problem of minimizing an average loss function over large training data. The data is distributed across a network of processors, where each processor computes the empirical loss over a local subset of data. The processors, therefore, must communicate to determine parameters that minimize the loss over the entire data set. Distributed algorithms for these problems have received a surge in interest in recent years. In particular, there are three widely-studied algorithms for distributed optimization:
Alternating direction method of multipliers (ADMM): This method has a provably fast convergence rate, i.e., an exponential convergence rate under assumptions of strong convexity and smoothness of objective functions; see for example the work in (Meteos et al., 2010; Boyd et al., 2011; Shi et al., 2014; Makhdoumi and Ozdaglar, 2014; Wei and Ozdaglar, 2013). However, the computations of ADMM are not truly parallelizable. The algorithm is often said to have a distributed implementation, which means that different processors compute different variables, but the updates of these variables must be performed sequentially.
Distributed dual averaging: In this algorithm, processors maintain estimates of variables and gradient-like quantities, which are exchanged in a truly parallel fashion. However, dual averaging has a slower convergence rate than ADMM; see for example, the work in (Duchi et al., 2012; Tsianos et al., 2012b; Tsianos and Rabbat, 2012a; Tsianos et al., 2012a).
Distributed gradient algorithms: These algorithms are the most popular and well-studied since they have the benefits of both ADMM and dual averaging; see for example, the work in (Touri and Gharesifard, 2015; Nedić and Ozdaglar, 2009; Nedić et al., 2010; Shi et al., 2015; Nedíc et al., 2016; Qu and Li, 2016; Gharesifard and Cortés, 2014)). In particular, distributed gradient algorithms are parallelizable like dual averaging and have fast convergence rates like ADMM. Moreover, the computation cost of each iteration is smaller than either dual averaging or ADMM.
In this paper, we study distributed gradient methods because of the advantages stated above. In particular, we focus on the convergence in the presence of inter-processor communication delays, which has been identified as an important problem in (D.P. Palomar, 2009) (see chapter 10). Communication delay, which is one of the most fundamental issues in distributed systems, has been studied in other contexts, such as distributed dual averaging (Tsianos et al., 2012a). The analysis in (Tsianos et al., 2012a) is based on adding fictitious nodes corresponding to the number of time delay steps, thus requiring a modification of the true network topology. As a result, the influence of the delays on the convergence rate for the original network topology is not clear. Convergence under delays are also considered in distributed consensus algorithms (Blondel et al., 2005; Nedić and Ozdaglar, 2010; Münz et al., 2011; Tsianos and Rabbat, 2012b; Charalambous et al., 2015), which are special cases of distributed gradient algorithms. However, these results do not apply to the general distributed algorithms considered here. Our goal in this paper, therefore, is to address this open problem of proving convergence and obtaining convergence rates for distibuted gradient algorithms with inter-processor communication delays.
Main Contributions. The main contribution of this paper is to derive the convergence rate of distributed gradient algorithms under uniform communication delays between nodes. In particular, we first show that under some appropriate choice of stepsizes the nodes’ estimates asymptotically converge to the solution of the problem, implying that the impact of communication delays is asymptotically negligible. This step allows us to study the rate of convergence of the algorithm, i.e., the convergence occurs at rate , where is the number of processors, is the time variable, and is the delay constant. In addition, is a constant in that depends on , the spectral properties of network connectivity of the processors. We note that such an explicit formula for the convergence rate is not available for dual averaging methods. As remarked, the existing analysis in distributed optimization literature cannot be extended to show this result. We, therefore, introduce a new approach by considering a new candidate Lyapunov functional, which takes into account the impact of delays. Finally, while we do not analyze dual averaging methods in the presence of delays, we provide simulation results comparing it to distributed gradient methods, which indicate that distributed gradient methods perform significantly better.
The remainder of this paper is organized as follows. We give a formal statement of distributed optimization problems in Section 2. We then study distributed gradient algorithms for the uniform delay case in Section 3 and present their convergence results in Section 4. In Section 5 we compare the performances of distributed gradient methods and dual averaging methods by simulations for both the delay-free and uniform delay cases. The proofs of our main results in Sections 4 are given in Section 6. Finally, we conclude this paper with some discussion of potential future extensions in Section 7.
Notation 1 ().
We use boldface to distinguish between vectors in and scalars in . Given any vector , we write and let denote its Euclidean norm. Given a vector and a set we write the projection of on as . Finally we denote by and a vector whose entries are and the identity matrix, respectively.
2. Problem formulation
In this paper, we consider an optimization problem where the objective function is distributed over a network of nodes. In particular, let be an undirected graph over the vertex set with the edge set . Associated with each node is a convex function . The goal of the network is to solve the following minimization problem:
where is compact, convex, and known by the nodes. We assume no central coordination between the nodes and since each node knows only a local function , the nodes are required to cooperatively solve the problem. We are interested in studying distributed consensus-based methods for problem (1) implying that each node maintains its own parameter estimate which is used to estimate the solution of (1). The nodes are only allowed to exchange their estimates with their neighbors through communication constraints imposed by a graph : in particular, node can communicate directly only with its neighbors where is the set of node ’s neighbors. The goal is to asymptotically drive the nodes’ estimates to , a solution of (1).
A concrete motivating example for this problem is distributed linear regression problems solved over a network of processors. Regression problems involving massive amounts of data are common in machine learning applications. Each function is the empirical loss over the local data stored at processor . The objective is to minimize the total loss over the entire dataset. Due to the difficulty of storing the enormous amount of data at a central location, the processors perform local computations over the local data, which are then exchanged to arrive at the globally optimal solution. Distributed gradient methods are a natural choice to solve such problems since they have been observed to be both fast and easily parallelizable in the case where the processors can exchange data instantaneously. The goal of this paper is to show that the algorithm continues to be convergent in the presence of delays, and to derive expressions for the convergence rate as a function of the delays. Another possible application of the model is the problem of estimating the radio frequency in a wireless network of sensors where the goal is to cooperatively estimate the radio-frequency power spectrum density through solving a regression problem (Meteos et al., 2010). In this application, each function is the empirical loss over the local data measured by the sensors, which are scattered across a large geographical area. The objective function is the total loss over the entire measured data, which is the sum of . Due to privacy concerns, the sensors may not be willing to share their measurements, but only their own estimates. Thus, distributed consensus-based methods seem to be a proper choice for this problem.
We conclude this section with additional notation and assumptions which facilitate our development given later. We make the following assumptions throughout the paper.
Assumption 1 ().
The functions are convex and differentiable.
Assumption 2 ().
The graph is undirected and connected.
Under Assumption 1 and since the set is compact, there exists a point which solves problem (1). However, may not be unique. We will use to denote the set of optimal solutions to problem (1). Moreover, given a solution we denote . Under Assumption 1 it is obvious that the functions are Lipschitz continuous, which we present below as a Proposition for future reference.
Proposition 2.1 ().
Let Assumption 1 hold. Then each function is Lipschitz continuous, i.e., there exists a positive constant such that
Given a vector we denote by the set of feasible directions of in , i.e.,
In the sequel we use the following results from (Bertsekas et al., 2004).
Proposition 2.2 (Proposition (Bertsekas et al., 2004)).
Let be a closed convex set. Then the tangent cone at is closed, convex, and , where is the closure of .
Finally, for ease of exposition, in the rest of this paper we consider problem (1) when the variable is a scalar, i.e., . Extensions for the case are presented in the appendix.
3. Distributed Gradient Methods under Communication Delays
Discrete-time distributed gradient methods were studied and first analyzed rigorously in (Nedić and Ozdaglar, 2009; Nedić et al., 2010) for the case of no communication delay; in this framework each node maintains a variable updated as,
where is some sequence of positive stepsizes and is some positive constant. In this paper we focus on the continuous-time version of (4) under the impact of uniform communication delays between nodes. In particular, we assume that at any time node only receives a delayed value of from node , where is a constant representing the time delay of communication between nodes. Each node (for all ) then uses these values to update its estimate as formally stated in (5), where is the tangent cone of at , is some postive constant, and is a sequence of positive stepsizes. The conditions of and to guarantee convergence of the algorithm will be explicitly given later. In addition, the initial conditions, , are assumed to be continuous functions of time. Thus, the estimates are now functionals since they are functions of . We assume that the delays are uniform across agents, represented by the positive constant .
This update has a simple interpretation: at any time , each node first combines its estimate with the weighted, delayed values received from its neighbors , with the goal of seeking consensus on their estimates. Each node then moves along the gradient of its respective objective function to update its estimate, pushing the consensus point toward the optimal set . The projection on the tangent cone guarantees that for all . Here the positive constant represents the weight which node assigns to the value received from node . Moreover, the nodes use the positive constant , which is inversely proportional to the delay constant , to control the speed of their updates. The distributed gradient algorithm with communication delays is formulated in Algorithm 1.
In the sequel, we denote by the weighted adjacency matrix corresponding to the graph , whose -th entries are . We make an assumption on which is standard in the consensus literature to guarantee the convergence of the nodes’ estimates to a consensus point. The assumption given below also imposes a constraint on the communication between the nodes in Algorithm 1 in which the nodes are only allowed to exchange messages with neighboring nodes, i.e., those are connected to them, as defined by .
Assumption 3 ().
is a doubly stochastic matrix, i.e., . Moreover, is assumed to be irreducible and aperiodic. Finally, the weights if and only if otherwise .
We note that the assumption on the irreducibility of can be satisfied when is connected. In addition, the aperiodicity of is guaranteed when at least one of its diagonal is strictly positive. Finally, the double stochasticity of is essential to the distributed consensus averaging problem (Nedić et al., 2009), a special case of problem (1). There has been some work in which this assumption is relaxed to just stochasticity of , however; additional assumptions on the problem are then imposed; see for example, push-sum protocols recently studied in (Nedić and Olshevsky, 2015).
4. Convergence Results
The focus of this section is to analyze the performance of distributed gradient methods under communication delays given in Algorithm 1. In particular, we provide a rigorous analysis which establishes the convergence rate of Algorithm 1. The main steps of the analysis are as follows.
We first show that the distances between the estimates to their average asymptotically converge to zero. We then study the convergence rate of Algorithm 1, where we utilize the standard techniques used in the centralized version of subgradient methods. The key idea of this step is to introduce a candidate Razumikhin-Krasovskii Lyapunov functional, which takes into account the impact of delays on the system. By using this function, we can show that the impact of delays is asymptotically negligible. In particular, we show that if each node maintains a variable to compute the time-weighted averages of the estimates and if the stepsize decays with rate , the algorithm achieves an asymptotic convergence to the optimal value estimated on the variable at a rate , where and . Here represents the algebraic connectivity of the graph .
We start our analysis by first introducing more notation. Given a vector we denote its average as , i.e.,
For convenience, we use the following notation,
We denote by the second largest singular value of , i.e., is the square root of the second largest eigenvalue of . Since is doubly stochastic we have is also doubly stochastic. In addition, also satisfies Assumption 3. Thus, by the Perron-Frobenius theorem (Horn and Johnson, 1985) we have .
Finally, without loss of generality we consider for some real numbers . The multi-dimensional case of is presented in the Appendix. This simplification will allow us to write explicitly the projection on the tangent cone in (5). In particular, given a real number we denote , the positive part of . Similarly, we denote , the negative part of . The update in (5) can now be rewritten as
where denotes the component-wise projection. Moreover, we have
As remarked, the first step in our analysis is to show the asymptotic convergence of to zero under some appropriate choice of stepsizes. The following Lemma, which will be essential for our analysis later, is an important facet of this result.
Lemma 4.1 ().
The main idea in the proof of Lemma 4.1 is to show (12). The analysis of (14) and (15) are consequences of (12) with the given assumptions on stepsizes and proper algebraic manipulations. We, therefore, provide here the key steps for the proof of (12), where the details are delayed to Section 6.1.
Finally, the key step of our analysis is to provide an upper bound for
which is done by applying the - Inequality (Khalil, 2002).
We are now ready to state our main result of this section, which is the convergence rate of Algorithm 1 to the optimal value using standard techniques in the analysis of centralized subgradient methods. One can view the update in (11) as a centralized projected subgradient used to solve problem (1). Specifically, at any time if each node maintains a variable to compute the time-weighted average of its estimate and if the stepsize decays as , the objective function value estimated at each converges to the optimal value with a rate , where and . We also note that this condition on the stepsizes is also used to study the convergence rate of centralized subgradient methods (Nesterov, 2004). The following Theorem is used to show the convergence rate of Algorithm 1, and its proof is given in Section 6.2
Theorem 4.2 ().
Sketch of Proof.
As mentioned previously, the main idea of this proof is to introduce a candidate Lyapunov functional, which takes into account the impact of delays. In particular, a quadractic Lyapunov function, i.e., , is often used in the case of no communication delay. However, since the estimates depends on the interval we consider an extra term to study this impact. Specifically, we consider the following candidate Razumikhin-Krasovskii Lyapunov functional (Hale and Lunel, 1993):
We then show that is suffciently decreasing by considering the following two main steps.
One can show that the derivative of satisfies
Remark 0 ().
Note that the convergence rate in (17) requires each node computing the time-weighted average of its estimate. This can be done iteratively as follows. Let every node stores a variable initialized at time with an arbitrary and for all updated by
where and for . Then we have
In this section, we apply the distributed gradient algorithm to study the well-known linear regression problem in statistical machine learning, which is the most popular technique for data fitting (Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014). The goal of this problem is to find a linear relationship between a set of variables and some real value outcome. Here, we focus on quadratic loss functions, that is, given a training set for , we want to learn a parameter that minimizes the following least squares problem,
We assume that the data sets are distributedly stored in a network of processors, i.e., each processor knows only the pair .
For the purpose of simulations, we consider the discrete-time version of Algorithm 1, i.e., Eq. (4) with communication delays . We simulate for the case when where , i.e., . We consider simulated training data sets, i.e., are generated randomly with uniform distribution between . We consider the performance of the distributed gradient algorithm on different sizes of network , where each network is generated as follows.
In each network, we first randomly generate the nodes’ coordinates in the plane with uniform distribution.
Then any two nodes are connected if their distance is less than a reference number , e.g, for our simulations.
Finally we check whether the network is connected. If not we return to step and run the program again.
To implement our algorithm, the communication matrix is chosen as a lazy Metropolis matrix corresponding to , i.e.,
It is straightforward to verify that the lazy Metropolis matrix satisfies Assumption 3. In all simulations considered herein, we set the stepsize for and .
In the sequel, we will compare the performance of the discretized version of distributed gradient (DG) with distributed dual averaging (DA) (Duchi et al., 2012; Tsianos et al., 2012a) for solving problem (20) in the delay-free case as well as in the case of constant delays. For DA, we chose the same stepsize as used in our algorithm. Simulations show that the distributed gradient algorithm outperforms distributed dual averaging in both cases.
5.1. Delay-free case
In the delay-free case, i.e., , we simulate DG and DA for three different sizes of networks, namely, , , and . In each simulation, we fix the number of iterations and output the worst-case distance of the function value to the optimal value, i.e., , where . The simulations are shown in Fig. 1.
In these simulations, the performance of the DG algorithm is always slightly better than that of the DA algorithm, but overall they seem to share the same convergence rate , which agrees with the analytical result in Theorem 4.2 and in (Duchi et al., 2012; Nedić and Ozdaglar, 2009).
5.2. Uniform delays
To study the impact of uniform communication delays on the performance of DG and DA, similar to the delay-free case we simulate the two algorithms for three different sizes of networks, namely, , , and . We implement DG and DA for each network, and terminate them when . We let the delay constant run from to and output the number of iterations as a function on . We plot the number of iterations as a function on the number of delay steps. The simulations are shown in Fig. 2.
We first note that the delays do influence the convergence rate of the two algorithms, that is, the greater the delay between nodes the more time the algorithms need to terminate. Second, as shown by the curve for DG the number of iterations seems to increase as a cubic function of the number of delay steps, which agrees with our analysis in Theorem 4.2. Finally, in this example, uniform delays have a bigger impact on the performance of DA, that is, DA requires more iterations to converge than DG under the same number of delay steps.
6. Proofs of Main Results
We provide here the complete proof of the main results presented in Section 4. In the following Lemma, we first study some important properties for the projection error , which can be viewed as the one-dimension version of Lemma A.1 for the general convex set , stated in the Appendix.
Lemma 6.1 ().
Recall that . Moreover, by (7) we have the following three cases for all
If then .
If then we have . If then . Otherwise if then since we have . This implies that
This implies that
Finally, if then . If then implying . Otherwise, if then , which implies
Thus we have
From these three cases, we have , which by (2) implies .
Let be a feasible direction, i.e., satisfies (22). Consider
We now investigate the second term of the previous relation for three cases
If then implying .
If then we have . If then implying . Otherwise if then . Since we have , which implies since
Finally, if then . If then implying . Otherwise, if then . Since we have , which implies since .
6.1. Proof of Lemma 4.1
We start by introducing the following notation for convenience
We first show the details of steps stated in the proof sketch of Lemma 4.1.
Taking the norm of (26), using the triangle inequality, and since we obtain
We first note that by the triangle inequality and (2) we have
Moreover, by (21) we have
Moreover, consider the second term on the right-hand side of (30)