A Distributed, Asynchronous and Incremental Algorithm for Nonconvex Optimization: An ADMM Based Approach
The alternating direction method of multipliers (ADMM) has been popular for solving many signal processing problems, convex or nonconvex. In this paper, we study an asynchronous implementation of the ADMM for solving a nonconvex nonsmooth optimization problem, whose objective is the sum of a number of component functions. The proposed algorithm allows the problem to be solved in a distributed, asynchronous and incremental manner. First, the component functions can be distributed to different computing nodes, who perform the updates asychronously without coordinating with each other. Two sources of asynchrony are covered by our algorithm: one is caused by the heterogeneity of the computational nodes, and the other arises from unreliable communication links. Second, the algorithm can be viewed as implementing an incremental algorithm where at each step the (possibly delayed) gradients of only a subset of component functions are updated. We show that when certain bounds are put on the level of asynchrony, the proposed algorithm converges to the set of stationary solutions (resp. optimal solutions) for the nonconvex (resp. convex) problem. To the best of our knowledge, the proposed ADMM implementation can tolerate the highest degree of asynchrony, among all known asynchronous variants of the ADMM. Moreover, it is the first ADMM implementation that can deal with nonconvexity and asynchrony at the same time.
Consider the following nonconvex and nonsmooth problem
where ’s are a set of smooth, possibly nonconvex functions; is a convex nonsmooth regularization term. In this paper we consider the scenario where the component functions ’s are located at different distributed computing nodes. We seek an algorithm that is capable of computing high quality solutions for problem (1) in a distributed, asynchronous and incremental manner.
Dealing with asynchrony is a central theme in designing distributed algorithms. Indeed, often in a completely decentralized setting, there is no clock synchronization, little coordination among the distributed nodes, and minimum mechanism to ensure reliable communication. Therefore an ideal distributed algorithm should be robust enough to handle different sources of asynchrony, while still producing high quality solutions in a reasonable amount of time. Since the seminal work of Bertsekas and Tsitsiklis [1, 2], there has been a large body of literature focusing on asynchronous implementation of various distributed schemes; see, e.g., [3, 4, 5, 6, 7] for the developments by the optimization and signal processing communities. In , an incremental and asynchronous gradient-based algorithm is proposed to solve a convex problem, where at each step certain outdated gradients can be used for update. In [6, 5], the authors show that the well-known iterative water-filling algorithm [8, 9] can be implemented in a totally asynchronous manner, as long as the interference among the users are weak enough.
The recent interest in optimization and machine learning for problems with massive amounts of data introduces yet another compelling reason for dealing with asynchrony; see [10, Chapter 10]. When large amounts of data are distributedly located at computing nodes, local computations can be costly and time consuming. If synchronous algorithms are used, then the slowest nodes can drag the performance of the entire system. To make distributed learning algorithms scalable and efficient, the machine learning community has also started to deal with asynchrony; see recent results in [11, 12, 13, 14, 15, 16]. For example in , an asynchronous randomized block coordinate descent method is developed for solving convex block structured problem, where the per-block update can utilize delayed gradient information. In , the authors show that it is also possible to tolerate asynchrony in stochastic optimization. Further, they prove that the rate of the convergence is more or less independent of the maximum allowable delay, which is an improvement over earlier results in .
In this paper, we show that through the lens of the ADMM method, the nonconvex and nonsmooth problem (1) can be optimized in an asynchronous, distributed, and incremental manner. The ADMM, originally developed in early 1970s [17, 18], has been extensively studied in the last two decades [19, 20, 21, 22, 23, 24, 25, 26, 27]. It is known to be effective in solving large-scale linearly constrained convex optimization problems. Its application includes machine learning, computer vision, signal and image processing, networking, etc; see [28, 29, 30, 31, 32, 33]. However, despite various successful numerical attempts (see, e.g., [34, 35, 36, 37, 38, 39, 40, 41, 42]), little is known about whether ADMM is capable of handling nonconvex optimization problems, or whether it can be used in an asynchronous setting. There are a few recent results that start to fill these gaps. Reference  shows that the ADMM converges when applied to certain nonconvex consensus and sharing problems, provided that the stepsize is chosen large enough. However it is not clear whether asynchrony will destroy the convergence. Reference  proposes an asynchronous implementation for convex global consensus problem, where the distributed worker nodes can use outdated information for updates. Two conditions are imposed on the protocol, namely the partial barrier and bounded delay. The algorithm cannot deal with the asynchrony cause by loss/delay in the communication link, nor does it cover nonconvex problems. In [44, 45] randomized versions of ADMM are proposed for consensus problems, where the nodes are allowed to be randomly activated for updates. We note that the algorithms in [44, 45] still require the nodes to use up-to-date information whenever they update, therefore they are more in line with randomized algorithms than asynchronous algorithms. Further, it is not known whether the analysis carries over to the case when the problem is nonconvex.
The algorithm proposed in this work is a generalization of the flexible proximal ADMM algorithm proposed in [43, Section 2.3]. The key feature of the proposed algorithm is that it can deal with asynchrony arises from the heterogeneity of the computing nodes as well as the loss/delay caused by unreliable communication links. The basic requirement here is that the combined effects of these sources leads to a bounded delay on the component gradient evaluation, and that the stepsize of the algorithm is chosen appropriately. Further, we show that the framework studied here can be viewed as an (possibly asynchronous) incremental scheme for nonconvex problem, where at each iteration only a subset of (possibly delayed) component gradients are updated. To the best of our knowledge, asynchronous incremental schemes of this kind hasn’t been studied in the literature; see [46, 47, 48] for recent works on synchronous incremental algorithm for nonconvex problems.
Ii The ADMM-based Framework
Consider the optimization problem (1). In many practical applications, ’s need to be handled by a single distributed node, such as a thread or a processor, which motivates the so-called global consensus formulation [49, Section 7]. Suppose there is a master node and distributed nodes available. Let us introduce a set of new variables , and transform problem (1) to the following linearly constrained problem
The augmented Lagrangian function is given by
where is some constant, and . Applying the vanilla ADMM algorithm, listed below in (4), one obtains a distributed solution where each function is only handled by a single node at any iteration :
At this point, it is important to note that the algorithm described in (4) uses a synchronous protocol, that is
The set of agents that are selected to update at each iteration act in a coordinated way;
There is no communication delay and/or loss between the agents and the master node;
All local updates are performed assuming that the most up-to-date information is available.
However, in many practical large-scale networks, these assumptions are hardly true. Nodes may have different computational capacity, or they may be assigned jobs that have different computational requirements. Therefore the time consumed to complete local computation can vary significantly among the nodes. This makes them difficult to coordinate with each other in terms of when to update, which information to use for the update and so on. Further, the communication links between the distributed and the master nodes can have delays or may even be lossy.
Additionally, we want to mention that in certain machine learning and signal processing problems when there is a large number of component functions, it is desirable that the algorithm is incremental, meaning at each iteration only a subset of are used for update; see [50, 51, 46, 47, 48]. Clearly the vanilla ADMM described in (4) does not belong to this type of algorithm.
Ii-B The Proposed Algorithm
There are two key features that we want to build into the ADMM-based algorithm. One is to allow the nodes to use staled information for local computation, as long as such information is not “too old” (this notion will be made precise shortly). This enables the nodes to have varying update frequency, therefore faster nodes do not need to wait for the slower ones. The other feature is to take into account scenarios where the communication links among the node are lossy or have delays. Below we give a high level description of the proposed scheme.
Suppose there is a master node and distributed nodes in the system. Let the index denote the total number of updates that have been performed on the variable . The master node takes care of updating all the primal and dual variables, while the distributed nodes compute the gradients for each component function . At each iteration , the master node first updates . Then it waits a fixed period of time, collects a few (possibly staled) gradients of component functions returned by a subset of local nodes , then proceed to the next round of update. On the other hand, each node is in charge of a local component function . Based on the copy of passed along by the master node, node computes and returns the gradient of to the master node. Note that for data intensive applications, the computation of the gradient can be time consuming. Also there can be delays of communication between two different nodes in the network. Therefore there is no guarantee that during the period of computation and communication of the gradient of , the variable at the master node will always remain the same.
To characterize the possible delay involved in the computation and communication, we define a new sequence , where each represents the index of the copy of that evaluates the used by the master node at iteration .
The proposed algorithm, named Asynchronous Proximal ADMM (Async-PADMM), is given in the following table.
In Algorithm 1, we have used the proximity operator, which is defined below. Let be a (possibly nonsmooth) convex function. For every , the proximity operator of is defined as [52, Section 31]
We note that in Step S2, defines the subset of component functions whose gradients have arrived during iteration ; again is the index of the copy of that evaluates the used by the master node at iteration . For those component functions without new gradient information available, the old gradients will continue to be used (indeed, note that we have for all , ). In Step S3, all the variables , regardless or not, are updated according to the following gradient-type scheme:
Despite the fact that the gradients of all the component functions are used at each step , only a subset of them (i.e., thosed indexed by ) differ from those at the previous iteration. Therefore the algorithm can be classified as incremental algorithm; see [50, 51] for related incremental algorithms for convex problems.
To highlight the asynchronous aspect of the algorithm, below we present an equivalent version of Algorithm 1, from the perspective of the distributed nodes and the master node, respectively. We use , to denote the clock at node , and use to denote the clock at the master node.
It is not hard to see that the scheme described here is equivalent to Algorithm 1, except that in Algorithm 1 every step is measured using the clock at the master node. We have the following remarks regarding to the above algorithm descriptions.
(Blocking Events) There is a minimal number of blocking events for both the master node and the distributed agents. In Algorithm 1(a), the master node only needs to wait for a given period of time in step S3). After the waiting period, it collects the set of new gradients that has arrived during that period. Note that is allowed to be an empty set, meaning the master node is not blocking on the arrival of any local gradients. Similarly, each node does not need to wait for the rest of the agents to perform computation: once it obtains a new copy of the computation starts immediately. As soon as the computation is done node can send out the new gradient, without checking whether that gradient has arrived at the master node. Admittedly, in Step S1 of Algorithm 1(b), node needs to wait for a new , but this is reasonable because otherwise there is nothing it can do.
(Characterization on the Delays) The proposed algorithm allows communication delays and packet loss between the master and the distributed nodes. For example, the vector broadcasted by the master node may arrive at the different distributed nodes at different time instances; it may even arrive at a given node out of order, i.e., arrives before . Further, may get lost during the transmission and never reaches a given node. All these scenarios can happen in the reverse communication direction as well. Comparing Algorithm 1 and Algorithm 1(a)–(b), we see that if , then the difference is the total computation time and the round-trip communication delay, starting from broadcasting until the updated is received by the master node. If , then the difference is the number of times that the gradient has been used so far (or equivalently the number of iterations since the last gradient from node has arrived). Clearly, when there is no delay at all , then the system is synchronous and we have . In Fig. 1, we illustrate the relationship and , and different types of asynchronous events covered by the algorithm.
(Connection to Existing Algorithms) To the best of our knowledge, the proposed algorithm can tolerate the highest degree of asynchrony, among all known asynchronous variants of ADMM. For example, the scheme proposed in  corresponds to the case where there is no communication delay or loss (all messages sent are received instantaneously by the intended receiver). It is not clear whether the scheme in  can be generalized to our case111In fact, no proof is provided in . Therefore it becomes difficult to see whether it is possible to extend their analysis.. The schemes proposed in  and  require the nodes to use the most up-to-date information, hence hardly asynchronous. The second major difference with the existing literature is about the tasks performed by the distributed nodes: in [45, 44, 37] each node directly optimizes the augmented Lagrangian, while here each node computes the gradient of their respective component functions. The third difference is on the assumptions made on problem (1): the schemes in [45, 44, 37] handle convex problem but each component function can be nonsmooth, while we can handle nonconvex functions, but there can be only a single nonsmooth function (see Assumption A1 below). The fourth difference is on the assumed network topology: the schemes in [45, 44] deal with general topology, where nodes are interconnected according to certain graphs; our work and  are restricted to the “star” network topology where all distributed nodes communicate directly with the master node.
(Incrementalism) Algorithm 1 can be viewed as an incremental algorithm, as long as each is a strict subset of , in which case the gradients of only a subset of component functions are updated. This is in the same spirit of several recent incremental algorithms for convex problems [50, 51], despite the fact that our algorithm has a different form, and we can further handle nonconvexity and asychrony.
It is worth noting that Algorithm 1 can be modified to resemble the more traditional incremental algorithm , where each iteration only those variables with “fresh” gradients are updated. That is, steps S3 and S4 are replaced with the following steps:
S3)’ Update by solving: S4)’ Update the dual variable:
However, we found that this variant leads to much more complicated analysis 222To analyze this version, we need to define a few additional sequences, one for each node , to characterize the iteration indices in which each component variable is updated. We will also need to impose that the ’s are updated often enough; see [1, Chapter 7]., stringent requirement on the range of stepsizes ’s, and most importantly, slow convergence. Therefore we choose not to discuss the related variants in the paper. We also note that recent works in incremental-type algorithms for solving (1) either do not deal with nonconvex problem [50, 51], or they do not consider asynchrony [46, 47, 48].
Iii Convergence Analysis
In order to reduce the notational burden, our analysis will be based on Algorithm 1, which uses a global clock. We first make a few assumptions.
(On the Problem) There exists a positive constant such that
Moreover, is convex (possibly nonsmooth); is a closed, convex and compact set. is bounded from below over .
(On the Asynchrony) The total delays are bounded, i.e., for each node there exists finite constants such that for all and .
(On the Algorithm) For all , the stepsize is chosen large enough such that:
By Assumption A2, we see that the only requirement on the asynchrony is that when each is updated, the information used to compute the gradient should be one of generated within last iterations. So it is perfectly legitimate if copies of or copies of the gradients get lost due to unsuccessful communication. Also there is nothing preventing copies of from arriving at the same node with reversed order (e.g., arrives after ). Due to this assumption on the boundedness of the asynchrony, Algorithms 1 belongs to the family of “partially asynchronous algorithm”, as opposed to the “totally asynchronous algorithm” in which the delays can potentially be unbounded 333In short, the only requirement for the totally asynchronous algorithm is that no nodes quits forever. ; see the definitions and discussions in .
From Assumption A3, it is clear that when the system is synchronous, i.e., when , the bound for becomes
We have the following result.
Suppose Assumption A is satisfied. Then for Algorithm 1, the following is true for all
Proof. From the update of in (II-B), we observe that the following is true
Note that both and are updated at each iteration, so we have the following equality for iteration as well
Suppose , which means that no new gradient information arrives for node . In this case, we have , therefore
It follows that for , (15) is true.
Suppose that , then we have
Therefore we have, for all
The above result further implies that,
The desired result is obtained.
Next, we upper bound the successive difference of the augmented Lagrangian. To this end, let us define a few new functions, given below
Using these short-handed definitions, we have
The lemma below bounds the difference between and .
Suppose Assumption A is satisfied. Let be generated by Algorithm 1. Then we have the following
Proof. From the definition of and we have the following
Observe that is generated according to (24). Combined with the strong convexity of with respect to , we have
Also we have
Using the strong convexity of , we have the series of inequalities given below
Further, we have the following series of inequalities
The desired result then follows.
Next, we bound the difference of the augmented Lagrangian function values.
Proof. We first bound the successive difference . We first decompose the difference by
The first term in (33) can be expressed as
To bound the second term in (33), we use Lemma III.2. We have the series of inequalities in (34), where the last inequality follows from Lemma III.2 and the strong convexity of with respect to the variable (with modulus ) at .
Combining the above two inequalities and use Lemma III.1, we obtain the inequality below: