On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays
Abstract
Recent years have witnessed the surge of asynchronous parallel (asyncparallel) iterative algorithms due to problems involving very largescale data and a large number of decision variables. Because of asynchrony, the iterates are computed with outdated information, and the age of the outdated information, which we call delay, is the number of times it has been updated since its creation. Almost all recent works prove convergence under the assumption of a finite maximum delay and set their stepsize parameters accordingly. However, the maximum delay is practically unknown.
This paper presents convergence analysis of an asyncparallel method from a probabilistic viewpoint, and it allows for large unbounded delays. An explicit formula of stepsize that guarantees convergence is given depending on delays’ statistics. With identical processors, we empirically measured that delays closely follow the Poisson distribution with parameter , matching our theoretical model, and thus the stepsize can be set accordingly. Simulations on both convex and nonconvex optimization problems demonstrate the validness of our analysis and also show that the existing maximumdelay induced stepsize is too conservative, often slowing down the convergence of the algorithm.
Keywords:
asynchronous unbounded delays, nonconvex, convex∎
1 Introduction
In the “big data” era, the size of the dataset and the number of decision variables involved in many areas such as health care, the Internet, economics, and engineering are becoming tremendously large WH2014big (). It motivates the development of new computational approaches by efficiently utilizing modern multicore computers or computing clusters.
In this paper, we consider the blockstructured optimization problem
(1) 
where is partitioned into disjoint blocks, has a Lipschitz continuous gradient (possibly nonconvex), and ’s are (possibly nondifferentiable) proper closed convex functions. Note that ’s can be extendedvalued, and thus (1) can have block constraints by incorporating the indicator function of in for all .
Many applications can be formulated in the form of (1), and they include classic machine learning problems: support vector machine (squared hinge loss and its dual formulation) cortes1995SVM (), LASSO tibshirani1996Lasso (), and logistic regression (linear or multilinear) zhou2013tensorreg (), and also subspace learning problems: sparse principal component analysis zou2006sparsePCA (), nonnegative matrix or tensor factorization cichocki2009NMFNTF (), just to name a few.
Toward solutions for these problems with extremely largescale datasets and many variables, firstorder methods and also stochastic methods become particularly popular because of their scalability to the problem size, such as FISTA beck2009FISTA (), stochastic approximation nemirovski2009robust (), randomized coordinate descent nesterov2012RCD (), and their combinations DangLanSBMD (); XuYin2015_block (). Recently, lots of efforts have been made to the parallelization of these methods, and in particular, asynchronous parallel (asyncparallel) methods attract more attention (e.g., liu2014asynchronous (); Peng_2015_AROCK ()) over their synchronous counterparts partly due to the better speedup performance.
This paper focuses on the asyncparallel block coordinate update (asyncBCU) method (see Algorithm 1) for solving (1). To the best of our knowledge, all works on asyncBCU before 2013 consider a deterministic selection of blocks with an exception to Strikwerda2002125 (), and thus they require strong conditions (like a contraction) for convergence. Recent works, e.g., liu2014asynchronous (); liu2015asyncscd (); Peng_2015_AROCK (); hannah2016unbounded (), employ randomized block selection and significantly weaken the convergence requirement. However, all of them require bounded delays and/or are restricted to convex problems. The work hannah2016unbounded () allows unbounded delays but requires convexity, and davis2016asynchronous (); cannelli2016asynchronous () do not assume convexity but require bounded delays. We consider unbounded delays and deal with nonconvex problems.
1.1 Algorithm
We describe the asyncBCU method as follows. Assume there are processors, and the data and variable are accessible to all processors. We let all processors continuously and asynchronously update the variable in parallel. At each time , one processor reads the variable as from the global memory, randomly picks a block , and renews by a proxlinear update while keeping all the other blocks unchanged. The pseudocode is summarized in Algorithm 1, where the operator is defined in (3).
The algorithm first appeared in liu2014asynchronous (), where the age of relative to , which we call the delay of iteration , was assumed to be bounded by a certain integer . For general convex problems, sublinear convergence was established, and for the strongly convex case, linear convergence was shown. However, its convergence for nonconvex problems and/or with unbounded delays was unknown. In addition, numerically, the stepsize is difficult to tune because it depends on , which is unknown before the algorithm completes.
(2) 
1.2 Contributions
We summarize our contributions as follows.

We analyze the convergence of Algorithm 1 and allow for large unbounded delays following a certain distribution. We require the delays to have certain bounded expected quantities (e.g., expected delay, variance of delay). Our results are more general than those requiring bounded delays such as liu2014asynchronous (); liu2015asyncscd ().

Both nonconvex and convex problems are analyzed, and those problems include both smooth and nonsmooth functions. For nonconvex problems, we establish the global convergence in terms of firstorder optimality conditions and show that any limit point of the iterates is a critical point almost surely. It appears to be the first result of an asyncBCU method for general nonconvex problems and allowing unbounded delays. For weakly convex problems, we establish a sublinear convergence result, and for strongly convex problems, we show the linear convergence.

We show that if all processors run at the same speed, the delay follows the Poisson distribution with parameter . In this case, all the relevant expected quantities can be explicitly computed and are bounded. By setting appropriate stepsizes, we can reach a nearlinear speedup if for smooth cases and for nonsmooth cases.

When the delay follows the Poisson distribution, we can explicitly set the stepsize based on the delay expectation (which equals ). We simulate the asyncBCU method on one convex problem: LASSO, and one nonconvex problem: the nonnegative matrix factorization. The results demonstrate that asyncBCU performs consistently better with a stepsize set based on the expected delay than on the maximum delay. The number of processors is known while the maximum delay is not. Hence, the setting based on expected delay is practically more useful.
Our algorithm updates one (block) coordinate of in each step and is sharply different from stochastic gradient methods that sample one function in each step to update all coordinates of . While there are asyncparallel algorithms in either classes and how to handle delays is important to both of their convergence, their basic lines of analysis are different with respect to how to absorb the delayinduced errors. The results of the two classes are in general not comparable. That said, for problems with certain proper structures, it is possible to apply both coordinatewise update and stochastic sampling (e.g., recht2011hogwild (); XuYin2015_block (); MokhtaiKoppelRibeiro2016_class (); davis2016asynchronous ()), and our results apply to the coordinate part.
1.3 Notation and assumptions
Throughout the paper, bold lowercase letters are used for vectors. We denote as the th block of and as the th sampling matrix, i.e., is a vector with as its th block and for the remaining ones. denotes the expectation with respect to conditionally on all previous history, and .
We consider the Euclidean norm denoted by , but all our results can be directly extended to problems with general primal and dual norms in a Hilbert space.
The projection to a convex set is defined as
and the proximal mapping of a convex function is defined as
(3) 
Definition 1
(Critical point) A point is a critical point of (1) if where denotes the subdifferential of at and
(4) 
Throughout our analysis, we make the following three assumptions to problem (1) and Algorithm 1. Other assumed conditions will be specified if needed.
Assumption 1
The function is lower bounded. The problem (1) has at least one solution, and the solution set is denoted as .
Assumption 2
is Lipschitz continuous with constant , namely,
(5) 
In addition, for each , fixing all block coordinates but the th one, and are Lipschitz continuous about with and , respectively, i.e., for any , and ,
(6)  
(7) 
Assumption 3
For each , the reading is consistent and delayed by , namely, . The delay follows an identical distribution as a random variable
(9) 
and is independent of . We let
Remark 1
Although the delay always satisfies , the assumption in (9) is without loss of generality if we make negative iterates and regard . For simplicity, we make the identical distribution assumption, which is the same as that in Strikwerda2002125 (). Our results can still hold for nonidentical distribution; see the analysis for the smooth nonconvex case in the arXiv version of the paper.
2 Related works
We briefly review block coordinate update (BCU) and asyncparallel computing methods.
The BCU method is closely related to the GaussSeidel method for solving linear equations, which can date back to 1823. In the literature of optimization, BCU method first appeared in Hildreth57 () as the block coordinate descent method, or more precisely, block minimization (BM), for quadratic programming. The convergence of BM was established early for both convex and nonconvex problems, for example luo1992convergence (); GrippoSciandrone00 (); Tseng01 (). However, in general, its convergence rate result was only shown for strongly convex problems (e.g., luo1992convergence ()) until the recent work hong2015iteration () that shows sublinear convergence for weakly convex cases. tseng2009_CGD () proposed a new version of BCU methods, called coordinate gradient descent method, which mimics proximal gradient descent but only updates a block coordinate every time. The block coordinate gradient or block proxlinear update (BPU) becomes popular since nesterov2012RCD () proposed to randomly select a block to update. The convergence rate of the randomized BPU is easier to show than the deterministic BPU. It was firstly established for convex smooth problems (both unconstrained and constrained) in nesterov2012RCD () and then generalized to nonsmooth cases in richtarik2014iteration (); Lu_Xiao_rbcd_2015 (). Recently, DangLanSBMD (); XuYin2015_block () incorporated stochastic approximation into the BPU framework to deal with stochastic programming, and both established sublinear convergence for convex problems and also global convergence for nonconvex problems.
The asyncparallel computing method (also called chaotic relaxation) first appeared in rosenfeld1969case () to solve linear equations arising in electrical network problems. DW1969chaoticrelax () first systematically analyzed (more general) asynchronous iterative methods for solving linear systems. Assuming bounded delays, it gave a necessary and sufficient condition for convergence. bertsekas1983distributed () proposed an asynchronous distributed iterative method for solving more general fixedpoint problems and showed its convergence under a contraction assumption. TB1990partially () weakened the contraction assumption to pseudononexpansiveness but made more other assumptions. FS2000asynreview () made a thorough review of asynchronous methods before 2000. It summarized convergence results under nested sets and synchronous convergence conditions, which are satisfied by Pcontraction mappings and isotone mappings.
Since it was proposed in 1969, the asyncparallel method has not attracted much attention until recent years when the size of data is increasing exponentially in many areas. Motivated by “big data” problems, liu2014asynchronous (); liu2015asyncscd () proposed the asyncparallel stochastic coordinate descent method (i.e., Algorithm 1) for solving problems in the form of (1). Their analysis focuses on convex problems and assumes bounded delays. Specifically, they established sublinear convergence for weakly convex problems and linear convergence for strongly convex problems. In addition, nearlinear speed up was achieved if for unconstrained smooth convex problems and for constrained smooth or nonsmooth cases. For nonconvex problems, davis2016asynchronous () introduced an asyncparallel coordinate descent method, whose convergence was established under iterate boundedness assumptions and appropriate stepsizes.
3 Convergence results for the smooth case
Throughout this section, let , i.e., we consider the smooth optimization problem
(10) 
The general (possibly nonsmooth) case will be analyzed in the next section. The results for nonsmooth problems of course also hold for smooth ones. However, the smooth case requires weaker conditions for convergence than those required by the nonsmooth case, and their analysis techniques are different. Hence, we consider the two cases separately.
3.1 Convergence for the nonconvex case
In this subsection, we establish a subsequence convergence result for the general (possibly nonconvex) case. We begin with some technical lemmas. The first lemma deals with certain infinite sums that will appear later in our analysis.
Lemma 1
For any and , let
(11a)  
(11b)  
(11c) 
Then
(12)  
(13) 
Proof
To bound , we bound the first term in (11a). Specifically,
where the last equality holds since We obtain (12) by combining these two equations.
To prove (13), we will use
(14) 
The above inequality yields
where the last inequality follows from . ∎
The second lemma below bounds the cross term that appears in our analysis.
Lemma 2 (Cross term bound)
For any and , it holds that
(15)  
(16)  
Proof
Define . Applying the CauchySchwarz inequality with yields
Since by applying Young’s inequality, we get
(17)  
(18) 
By taking expectation, we have
Now taking expectation on both sides of (17) and using the above equation, we get
(19)  
(20)  
(21) 
Finally, (15) follows from
and
(22)  
(23)  
(24)  
(25)  
(26) 
∎
Using the above lemma, we show a result of running one iteration of the algorithm.
Theorem 3.1 (Fundamental bound)
Set and as in (11). For any , we have
(27)  
(28) 
Proof
Since , we have from (8) that
Taking conditional expectation on gives
(29)  
(30)  
(31)  
(32) 
For the first cross term in (29), we write each summand as
(33) 
and we use Young’s inequality to bound the second cross term by
(34) 
Now taking expectation over both sides of (29), plugging in (33) and (34), and using Lemma 2, we have the desired result. ∎
We are now ready to show the main result in the following theorem.
Theorem 3.2
Remark 2
If , then only weakly depends on the delay. The conditions or being bounded can be dropped if is bounded; see Theorem 4.1.
Proof
Summing up (28) from through and using (142), we have
(36)  
(37) 
Note that as . If or is bounded, by letting in (36) and using the lower boundedness of , we have from Lemma 1 that
Since , we have (35) from the above inequality.
From the Markov inequality, it follows that converges to zero with probability one. Let be a limit point of , i.e., there is a subsequence convergent to . Hence, almost surely as . By (gut2006probability, , Theorem 3.4, p.212), there is a subsubsequence such that almost surely as . This completes the proof. ∎
3.2 Convergence rate for the convex case
In this subsection, we assume the convexity of and establish convergence rate results of Algorithm 1 for solving (10). Besides Assumptions 1 through 3, we make an additional assumption to the delay as follows. It means the delay follows a subexponential distribution.
Assumption 4
There is a constant such that
(38) 
The condition in (38) is stronger than , and both of them hold if the delay is uniformly bounded by some number or follows the Poisson distribution; see the discussions in Section 5. Using this additional assumption and choosing an appropriate stepsize, we are able to control the gradient of such that it changes not too fast.
Lemma 3
The proof of Lemma 3 follows an argument similar to liu2014asynchronous (). Since it is rather long, it is included in the appendix. Similar to Lemma 2, we can show the following result.
Lemma 4
For any , it holds that
(41)  
(42)  
(43) 
Proof
Using the above two lemmas, we establish sufficient objective decrease.
Theorem 3.3 (Sufficient progress)
Proof
First note that for any , is dominated by as is sufficiently large. Hence, from (38), and it is easy to see . Also note that
(50) 
We write the cross terms in (29) to
Taking expectation on both sides of (29) and using (41), we have
(51)  
(52) 
The above inequality together with (40) implies
(53)  
(54) 
Note that