(Near) Optimal Parallelism Bound for Fully Asynchronous Coordinate Descent with Linear Speedup

(Near) Optimal Parallelism Bound for Fully Asynchronous Coordinate Descent with Linear Speedup

Yun Kuen Cheung
Max Planck Institute for Informatics
Saarland Informatics Campus
   Richard Cole          Yixin Tao
Courant Institute, NYU
Abstract

When solving massive optimization problems in areas such as machine learning, it is a common practice to seek speedup via massive parallelism. However, especially in an asynchronous environment, there are limits on the possible parallelism. Accordingly, we seek tight bounds on the viable parallelism in asynchronous implementations of coordinate descent.

We focus on asynchronous coordinate descent (ACD) algorithms on convex functions of the form

where is a smooth convex function, and each is a univariate and possibly non-smooth convex function.

Our approach is to quantify the shortfall in progress compared to the standard sequential stochastic gradient descent. This leads to a truly simple yet optimal analysis of the standard stochastic ACD in a partially asynchronous environment, which already generalizes and improves on the bounds in prior work. We also give a considerably more involved analysis for general asynchronous environments in which the only constraint is that each update can overlap with at most others, where is at most the number of processors times the ratio in the lengths of the longest and shortest updates. The main technical challenge is to demonstrate linear speedup in the latter environment. This stems from the subtle interplay of asynchrony and randomization. This improves Liu and Wright’s [16] lower bound on the maximum degree of parallelism almost quadratically, and we show that our new bound is almost optimal.

Keywords. Asynchronous Coordinate Descent, Asynchronous Optimization, Asynchronous Iterative Algorithm, Coordinate Descent, Amortized Analysis.

1 Introduction

We consider the problem of finding an (approximate) minimum point of a convex function of the form

where is a smooth convex function, and each is a univariate convex function, but may be non-smooth. Such functions occur in many data analysis and machine learning problems, such as linear regression (e.g., the Lasso approach to regularized least squares [27]) where , logistic regression [20], ridge regression [25] where is a quadratic function, and Support Vector Machines [10] where is often a quadratic function or a hinge loss (essentially, ).

Gradient descent is the standard solution approach for the prevalent massive problems of this type. Broadly speaking, gradient descent proceeds by moving iteratively in the direction of the negative gradient of a convex function. Coordinate descent is a commonly studied version of gradient descent. It repeatedly selects and updates a single coordinate of the argument to the convex function. Stochastic versions are standard: at each iteration the next coordinate to update is chosen uniformly at random111There are also versions in which different coordinates can be selected with different probabilities..

Due to the enormous size of modern problems, there has been considerable interest in parallel versions of coordinate descent in order to achieve speedup, ideally in proportion to the number of processors or cores at hand, called linear speedup.

One important issue in parallel implementations is whether the different processors are all using up-to-date information for their computations. To ensure this requires considerable synchronization, locking, and consequent waiting. Avoiding the need for the up-to-date requirement, i.e. enabling asynchronous updating, was a significant advance. The advantage of asynchronous updating is it reduces and potentially eliminates the need for waiting. At the same time, as some of the data being used in calculating updates will be out of date, one has to ensure that the out-of-datedness is bounded in some fashion.

Modeling asynchrony  The study of asynchrony in parallel and distributed computing goes back to Chazen and Miranker [6] for linear systems and to Bertsakis and Tsitsiklis for a wider range of computations [3]. They obtained convergence results for both deterministic and stochastic algorithms along with rate of convergence results for deterministic algorithms. The first analyses to prove rate of convergence bounds for stochastic asynchronous computations were those by Avron, Druinsky and Gupta [1] (for the Gauss-Seidel algorithm), and by Liu et al. [17] and Liu and Wright [16] (for coordinate descent). Liu et al. [17] imposed a “consistent read” constraint on the asynchrony; the other two works considered a more general “inconsistent read” model. 222“Consistent read” mean that all the coordinates a core read may have some delay, but they must appear simultaneously at some moment. Precisely, the vector of values used by the update at time must be for some . “Inconsistent reads” mean that the values used by the update at time can be any of the , where each and the ’s can be distinct. Subsequent to Liu and Wright’s work, several overlooked issues were identified by Mania et al. [19] and Sun et al. [26]; we call them Undoing of Uniformity (UoU)333In [1], the authors also raised a similar issue about their asynchronous Gauss-Seidel algorithm. and No-Common-Value.

In brief, as the asynchrony assumptions were relaxed, the bounds that could be shown, particularly in terms of achievable speedup, became successively weaker. In this work we ask the following question:

Can we achieve both linear speedup and full asynchrony

when applying coordinate descent to non-smooth functions ?

Our answer to this question is “yes”, and we obtain the maximum possible parallelism while maintaining linear speedup (up to at most factor). Our results match the best speedup, namely linear speedup with to processors as in [17], but with no constraints on the asynchrony, beyond a requirement that unlimited delays do not occur. Specifically, as in [16], we assume there is a bounded amount of overlap between the various updates. We now state our results for strongly convex functions informally.

Theorem 1 (Informal).

Let be an upper bound on how many other updates a single update can overlap. , and are Lispschitz parameters defined in Section 2.

(i) Let be a strongly convex function with strongly convex parameter If and if , then .

(ii) This bound is tight up to a factor: there is a family of functions, with , such that for , with probability , after updates, the current point is essentially the starting point.

Standard sequential analyses [18, 24] achieve similar bounds with the replaced by 1; i.e. up to a factor of 3, this is the same rate of convergence.

Asynchronicity assumptions  The Uniformity assumption states that the start time ordering of the updates and their commit time ordering are identical. Undoing of Uniformity (UoU) arises because while each core initiates an update by choosing a coordinate uniformly at random, due to the possibly differing lengths of the different updates, and also due to various asynchronous effects, the commit time ordering of the updates may be far from uniformly distributed. In an experimental study, Sun et al. [26] showed that iteration lengths in coordinate descent problem instances varied by factors of 2–10, demonstrating that this effect is likely.

The Common Value assumption states that regardless of which coordinate is randomly selected to update, the same values are read in their gradient computation. If coordinates are not being read on the same schedule, as seems likely for sparse problems, it would appear that this assumption too will be repeatedly violated.

Related work  Coordinate Descent is a method that has been widely studied; see Wright for a recent survey [31]. There have been multiple analyses of various asynchronous parallel implementations of coordinate descent [17, 16, 19, 26]. We have already mentioned the results of Liu et al. [17] and Liu and Wright [16]. Both obtained bounds for both convex and “optimally” strongly convex functions444This is a weakening of the standard strong convexity., attaining linear speedup so long as there are not too many cores. Liu et al. [17] obtained bounds similar to ours (see their Corollary 2 and our Theorem 2), but the version they analyzed is more restricted than ours in two respects: first, they imposed the strong assumption of consistent reads, and second, they considered only smooth functions (i.e., no non-smooth univariate components ). The version analyzed by Liu and Wright [16] is the same as ours, but their result requires both the Uniformity and Common Value assumptions. Our bound has a similar flavor but with a limit of . The analysis by Mania et al. [19] removed the Uniformity assumption. However, the maximum parallelism was much reduced (to at most ), and their results applied only to smooth strongly convex functions. We note that a major focus of their work concerned the analysis of Hogwild, a coordinate descent algorithm used for functions of the form , where each of the is convex, and the bounds there were optimal. The analysis in Sun et al. [26] removed the Common Value assumption and partially removed the Uniformity assumption. However, this came at the cost of achieving no parallel speedup. [26] also noted that a hard bound on the parameter could be replaced by a probabilistic bound, which in practice is more plausible.

Asynchronous methods for solving linear systems have been studied at least since the work of Chazan and Miranker [6] in 1969. See [1] for an account of the development. Avron, Druinsky and Gupta [1] proposed and analyzed an asynchronous and randomized version of the Gauss-Seidel algorithm for solving symmetric and positive definite matrix systems. They pointed out that in practice delays depend on the random choice of direction (which corresponds to coordinate choice in our case), which is indeed one of the sources leading to Undoing of Uniformity. Their analysis bypasses this issue with their Assumption A-4, which states that delays are independent of the coordinate being updated, but the already mentioned experimental study of Sun et al. indicates that this assumption does not hold in general.

Another widely studied approach to speeding up gradient and coordinate descent is the use of acceleration. Very recently, attempts have been made to combine acceleration and parallelism [14, 12, 9]. But at this point, these results do not extend to non-smooth functions.

Organization of the Paper

In Section 2, we describe our model of coordinate descent and state our main results, focusing on the SACD algorithm applied to strongly convex functions. In Section 3, we give a high-level sketch of the structure of our analyses. Then, in Section 4, we show that with the Common Value assumption we can obtain a truly simple analysis for SACD; this analysis achieves the maximum possible speedup (i.e. linear speedup with up to processors). Note that this is the same assumption as in Mania et al.’s result [19] and less restrictive than the assumptions in Liu and Wright’s analysis [16]. We follow this with a discussion of some of the obstacles that need to be overcome in order to remove the Common Value assumption, and some comments on how we achieve this. The full analysis of SACD is deferred to the appendix.

2 Model and Main Results

Recall that we are considering convex functions of the form , where is a smooth convex function, and each is a univariate and possibly non-smooth convex function. We let denote a minimum point of and denote the set of all minimum points of . Without loss of generality, we assume that , the minimum value of , is .

We recap a few standard terminologies. Let denote the unit vector along coordinate .

Definition 1.

The function is -Lipschitz-smooth if for any , . For any coordinates , the function is -Lipschitz-smooth if for any and , ; it is -Lipschitz-smooth if . Finally, and .

The difference between and

In general, . when the rates of change of the gradient are constant, as for example in quadratic functions such as . We need because we do not make the Common Value assumption. We use to bound terms of the form , where , and for all , , whereas in the analyses with the Common Value assumption, the term being bounded is , where ; i.e., our bound is over a sum of gradient differences along the coordinate axes for pairs of points which are all nearby, whereas the other sum is over gradient differences along the coordinate axes for the same pair of nearby points. Finally, if the convex function is -sparse, meaning that each term depends on at most variables, then . When is huge, this would appear to be the only feasible case.

By a suitable rescaling of variables, we may assume that is the same for all and equals . This is equivalent to using step sizes proportional to without rescaling, a common practice.

Next, we define strong convexity.

Definition 2.

Let be a convex function. is strongly convex with parameter , if for all , .

The update rule

Recall that in a standard coordinate descent, be it sequential or parallel and synchronous, the update rule, applied to coordinate , first computes the accurate gradient , and then performs the update given below.

and , where is a parameter controlling the step size.

However, in an asynchronous environment, an updating core (or processor) might retrieve outdated information instead of , so the gradient the core computes will be , instead of the accurate value . Our update rule, which is naturally motivated by its synchronous counterpart, is

(1)

We let

It is well known that in the synchronous case, is a lower bound on the reduction in the value of , which we treat as the progress. We let denote the coordinate being updated at time .

2.1 The Sacd Algorithm

INPUT: The initial point .
Multiple cores use a shared memory. Each core iteratively repeats the following six-step procedure, with no global coordination among them:
1Choose a coordinate uniformly at random.
2Retrieve coordinate values from the shared memory.
3Compute the gradient .
4Request a lock on the memory that stores the value of the -th coordinate.
5Retrieve the most updated -th coordinate value, then update it using rule (1). 555Even if the core had retrieved the value of the -th coordinate from the shared memory in Step 2, the core needs to retrieve it again here, because it needs the most updated value when applying update rule (1).
6Release the lock acquired in Step 4.
Algorithm 1 SACD Algorithm.

The coordinate descent process starts at an initial point . Multiple cores then iteratively update the coordinate values. We assume that at each time, there is exactly one coordinate value being updated. In practice, since there will be little coordination between cores, it is possible that multiple coordinate values are updated at the same moment; but by using an arbitrary tie-breaking rule, we can immediately extend our analyses to these scenarios.

In Algorithm 1, we provide the complete description of SACD. The retrieval times for Step 2 plus the gradient-computation time for Step 3 can be non-trivial, and also in Step 4 a core might need to wait if the coordinate it wants to update is locked by another core. Thus, during this period of time other coordinates are likely to be updated. For each update, we call the period of time spent performing the six-step procedure the span of the update. We say that update interferes with update if the commit time of update lies in the span of update .

In Appendix F we discuss why locking is needed and when it can be avoided; we also explain why the random choice of coordinate should be made before retrieving coordinate values.

Timing Scheme and the Undoing of Uniformity

Before stating the SACD result formally, we need to disambiguate our timing scheme. In every asynchronous iterative system, including our SACD algorithm, each procedure runs over a span of time rather than atomically. Generally, these spans are not consistent — it is possible for one update to start later than another one but to commit earlier. To create an analysis, we need a scheme that orders the updates in a consistent manner.

Using the commit times of the updates for the ordering seems the natural choice, since this ensures that future updates do not interfere with the current update. This is the choice made in many prior works. However, this causes uniformity to be undone. To understand why, consider the case when there are three cores and four coordinates, and suppose that the workload for updating is three times greater than those for updating . If for some , then the probability distribution which the random variable follows is biased away from coordinate ; precisely, . When there are many more cores and coordinates than the simple case we just considered, and when the other asynchronous effects666E.g., communication delays, interference from other computations (say due to mutual exclusion when multiple cores commit updates to the same coordinate), interference from the operating system and CPU scheduling. are taken into account, it is highly uncertain what is the exact or even an approximate distribution for conditioned on knowledge of the history of . However, all prior analyses apart from [19] and [26] proceeded by making the idealized assumption that the conditional probability distribution remains uniform, while in fact it may be far from uniform. While it seems plausible that without conditioning, the -th update to commit is more or less uniformly distributed, the prior analyses needed this property with the conditioning, and they needed it for every update without fail.

To bypass the above issue, as in [19], we use the starting times of updates for the ordering — then clearly the history has no influence on the choice of . However, this raises a new issue: future updates can interfere with the current update. Here the term future is used w.r.t. the update ordering, which is by starting time; recall that an update with an earlier starting time can commit later than another later starting update , and therefore could interfere with .

We discuss the Common Value assumption futher in Appendix E.

2.2 Selected Results

We assume that our algorithms are run until exactly coordinates are selected and then updated for some pre-specified , with the commit times constrained by the following assumption.

Assumption 1.

There exists a non-negative integer such that for any update at time , the only updates that can interfere with it are those at times and .

When asynchronous effects are moderate, and if the various gradients have a similar computational cost, the parameter will typically be bounded above by a small constant times the number of cores.

Theorem 2 (Sacd Upper Bound).

Given initial point , Algorithm 1 is run for exactly iterations by multiple cores. Suppose that Assumption 1 holds, , , and . If is strongly convex with parameter , then

(2)

In the main body of the paper, we focus on the case of strongly convex . The full result (including the plain convex case) is in Appendix B. In Appendix C, we show that the bound on the degree of parallelism given in Theorem 2 is essentially optimal. Theorem 2 states that w.h.p., for the first updates, each point oscillates around its starting position in a range of one of , i.e. there is no progress toward the optimum.

Theorem 3 (Sacd Lower Bound).

For any constant and , for all large enough , there exists a convex function with , , minimum point , an initial point satisfying for every coordinate , and an asynchronous schedule for Algorithm 1, such that with , at every one of the first updates, for all but coordinates, with probability .

3 The Basic Framework

Let denote the index of the coordinate that is updated at time , denote the value of the gradient along coordinate computed at time using up-to-date values of the coordinates, and denote the actual value computed, which may use some out-of-date values.

The classical analysis of stochastic (synchronous) coordinate descent proceeds by first showing that for any chosen , . Taking the expectation yields

(3)

By [24, Lemmas 4,6], the RHS of the above inequality is at least ; for completeness, we provide a proof of this result in Appendix A.4. Let . Then ; iterating this inequality yields .

To handle the case where inaccurate gradients are used, we employ the following two lemmas.

Lemma 1.

If ,

Lemma 2.

If , .

Proving these results for smooth functions is straightfoward. The version for non-smooth functions is less simple. It follows from Lemma 9 in Appendix A.

Combining Lemmas 1 and 2 yields

(4)

The first term on the RHS of the above inequality, after taking the expectation, is more or less what is needed in order to demonstrate progress. To complete the analysis we need to show that

(5)

for then we can conclude that .

4 A Truly Simple Analysis with the Common Value Assumption

Suppose there are a total of updates. We view the whole stochastic process as a branching tree of height . Each node in the tree corresponds to the moment when some core randomly picks a coordinate to update, and each edge corresponds to a possible choice of coordinate. We use to denote a path from the root down to some leaf of this tree. A superscript of on a variable will denote the instance of the variable on path . A double superscript of will denote the instance of the variable at time on path , i.e. following the -th update. Finally will denote the path with the time coordinate on path replaced by coordinate . Note that .

In this section, we give a simple proof which shows that the error term when reading out-of-date values, , can be bounded by , where denotes the update on path at time , and is the index of the coordinate chosen at time . (Note that by the Common Value assumption, the values are the same for all paths .)

Lemma 3.

With the Common Value assumption,

Proof.

By definition, , the gradient of up-to-date point , and , the gradient of the point actually read from memory, out-of-date point . By the definition of , we see that the difference between and is a subset of the updates in the time interval . 888Assumption 1 states that the updates before time have been written into memory before the update at time starts. We denote this subset by :

Viewing as an -vector with a non-zero entry for coordinate and no other, we have:

For simplicity, we define

Then, and . By the definition of and the triangle inequality, we obtain:

(6)

The last inequality followed from applying the Cauchy-Schwarz inequality to the RHS, relaxing to . Note that, for any and , as and by the Common Value assumption. So,

The result follows on applying (6).

To demonstrate the bound in (5), it suffices that . The bound in Theorem 1 then follows readily.

5 Comments on Achieving the Full Result

The SACD analysis

Although the analyis in the previous section is simple, it is not obvious how to obtain a similar bound without the Common Value assumption. We want to have a similar relationship between and . We mention several of the challenges we face when we drop the Common Value assumption.

  1. Without the Common Value assumption, may depend on the coordinate being updated at time . The reason is that the updates to two different coordinates may read different subsets of coordinates and as a result their reads of a common coordinate may occur at different times, and as a result may be reads of different updates of this common coordinate. We write for the value of read by the update when coordinate is chosen. Now we need to bound , and the first inequality in (6) may no longer apply.

  2. In addition, may also depend on the coordinate being updated at time . Suppose the updates to coordinates and at time have different read schedules and this affects the timing of an earlier update to coordinate (because the update has to be atomic and so may be slightly delayed if there is a read). Then a read of coordinate by an update to coordinate may occur before ’s update in the scenario with the time update to coordinate and after in the scenario with the time update to coordinate . If the update to coordinate occurs before time then will depend on the coordinate chosen at time . While this may seem esoteric, to rule it out imposes unclear limitations on the asynchronous schedules. So actually, we need to bound .

  3. Without the assumption, a simple bound is that . This is essentially the bound in Sun et al. [26] (except that they use rather than ). But this bound does not enable any parallel speedup because of the term.

  4. Without the assumption, the RHS of (5) becomes . [24, Lemmas 4,6] does not apply to this expression. Instead, we need , where indicates the value of at time had coordinate rather than coordinate been selected, and similarly for . The two expectations would be the same if the Common Value assumption held. Our remedy is to devise new shifting lemmas to bound the cost of changing the arguments in .

To handle the first three issues, roughly speaking, for each path , we bound the difference between the maximum and minimum possible updates over all possible asynchronous schedules. By considering the directed acyclic graph induced by the read dependencies, we show these differences are equal for many paths given some constraints on the asynchronous schedules. With this, we can average over these paths, and by amortizing with these values, we can achieve an bound. We emphasize that our analysis considers all possible asynchronous schedules, but the averaging is done over subsets of these schedules.

The lower bound

Our construction applies to the functions , where . The idea is that w.h.p., for each time step, after the first updates, among the most recent updates, a constant fraction will have had a increment and a constant fraction will have had a decrement. Then, for the next update, the asynchronous scheduling allows us to designate whether positive or negative updates are read and hence to maintain the property. The reason this can last for only steps, is that each coordinate needs to alternate the direction of its update, and so eventually too large a fraction of the most recent updates may be in one direction or the other. We organize this as a novel ball in urns problem.

Problems with large and

. Both and can be as large as . For problem instances of this type, the bound on becomes ; i.e. they do not demonstrate any parallel speedup. We conjecture that this is inherent. Even if the conjecture holds, it is still conceivable that parallel speedup will occur in practice, but to provide a confirming analysis would require new assumptions on the asynchronous behavior, and we leave the devising of such assumptions as on open problem.

Acknowledgment

We thank several anonymous reviewers for their helpful and thoughtful suggestions regarding earlier versions of this paper.

Index for the Appendices

For the reader’s convenience, we provide an index for the sections in the appendix and also brief descriptions of the topics.

  1. Appendix A: Basic Lemmas and Handling No Common Write. Page: A

    We show some basic lemmas needed to measure progress and to bound the gradient error; these will be used later in the convergence analysis. Also, Appendix A.5 provides the lemmas that bound the costs for shifting ’s arguments; they are needed to handle the No Common Write setting (and elsewhere for that matter).

  2. Appendix B: Full SACD Analysis. Page: B

    Here, we give the complete analysis of SACD. Beyond the analysis given in the main part, we show how to deal with when bounding the difference (to do this we need to create a new ordering that lies between the start and commit time orderings); we then show how to perform the amortization.

  3. Appendix C: Lower bound on SACD. Page: C

    We show a family of functions that yields the lower bound for SACD, demonstrating that our analysis of SACD is tight.

  4. Appendix D: Further Related Work. Page: D

  5. Appendix E: Common Value Assumption. Page: E

    We explain how the Common Value assumption is violated due to different retrieval schedules for different choices of coordinates, and due to varying iteration length.

  6. Appendix F: Comments on Locking. Page: F

  7. References Page: F

Appendix A Some Basic Lemmas and Facts

We now consider the general form of the function, i.e. . Recall that the update rule is

We define , and we set .

Recall we assume that at each time, there is exactly one coordinate value being updated. For any time , let denote the coordinate which is updated at time , and let .

Since, for each coordinate , the parameter and the function remain unchanged throughout the ACD process, to avoid clutter, we use the shorthand

Note that ; thus .

Let denote the set of coordinates . In this proof, will always denote a function which is univariate, proper, convex and lower semi-continuous. Recall the definition of in Definition 1. As is conventional, we write .

It is well-known that for any , and ,

(7)

a.1 Some Lemmas about the Functions and

We state a number of technical lemmas concerning the functions and . The following fact, which follows directly from the definition of , will be used multiple times:

(8)
Lemma 4 (Three-Point Property, [7, Lemma 3.2]).

For any proper, convex and lower semi-continuous function and for any , let . Then for any ,

Lemma 5.

For any and , .

Proof: .

We apply Lemma 4 with and . Then , and hence , as defined in Lemma 4, equals . These yield

Since and , we are done. ∎

Lemma 6 ( Shifting on parameter).

Let and . Then

Proof: .

RJC:  I think we should remove the from the or include the . RJC:  Right. We use Lemma 4 with , and . Then we have

The above inequality holds for any . In particular, we pick , yielding

By adding to both sides, we have