Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
We describe an asynchronous parallel stochastic proximal coordinate descent algorithm for minimizing a composite objective function, which consists of a smooth convex function added to a separable convex function. In contrast to previous analyses, our model of asynchronous computation accounts for the fact that components of the unknown vector may be written by some cores simultaneously with being read by others. Despite the complications arising from this possibility, the method achieves a linear convergence rate on functions that satisfy an optimal strong convexity property and a sublinear rate () on general convex functions. Near-linear speedup on a multicore system can be expected if the number of processors is . We describe results from implementation on ten cores of a multicore processor.
Key words. stochastic coordinate descent, asynchronous parallelism, inconsistent read, composite objective
AMS subject classifications. 90C25, 68W20, 68W10, 90C05
We consider the convex optimization problem
where is a smooth convex function and is a separable, closed, convex, and extended real-valued function. “Separable” means that can be expressed as , where denotes the th element of and each , is a closed, convex, and extended real-valued function.
Formulations of the type (LABEL:eqn_mainproblem) arise in many data analysis and machine learning problems, for example, the linear primal or nonlinear dual formulation of support vector machines , the LASSO approach to regularized least squares, and regularized logistic regression. Algorithms based on gradient and approximate / partial gradient information have proved effective in these settings. We mention in particular gradient projection and its accelerated variants , proximal gradient  and accelerated proximal gradient  methods for regularized objectives, and stochastic gradient methods [28, 38]. These methods are inherently serial, in that each iteration depends on the result of the previous iteration. Recently, parallel multicore versions of stochastic gradient and stochastic coordinate descent have been described for problems involving large data sets; see for example [31, 34, 3, 21, 39, 22].
This paper proposes an asynchronous stochastic proximal coordinate-descent algorithm, called AsySPCD, for composite objective functions. The basic step of AsySPCD, executed repeatedly by each core of a multicore system, is as follows: Choose an index ; read from shared memory and evaluate the th element of ; subtract a short, constant, positive of this partial gradient from ; and perform a proximal operation on to account for the regularization term . We use a simple model of computation that matches well to modern multicore architectures. Each core performs its updates on centrally stored vector in an asynchronous, uncoordinated fashion, without any form of locking. A consequence of this model is that the version of that is read by a core in order to evaluate its gradient is usually not the same as the version to which the update is made later, because is updated in the interim by other cores. (Generally, we denote by the version of that is used by a core to evaluate its component of .) We assume, however, that indefinite delays do not occur between reading and updating: There is a bound such no more than component-wise updates to are missed by a core, between the time at which it reads the vector and the time at which it makes its update to the chosen element of . A similar model of parallel asynchronous computation was used in Hogwild!  and AsySCD . However, there is a key difference in this paper: We do not assume that the evaluation vector is a version of that actually existed in the shared memory at some point in time. Rather, we account for the fact that the components of may be updated by multiple cores while in the process of being read by another core, so that may be a “hybrid” version that never actually existed in memory. Our new model, which we call an “inconsistent read” model, is significantly closer to the reality of asynchronous computation, and dispenses with the somewhat unsatisfying “consistent read” assumption of previous work. It also requires a quite distinct style of analysis; our proofs differ substantially from those in previous related works.
We show that, for suitable choices of steplength, our algorithm converges at a linear rate if an “optimal strong convexity” property (LABEL:eq:esc) holds. It attains sublinear convergence at a “” rate for general convex functions. Our analysis also defines a sufficient condition for near-linear speedup in the number of cores used. This condition relates the value of delay parameter (which corresponds closely to the number of cores / threads used in the computation) to the problem dimension . A parameter that quantifies the cross-coordinate interactions in also appears in this relationship. When the Hessian of is nearly diagonal, the minimization problem (LABEL:eqn_mainproblem) is almost separable, so higher degrees of parallelism are possible.
We review related work in Section LABEL:sec_relatedwork. Section LABEL:sec_alg specifies the proposed algorithm. Convergence results are described in Section LABEL:sec_mainresult, with proofs given in the appendix. Computational experience is reported in Section LABEL:sec_exp. A summary and conclusions appear in Section LABEL:sec_conclusion.
Notation and Assumption
We use the following notation in the remainder of the paper.
denotes the intersection of and
denotes the set on which attains its optimal value, which is denoted by .
denotes Euclidean-norm projection onto .
denotes the th natural basis vector in .
Given a matrix , we use to denote its th column and to denote its th row.
denotes the Euclidean norm .
denotes the th iterate generated by the algorithm.
denotes the optimal objective value. (Note that for any .)
We use for the th element of , and for the th element of .
Given a scalar function , define the componentwise proximal operator
Similarly, for the vector function , we denote
Note that the proximal operator is nonexpansive, that is, .
We define the following optimal strong convexity condition for a convex function with respect to the optimal set , with parameter :
This condition is significantly weaker than the usual strong convexity condition; a strongly convex function is an optimally strongly convex function, but the converse is not true in general. We provide several examples of optimally strongly convex functions that are not strongly convex:
, where is a strongly convex function and is any matrix, possibly one with a nontrivial kernel.
with strongly convex , and arbitrary , where is an indicator function defined on a polyhedron set . Note first that is unique for any , from the strong convexity of . The optimal solution set is defined by
The inequality (LABEL:eq:esc) clearly holds for , since the left-hand side is infinite in this case. For , we have by the famous theorem of Hoffman  that there exists such that
Then from the strong convexity of , we have that there exists a positive number such that for any
Squared hinge loss . To verify optimal strong convexity, we reformulate this problem as
and apply the result just derived.
Note that optimal strong convexity (LABEL:eq:esc) is a weaker version of the “essential strong convexity” condition used in . A concept called “restricted strong convexity” proposed in  (See Lemma 4.6) is similar in that it requires a certain quantity to increase quadratically with distance from the solution set, but different in that the objective is assumed to be differentiable. Anitescu  defines a “quadratic growth condition” for (smooth) nonlinear programming in which the objective is assumed to grow at least quadratically with distance to a local solution in some feasible neighborhood of that solution. Since our setting (unconstrained, nonsmooth, convex) is quite different, we believe the use of a different term is warranted here.
Throughout this paper, we make the following assumption.
The solution set of (LABEL:eqn_mainproblem) is nonempty.
We define two different Lipschitz constants and that are critical to the analysis, as follows. is the restricted Lipschitz constant for along the coordinate directions: For any , for any , and any such that , we have
The coordinate Lipschitz constant is defined for , , satisfying the same conditions as above:
We denote the ratio between these two quantities by :
Making the implicit assumption that and are chosen to be the smallest values that satisfy their respective definitions, we have from standard relationships between the and norms that
Besides bounding the nonlinearity of along various directions, the quantities and capture the interactions between the various components in the gradient . In the case of twice continuously differentiable , we can understand these interactions by observing the diagonal and off-diagonal terms of the Hessian . Let us consider upper bounds on the ratio in various situations. For simplicity, we suppose that is quadratic with positive semidefinite Hessian .
If is sparse with at most nonzeros per row/column, we have that
so that in this situation.
If is diagonally dominant, we have for any column that
which, by taking the maximum of both sides, implies that in this case.
Suppose that , where is a random matrix whose entries are i.i.d from . (For example, could be the linear least-squares objective .) We show in  that is upper-bounded roughly by in this case.
2 Related Work
We have surveyed related work on coordinate descent and stochastic gradient methods in a recent report . Our discussion there included non-stochastic, cyclic coordinate descent methods [40, 24, 44, 5, 42, 43, 35], synchronous parallel methods that distribute the work of function and gradient evaluation [16, 25, 18, 7, 11, 1, 10, 37], and asynchronous parallel stochastic gradient methods (including the randomized Kaczmarz algorithm) [31, 22]. We make some additional comments here on related topics, and include some recent references from this active research area.
Stochastic coordinate descent can be viewed as a special case of stochastic gradient, so analysis of the latter approach can be applied, to obtain for example a sublinear rate of convergence in expectation for strongly convex functions; see, for example . However, stochastic coordinate descent is “special” in that it is possible to guarantee improvement in the objective at every step. Nesterov  studied the convergence rate for a stochastic block coordinate descent method for unconstrained and separably constrained convex smooth optimization, proving linear convergence for the strongly convex case and a sublinear rate for the convex case. Richtárik and Takáč  and Lu and Xiao  extended this work to composite minimization, in which the objective is the sum of a smooth convex function and a separable nonsmooth convex function, and obtained similar (slightly stronger) convergence results. Stochastic coordinate descent is extended by Necoara and Patrascu  to convex optimization with a single linear constraint, randomly updating two coordinates at a time to maintain feasibility.
In the class of synchronous parallel methods for coordinate descent, Richtárik and Takáč  studied a synchronized parallel block (or minibatch) coordinate descent algorithm for composite optimization problems of the form (LABEL:eqn_mainproblem), with a block separable regularizer . At each iteration, processors update the randomly selected coordinates concurrently and synchronously. Speedup depends on the sparsity of the data matrix that defines the loss functions. A similar synchronous parallel method was studied in  and ; the latter focuses on the case of . Scherrer et al.  make greedy choices of multiple blocks of variables to update in parallel. Another greedy way of selecting coordinates was considered by Peng et al. , who also describe a parallel implementation of FISTA, an accelerated first-order algorithm due to Beck and Teboulle . Fercoq and Richtárik  consider a variant of (LABEL:eqn_mainproblem) in which is allowed to be nonsmooth. They apply Nesterov’s smoothing scheme to obtain a smoothed version and update multiple blocks of coordinates using block coordinate descent in parallel. Sublinear convergence rate is established for both strongly convex and weakly convex cases. Fercoq and Richtárik  proposed a variant of Nesterov’s accelerated scheme to accelerate the synchronous parallel block coordinate algorithm of , proving an improved sublinear convergence rate for weakly convex problems. This variant avoids the disadvantage of the original Nesterov acceleration scheme , which requires complexity per iteration, even on sparse data. Facchinei, Sagratella, and Scutari  consider a general framework for synchronous block coordinate descent methods with separable regularizers, in which the block subproblems may be solved inexactly. However, the block to be updated at each step is not chosen randomly; it must contain a component that is furthest from optimality, in some sense.
We turn now to asynchronous parallel methods. Bertsekas and Tsitsiklis  described an asynchronous method for fixed-point problems over a separable convex closed feasible region. (The optimization problem (LABEL:eqn_mainproblem) can be formulated in this way by defining for a fixed .) They use an inconsistent-read model of asynchronous computation, and establish linear convergence provided that components are not neglected indefinitely and that the iteration is a maximum-norm contraction. The latter condition is quite strong. In the case of null and convex quadratic in (LABEL:eqn_mainproblem) for instance, it requires the Hessian to satisfy a diagonal dominance condition — a stronger condition than strong convexity. By comparison, AsySCD  guarantees linear convergence under an “essential strong convexity” condition, though it assumes a consistent-read model of asynchronous computation. Elsner et al.  considered the same fixed point problem and architecture as , and describe a similar scheme. Their scheme appears to require locking of the shared-memory data structure for to ensure consistent reading and writing. Frommer and Szyld  give a comprehensive survey of asynchronous methods for solving fixed-point problems.
Liu et al.  followed the asynchronous consistent-read model of Hogwild! to develop an asynchronous stochastic coordinate descent (AsySCD) algorithm and proved sublinear () convergence on general convex functions and a linear convergence rate on functions that satisfy an “essential strong convexity” property. Sridhar et al.  developed an efficient LP solver by relaxing an LP problem into a bound-constrained QP problem, which is then solved by AsySCD.
Liu et al.  developed an asynchronous parallel variant of the randomized Kaczmarz algorithm for solving a general consistent linear system , proving a linear convergence rate. Avron et al.  proposed an asynchronous solver for the system where is a symmetric positive definite matrix, proving a linear convergence rate. This method is essentially an asynchronous stochastic coordinate descent method applied to the strongly convex quadratic optimization problem . The paper considers both inconsistent- and consistent-read cases are considered, with slightly different convergence results.
In our algorithm AsySPCD, multiple processors have access to a shared data structure for the vector , and each processor is able to compute a randomly chosen element of the gradient vector . Each processor repeatedly runs the following proximal coordinate descent process. (Choice of the steplength parameter is discussed further in the next section.)
Choose an index at random, read into the local storage location , and evaluate ;
Update component of the shared by taking a step of length in the direction , follows by a proximal operation defined as follows:111Our analysis assumes that no other process modifies while this proximal operation is being computed. As we explain in Section LABEL:sec_exp, our practical implementation actually assigns each coordinate to a single core, and allows only that core to update , so this issue does not arise. An alternative implementation, pointed out by a referee, would be to use a “compare-and-swap” atomic instruction to implement the update. This operation would perform the update only if was not changed while the update was being computed.
Notice that each step changes just a single element of , that is, the th element. Unlike standard proximal coordinate descent, the value at which the coordinate gradient is calculated usually differs from the value of to which the update is applied, because while the processor is evaluating its gradient, other processors may repeatedly update the value of stored in memory. As mentioned above, we use an “inconsistent read” model of asynchronous computation here, in contrast to the “consistent read” models of AsySCD  and Hogwild! . Figure LABEL:fig_inconsistentread shows how inconsistent reading can occur, as a result of updating of components of while it is being read. Consistent reading can be guaranteed by means of a software lock, but such a mechanism degrades parallel performance significantly. In fact, the implementations of Hogwild! and AsySCD described in the papers [31, 21] do not use any software lock, and in this respect the computations in those papers are not quite compatible with their analysis.
The “global” view of algorithm AsySPCD is shown in Algorithm LABEL:alg_ascd. To obtain this version from the “local” version, we introduce a counter to track the total number of updates applied to , so that is the state of in memory after update is performed. We use to denote the component that is updated at iteration , and for value of that is used in the calculation of the gradient element . The components of may have different ages. Some components may be current at iteration , others may not reflect recent updates made by other processors. We assume however that there is an upper bound of on the age of each component, measured in terms of updates. defines an iterate set such that
One can see that , . Here we assume to be the upper bound on the age of all elements in , for all , so that . We assume further that is ordered from oldest to newest index (that is, smallest to largest). Note that is empty if , that is, if the step is simply an ordinary stochastic coordinate gradient update. The value of corresponds closely to the number of cores involved in the computation provided that computation of the update for each component of costs roughly the same.
4 Main Results
This section presents results on convergence of AsySPCD. The theorem encompasses both the linear rate for optimally strongly convex and the sublinear rate for general convex . The result depends strongly on the delay parameter . The proofs are highly technical, and are relegated to Appendix LABEL:app:con. We note the proof techniques differ significantly from those used for the consistent-read algorithms of  and .
We start by describing the key idea of the algorithm, which is reflected in the way that it chooses the steplength parameter . Denoting by
we can see that
so that . Thus, we have
Therefore, we can view as capturing the expected behavior of . Note that when , we have , a standard negative-gradient step. The choice of steplength parameter entails a tradeoff: We would like to be long enough that significant progress is made at each step, but not so long that the gradient information computed at is stale and irrelevant by the time the update is applied to . We enforce this tradeoff by means of a bound on the ratio of expected squared norms on at successive iterates; specifically,
where is a user defined parameter. The analysis becomes a delicate balancing act in the choice of and steplength between aggression and excessive conservatism. We find, however, that these values can be chosen to ensure steady convergence for the asynchronous method at a linear rate, with rate constants that are almost consistent with a standard short-step proximal full-gradient descent, when the optimal strong convexity condition (LABEL:eq:esc) is satisfied.
Our main convergence result is the following.
Suppose that Assumption LABEL:ass_1 is satisfied. Let be a constant that satisfies , and define the quantities , , and as follows:
Suppose that the steplength parameter satisfies the following two bounds:
Then we have
If the optimal strong convexity property (LABEL:eq:esc) holds with , we have for that
while for general smooth convex function , we have
The following corollary proposes an interesting particular choice for the parameters for which the convergence expressions become more comprehensible. The result requires a condition on the delay bound in terms of and the ratio .
Suppose that Assumption LABEL:ass_1 holds and that
If we choose
then the steplength will satisfy the bounds (LABEL:eq:boundgammac). In addition, when the optimal strong convexity property (LABEL:eq:esc) holds with , we have for that
while for the case of general convex , we have
We note that the linear rate (LABEL:eqn_thm_2_good_c) is broadly consistent with the linear rate for the classical steepest descent method applied to strongly convex functions, which has a rate constant of , where is the standard Lipschitz constant for . Suppose we assume (not unreasonably) that steps of stochastic coordinate descent cost roughly the same as one step of steepest descent, and that . It follows from (LABEL:eqn_thm_2_good_c) that steps of stochastic coordinate descent would achieve a reduction factor of about
so a standard argument would suggest that stochastic coordinate descent would require about times more computation. Since , the stochastic asynchronous approach may actually require less computation. It may also gain an advantage from the parallel asynchronous implementation. A parallel implementation of standard gradient descent would require synchronization and careful division of the work of evaluating , whereas the stochastic approach can be implemented in an asynchronous fashion.
For the general convex case, (LABEL:eqn_thm_3_good_c) defines a sublinear rate, whose relationship with the rate of standard gradient descent for general convex optimization is similar to the previous paragraph.
Note that the results in Theorem LABEL:thm_2 and Corollary LABEL:co:thm_2 are consistent with the analysis for constrained AsySCD in , but this paper considers the more general case of composite optimization and the inconsistent-read model of parallel computation.
As noted in Section LABEL:sec:intro, the parameter corresponds closely to the number of cores that can be involved in the computation, since if all cores are working at the same rate, we would expect each other core to make one update between the times at which is read and (later) updated. If is small enough that (LABEL:eq:boundtauc) holds, the analysis indicates that near-linear speedup in the number of processors is achievable. A small value for the ratio (not much greater than ) implies a greater degree of potential parallelism. As we note at the end of Section LABEL:sec:intro, this ratio tends to closer to than to in some important applications. In these situations, the bound (LABEL:eq:boundtauc) indicates that can vary like without affecting the iteration-wise convergence rate, and yielding near-linear speedup in the number of cores. This quantity is consistent with the analysis for constrained AsySCD in  but weaker than the unconstrained AsySCD (which allows the maximal number of cores being ). A further comparison is with the asynchronous randomized Kaczmarz algorithm  which allows cores to be used efficiently when solving a consistent sparse linear system.
We conclude this section with a high-probability bound. The result follows immediately from Markov’s inequality. See Theorem 3 in  for a related result and complete proof.
Suppose that the conditions of Corollary LABEL:co:thm_2 hold, including the choice of . Then for and , we have that
provided that one of the following conditions holds. In the optimally strongly convex case (LABEL:eq:esc) with , we require
iterations, while in the general convex case, it suffices that
This section presents some results to illustrate the effectiveness of AsySPCD, in particular, the fact that near-linear speedup can be observed on a multicore machine. We note that more comprehensive experiments can be found in  and , for unconstrained and box-constrained problems. Although the analysis in  assumes consistent read, it is not enforced in the implementation, so apart from the fact that we now include a prox-step to account for the regularization term, the implementations in  and  are quite similar to the one employed in this section.
We apply our code for AsySPCD to the following “-” problem:
The elements of are selected i.i.d. from a Gaussian distribution. To construct a sparse true solution , given the dimension and sparsity , we select entries of at random to be nonzero and normally distributed, and set the rest to zero. The measurement vector is obtained by , where elements of the noise vector are i.i.d. , where the value of controls the signal-to-noise ratio.
Our experiments run on to threads of an Intel Xeon machine, with all threads sharing a single memory socket. Our implementations deviate modestly from the version of AsySPCD described in Section LABEL:sec_alg. We compute and offline. and are partitioned into slices (row submatrices) and subvectors (respectively) of equal size, and each thread is assigned one submatrix from and the corresponding subvector from . During the algorithm, each thread updates the elements of corresponding to its slice of , in order. After one scan, or “epoch” is complete, it reorders the indices randomly, then repeats the process. This scheme essentially changes the scheme from sampling with replacement (as analyzed) to sampling without replacement, which has demonstrated empirically better performance on many related problems. (The same advantage is noted in the implementations of Hogwild! .)
We choose with , , and in Figure LABEL:fig:LASSO_1 and , , and in Figure LABEL:fig:LASSO_2. We set (a value of the order of is suggested by compressed sensing theory) and the steplength is set as in both figures. In both cases, we can estimate the ratio roughly by , as suggested at the end of Section LABEL:sec:intro. Our final computed values of have nonzeros in the same locations as the chosen solution , though the values differ, because of the noise in .
The left-hand graph in each figure indicates the number of threads / cores and plots objective function value vs epoch count, where one epoch is equivalent to iterations. Note that the curves are almost overlaid, indicating that the total workload required for AsySPCD is almost independent of the number of cores used in the computation. This observation validates our result in Corollary LABEL:co:thm_2, which indicates that provided is below a certain threshold, it does not seriously affect the rate of convergence, as a function of total computation performed. The right-hand graph in each figure shows speedup when executed on different numbers of cores. Near-linear speedup is observed in Figure LABEL:fig:LASSO_2, while there is a slight dropoff for the larger numbers of cores in Figure LABEL:fig:LASSO_1. The difference can be explain by the smaller dimension of the problem illustrated in Figure LABEL:fig:LASSO_1. Referring to our threshold value (LABEL:eq:boundtauc) that indicates dimensions above which linear speedup should be expected, we have by setting (as discussed above) and (the maximum number of threads used in this experiment) that the left-hand side of (LABEL:eq:boundtauc) is approximately 3000, while the right-hand side is (for Figure LABEL:fig:LASSO_1) and approximately (for Figure LABEL:fig:LASSO_1). As expected, our analysis is quite conservative; near-linear speedup is observed even when the threshold (LABEL:eq:boundtauc) is violated significantly.
This paper proposes an asynchronous parallel proximal stochastic coordinate descent algorithm for minimizing composite objectives of the form (LABEL:eqn_mainproblem). Sublinear convergence (at rate ) is proved for general convex functions, with stronger linear convergence results for problems that satisfy the optimal strong convexity property (LABEL:eq:esc). Our analysis indicates the extent to which parallel implementations can be expected to yield near-linear speedup, in terms of a parameter that quantifies the cross-coordinate interactions in the gradient and a parameter that bounds the delay in updating. Our computational experience confirms that the linear speedup properties suggested by the analysis can be observed in practice.
The authors thank the editor and both referees for their valuable comments. Special thanks to Dr. Yijun Huang for her implementation of AsySPCD, which was used here to obtain computational results.
A Proofs of Main Results
This section provides the proofs for the main convergence results. We start with some preliminaries, then proceed to proofs of Theorem LABEL:thm_2 and Corollary LABEL:co:thm_2.
Note that the component indices in Algorithm LABEL:alg_ascd are independent random variables. We use to denote the expectation over all random variables, and to denote the conditional expectation in term of given . We also denote
and formulate the update in Step LABEL:step_proj of Algorithm LABEL:alg_ascd in the following way:
(Note that for .) From the optimality condition for this formulation (see (41) in ), we have for all that
By rearranging this expression and substituting for , we find that the following inequality is true for all :
From the definition of , and using the notation (LABEL:eq:defdelta), we have
From the definition of in (LABEL:eqn_xj1), we have
so, using (41) from  again, we have
We now define
and note that this definition is consistent with defined in (LABEL:eq:defdelta). From (LABEL:xbarij), we have
Recalling that the indices in are sorted in the increasing order from smallest (oldest) iterate to largest (newest) iterate, we use to denote the -th smallest entry in . For , we define
We have the following relations:
Furthermore, we have
where the second inequality holds because and differ in only a single coordinate.
a.2 Proof of Theorem LABEL:thm_2
Proof. We prove (LABEL:eqn_thm2_1) by induction. First, note that for any vectors and , we have
Thus for all , we have
The second factor in the r.h.s. of (LABEL:eqn_proof2_1) is bounded as follows:
where the fourth inequality uses , since and differ in just one component.
We set , and note that and . In this case, we obtain a bound from (LABEL:eqn_proof2_2)
By substituting this bound in (LABEL:eqn_proof2_1) and setting , and taking expectations, we obtain