Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions

# Dykstra’s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions

Ryan J. Tibshirani
Statistics Department and Department of Machine Learning
Carnegie Mellon University
Pittsburgh, PA 15213
ryantibs@stat.cmu.edu
###### Abstract

We study connections between Dykstra’s algorithm for projecting onto an intersection of convex sets, the augmented Lagrangian method of multipliers or ADMM, and block coordinate descent. We prove that coordinate descent for a regularized regression problem, in which the (separable) penalty functions are seminorms, is exactly equivalent to Dykstra’s algorithm applied to the dual problem. ADMM on the dual problem is also seen to be equivalent, in the special case of two sets, with one being a linear subspace. These connections, aside from being interesting in their own right, suggest new ways of analyzing and extending coordinate descent. For example, from existing convergence theory on Dykstra’s algorithm over polyhedra, we discern that coordinate descent for the lasso problem converges at an (asymptotically) linear rate. We also develop two parallel versions of coordinate descent, based on the Dykstra and ADMM connections.

## 1 Introduction

In this paper, we study two seemingly unrelated but closely connected convex optimization problems, and associated algorithms. The first is the best approximation problem: given closed, convex sets and , we seek the point in the intersection closest to , and solve

 minu∈Rn∥y−u∥22subjecttou∈C1∩⋯∩Cd. (1)

The second problem is the regularized regression problem: given a response and predictors , and a block decomposition , of the columns of (i.e., these could be columns, or groups of columns), we build a working linear model by applying blockwise regularization over the coefficients, and solve

 minw∈Rp12∥y−Xw∥22+d∑i=1hi(wi), (2)

where , are convex functions, and we write , for the appropriate block decomposition of a coefficient vector (so that ).

Two well-studied algorithms for problems (1), (2) are Dykstra’s algorithm (Dykstra, 1983; Boyle and Dykstra, 1986) and (block) coordinate descent (Warga, 1963; Bertsekas and Tsitsiklis, 1989; Tseng, 1990), respectively. The jumping-off point for our work in this paper is the following fact: these two algorithms are equivalent for solving (1) and (2). That is, for a particular relationship between the sets and penalty functions , the problems (1) and (2) are duals of each other, and Dykstra’s algorithm on the primal problem (1) is exactly the same as coordinate descent on the dual problem (2). We provide details in Section 2.

This equivalence between Dykstra’s algorithm and coordinate descent can be essentially found in the optimization literature, dating back to the late 1980s, and possibly earlier. (We say “essentially” here because, to our knowledge, this equivalence has not been stated for a general regression matrix , and only in the special case ; but, in truth, the extension to a general matrix is fairly straightforward.) Though this equivalence has been cited and discussed in various ways over the years, we feel that it is not as well-known as it should be, especially in light of the recent resurgence of interest in coordinate descent methods. We revisit the connection between Dykstra’s algorithm and coordinate descent, and draw further connections to a third method—the augmented Lagrangian method of multipliers or ADMM (Glowinski and Marroco, 1975; Gabay and Mercier, 1976)—that has also received a great deal of attention recently. While these basic connections are interesting in their own right, they also have important implications for analyzing and extending coordinate descent. Below we give a summary of our contributions.

1. We prove in Section 2 (under a particular configuration relating to seminorms ) that Dykstra’s algorithm for (1) is equivalent to block coordinate descent for (2). (This is a mild generalization of the previously known connection when .)

2. We also show in Section 2 that ADMM is closely connected to Dykstra’s algorithm, in that ADMM for (1), when and is a linear subspace, matches Dykstra’s algorithm.

3. Leveraging existing results on the convergence of Dykstra’s algorithm for an intersection of halfspaces, we establish in Section 3 that coordinate descent for the lasso problem has an (asymptotically) linear rate of convergence, regardless of the dimensions of (i.e., without assumptions about strong convexity of the problem). We derive two different explicit forms for the error constant, which shed light onto how correlations among the predictor variables affect the speed of convergence.

4. Appealing to parallel versions of Dykstra’s algorithm and ADMM, we present in Section 4 two parallel versions of coordinate descent (each guaranteed to converge in full generality).

5. We extend in Section 5 the equivalence between coordinate descent and Dykstra’s algorithm to the case of nonquadratic loss in (2), i.e., non-Euclidean projection in (1). This leads to a Dykstra-based parallel version of coordinate descent for (separably regularized) problems with nonquadratic loss, and we also derive an alternative ADMM-based parallel version of coordinate descent for the same class of problems.

## 2 Preliminaries and connections

### Dykstra’s algorithm.

Dykstra’s algorithm was first proposed by Dykstra (1983), and was extended to Hilbert spaces by Boyle and Dykstra (1986). Since these seminal papers, a number of works have analyzed and extended Dykstra’s algorithm in various interesting ways. We will reference many of these works in the coming sections, when we discuss connections between Dykstra’s algorithm and other methods; for other developments, see the comprehensive books Deutsch (2001); Bauschke and Combettes (2011) and review article Bauschke and Koch (2013).

Dykstra’s algorithm for the best approximation problem (1) can be described as follows. We initialize , , and then repeat, for :

 u(k) =PC[k](u(k−1)+z(k−d)), (3) z(k) =u(k−1)+z(k−d)−u(k),

where denotes the (Euclidean) projection of onto a closed, convex set , and denotes the modulo operator taking values in . What differentiates Dykstra’s algorithm from the classical alternating projections method of von Neumann (1950); Halperin (1962) is the sequence of (what we may call) dual variables , . These track, in a cyclic fashion, the residuals from projecting onto . The simpler alternating projections method will always converge to a feasible point in , but will not necessarily converge to the solution in (1) unless are subspaces (in which case alternating projections and Dykstra’s algorithm coincide). Meanwhile, Dykstra’s algorithm converges in general (for any closed, convex sets with nonempty intersection, see, e.g., Boyle and Dykstra (1986); Han (1988); Gaffke and Mathar (1989)). We note that Dykstra’s algorithm (3) can be rewritten in a different form, which will be helpful for future comparisons. First, we initialize , , and then repeat, for :

 u(k)0=u(k−1)d, (4)

### Coordinate descent.

Coordinate descent methods have a long history in optimization, and have been studied and discussed in early papers and books such as Warga (1963); Ortega and Rheinboldt (1970); Luenberger (1973); Auslender (1976); Bertsekas and Tsitsiklis (1989), though coordinate descent was still likely in use much earlier. (Of course, for solving linear systems, coordinate descent reduces to Gauss-Seidel iterations, which dates back to the 1800s.) Some key papers analyzing the convergence of coordinate descent methods are Tseng and Bertsekas (1987); Tseng (1990); Luo and Tseng (1992, 1993); Tseng (2001). In the last 10 or 15 years, a considerable interest in coordinate descent has developed across the optimization community. With the flurry of recent work, it would be difficult to give a thorough account of the recent progress on the topic. To give just a few examples, recent developments include finite-time (nonasymptotic) convergence rates for coordinate descent, and exciting extensions such as accelerated, parallel, and distributed versions of coordinate descent. We refer to Wright (2015), an excellent survey that describes this recent progress.

In (block) coordinate descent111To be precise, this is cyclic coordinate descent, where exact minimization is performed along each block of coordinates. Randomized versions of this algorithm have recently become popular, as have inexact or proximal versions. While these variants are interesting, they are not the focus of our paper. for (2), we initialize say , and repeat, for :

 w(k)i=argminwi∈Rpi12∥∥∥y−∑jiXjw(k−1)j−Xiwi∥∥∥22+hi(wi),i=1,…,d. (5)

We assume here and throughout that , each have full column rank so that the updates in (5) are uniquely defined (this is used for convenience, and is not a strong assumption; note that it places no restriction on the dimensionality of the full problem in (2), i.e., we could still have with ). The precise form of these updates, of course, depends on the penalty functions , . Suppose that , are seminorms, which we can express in the general form , where is a closed, convex set containing 0, for . Suppose also that , the inverse image of under the linear mapping , for . Then, perhaps surprisingly, it turns out that the coordinate descent iterations (5) are exactly the same as the Dykstra iterations (4), via duality. The key relationship is extracted as a lemma below, for future reference, and then the formal equivalence is stated. Proofs of these results, as with all results in this paper, are given in the supplement.

###### Lemma 1.

Assume that has full column rank and for a closed, convex set containing 0. Then for and any ,

 ^wi=argminwi∈Rpi12∥b−Xiwi∥22+hi(wi)⟺Xi^wi=(Id−PCi)(b).

where denotes the identity mapping.

###### Theorem 1.

Assume the setup in Lemma 1, for each . Then problems (1), (2) are dual to each other, and their solutions, denoted , respectively, satisfy . Further, Dykstra’s algorithm (4) and coordinate descent (5) are equivalent, and satisfy at all iterations :

 z(k)i=Xiw(k)iandu(k)i=y−∑j≤iXjw(k)j−∑j>iXjw(k−1)j,for i=1,…,d.

The equivalence between coordinate descent and Dykstra’s algorithm dates back to (at least) Han (1988); Gaffke and Mathar (1989), under the special case . In fact, Han (1988), presumably unaware of Dykstra’s algorithm, seems to have reinvented the method and established convergence through its relationship to coordinate descent. This work then inspired Tseng (1993) (who must have also been unaware of Dykstra’s algorithm) to improve the existing analyses of coordinate descent, which at the time all assumed smoothness of the objective function. (Tseng continued on to become arguably the single most important contributor to the theory of coordinate descent of the 1990s and 2000s, and his seminal work Tseng (2001) is still one of the most comprehensive analyses to date.)

References to this equivalence can be found speckled throughout the literature on Dykstra’s method, but given the importance of the regularized problem form (2) for modern statistical and machine learning estimation tasks, we feel that the connection between Dykstra’s algorithm and coordinate descent and is not well-known enough and should be better explored. In what follows, we show that some old work on Dykstra’s algorithm, fed through this equivalence, yields new convergence results for coordinate descent for the lasso and a new parallel version of coordinate descent.

The augmented Lagrangian method of multipliers or ADMM was invented by Glowinski and Marroco (1975); Gabay and Mercier (1976). ADMM is a member of a class of methods generally called operator splitting techniques, and is equivalent (via a duality argument) to Douglas-Rachford splitting (Douglas and Rachford, 1956; Lions and Mercier, 1979). Recently, there has been a strong revival of interest in ADMM (and operator splitting techniques in general), arguably due (at least in part) to the popular monograph of Boyd et al. (2011), where it is argued that the ADMM framework offers an appealing flexibility in algorithm design, which permits parallelization in many nontrivial situations. As with coordinate descent, it would be difficult thoroughly describe recent developments on ADMM, given the magnitude and pace of the literature on this topic. To give just a few examples, recent progress includes finite-time linear convergence rates for ADMM (see Nishihara et al. 2015; Hong and Luo 2017 and references therein), and accelerated extensions of ADMM (see Goldstein et al. 2014; Kadkhodaie et al. 2015 and references therein).

To derive an ADMM algorithm for (1), we introduce auxiliary variables and equality constraints to put the problem in a suitable ADMM form. While different formulations for the auxiliary variables and constraints give rise to different algorithms, loosely speaking, these algorithms generally take on similar forms to Dykstra’s algorithm for (1). The same is also true of ADMM for the set intersection problem, a simpler task than the best approximation problem (1), in which we only seek a point in the intersection , and solve

 minu∈Rnd∑i=11Ci(ui), (6)

where denotes the indicator function of a set (equal to 0 on , and otherwise). Consider the case of sets, in which case the translation of (6) into ADMM form is unambiguous. ADMM for (6), properly initialized, appears highly similar to Dykstra’s algorithm for (1); so similar, in fact, that Boyd et al. (2011) mistook the two algorithms for being equivalent, which is not generally true, and was shortly thereafter corrected by Bauschke and Koch (2013).

Below we show that when , is a linear subspace, and , an ADMM algorithm for (1) (and not the simpler set intersection problem (6)) is indeed equivalent to Dykstra’s algorithm for (1). Introducing auxiliary variables, the problem (1) becomes

 minu1,u2∈Rn∥y−u1∥22+1C1(u1)+1C2(u2)subjecttou1=u2,

and the augmented Lagrangian is , where is an augmented Lagrangian parameter. ADMM now repeats, for :

 u(k)1 =PC1(y1+ρ+ρ(u(k−1)2−z(k−1))1+ρ), (7) u(k)2 =PC2(u(k)1+z(k−1)), z(k) =z(k−1)+u(k)1−u(k)2.

Suppose we initialize , , and set . Using linearity of , the fact that , and a simple inductive argument, the above iterations can be rewritten as

 u(k)1 =PC1(u(k−1)2), (8) u(k)2 =PC2(u(k)1+z(k−1)), z(k) =z(k−1)+u(k)1−u(k)2,

which is precisely the same as Dykstra’s iterations (4), once we realize that, due again to linearity of , the sequence , in Dykstra’s iterations plays no role and can be ignored.

Though sets in (1) may seem like a rather special case, the strategy for parallelization in both Dykstra’s algorithm and ADMM stems from rewriting a general -set problem as a 2-set problem, so the above connection between Dykstra’s algorithm and ADMM can be relevant even for problems with , and will reappear in our later discussion of parallel coordinate descent. As a matter of conceptual interest only, we note that for general (and no constraints on the sets being subspaces), Dykstra’s iterations (4) can be viewed as a limiting version of the ADMM iterations either for (1) or for (6), as we send the augmented Lagrangian parameters to or to 0 at particular scalings. See the supplement for details.

## 3 Coordinate descent for the lasso

The lasso problem (Tibshirani, 1996; Chen et al., 1998), defined for a tuning parameter as

 minw∈Rp12∥y−Xw∥22+λ∥w∥1, (9)

is a special case of (2) where the coordinate blocks are of each size 1, so that , are just the columns of , and , are the components of . This problem fits into the framework of (2) with for , for each .

Coordinate descent is widely-used for the lasso (9), both because of the simplicity of the coordinatewise updates, and because careful implementations can achieve state-of-the-art performance, at the right problem sizes. The use of coordinate descent for the lasso was popularized by Friedman et al. (2007, 2010), but was studied earlier or concurrently by several others, e.g., Fu (1998); Sardy et al. (2000); Wu and Lange (2008).

As we know from Theorem 1, the dual of problem (9) is the best approximation problem (1), where is an intersection of two halfspaces, for . This makes an intersection of halfspaces, i.e., a (centrally symmetric) polyhedron. For projecting onto a polyhedron, it is well-known that Dykstra’s algorithm reduces to Hildreth’s algorithm (Hildreth, 1957), an older method for quadratic programming that itself has an interesting history in optimization. Theorem 1 hence shows coordinate descent for the lasso (9) is equivalent not only to Dykstra’s algorithm, but also to Hildreth’s algorithm, for (1).

This equivalence suggests a number of interesting directions to consider. For example, key practical speedups have been developed for coordinate descent for the lasso that enable this method to attain state-of-the-art performance at the right problem sizes, such as clever updating rules and screening rules (e.g., Friedman et al. 2010; El Ghaoui et al. 2012; Tibshirani et al. 2012; Wang et al. 2015). These implementation tricks can now be used with Dykstra’s (Hildreth’s) algorithm. On the flip side, as we show next, older results from Iusem and De Pierro (1990); Deutsch and Hundal (1994) on Dykstra’s algorithm for polyhedra, lead to interesting new results on coordinate descent for the lasso.

###### Theorem 2 (Adaptation of Iusem and De Pierro 1990).

Assume the columns of are in general position, and . Then coordinate descent for the lasso (9) has an asymptotically linear convergence rate, in that for large enough ,

 (10)

where is the lasso solution in (9), , and for , is the active set of , is its size, denotes the columns of indexed by , and denotes the smallest eigenvalue of .

###### Theorem 3 (Adaptation of Deutsch and Hundal 1994).

Assume the same conditions and notation as in Theorem 2. Then for large enough ,

 (11)

where we enumerate , , and we denote by the projection onto the orthocomplement of the column span of .

The results in Theorems 2, 3 both rely on the assumption of general position for the columns of . This is only used for convenience and can be removed at the expense of more complicated notation. Loosely put, the general position condition simply rules out trivial linear dependencies between small numbers of columns of , but places no restriction on the dimensions of (i.e., it still allows for ). It implies that the lasso solution is unique, and that (where ) has full column rank. See Tibshirani (2013) for a precise definition of general position and proofs of these facts. We note that when has full column rank, the bounds in (10), (11) are strictly less than 1.

###### Remark 1 (Comparing (10) and (11)).

Clearly, both the bounds in (10), (11) are adversely affected by correlations among , (i.e., stronger correlations will bring each closer to 1). It seems to us that (11) is usually the smaller of the two bounds, based on simple mathematical and numerical comparisons. More detailed comparisons would be interesting, but is beyond the scope of this paper.

###### Remark 2 (Linear convergence without strong convexity).

One striking feature of the results in Theorems 2, 3 is that they guarantee (asymptotically) linear convergence of the coordinate descent iterates for the lasso, with no assumption about strong convexity of the objective. More precisely, there are no restrictions on the dimensionality of , so we enjoy linear convergence even without an assumption on the smooth part of the objective. This is in line with classical results on coordinate descent for smooth functions, see, e.g., Luo and Tseng (1992). The modern finite-time convergence analyses of coordinate descent do not, as far as we understand, replicate this remarkable property. For example, Beck and Tetruashvili (2013); Li et al. (2016) establish finite-time linear convergence rates for coordinate descent, but require strong convexity of the entire objective.

###### Remark 3 (Active set identification).

The asymptotics developed in Iusem and De Pierro (1990); Deutsch and Hundal (1994) are based on a notion of (in)active set identification: the critical value of after which (10), (11) hold is based on the (provably finite) iteration number at which Dykstra’s algorithm identifies the inactive halfspaces, i.e., at which coordinate descent identifies the inactive set of variables, . This might help explain why in practice coordinate descent for the lasso performs exceptionally well with warm starts, over a decreasing sequence of tuning parameter values (e.g., Friedman et al. 2007, 2010): here, each coordinate descent run is likely to identify the (in)active set—and hence enter the linear convergence phase—at an early iteration number.

## 4 Parallel coordinate descent

### Parallel-Dykstra-CD.

An important consequence of the connection between Dykstra’s algorithm and coordinate descent is a new parallel version of the latter, stemming from an old parallel version of the former. A parallel version of Dykstra’s algorithm is usually credited to Iusem and Pierro (1987) for polyhedra and Gaffke and Mathar (1989) for general sets, but really the idea dates back to the product space formalization of Pierra (1984). We rewrite problem (1) as

 minu=(u1,…,ud)∈Rndd∑i=1γi∥y−ui∥22subjecttou∈C0∩(C1×⋯×Cd), (12)

where , and are weights that sum to 1. After rescaling appropriately to turn (12) into an unweighted best approximation problem, we can apply Dykstra’s algorithm, which sets , , and repeats:

 u(k)0=d∑i=1γiu(k−1)i, (13)

for . The steps enclosed in curly brace above can all be performed in parallel, so that (13) is a parallel version of Dykstra’s algorithm (4) for (1). Applying Lemma 1, and a straightforward inductive argument, the above algorithm can be rewritten as follows. We set , and repeat:

 (14)

for , which we call parallel-Dykstra-CD (with CD being short for coordinate descent). Again, note that each update in (14) can be performed in parallel, so that (14) is a parallel version of coordinate descent (5) for (2). Also, as (14) is just a reparametrization of Dykstra’s algorithm (13) for the 2-set problem (12), it is guaranteed to converge in full generality, as per the standard results on Dykstra’s algorithm (Han, 1988; Gaffke and Mathar, 1989).

###### Theorem 4.

Assume that has full column rank and for a closed, convex set containing 0, for . If (2) has a unique solution, then the iterates in (14) converge to this solution. More generally, if the interior of is nonempty, then the sequence , from (14) has at least one accumulation point, and any such point solves (2). Further, , converges to , the optimal fitted value in (2).

There have been many recent exciting contributions to the parallel coordinate descent literature; two standouts are Jaggi et al. (2014); Richtarik and Takac (2016), and numerous others are described in Wright (2015). What sets parallel-Dykstra-CD apart, perhaps, is its simplicity: convergence of the iterations (14), given in Theorem 4, just stems from the connection between coordinate descent and Dykstra’s algorithm, and the fact that the parallel Dykstra iterations (13) are nothing more than the usual Dykstra iterations after a product space reformulation. Moreover, parallel-Dykstra-CD for the lasso enjoys an (asymptotic) linear convergence rate under essentially no assumptions, thanks once again to an old result on the parallel Dykstra (Hildreth) algorithm from Iusem and De Pierro (1990). The details can be found in the supplement.

As an alternative to the parallel method derived using Dykstra’s algorithm, ADMM can also offer a version of parallel coordinate descent. Since (12) is a best approximation problem with sets, we can refer back to our earlier ADMM algorithm in (7) for this problem. By passing these ADMM iterations through the connection developed in Lemma 1, we arrive at what we call parallel-ADMM-CD, which initializes , , and repeats:

 u(k)0 =(∑di=1ρi)u(k−1)01+∑di=1ρi+y−Xw(k−1)1+∑di=1ρi+X(w(k−2)−w(k−1))1+∑di=1ρi, (15) w(k)i =argminwi∈Rpi12∥∥u(k)0+Xiw(k−1)i/ρi−Xiwi/ρi∥∥22+hi(wi/ρi),i=1,…,d,

for , where are augmented Lagrangian parameters. In each iteration, the updates to , above can be done in parallel. Just based on their form, it seems that (15) can be seen as a parallel version of coordinate descent (5) for problem (2). The next result confirms this, leveraging standard theory for ADMM (Gabay, 1983; Eckstein and Bertsekas, 1992).

###### Theorem 5.

Assume that has full column rank and for a closed, convex set containing 0, for . Then the sequence , in (15) converges to a solution in (2).

The parallel-ADMM-CD iterations in (15) and parallel-Dykstra-CD iterations in (14) differ in that, where the latter uses a residual , the former uses an iterate that seems to have a more complicated form, being a convex combination of and , plus a quantity that acts like a momentum term. It turns out that when sum to 1, the two methods (14), (15) are exactly the same. While this may seem like a surprising coincidence, it is in fact nothing more than a reincarnation of the previously established equivalence between Dykstra’s algorithm (4) and ADMM (8) for a 2-set best approximation problem, as here is a linear subspace.

Of course, with ADMM we need not choose probability weights for , and the convergence in Theorem 5 is guaranteed for any fixed values of these parameters. Thus, even though they were derived from different perspectives, parallel-ADMM-CD subsumes parallel-Dykstra-CD, and it is a strictly more general approach. It is important to note that larger values of can often lead to faster convergence, as we show in numerical experiments in the supplement. More detailed study and comparisons to related parallel methods are worthwhile, but are beyond the scope of this work.

## 5 Extensions and discussion

We studied connections between Dykstra’s algorithm, ADMM, and coordinate descent, leveraging these connections to establish an (asymptotically) linear convergence rate for coordinate descent for the lasso, as well as two parallel versions of coordinate descent (one based on Dykstra’s algorithm and the other on ADMM). Some extensions and possibilities for future work are described below.

### Nonquadratic loss: Dykstra’s algorithm and coordinate descent.

Given a convex function , a generalization of (2) is the regularized estimation problem

 minw∈Rpf(Xw)+d∑i=1hi(wi). (16)

Regularized regression (2) is given by , and e.g., regularized classification (under the logistic loss) by . In (block) coordinate descent for (16), we initialize say , and repeat, for :

 w(k)i=argminwi∈Rpif(∑jiXjw(k−1)j+Xiwi)+hi(wi),i=1,…,d. (17)

On the other hand, given a differentiable and strictly convex function , we can generalize (1) to the following best Bregman-approximation problem,

 minu∈RnDg(u,b)subjecttou∈C1∩⋯∩Cd. (18)

where is the Bregman divergence between and with respect to . When (and ), this recovers the best approximation problem (1). As shown in Censor and Reich (1998); Bauschke and Lewis (2000), Dykstra’s algorithm can be extended to apply to (18). We initialize , , and repeat for :

 u(k)0=u(k−1)d, (19)

where denotes the Bregman (rather than Euclidean) projection of onto a set , and is the conjugate function of . Though it may not be immediately obvious, when the above iterations (19) reduce to the standard (Euclidean) Dykstra iterations in (4). Furthermore, Dykstra’s algorithm and coordinate descent are equivalent in the more general setting.

###### Theorem 6.

Let be a closed, strictly convex, differentiable function. Assume that has full column rank, and for a closed, convex set containing 0, for . Also, let , , and , . Then problems (16), (18) are dual to each other, and their solutions satisfy . Further, Dykstra’s algorithm (19) and coordinate descent (17) are equivalent, i.e., for :

 z(k)i=Xiw(k)iandu(k)i=−∇f(∑j≤iXjw(k)j+∑j>iXjw(k−1)j),%for$i=1,…,d$.

### Nonquadratic loss: parallel coordinate descent methods.

For a general regularized estimation problem (16), parallel coordinate descent methods can be derived by applying Dykstra’s algorithm and ADMM to a product space reformulation of the dual. Interestingly, the subsequent coordinate descent algorithms are no longer equivalent (for a unity augmented Lagrangian parameter), and they feature complementary computational structures. The Dykstra version has a closed-form -update, but its (parallel) -updates require coordinatewise minimizations involving the smooth, convex loss . On the other hand, the ADMM version admits a more difficult -update, but its (parallel) -updates only require coordinatewise minimizations with a quadratic loss (this being typically simpler than the corresponding minimizations for most nonquadratic of interest). The supplement gives details.

### Asynchronous parallel algorithms, and coordinate descent in Hilbert spaces.

We finish with some directions for possible future work. Asynchronous variants of parallel coordinate descent are currently of great interest, e.g., see the review in Wright (2015). Given the link between ADMM and coordinate descent developed in this paper, it would be interesting to investigate the implications of the recent exciting progress on asynchronous ADMM, e.g., see Chang et al. (2016a, b) and references therein, for coordinate descent. In a separate direction, much of the literature on Dykstra’s algorithm emphasizes that this method works seamlessly in Hilbert spaces. It would be interesting to consider the connections to (parallel) coordinate descent in infinite-dimensional function spaces, which we would encounter, e.g., in alternating conditional expectation algorithms or backfitting algorithms in additive models.

Appendix: Proofs, Technical Details, and Numerical Experiments

## a.1 Proofs of Lemma 1 and Theorem 1

These results are direct consequences of the more general Lemma A.2 and Theorem 6, respectively, when (and so ); see Section A.9 below for their proofs.

## a.2 Dykstra’s algorithm and ADMM for the d-set best approximation and set intersection problems

Here we show that, under an inertial-type modification, the ADMM iterations for (6) are in a certain limiting sense equivalent to Dykstra’s iterations for (1). We can introducing auxiliary variables to transform problem (1) into

 minu0,…,ud∈Rnd∑i=11Ci(u)subjecttoud=u0,u0=u1,…,ud−1=ud,

and the corresponding augmented Lagrangian is , where are augmented Lagrangian parameters. ADMM is defined by repeating the updates:

 u(k)i =argminui∈RnL(u(k)0,…,u(k)i−1,ui,u(k−1)i+1,…,u(k−1)d),i=0,…,d, z(k)i =z(k−1)i+u(k)i−1−u(k)i,i=0,…,d,

for , where we use for convenience. Now consider an inertial modification in which, for the update above, we add the term to the augmented Lagrangian in the minimization. A straightforward derivation then leads to the ADMM updates:

 u(k)0 =u(k−1)d1+ρ0+ρ1+ρ0(u(k−1)d+z(k−1)0)1+ρ0+ρ1+ρ1(u(k−1)1−z(k−1)1)1+ρ0+ρ1, (A.1) u(k)i =PCi((u(k)i−1+z(k−1)i)1+ρi+1/ρi+(ρi+1/ρi)(u(k−1)i+1−z(k−1)i+1)1+ρi+1/ρi),i=1,…,d−1, u(k)d =PCd((u(k)d−1+z(k−1)d)1+ρ0/ρd+(ρ0/ρd)(u(k)0−z(k−1)0)1+ρ0/ρd), z(k)i =z(k−1)i+u(k)i−1−u(k)i,i=0,…,d,

for . Under the choices and , , we see that as the ADMM iterations (A.1) exactly coincide with the Dykstra iterations (4). Thus, under the proper initializations, and , the limiting ADMM algorithm for (6) matches Dykstra’s algorithm for (1).

Similar arguments can be used equate ADMM for (1) to Dykstra’s algorithm, again in limiting sense. We rewrite (1) as

 minu0,…,ud∈Rn∥y−u0∥22+d∑i=11Ci(u)subjecttoud=u0,u0=u1,…,ud−1=ud.

Using an inertial modification for the update, where we now add the term to the augmented Lagrangian in the minimization, the ADMM updates become:

 u(k)0 =y1+ρ−1+ρ0+ρ1+ρ−1u(k−1)d1+ρ−1+ρ0+ρ1+ρ0(u(k−1)d+z(k−1)0)1+ρ−1+ρ0+ρ1+ρ1(u(k−1)1−z(k−1)1)1+ρ−1+ρ0+ρ1, (A.2) u(k)i =PCi((u(k)i−1+z(k−1)i)1+ρi+1/ρi+(ρi+1/ρi)(u(k−1)i+1−z(k−1)i+1)1+ρi+1/ρi),i=1,…,d−1, u(k)d =PCd((u(k)d−1+z(k−1)d)1+ρ0/ρd+(ρ0/ρd)(u(k)0−z(k−1)0)1+ρ0/ρd), z(k)i =z(k−1)i+u(k)i−1−u(k)i,i=0,…,d,

for . Setting , , and , , we can see that as , the ADMM iterations (A.2) converge to the Dykstra iterations (4), and therefore with initializations and , the limiting ADMM algorithm for (1) coincides with Dykstra’s algorithm for the same problem.

The links above between ADMM and Dykstra’s algorithm are intended to be of conceptual interest, and the ADMM algorithms (A.1), (A.2) may not be practically useful for arbitrary configurations of the augmented Lagrangian parameters. After all, both of these are multi-block ADMM approaches, and multi-block ADMM has subtle convergence behavior as studied, e.g., in Lin et al. (2015); Chen et al. (2016).

## a.3 Proof of Theorem 2

By Theorem 1, we know that coordinate descent applied to the lasso problem (9) is equivalent to Dykstra’s algorithm on the best approximation problem (1), with , for . In particular, at the end of the th iteration, it holds that

 u(k)p=y−Xw(k),for k=1,2,3,….

By duality, we also have at the solutions in (1), (9), respectively. Therefore any statement about the convergence of Dykstra’s iterates may be translated into a statement about the convergence of the coordinate descent iterates, via the relationship

 ∥u(k)−^u∥2=∥Xw(k)−X^w∥2=∥w(k)−^w∥Σ,for k=1,2,3,…. (A.3)

We seek to apply the main result from Iusem and De Pierro (1990), on the asymptotic convergence rate of Dykstra’s (Hildreth’s) algorithm for projecting onto a polyhedron. One slight complication is that, in the current paramterization, coordinate descent is equivalent to Dykstra’s algorithm on

 C1∩…∩Cp=p⋂i=1{v∈Rn:|XTiv|≤λ},

While polyhedral, the above is not explicitly an intersection of halfspaces (it is an intersection of slabs), which is the setup required by the analysis of Iusem and De Pierro (1990). Of course, we can simply define and , , and then the above intersection is equivalent to

 C+1∩C−1∩…∩C+p∩C−p=p⋂i=1({v∈Rn:XTiv≤λ}∩{v∈Rn:XTiv≥−λ}).

Moreover, one can check that the iterates from Dykstra’s algorithm on match222By this we mean that for all and , if the iterates from Dykstra’s algorithm on are denoted as , . those from Dykstra’s algorithm on , provided that the algorithms cycle over the sets in the order they are written in these intersections. This means that the analysis of Iusem and De Pierro (1990) can be applied to coordinate descent for the lasso.

The error constant from Theorem 1 in Iusem and De Pierro (1990) is based on a geometric quantity that we explicitly lower bound below. It is not clear to us whether our lower bound is the best possible, and a better lower bound would improve the error constant presented in Theorem 2.

###### Lemma A.1.

Let , be hyperplanes, and the -dimensional affine subspace formed by their intersection. For each , denote by the hyperplane among farthest from . Define

 μ=infx∈Rnd(x,Hx)d(x,S),

where is the distance between and , and similarly for . Then

 μ≥σmin(M)√smaxi=1,…,s∥hi∥2>0,

where has columns , and is its smallest nonzero singular value.

###### Proof.

For any , note that , where is the Moore-Penrose pseudoinverse of . Also,