Coordinate Descent Converges Faster with theGauss-Southwell Rule Than Random Selection

# Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

Julie Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander, Hoyt Koepke

University of British Columbia, University of California, Davis, Dato
###### Abstract

There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J. Optim., 22(2), 2012], who showed that a random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection rule. This result suggests that we should never use the Gauss-Southwell rule, because it is typically much more expensive than random selection. However, the empirical behaviours of these algorithms contradict this theoretical result: in applications where the computational costs of the selection rules are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that—except in extreme cases—its convergence rate is faster than choosing random coordinates. We also (i) show that exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and (iv) analyze proximal-gradient variants of the Gauss-Southwell rule.

## 1 Coordinate Descent Methods

There has been substantial recent interest in applying coordinate descent methods to solve large-scale optimization problems, starting with the seminal work of Nesterov (2012), who gave the first global rate-of-convergence analysis for coordinate-descent methods for minimizing convex functions. This analysis suggests that choosing a random coordinate to update gives the same performance as choosing the “best” coordinate to update via the more expensive Gauss-Southwell (GS) rule. (Nesterov also proposed a more clever randomized scheme, which we consider later in this paper.) This result gives a compelling argument to use randomized coordinate descent in contexts where the GS rule is too expensive. It also suggests that there is no benefit to using the GS rule in contexts where it is relatively cheap. But in these contexts, the GS rule often substantially outperforms randomized coordinate selection in practice. This suggests that either the analysis of GS is not tight, or that there exists a class of functions for which the GS rule is as slow as randomized coordinate descent.

After discussing contexts in which it makes sense to use coordinate descent and the GS rule, we answer this theoretical question by giving a tighter analysis of the GS rule (under strong-convexity and standard smoothness assumptions) that yields the same rate as the randomized method for a restricted class of functions, but is otherwise faster (and in some cases substantially faster). We further show that, compared to the usual constant step-size update of the coordinate, the GS method with exact coordinate optimization has a provably faster rate for problems satisfying a certain sparsity constraint (Section 5). We believe that this is the first result showing a theoretical benefit of exact coordinate optimization; all previous analyses show that these strategies obtain the same rate as constant step-size updates, even though exact optimization tends to be faster in practice. Furthermore, in Section 6, we propose a variant of the GS rule that, similar to Nesterov’s more clever randomized sampling scheme, uses knowledge of the Lipschitz constants of the coordinate-wise gradients to obtain a faster rate. We also analyze approximate GS rules (Section 7), which provide an intermediate strategy between randomized methods and the exact GS rule. Finally, we analyze proximal-gradient variants of the GS rule (Section 8) for optimizing problems that include a separable non-smooth term.

## 2 Problems of Interest

The rates of Nesterov show that coordinate descent can be faster than gradient descent in cases where, if we are optimizing variables, the cost of performing coordinate updates is similar to the cost of performing one full gradient iteration. This essentially means that coordinate descent methods are useful for minimizing convex functions that can be expressed in one of the following two forms:

 h1(x):=n∑i=1gi(xi)+f(Ax),h2(x):=∑i∈Vgi(xi)+∑(i,j)∈Efij(xi,xj),

where is element of , is smooth and cheap, the are smooth, is a graph, and is a matrix. (It is assumed that all functions are convex.)111We could also consider slightly more general cases like functions that are defined on hyper-edges (Richtárik and Takáč, 2015), provided that we can still perform coordinate updates for a similar cost to one gradient evaluation. The family of functions includes core machine-learning problems such as least squares, logistic regression, lasso, and SVMs (when solved in dual form) (Hsieh et al., 2008). Family includes quadratic functions, graph-based label propagation algorithms for semi-supervised learning (Bengio et al., 2006), and finding the most likely assignments in continuous pairwise graphical models (Rue and Held, 2005).

In general, the GS rule for problem is as expensive as a full gradient evaluation. However, the structure of often allows efficient implementation of the GS rule. For example, if each node has at most neighbours, we can track the gradients of all the variables and use a max-heap structure to implement the GS rule in time (Meshi et al., 2012). This is similar to the cost of the randomized algorithm if (since the average cost of the randomized method depends on the average degree). This condition is true in a variety of applications. For example, in spatial statistics we often use two-dimensional grid-structured graphs, where the maximum degree is four and the average degree is slightly less than . As another example, for applying graph-based label propagation on the Facebook graph (to detect the spread of diseases, for example), the average number of friends is around but no user has more than seven thousand friends. The maximum number of friends would be even smaller if we removed edges based on proximity. A non-sparse example where GS is efficient is complete graphs, since here the average degree and maximum degree are both . Thus, the GS rule is efficient for optimizing dense quadratic functions. On the other hand, GS could be very inefficient for star graphs.

If each column of has at most non-zeroes and each row has at most non-zeroes, then for many notable instances of problem we can implement the GS rule in time by maintaining as well as the gradient and again using a max-heap (see Appendix A). Thus, GS will be efficient if is similar to the number of non-zeroes in divided by . Otherwise, Dhillon et al. (2011) show that we can approximate the GS rule for problem with no functions by solving a nearest-neighbour problem. Their analysis of the GS rule in the convex case, however, gives the same convergence rate that is obtained by random selection (although the constant factor can be smaller by a factor of up to ). More recently, Shrivastava and Li (2014) give a general method for approximating the GS rule for problem with no functions by writing it as a maximum inner-product search problem.

## 3 Existing Analysis

We are interested in solving the convex optimization problem

 minx∈Rn f(x), (1)

where is coordinate-wise -Lipschitz continuous, i.e., for each ,

 |∇if(x+αei)−∇if(x)|≤L|α|,∀x∈Rn and α∈R,

where is a vector with a one in position and zero in all other positions. For twice-differentiable functions, this is equivalent to the assumption that the diagonal elements of the Hessian are bounded in magnitude by . In contrast, the typical assumption used for gradient methods is that is -Lipschitz continuous (note that ). The coordinate-descent method with constant step-size is based on the iteration

 xk+1=xk−1L∇ikf(xk)eik.

The randomized coordinate-selection rule chooses uniformly from the set . Alternatively, the GS rule

 ik=argmaxi |∇if(xk)|,

chooses the coordinate with the largest directional derivative. Under either rule, because is coordinate-wise Lipschitz continuous, we obtain the following bound on the progress made by each iteration:

 f(xk+1) ≤f(xk)+∇ikf(xk)(xk+1−xk)ik+L2(xk+1−xk)2ik (2) =f(xk)−1L(∇ikf(xk))2+L2[1L∇ikf(xk)]2 =f(xk)−12L[∇ikf(xk)]2.

We focus on the case where is -strongly convex, meaning that, for some positive ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+μ2∥y−x∥2,∀x,y∈Rn, (3)

which implies that

 f(x∗)≥f(xk)−12μ∥∇f(xk)∥2, (4)

where is the optimal solution of (1). This bound is obtained by minimizing both sides of (3) with respect to .

### 3.1 Randomized Coordinate Descent

Conditioning on the -field generated by the sequence , and taking expectations of both sides of (2), when is chosen with uniform sampling we obtain

 E[f(xk+1)] ≤E[f(xk)−12L(∇ikf(xk))2] =f(xk)−12Ln∑i=11n(∇if(xk))2 =f(xk)−12Ln∥∇f(xk)∥2.

Using (4) and subtracting from both sides, we get

 E[f(xk+1)]−f(x∗)≤(1−μLn)[f(xk)−f(x∗)]. (5)

This is a special of case of Nesterov (2012, Theorem 2) with in his notation.

### 3.2 Gauss-Southwell

We now consider the progress implied by the GS rule. By the definition of ,

 (∇ikf(xk))2=∥∇f(xk)∥2∞≥(1/n)∥∇f(xk)∥2. (6)

Applying this inequality to (2), we obtain

 f(xk+1)≤f(xk)−12Ln∥∇f(xk)∥2,

which together with (4), implies that

 f(xk+1)−f(x∗)≤(1−μLn)[f(xk)−f(x∗)]. (7)

This is a special case of Boyd and Vandenberghe (2004, §9.4.3), viewing the GS rule as performing steepest descent in the -norm. While this is faster than known rates for cyclic coordinate selection (Beck and Tetruashvili, 2013) and holds deterministically rather than in expectation, this rate is the same as the randomized rate given in (5).

## 4 Refined Gauss-Southwell Analysis

The deficiency of the existing GS analysis is that too much is lost when we use the inequality in (6). To avoid the need to use this inequality, we instead measure strong-convexity in the -norm, i.e.,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+μ12∥y−x∥21,

which is the analogue of (3). Minimizing both sides with respect to , we obtain

 f(x∗) ≥f(x)−supy{⟨−∇f(x),y−x⟩−μ12∥y−x∥21} (8) =f(x)−(μ12∥⋅∥21)∗(−∇f(x)) =f(x)−12μ1∥∇f(x)∥2∞,

which makes use of the convex conjugate (Boyd and Vandenberghe, 2004, §3.3). Using (8) in (2), and the fact that for the GS rule, we obtain

 f(xk+1)−f(x∗)≤(1−μ1L)[f(xk)−f(x∗)]. (9)

It is evident that if , then the rates implied by (5) and (9) are identical, but (9) is faster if . In Appendix B, we show that the relationship between and can be obtained through the relationship between the squared norms and . In particular, we have

 μn≤μ1≤μ.

Thus, at one extreme the GS rule obtains the same rate as uniform selection (). However, at the other extreme, it could be faster than uniform selection by a factor of (). This analysis, that the GS rule only obtains the same bound as random selection in an extreme case, supports the better practical behaviour of GS.

### 4.1 Comparison for Separable Quadratic

We illustrate these two extremes with the simple example of a quadratic function with a diagonal Hessian . In this case,

 μ=mini λi,andμ1=(n∑i=11λi)−1.

We prove the correctness of this formula for in Appendix C. The parameter achieves its lower bound when all are equal, , in which case

 μ=αandμ1=α/n.

Thus, uniform selection does as well as the GS rule if all elements of the gradient change at exactly the same rate. This is reasonable: under this condition, there is no apparent advantage in selecting the coordinate to update in a clever way. Intuitively, one might expect that the favourable case for the Gauss-Southwell rule would be where one is much larger than the others. However, in this case, is again similar to . To achieve the other extreme, suppose that and with . In this case, we have and

 μ1=βαn−1αn−1+(n−1)βαn−2=βαα+(n−1)β.

If we take , then we have , so . This case is much less intuitive; GS is times faster than random coordinate selection if one element of the gradient changes much more slowly than the others.

### 4.2 ‘Working Together’ Interpretation

In the separable quadratic case above, is given by the harmonic mean of the eigenvalues of the Hessian divided by . The harmonic mean is dominated by its smallest values, and this is why having one small value is a notable case. Furthermore, the harmonic mean divided by has an interpretation in terms of processes ‘working together’ (Ferger, 1931). If each represents the time taken by each process to finish a task (e.g., large values of correspond to slow workers), then is the time needed by the fastest worker to complete the task, and is the time needed to complete the task if all processes work together (and have independent effects). Using this interpretation, the GS rule provides the most benefit over random selection when working together is not efficient, meaning that if the processes work together, then the task is not solved much faster than if the fastest worker performed the task alone. This gives an interpretation of the non-intuitive scenario where GS provides the most benefit: if all workers have the same efficiency, then working together solves the problem times faster. Similarly, if there is one slow worker (large ), then the problem is solved roughly times faster by working together. On the other hand, if most workers are slow (many large ), then working together has little benefit.

### 4.3 Fast Convergence with Bias Term

Consider the standard linear-prediction framework,

 argminx,βm∑i=1f(aTix+β)+λ2∥x∥2+σ2β2,

where we have included a bias variable (an example of problem ). Typically, the regularization parameter of the bias variable is set to be much smaller than the regularization parameter of the other covariates, to avoid biasing against a global shift in the predictor. Assuming that there is no hidden strong-convexity in the sum, this problem has the structure described in the previous section () where GS has the most benefit over random selection.

## 5 Rates with Different Lipschitz Constants

Consider the more general scenario where we have a Lipschitz constant for the partial derivative of with respect to each coordinate ,

 |∇if(x+αei)−∇if(x)|≤Li|α|,∀x∈Rn and α∈R,

and we use a coordinate-dependent step-size at each iteration:

 xk+1=xk−1Lik∇ikf(xk)eik. (10)

By the logic of (2), in this setting we have

 f(xk+1)≤f(xk)−12Lik[∇ikf(xk)]2, (11)

and thus a convergence rate of

 f(xk)−f(x∗)≤[k∏j=1(1−μ1Lij)][f(x0)−f(x∗)]. (12)

Noting that , we have

 k∏j=1(1−μ1Lij)≤(1−μ1L)k. (13)

Thus, the convergence rate based on the will be faster, provided that at least one iteration chooses an with . In the worst case, however, (13) holds with equality even if the are distinct, as we might need to update a coordinate with on every iteration. (For example, consider a separable function where all but one coordinate is initialized at its optimal value, and the remaining coordinate has .) In Section 6, we discuss selection rules that incorporate the to achieve faster rates whenever the are distinct, but first we consider the effect of exact coordinate optimization on the choice of the .

### 5.1 Gauss-Southwell with Exact Optimization

For problems involving functions of the form and , we are often able to perform exact (or numerically very precise) coordinate optimization, even if the objective function is not quadratic (e.g., by using a line-search or a closed-form update). Note that (12) still holds when using exact coordinate optimization rather than using a step-size of , as in this case we have

 f(xk+1) =minα{f(xk+αeik)} (14) ≤f(xk−1Lik∇iif(xk)eik) ≤f(xk)−12Lik[∇ikf(xk)]2,

which is equivalent to (11). However, in practice using exact coordinate optimization leads to better performance. In this section, we show that using the GS rule results in a convergence rate that is indeed faster than (9) for problems with distinct when the function is quadratic, or when the function is not quadratic but we perform exact coordinate optimization.

The key property we use is that, after we have performed exact coordinate optimization, we are guaranteed to have . Because the GS rule chooses , we cannot have , unless is the optimal solution. Hence, we never choose the same coordinate twice in a row, which guarantees that the inequality (13) is strict (with distinct ) and exact coordinate optimization is faster. We note that the improvement may be marginal, as we may simply alternate between the two largest values. However, consider minimizing when the graph is sparse; after updating , we are guaranteed to have for all future iterations until we choose a variable that is a neighbour of node in the graph. Thus, if the two largest are not connected in the graph, GS cannot simply alternate between the two largest .

By using this property, in Appendix D we show that the GS rule with exact coordinate optimization for problem under a chain-structured graph has a convergence rate of the form

 f(xk)−f(x∗)≤O(max{ρG2,ρG3}k)[f(x0)−f(x∗)],

where is the maximizer of among all consecutive nodes and in the chain, and is the maximizer of among consecutive nodes , , and . The implication of this result is that, if the large values are more than two edges from each other in the graph, then we obtain a much better convergence rate. We conjecture that for general graphs, we can obtain a bound that depends on the largest value of among all nodes and connected by a path of length or . Note that we can obtain similar results for problem , by forming a graph that has an edge between nodes and whenever the corresponding variables are both jointly non-zero in at least one row of .

## 6 Rules Depending on Lipschitz Constants

If the are known, Nesterov (2012) showed that we can obtain a faster convergence rate by sampling proportional to the . We review this result below and compare it to the GS rule, and then propose an improved GS rule for this scenario. Although in this section we will assume that the are known, this assumption can be relaxed using a backtracking procedure (Nesterov, 2012, §6.1).

### 6.1 Lipschitz Sampling

Taking the expectation of (11) under the distribution and proceeding as before, we obtain

 E[f(xk+1)]−f(x∗)≤(1−μn¯L)[f(xk)−f(x∗)],

where is the average of the Lipschitz constants. This was shown by Leventhal and Lewis (2010) and is a special case of Nesterov (2012, Theorem 2) with in his notation. This rate is faster than (5) for uniform sampling if any differ.

Under our analysis, this rate may or may not be faster than (9) for the GS rule. On the one extreme, if and any differ, then this Lipschitz sampling scheme is faster than our rate for GS. Indeed, in the context of the problem from Section 4.1, we can make Lipschitz sampling faster than GS by a factor of nearly by making one much larger than all the others (recall that our analysis shows no benefit to the GS rule over randomized selection when only one is much larger than the others). At the other extreme, in our example from Section 4.1 with many large and one small , the GS and Lipschitz sampling rates are the same when , with a rate of . However, the GS rate will be faster than the Lipschitz sampling rate for any when , as the Lipschitz sampling rate is , which is slower than the GS rate of .

### 6.2 Gauss-Southwell-Lipschitz Rule

Since neither Lipschitz sampling nor GS dominates the other in general, we are motivated to consider if faster rules are possible by combining the two approaches. Indeed, we obtain a faster rate by choosing the that minimizes (11), leading to the rule

 ik=argmaxi|∇if(xk)|√Li,

which we call the Gauss-Southwell-Lipschitz (GSL) rule. Following a similar argument to Section 4, but using (11) in place of (2), the GSL rule obtains a convergence rate of

 f(xk+1)−f(x∗)≤(1−μL)[f(xk)−f(x∗)],

where is the strong-convexity constant with respect to the norm . This is shown in Appendix E, and in Appendix F we show that

 max{μn¯L,μ1L}≤μL≤μ1mini{Li}.

Thus, the GSL rule is always at least as fast as the fastest of the GS rule and Lipschitz sampling. Indeed, it can be more than a factor of faster than using Lipschitz sampling, while it can obtain a rate closer to the minimum , instead of the maximum that the classic GS rule depends on.

An interesting property of the GSL rule for quadratic functions is that it is the optimal myopic coordinate update. That is, if we have an oracle that can choose the coordinate and the step-size that decreases by the largest amount, i.e.,

 f(xk+1)=argmini,α{f(xk+αei)}, (15)

this is equivalent to using the GSL rule and the update in (10). This follows because (11) holds with equality in the quadratic case, and the choice yields the optimal step-size. Thus, although faster schemes could be possible with non-myopic strategies that cleverly choose the sequence of coordinates or step-sizes, if we can only perform one iteration, then the GSL rule cannot be improved.

For general , (15) is known as the maximum improvement (MI) rule. This rule has been used in the context of boosting (Rätsch et al., 2001), graphical models (Della Pietra et al., 1997; Lee et al., 2006; Scheinberg and Rish, 2009), Gaussian processes (Bo and Sminchisescu, 2008), and low-rank tensor approximations (Li et al., 2015). Using an argument similar to (14), our GSL rate also applies to the MI rule, improving existing bounds on this strategy. However, the GSL rule is much cheaper and does not require any special structure (recall that we can estimate as we go).

### 6.3 Connection between GSL Rule and Normalized Nearest Neighbour Search

Dhillon et al. (2011) discuss an interesting connection between the GS rule and the nearest-neighbour-search (NNS) problem for objectives of the form

 minx∈IRnF(x)=f(Ax), (16)

This is a special case of with no functions, and its gradient has the special form

 ∇F(x)=ATr(x),

where . We use the symbol because it is the residual vector () in the special case of least squares. For this problem structure the GS rule has the form

 ik =argmaxi|∇if(xk)| =argmaxi|r(xk)Tai|,

where denotes column of for . Dhillon et al. (2011) propose to approximate the above by solving the following NNS problem

 ik=argmini∈[2n]∥r(xk)−ai∥,

where in the range through refers to the negation of column and if the selected is greater than we return . We can justify this approximation using the logic

 ik =argmini∈[2n]∥r(xk)−ai∥ =argmini∈[2n]12∥r(xk)−ai∥2 =argmini∈[2n]12∥r(xk)∥2constant−r(xk)Tai+12∥ai∥2 =argmaxi∈[2n]r(xk)Tai−12∥ai∥2 =argmaxi∈[n]|r(xk)Tai|−12∥ai∥2 =argmaxi∈[n]|∇if(xk)|−12∥ai∥2.

Thus, the NNS computes an approximation to the GS rule that is biased towards coordinates where is small. Note that this formulation is equivalent to the GS rule in the special case that (or any other constant) for all . Shrivastava and Li (2014) have more recently considered the case where and incorporate powers of in the NNS to yield a better approximation.

A further interesting property of the GSL rule is that we can often formulate the exact GSL rule as a normalized NNS problem. In particular, for problem (16) the Lipschitz constants will often have the form for a some positive scalar . For example, least squares has and logistic regression has . When the Lipschitz constants have this form, we can compute the exact GSL rule by solving a normalized NNS problem,

 ik=argmini∈[2n]∣∣∣∣∣∣r(xk)−ai∥ai∥∣∣∣∣∣∣. (17)

The exactness of this formula follows because

 ik =argmini∈[2n]∣∣∣∣∣∣r(xk)−ai∥ai∥∣∣∣∣∣∣ =argmini∈[2n]12∥r(xk)−ai/∥ai∥∥2 =argmini∈[2n]12∥r(xk)∥2constant−r(xk)Tai∥ai∥+12∥ai∥2∥ai∥2constant =argmaxi∈[n]|r(xk)Tai|∥ai∥ =argmaxi∈[n]|r(xk)Tai|√γ∥ai∥ =argmaxi∈[n]|∇if(xk)|√Li.

Thus, the form of the Lipschitz constant conveniently removes the bias towards smaller values of that gets introduced when we try to formulate the classic GS rule as a NNS problem. Interestingly, in this setting we do not need to know to implement the GSL rule as a NNS problem.

## 7 Approximate Gauss-Southwell

In many applications, computing the exact GS rule is too inefficient to be of any practical use. However, a computationally cheaper approximate GS rule might be available. Approximate GS rules under multiplicative and additive errors were considered by Dhillon et al. (2011) in the convex case, but in this setting the convergence rate is similar to the rate achieved by random selection. In this section, we give rates depending on for approximate GS rules.

### 7.1 Multiplicative Errors

In the multiplicative error regime, the approximate GS rule chooses an satisfying

 |∇ikf(xk)|≥∥∇f(xk)∥∞(1−ϵk),

for some . In this regime, our basic bound on the progress (2) still holds, as it was defined for any . We can incorporate this type of error into our lower bound (8) to obtain

 f(x∗) ≥f(xk)−12μ1∥∇f(xk)∥2∞ ≥f(xk)−12μ1(1−ϵk)2|∇ikf(xk)|2.

This implies a convergence rate of

 f(xk+1)−f(x∗)≤(1−μ1(1−ϵk)2L)[f(xk)−f(x∗)].

Thus, the convergence rate of the method is nearly identical to using the exact GS rule for small (and it degrades gracefully with . This is in contrast to having an error in the gradient (Friedlander and Schmidt, 2012), where the error must decrease to zero over time.

In the additive error regime, the approximate GS rule chooses an satisfying

 |∇ikf(xk)|≥∥∇f(xk)∥∞−ϵk,

for some . In Appendix G, we show that under this rule, we have

 f(xk+1)−f(x∗) ≤(1−μ1L)k[f(x0)−f(x∗)+Ak],

where

 Ak≤min{k∑i=1(1−μ1L)−iϵi√2L1L√f(x0)−f(x∗),k∑i=1(1−μ1L)−i(ϵi√2L√f(x0)−f(x∗)+ϵ2i2L)},

where is the Lipschitz constant of with respect to the 1-norm. Note that could be substantially larger than , so the second part of the maximum in is likely to be the smaller part unless the are large. This regime is closer to the case of having an error in the gradient, as to obtain convergence the must decrease to zero. This result implies that a sufficient condition for the algorithm to obtain a linear convergence rate is that the errors converge to zero at a linear rate. Further, if the errors satisfy for some , then the convergence rate of the method is the same as if we used an exact GS rule. On the other hand, if does not decrease to zero, we may end up repeatedly updating the same wrong coordinate and the algorithm will not converge (though we could switch to the randomized method if this is detected).

One of the key motivations for the resurgence of interest in coordinate descent methods is their performance on problems of the form

 minx∈RnF(x)≡f(x)+n∑i=1gi(xi),

where is smooth and convex and the are convex, but possibly non-smooth. This includes problems with -regularization, and optimization with lower and/or upper bounds on the variables. Similar to proximal-gradient methods, we can apply the proximal operator to the coordinate update,

 xk+1=prox1Lgik[xk−1L∇ikf(xk)eik],

where

 proxαgi[y]=argminx∈Rn12∥x−y∥2+αgi(x).

With random coordinate selection, Richtárik and Takáč (2014) show that this method has a convergence rate of

 E[F(xk+1)−F(x∗)]≤(1−μnL)[F(xk)−F(x∗)],

similar to the unconstrained/smooth case.

There are several generalizations of the GS rule to this scenario. Here we consider three possibilities, all of which are equivalent to the GS rule if the are not present. First, the GS- rule chooses the coordinate with the most negative directional derivative. This strategy is popular for -regularization (Shevade and Keerthi, 2003; Wu and Lange, 2008; Li and Osher, 2009) and in general is given by (see Bertsekas, 1999, §8.4)

 ik =argmaxi{mins∈∂gi|∇if(xk)+s|}.

However, the length of the step () could be arbitrarily small under this choice. In contrast, the GS- rule chooses the coordinate that maximizes the length of the step (Tseng and Yun, 2009; Dhillon et al., 2011),

 ik =argmaxi{∣∣∣xki−prox1Lgi[xki−1L∇if(xk)]∣∣∣}.

This rule is effective for bound-constrained problems, but it ignores the change in the non-smooth term (). Finally, the GS- rule maximizes progress assuming a quadratic upper bound on  (Tseng and Yun, 2009),

 ik=argmini{mind{ f(xk)+∇if(xk)d+L2d2+gi(xki+d)−gi(xki)}}.

While the least intuitive rule, the GS- rule seems to have the best theoretical properties. Further, if we use in place of in the GS- rule (which we call the GSL- strategy), then we obtain the GSL rule if the are not present. In contrast, using in place of in the GS- rule (which we call the GSL- strategy) does not yield the GSL rule as a special case.

In Appendix H, we show that using the GS- rule yields a convergence rate of

 F(xk+1)−F(x∗) ≤(1−μLn)[f(xk)−f(x∗)],

thus matching the convergence rate of randomized coordinate descent (but deterministically rather than in expectation). In contrast, in Appendix H we also give counter-examples showing that the above rate does not hold for the GS- or the GS- rule. Thus, any bound for the GS- or the GS- rule would be slower than the expected rate under random selection, while the GS- rule matches this bound. It is an open problem whether the GS- rule obtains the rate in general, but in the next section we discuss special cases where rates depending on can be obtained.

### 8.1 Rates Depending on μ1

First, we note that if the are linear then the GS- rule obtains

 F(xk+1)−F(x∗) ≤(1−μ1L)[f(xk)−f(x∗)], (18)

since in this particular (smooth) case the algorithm and assumptions are identical to the setting of Section 4: the GS-q rule chooses the same coordinate to update as the GS rule applied to , while has the same and as because the are linear.

It is possible to change the update rule in order to obtain a rate depending on for general . In particular, after the publication of this work Song et al. (2017) considered another generalization of the GS rule in the context of -regularization. A generalized version of their update rule is given by

 xk+1 =xk+dk, dk ∈mind∈IRn{⟨∇f(xk),d⟩+L12∥d∥21+g(xk+d)},

where and where is the Lipschitz constant of in the -norm. We call this the GS- rule, and simlar to the other GS- rules it is equivalent to the GS rule if the are not present. This equivalence follows from viewing the GS rule as steepest descent in the -norm (Boyd and Vandenberghe, 2004, Section 9.4.2)Nutini (2018, Appendix A.8) shows that this rule obtains a convergence rate of

 F(xk+1)−F(x∗) ≤(1−μ1L1)[f(xk)−f(x∗)].

Note that , so this rate is slower than the other rates we have shown involving . Further, unlike the other non-smooth generalizations of the GS rule, for non-linear this generalization may select more than one variable to update at each iteration. Thus it would be more appropriate to refer to this as a block coordinate descent method than a coordinate descent method. Although Song et al. (2017) give an efficient way to compute given the gradient in the case of -regularization, computing for other choices of may add an additional computational cost to the method.

Finally, consider the case of that are piecewise-linear. Under a suitable non-degeneracy assumption, coordinate descent methods achieve a particular “active set” property in a finite number of iterations (Wright, 2012; Nutini et al., 2017). Specifically, for values of that occur at non-smooth values of , we will have for all sufficiently large . At this point, none of the four GS- rules would select such coordinates again. Similarly, for values where occurs at smooth values of , the iterates will eventually be confined to a region where the is smooth. Once this “active set” identification happens for piecewise-linear , the iterates will be confined to a region where the selected are linear. At this point, linearity means that only one coordinate will be selected by the GS- rule and it will select the same coordinate as the GS- rule. Further, at this point the analysis of Nutini (2018, Appendix A.8) can be applied with instead for the GS- rule which leads to a rate of as in the smooth case.

## 9 Experiments

We first compare the efficacy of different coordinate selection rules on the following simple instances of . -regularized sparse least squares: Here we consider the problem

 minx12m∥Ax−b∥2+λ2∥x∥2,

an instance of problem . We set to be an by matrix with entries sampled from a distribution (with and ). We then added 1 to each entry (to induce a dependency between columns), multiplied each column by a sample from multiplied by ten (to induce different Lipschitz constants across the coordinates), and only kept each entry of non-zero with probability (a sparsity level that allows the Gauss-Southwell rule to be applied with cost . We set and , where the entries of and were drawn from a distribution. In this setting, we used a step-size of for each coordinate , which corresponds to exact coordinate optimization.

-regularized sparse logistic regression: Here we consider the problem

 minx1mm∑i=1log(1+exp(−biaTix))+λ2∥x∥2.

We set the to be the rows of from the previous problem, and set sign, but randomly flipping each with probability . In this setting, we compared using a step-size of to using exact coordinate optimization.

Over-determined dense least squares: Here we consider the problem

 minx12m∥Ax−b∥2,

but, unlike the previous case, we do not set elements of to zero and we make have dimension by . Because the system is over-determined, it does not need an explicit strongly-convex regularizer to induce global strong-convexity. In this case, the density level means that the exact GS rule is not efficient. Hence, we use a balltree structure (Omohundro, 1989) to implement an efficient approximate GS rule based on the connection to the NNS problem discovered by Dhillon et al. (2011). On the other hand, we can compute the exact GSL rule for this problem as a NNS problem as discussed in Section 6.3.

-regularized underdetermined sparse least squares: Here we consider the non-smooth problem

 minx12m∥Ax−b∥2+λ∥x∥1.

We generate as we did for the -regularized sparse least squares problem, except with the dimension by . This problem is not globally strongly-convex, but will be strongly-convex along the dimensions that are non-zero in the optimal solution.

We plot the objective function (divided by its initial value) of coordinate descent under different selection rules in Figure 1. Even on these simple datasets, we see dramatic differences in performance between the different strategies. In particular, the GS rule outperforms random coordinate selection (as well as cyclic selection) by a substantial margin in all cases. The Lipschitz sampling strategy can narrow this gap, but it remains large (even when an approximate GS rule is used). The difference between GS and randomized selection seems to be most dramatic for the -regularized problem; the GS rules tend to focus on the non-zero variables while most randomized/cyclic updates focus on the zero variables, which tend not to move away from zero.333To reduce the cost of the GS- method in this context, Shevade and Keerthi (2003) consider a variant where we first compute the GS- rule for the non-zero variables and if an element is sufficiently large then they do not consider the zero variables. Exact coordinate optimization and using the GSL rule seem to give modest but consistent improvements. The three non-smooth GS- rules had nearly identical performance despite their different theoretical properties. The GSL- rule gave better performance than the GS- rules, while the the GSL- variant performed worse than even cyclic and random strategies. We found it was also possible to make the GS- rule perform poorly by perturbing the initialization away from zero. While these experiments plot the performance in terms of the number of iterations, in Appendix I we show that the GS- rules can also be advantageous in terms of runtime.

We next consider an instance of problem , performing label propagation for semi-supervised learning in the ‘two moons’ dataset (Zhou et al., 2004). We generate samples from this dataset, randomly label five points in the data, and connect each node to its five nearest neighbours. This high level of sparsity is typical of graph-based methods for semi-supervised learning, and allows the exact Gauss-Southwell rule to be implemented efficiently. We use the quadratic labeling criterion of Bengio et al. (2006), which allows exact coordinate optimization and is normally optimized with cyclic coordinate descent. We plot the performance under different selection rules in Figure 2. Here, we see that even cyclic coordinate descent outperforms randomized coordinate descent, but that the GS and GSL rules give even better performance. We note that the GS and GSL rules perform similarly on this problem since the Lipschitz constants do not vary much.

## 10 Discussion

It is clear that the GS rule is not practical for every problem where randomized methods are applicable. Nevertheless, we have shown that even approximate GS rules can obtain better convergence rate bounds than fully-randomized methods. We have given a similar justification for the use of exact coordinate optimization, and we note that our argument could also be used to justify the use of exact coordinate optimization within randomized coordinate descent methods (as used in our experiments). We have also proposed the improved GSL rule, and considered approximate/proximal variants. We expect our analysis also applies to block updates by using mixed norms , and could be used for accelerated/parallel methods (Fercoq and Richtárik, 2013), for primal-dual rates of dual coordinate ascent (Shalev-Shwartz and Zhang, 2013), for successive projection methods (Leventhal and Lewis, 2010), for boosting algorithms (Rätsch et al., 2001), and for scenarios without strong-convexity under general error bounds (Luo and Tseng, 1993).

## Acknowledgements

We would like to thank the anonymous referees for their useful comments that significantly improved the paper. Julie Nutini is funded by an NSERC Canada Graduate Scholarship.

## Appendix A Efficient calculation of GS rules for sparse problems

We first give additional details on how to calculate the GS rule efficiently for sparse instances of problems and . We will consider the case where each is smooth, but the ideas can be extended to allow a non-smooth . Further, note that the efficient calculation does not rely on convexity, so these strategies can also be used for non-convex problems.

### a.1 Problem h2

Problem has the form

 h2(x):=∑i∈Vgi(xi)+∑(i,j)∈Efij(xi,xj),

where each and are differentiable and is a graph where the number of vertices is the same as the number of variables . If all nodes in the graph have a degree (number of neighbours) bounded above by some constant , we can implement the GS rule in after an time initialization by maintaining the following information about :

1. A vector containing the values .

2. A matrix containing the values in the first column and in the second column.

3. The elements of the gradient vector stored in a binary max heap data structure (see Cormen et al., 2001, Chapter 6).

Given the heap structure, we can compute the GS rule in by simply reading the index value of the root node in the max heap. The costs for initializing these structures are:

1. to compute for all nodes.

2. to compute for all edges.

3. to sum the values in the above structures to compute , and to construct the initial max heap.

Thus, the one-time initialization cost is . The costs of updating the data structures after we update to for the selected coordinate are:

1. to compute .

2. to compute for and or (only such values exist by assumption, and all other are unchanged).

3. to update up to elements of that differ from by using differences in changed values of and , followed by to perform updates of the heap at a cost of for each update.

The most expensive part of the update is modifying the heap, and thus the total cost is .444For less-sparse problems where , using a heap is actually inefficient and we should simply store as a vector. The initialization cost is the same, but we can then perform the GS rule in by simply searching through the vector for the maximum element.

Problem